Tutorial: Overfitting/Underfitting and Bias/Variance

Tutorial: Overfitting/Underfitting and Bias/Variance#

Tutorial to the class Overfitting/Underfitting and Bias/Variance

Tutorial Objectives

Evaluate model performance by estimating the Expected Prediction Errors (EPE) using test data.
Same as above but with cross-validation.
Compute and plot learning curves.
Estimate the irreducible error and bias error.
Improve the models by modifying the input features.

We are going to study the temperature variations in the upper part of the equatorial ocean. Our goal is to get an estimate of the temperature as a function of depth \(T = f(z)\). We are going to suppose that \(f\) is linear such that \(T = \alpha z + \beta\).

Argo profiles#

To measure temperature and salinity in the ocean, the scientific community has deployed Argo Floats in all ocean basins. Argo floats look like cylinders that can adjust their density (like a submarine). They are programmed to go up and down in the ocean and measure temperature, salinity and pressure along their path. The figure below is an illustration of a typical work cycle of such a float. Each time the Argo float moves up, it records a profile, as illustrated in the figure below.

Figure from Walczowski et al (2020)

Download the data#

We are going to analyze the data taken from one argo float

Question

Check in the data folder that you have a file named nodc_13858_prof.nc.

If not, you can download it from https://data.nodc.noaa.gov/argo/gadr/data/aoml/13858/nodc_13858_prof.nc and place it in the data folder. (as in Tutorial 1)

Information about the data#

The dataset contains 48 profiles of temperature, salinity recorded at predefined pressure levels near the Equator in the Atlantic ocean : Lat: +2 deg, longitude; Lon: -14 deg. It is a subset of the Argo dataset
In our subset, there are n_prof = 48 vertical profiles recorded between July 1997 and December 1998. At the equator, there is no seasonal cycle so that all profiles are from the same distribution. Each profile should be considered as one coherent sample.
We are going to suppose that pressure in decibar is equivalent to depth in meters.
We will focus on the upper part of the ocean (upper 100 m) which corresponds to the first 15 measurements of each profile

Getting ready#

In the cell below we import the main libraries and load the dataset in the ds object. We extract a subsample of pressures (depths) and temperatures that we store in the variables x_pres and y_temp respectively. These are 2d arrays where each row correspond to one individual profile.

# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Plot module
import matplotlib.pyplot as plt
# read netCDF files
import xarray as xr


# Set data directory
data_dir = Path('data')
argo_filename = 'nodc_13858_prof.nc'
argo_filepath = Path(data_dir, argo_filename)

# name of the temperature and pressure variables in the netcdf file
var_temp = 'temp' 
var_pres = 'pres'

# load the data
ds = xr.open_dataset(argo_filepath)

n_prof = int(ds['n_prof'][-1]) + 1
print(f'There are {n_prof} profiles in this dataset')

n_max = 15
print(f'For each profile, we are going to keep only the first {n_max} measurements')

x_pres = ds[var_pres].values[:,:n_max]
y_temp = ds[var_temp].values[:,:n_max]

There are 48 profiles in this dataset
For each profile, we are going to keep only the first 15 measurements

Question

Check in the ds object that the unit of temperature is degree Celsius and the unit of pressure is decibar. In the remainder of the tutorial, we will consider that the pressure in decibar is equivalent to the depth in meters.

Pick the first profile ip = 0, and plot the temperature as a function of depth. Don’t forget the labels!

# your answer here
ip = 0

Question

Do a linear regression to estimate the coefficients \(\alpha\) and \(\beta\) (as explained above) with this profile as training data.

# your answer here

Question

Plot your linear regression on top of the raw data. Don’t forget the labels

# your answer here

Question

What is the training score \(R^2\) for this linear regression?

What do you think of this value of \(R^2\)?

# your answer here

your answer here

Question

Select a random profile ip2 between 1 and n_prof -1.

We are going to use this new profile as testing set. What is the testing score \(R^2\)?

Are you overfitting? Justify your answer

# your answer here
ip2 = int(np.random.rand()*(n_prof - 2)) + 1

print(f"Working with profile number: ip2 = {ip2}")

Working with profile number: ip2 = 10

Learning curve#

In general, there are two possible reasons that can explain the fact that we are overfitting the data. Either we do not have enough data or the model is to complicated. Since linear regression is one of the simplest possible models, we will focus more on the amount of data needed to extract a general law for this problem. This can be achieved with a learning curve.

What is a sample?#

A key assumption regarding datasets is that each point should be Independent Identically Distributed (also noted iid). This is in practice rarely true and in fact, for our dataset, this is clearly not the case because the points within a profile are correlated (i.e. not independent). There are then two possible approaches

Treat each profile as a coherent block of data (the new sample size is then a whole profile rather than one data point)
Shuffle the entire data set

In this tutorial, we will adopt the first strategy to design well balanced training and testing sets. So keep in mind that for this example one sample = one profile. Note that this type of consideration is not specific to this dataset: if you are aware of a pre-existing structure (seasonality, space proximity, etc), you will need to keep it in mind to build your train and test data.

Question

Compute and plot a learning curve. To do so:

Partition your dataset into n_train samples for training and n_test = n_prof - n_train samples for testing.

Define a list of train period of increasing lengths

Loop over these train periods to iteratively:

Select data for this train period

Train the model

Compute the train error from the train data for the train period

Compute the test error from the test data for the test period

Save both errors

Plot both errors curves

Interpret the results.

# your answer here

Estimating the expected prediction error with cross-validation#

Question

Perform a \(k\)-fold cross-validation of your own by repeating the above estimation of the test error on all samples. To do so:

Use the split method of a sklearn.model_selection.KFold object initialized with the n_splits option to get a sequence train and test indices over which to loop.

For each pair of train and test indices:

Select the train and test data from the input and output data;

Fit the model using the train data;

Use the fitted model to predict the target from the test inputs;

Estimate the \(R^2\) from the test output.

Average the \(R^2\) estimates.

# your answer here

Question

Verify your results using the cross_val_score function of sklearn.model_selection with the appropriate value for the cv option.

How does the \(R^2\) estimate from the cross-validation compare to your estimation above?

# answer cell

Answer:

Irreducible error, Bias error#

This data set is special because all measurements are at the same depth. We can use this property to measure specific quantities such as the irreducible error or the bias error.

Question

In one single figure, plot:

All sample points with small dots

The mean temperature at each depth

your best linear regression

Propose a graphical interpretation for the irreducible error and the bias error

Where can you read the irreducible error on the learning curve?

# your answer here

Question

In order to verify that the data is Independent Identically Distributed, plot the variance of the temperature at each depth.

What do you conclude?

Beyond a linear model#

We consider the full depth of the ocean and so do no restrict ourselves to the first 100 m.

Question

Look at the whole temperature data. Two profiles have a lot of NaNs. Remove them. What would be a good depth to keep as many data points as possible?

Do a linear regression for the full dataset and plot it. Do you observe a bias error?

Add input features in the form of powers of \(z\) and use the linear regression method of scikit-learn to fit a function of the type \(f(z) = a_0 + a_1 z + a_2 z^2 ...\). Do you observe a reduction of the bias?

Do you think that with this dataset, going from a linear fit to a polynomial fit will increase the variance error? why?