Tutorial: Overfitting/Underfitting and Bias/Variance#
Tutorial to the class Overfitting/Underfitting and Bias/Variance
Evaluate model performance by estimating the Expected Prediction Errors (EPE) using test data.
Same as above but with cross-validation.
Compute and plot learning curves.
Estimate the irreducible error and bias error.
Improve the models by modifying the input features.
We are going to study the temperature variations in the upper part of the equatorial ocean. Our goal is to get an estimate of the temperature as a function of depth \(T = f(z)\). We are going to suppose that \(f\) is linear such that \(T = \alpha z + \beta\).
Argo profiles#
To measure temperature and salinity in the ocean, the scientific community has deployed Argo Floats in all ocean basins. Argo floats look like cylinders that can adjust their density (like a submarine). They are programmed to go up and down in the ocean and measure temperature, salinity and pressure along their path. The figure below is an illustration of a typical work cycle of such a float. Each time the Argo float moves up, it records a profile, as illustrated in the figure below.
Figure from Walczowski et al (2020)Download the data#
We are going to analyze the data taken from one argo float
Question
Check in the
data
folder that you have a file namednodc_13858_prof.nc
.If not, you can download it from
https://data.nodc.noaa.gov/argo/gadr/data/aoml/13858/nodc_13858_prof.nc
and place it in thedata
folder. (as in Tutorial 1)
Information about the data#
The dataset contains 48 profiles of temperature, salinity recorded at predefined pressure levels near the Equator in the Atlantic ocean : Lat: +2 deg, longitude; Lon: -14 deg. It is a subset of the Argo dataset
In our subset, there are
n_prof = 48
vertical profiles recorded between July 1997 and December 1998. At the equator, there is no seasonal cycle so that all profiles are from the same distribution. Each profile should be considered as one coherent sample.We are going to suppose that pressure in decibar is equivalent to depth in meters.
We will focus on the upper part of the ocean (upper 100 m) which corresponds to the first 15 measurements of each profile
Getting ready#
In the cell below we import the main libraries and load the dataset in the ds
object. We extract a subsample of pressures (depths) and temperatures that we store in the variables x_pres
and y_temp
respectively. These are 2d arrays where each row correspond to one individual profile.
# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Plot module
import matplotlib.pyplot as plt
# read netCDF files
import xarray as xr
# Set data directory
data_dir = Path('data')
argo_filename = 'nodc_13858_prof.nc'
argo_filepath = Path(data_dir, argo_filename)
# name of the temperature and pressure variables in the netcdf file
var_temp = 'temp'
var_pres = 'pres'
# load the data
ds = xr.open_dataset(argo_filepath)
n_prof = int(ds['n_prof'][-1]) + 1
print(f'There are {n_prof} profiles in this dataset')
n_max = 15
print(f'For each profile, we are going to keep only the first {n_max} measurements')
x_pres = ds[var_pres].values[:,:n_max]
y_temp = ds[var_temp].values[:,:n_max]
There are 48 profiles in this dataset
For each profile, we are going to keep only the first 15 measurements
Question
Check in the
ds
object that the unit of temperature is degree Celsius and the unit of pressure is decibar. In the remainder of the tutorial, we will consider that the pressure in decibar is equivalent to the depth in meters.Pick the first profile
ip = 0
, and plot the temperature as a function of depth. Don’t forget the labels!
# your answer here
ip = 0
Question
Do a linear regression to estimate the coefficients \(\alpha\) and \(\beta\) (as explained above) with this profile as training data.
# your answer here
Question
Plot your linear regression on top of the raw data. Don’t forget the labels
# your answer here
Question
What is the training score \(R^2\) for this linear regression?
What do you think of this value of \(R^2\)?
# your answer here
your answer here
Question
Select a random profile
ip2
between1
andn_prof -1
.We are going to use this new profile as testing set. What is the testing score \(R^2\)?
Are you overfitting? Justify your answer
# your answer here
ip2 = int(np.random.rand()*(n_prof - 2)) + 1
print(f"Working with profile number: ip2 = {ip2}")
Working with profile number: ip2 = 2
Learning curve#
In general, there are two possible reasons that can explain the fact that we are overfitting the data. Either we do not have enough data or the model is to complicated. Since linear regression is one of the simplest possible models, we will focus more on the amount of data needed to extract a general law for this problem. This can be achieved with a learning curve.
What is a sample?#
A key assumption regarding datasets is that each point should be Independent Identically Distributed (also noted iid). This is in practice rarely true and in fact, for our dataset, this is clearly not the case because the points within a profile are correlated (i.e. not independent). There are then two possible approaches
Treat each profile as a coherent block of data (the new sample size is then a whole profile rather than one data point)
Shuffle the entire data set
In this tutorial, we will adopt the first strategy to design well balanced training and testing sets. So keep in mind that for this example one sample = one profile. Note that this type of consideration is not specific to this dataset: if you are aware of a pre-existing structure (seasonality, space proximity, etc), you will need to keep it in mind to build your train and test data.
Question
Compute and plot a learning curve. To do so:
Partition your dataset into
n_train
samples for training andn_test = n_prof - n_train
samples for testing.Define a list of train period of increasing lengths
Loop over these train periods to iteratively:
Select data for this train period
Train the model
Compute the train error from the train data for the train period
Compute the test error from the test data for the test period
Save both errors
Plot both errors curves
Interpret the results.
# your answer here
Estimating the expected prediction error with cross-validation#
Question
Perform a \(k\)-fold cross-validation of your own by repeating the above estimation of the test error on all samples. To do so:
Use the
split
method of asklearn.model_selection.KFold
object initialized with then_splits
option to get a sequence train and test indices over which to loop.For each pair of train and test indices:
Select the train and test data from the input and output data;
Fit the model using the train data;
Use the fitted model to predict the target from the test inputs;
Estimate the \(R^2\) from the test output.
Average the \(R^2\) estimates.
# your answer here
Question
Verify your results using the
cross_val_score
function ofsklearn.model_selection
with the appropriate value for thecv
option.How does the \(R^2\) estimate from the cross-validation compare to your estimation above?
# answer cell
Answer:
Irreducible error, Bias error#
This data set is special because all measurements are at the same depth. We can use this property to measure specific quantities such as the irreducible error or the bias error.
Question
In one single figure, plot:
All sample points with small dots
The mean temperature at each depth
your best linear regression
Propose a graphical interpretation for the irreducible error and the bias error
Where can you read the irreducible error on the learning curve?
# your answer here
Question
In order to verify that the data is Independent Identically Distributed, plot the variance of the temperature at each depth.
What do you conclude?
Beyond a linear model#
We consider the full depth of the ocean and so do no restrict ourselves to the first 100 m.
Question
Look at the whole temperature data. Two profiles have a lot of NaNs. Remove them. What would be a good depth to keep as many data points as possible?
Do a linear regression for the full dataset and plot it. Do you observe a bias error?
Add input features in the form of powers of \(z\) and use the linear regression method of scikit-learn to fit a function of the type \(f(z) = a_0 + a_1 z + a_2 z^2 ...\). Do you observe a reduction of the bias?
Do you think that with this dataset, going from a linear fit to a polynomial fit will increase the variance error? why?
# your answer here
Credit#
Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.