Regularization, Model Selection and Evaluation#
Understand the overfitting-underfitting and bias-variance tradeoffs
Estimate the expected prediction error using cross-validation
Use regularization to prevent overfitting
Be aware of underlying statistical assumptions (identity, independence)
Cross-Validation#
Data is often scarce so that living aside test data to estimate the Prediction Error (PE) is a problem
Cross-validation is used to estimate the Expected PE (EPE), while avoiding living aside test data
It uses part of the data to train, part of the data to test, repeating the operation on different subset selections
\(K\)-Fold Cross Validation#
Slice the train data into \(K\) roughly equal-sized parts;
Keep the \(K\)th part for validation and train on the \(K - 1\) other parts;
Compute the prediction error \(K\) using the \(K\)th part;
Repeat the operation for all \(K\) parts;
Average the prediction errors to estimate the EPE.
Cross-validation allows one to estimate expected prediction error rather than the prediction error conditioned on a particular training set (see Chap. 7.12 in Hastie et al. 2009).
Choosing the number \(K\) of parts#
\(K\) small (e.g. \(K = 2\)):
biases towards large error (CV training sets size \((K - 1) / K \times N\) much smaller than original training set size \(N\));\(K\) large (e.g. \(K = N\), leave-one out CV):
unbiased estimate of EPE, but with high variance (\(N\) “training sets” are similar to each other);\(K = 5\) or 10:
recommended by Hastie et al. (2009) as a good compromise, but depends on case study.
Adapting Cross-Validation to the Data Structure#
For instance:
By choosing the the number of parts \(K\) to adapt to cycles (e.g. avoid splitting years)
By grouping (e.g. if measurements for different people should be kept together)
Shuffling splits (e.g. if the ordering of the data is special)
By taking serial correlations into account in time series.
See https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators
Regularizing to Avoid Overfitting: Linear Models#
Linear Models Can Overfit#
Linear models are simpler than alternatives
\(\rightarrow\) they tend to overfit less than alternatives
They even often underfit when:
\(p\) is small compared to \(N\) and \(\sigma^{-2}\)
the problem is not linearly separable
But linear models can also overfit, when:
\(p\) is large compared to \(N\) and \(\sigma^{-2}\) (see Overfitting/Underfitting and Bias/Variance)
Many uninformative features (that depend a lot on other features)
Effect of uninformative features#
Recall from Ordinary Least Squares that, if we assume that generating model is \(Y = \mathbf{X}^\top \boldsymbol{\beta} + \epsilon\), where the observations of \(\epsilon\) are uncorrelated and with mean zero and constant variance \(\sigma^2\), then,
#### 2D example
Assume that \(Y = X_0 + \epsilon\).
OLS with \(\boldsymbol{X} = (X_0, X_1)\) centered, \(X_1 = c X_0 + (1- c) Z_1\) and
Thus,
So, a large absolute value of inputs correlation imply
large coef. variance, i.e. coef. errors
large coef. covariance, i.e. errors in one coef. associated with errors in the other
Question (optional)
Interpret the role of covariances between inputs in OLS for \(M > 2\) using an eigendecomposition of the covariance matrix.
#### Looking for a solution to the confusion problem
Many correlated variables \(\rightarrow\) high variance.
A widely large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated cousin.
Limit the size of the coefficients to reduce the variance of the coefficients and predictions.
Ridge Regression#
#### Ridge Regression Problem
A very common form of regularization for linear regression;
Shrink coefficients by imposing a penalty on their size;
Minimize a penalized RSS:
\(\lambda \ge 0\) controls the amount of shrinkage.
Ridge Regression as a Constrained Optimization Problem#
For any \(\lambda\) there exists a \(t \ge 0\) such that the ridge regression problem is equivalent to:
Contrary to OLS and as for many regularization methods, Ridge solutions are not equivarient under scaling:
If different inputs represent different kinds of variables, one normally standardizes the inputs before solving.
If different inputs represent measurements of the same variable in different situations (location, epoch…), it may be interesting to keep the scales.
Penalization of the intercept would make the procedure depend on the origin chosen for \(Y\).
After centering the inputs by replacing each \(x_{ij}\) by \(x_{ij} - \bar{x}_j\), OLS gives \(\hat{\beta}_0 = \bar{y}\).
We can thus first solve the ridge for centered inputs and outputs and then add \(\hat{\beta}_0 = \bar{y}\) to the predictions.
Effect of Ridge Penality on Coefficients#
Question (optional)
Show that the solution to the ridge-penalized RSS is
(40)#\[\begin{equation} \hat{\boldsymbol{\beta}}^\mathrm{ridge} = \left(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}\right)^{-1} \left(\mathbf{X}^\top \mathbf{y}\right), \end{equation}\]where \(\mathbf{I}\) is the \(p\times p\) identity matrix.
Question
How does this formula differ from the OLS solution for the coefficients?
When are these solutions unique?
Question (optional)
Show that the ridge estimates for orthogonal inputs with variances \(\sigma_j^2\) are given by: $\(\hat{\beta}^\mathrm{Ridge}_j = \hat{\beta}^\mathrm{OLS}_j \frac{\sigma_j^2}{\sigma_j^2 + \frac{\lambda}{N - 1}}.\)$
Question
Use these formulas to interpret the effect of the Ridge regularization on the coefficients for orthonormal inputs.
The regularization coefficient \(\lambda\) scales with \(N\) as can be seen from the formulas for \(\hat{\boldsymbol{\beta}}^\mathrm{Ridge}\) and from the fact that the weight penalty is added to the residual sum of squares and not to the mean squared error.
This means that, all other things remaining the same, \(\lambda\) must be increased when increasing the number of points in order to obtain the same effect from the regularization.
#### Ridge Penalty Effect on Variance
Assume again that the generating model is \(Y = \mathbf{X}^\top \boldsymbol{\beta} + \epsilon\), where the observations of \(\epsilon\) are uncorrelated and with mean zero and constant variance \(\sigma^2\)
Question (optional)
Show that $\( \mathrm{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1}. \)$
Singular Value Decomposition Interpretation (optional)#
Let the SVD of \(\mathbf{X}\) be \(\mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^\top\) with:
\(\mathbf{U}\) an \(N \times p\) orthogonal matrix with columns spanning the column space of \(\mathbf{X}\);
\(\mathbf{V}\) a \(p \times p\) orthogonal matrix with columns spanning the row space of \(\mathbf{X}\);
\(\mathbf{D}\) a \(p \times p\) diagonal matrix with diagonal entries \(d_1 \ge d_2 \ge \cdots d_p \ge 0\) called the singular values of \(\mathbf{X}\).
Question (optional)
Show that the OLS estimates are such that \(\mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{U} \mathbf{U}^\top \mathbf{y} = \sum_{j = 1}^p \mathbf{u}_j \mathbf{u}_j^\top \mathbf{y}\).
Show that the ridge estimates are such that \(\mathbf{X} \hat{\boldsymbol{\beta}}^\mathrm{ridge} = \sum_{j = 1}^p \mathbf{u}_j \frac{d_j^2}{d_j^2 + \lambda} \mathbf{u}_j^\top \mathbf{y}\).
Question
Interpret how the ridge shrinks the coefficients \(\hat{\boldsymbol{\beta}}\) with respect to the orthonormal basis \(\mathbf{V}\).
What is the role of \(\mathrm{df}(\lambda) := \mathrm{tr}\left[\mathbf{X}(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top\right] = \sum_{j = 1}^p \frac{d_j^2}{d_j^2 + \lambda}\)?
Regularization on a Simple Example#
Small training set
Fit a linear model without regularization
Regularization on a Simple Example#
Small training set
Fit a linear model without regularization
Training points sampled at random
Can overfit if the data is noisy!
Regularization on a Simple Example: Random Training Sets#
Bias-Variance Tradeoff in Ridge Regression#
Bias-Variance Tradeoff in Ridge Regression#
Too much variance |
Best tradeoff |
Too much bias |
Partial Summary#
Can overfit when:
\(N\) is too small compared to \(p\) and \(\sigma^2\)
In particular with non-informative features
Regularization for regression:
From linear regression to ridge regression \(\rightarrow\) less overfit
large regularization parameter \(\rightarrow\) strong regularization \(\rightarrow\) smaller coefficients \(\rightarrow\) smaller variance
Tuning Hyperparameters (model selection)#
Some hyperparameters need to be estimated in addition to coefficients (regularization parameter, nonlinear coefficients in linear models, etc.)
Validation data or Cross-Validation (CV) may be used to select the best hyperparameters by minimizing the estimated validation error (grid search, random search, etc.)
Testing after validation#
The best validation error is for hyperparameters that were trained using the validation data: it is not the EPE for this choice of hyperparameters!
\(\rightarrow\) Distinguish between train, validation and test data
Training and validation can be combined using CV but test data should be kept to evaluate the PE conditionned on the CV set for the final choice of coefficients and hyperparameters.
To maximize the use of the data and estimate the EPE, nested CV can be used.
Law of Large Numbers?#
Estimates attempt to minimize a function of the training error \(\overline{\mathrm{err}}\).
For estimates to converge with the sample size, so should \(\overline{\mathrm{err}}\).
\(\rightarrow\) We need some Law of Large Numbers to be applicable.
Basic assumptions: independent and identically distributed.
What could go wrong?#
So far, we have assumed that the joint distribution \(f_{\boldsymbol{X}, Y}\) is independent of time.
In particular, we have assumed that the joint process is statistically stationary.
Yet, in the natural and engineering sciences, variables often depend on time as well as distributions (cycles, trends).
Variations in time can rarely be considered purely random:
\(\rightarrow\) some dependence persist between realizations
Yet, we are fine if we can show that:
there is a stationary distribution
realizations sufficiently distant in time no longer correlate
Violation of Statistical Stationarity#
Surface air temperature variability can be decomposed into:
\(-\) (pseudo-)periodic cycles (diurnal, annual, Milankovitch)
\(-\) a continuous spectrum of frequencies due to chaotic dynamics
\(-\) an increasing trend due to global warming
\(-\) other non-equilibrium variations (effect of volcanoes, solar activity, …)
Violation of Independence#
Climate perspective: Weather conditions tend to persist for a while so that observations that are close in time are more likely to be similar than observations distant in time.
Dynamical system perspective: The evolution of the state of the atmosphere can be described by a system of differential equations whose solutions are relatively smooth (not completely erratic).
Probabilistic perspective: This memory corresponds to a (serial) dependence between an observation at one time step and an observation at a previous time step. The auto-correlation for a given lag is a measure of linear dependence.
Example: Serial dependence in the Paris climate (from ERA5).
# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Plot module
import matplotlib.pyplot as plt
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
# Matplotlib configuration
plt.rc('font', size=14)
# Set data directory
data_dir = Path('data')
# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict(header=0, index_col=0, parse_dates=True)
# Set first and last years
FIRST_YEAR = 2000
LAST_YEAR = 2009
# Define temperature filepath
filename = 'era5_paris_sf_{}_{}.csv'.format(
FIRST_YEAR, LAST_YEAR)
filepath = Path(data_dir, filename)
# Read ERA5 time series for Paris
df = pd.read_csv(filepath, **kwargs_read_csv).resample('D').mean()
df.index = pd.DatetimeIndex(df.index.date, name='time')
da = df.to_xarray()
# Get anomalies
gp = da.groupby('time.dayofyear')
da_clim = gp.mean('time')
da_anom = gp - da_clim
df_anom = da_anom.to_pandas()
del df_anom['dayofyear']
# Variable selection
VARIABLE_NAMES = ['t2m', 'u10', 'v10', 'sp', 'ssrd', 'tp']
VARIABLE_LONG_NAMES = {
't2m': '2m Temperature',
'u10': '10m Zonal Wind',
'v10': '10m Meridional Wind',
'sp': 'Surface Pressure',
'ssrd': 'Surf. Solar Radiation Down.',
'tp': 'Total Precipitation'
}
# Plot time series
NDAYS_SERIES = 30
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
(df_anom / df_anom.std())[VARIABLE_NAMES].iloc[:NDAYS_SERIES].plot(ax=ax[0])
ax[0].set_ylabel('Normalized Anomalies')
ax[0].set_xlabel('')
ax[0].get_legend().remove()
# Plot auto-correlation functions
NDAYS_ACF = 21
nt = len(df_anom)
for variable_name in VARIABLE_NAMES:
s = df_anom[variable_name]
variable_long_name = VARIABLE_LONG_NAMES[variable_name]
acf = np.correlate(s, s, 'full')[nt - 1:] / s.var() / (nt - 1)
ax[1].plot(acf[:NDAYS_ACF], label=variable_long_name)
ax[1].set_xlabel('Lag (days)')
ax[1].set_ylabel('Sample Auto-Correlation Function')
ax[1].set_xlim(0, NDAYS_ACF - 1)
labels = [VARIABLE_LONG_NAMES[variable_name]
for variable_name in VARIABLE_NAMES]
plt.legend(labels=labels, bbox_to_anchor=(0.7, -0.2), ncols=2)
plt.subplots_adjust(wspace=0.3)
_ = ax[1].grid(True)
Partial Summary#
In statistics, we always make assumptions about the probability distribution of the data (stationarity, independence, parametric form, etc.)
The quality of statistics (predictions, error estimates, etc.) depends on the validity of these assumptions
Even the best validation could be wrong about predictions if the new data does not satisfy these assumptions!
To go further#
Introduction of the evaluation metrics in Scikit-learn course;
Generalized CV approximation of leave-one out CV (Chap. 7.10 in Hastie et al. 2009);
The bootstrap as a way of assessing the accuracy of a parameter estimate or a prediction (shuffled version of cross-validation, Chap. 7.11 and 8 in Hastie et al. 2009).
Avoiding validation thanks to Bayesian approach (Chap. 3 in Bishop 2006).
References#
Chap. 2, 3 and 7 in Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning, 2nd ed. Springer, New York.
For a Bayesian perspective, Chap. 3 in Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer-Verlag
Chap. 5 and 7 in Wilks, D.S., 2019. Statistical Methods in the Atmospheric Sciences, 4th ed. Elsevier, Amsterdam.
Credit#
Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.