Regularization, Model Selection and Evaluation#

Binder

Prerequisites
Learning Outcomes
  • Estimate the expected prediction error using cross-validation

  • Use regularization to prevent overfitting

  • Be aware of underlying statistical assumptions (identity, independence)

Cross-Validation#

  • Data is often scarce so that living aside test data to estimate the Prediction Error (PE) is a problem

  • Cross-validation is used to estimate the Expected PE (EPE), while avoiding living aside test data

  • It uses part of the data to train, part of the data to test, repeating the operation on different subset selections

\(K\)-Fold Cross Validation#

  1. Slice the train data into \(K\) roughly equal-sized parts;

  2. Keep the \(K\)th part for validation and train on the \(K - 1\) other parts;

  3. Compute the prediction error \(K\) using the \(K\)th part;

  4. Repeat the operation for all \(K\) parts;

  5. Average the prediction errors to estimate the EPE.

../_images/K-fold_cross_validation_EN.svg
Warning

Cross-validation allows one to estimate expected prediction error rather than the prediction error conditioned on a particular training set (see Chap. 7.12 in Hastie et al. 2009).

Choosing the number \(K\) of parts#

  • \(K\) small (e.g. \(K = 2\)):
    biases towards large error (CV training sets size \((K - 1) / K \times N\) much smaller than original training set size \(N\));

  • \(K\) large (e.g. \(K = N\), leave-one out CV):
    unbiased estimate of EPE, but with high variance (\(N\) “training sets” are similar to each other);

  • \(K = 5\) or 10:
    recommended by Hastie et al. (2009) as a good compromise, but depends on case study.

Adapting Cross-Validation to the Data Structure#

For instance:

  • By choosing the the number of parts \(K\) to adapt to cycles (e.g. avoid splitting years)

  • By grouping (e.g. if measurements for different people should be kept together)

  • Shuffling splits (e.g. if the ordering of the data is special)

  • By taking serial correlations into account in time series.

See https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators

Regularizing to Avoid Overfitting: Linear Models#

Linear Models Can Overfit#

  • Linear models are simpler than alternatives

    \(\rightarrow\) they tend to overfit less than alternatives

  • They even often underfit when:

    • \(p\) is small compared to \(N\) and \(\sigma^{-2}\)

    • the problem is not linearly separable

  • But linear models can also overfit, when:

Effect of uninformative features#

Recall from Ordinary Least Squares that, if we assume that generating model is \(Y = \mathbf{X}^\top \boldsymbol{\beta} + \epsilon\), where the observations of \(\epsilon\) are uncorrelated and with mean zero and constant variance \(\sigma^2\), then,

\[\begin{split} \begin{aligned} \mathbb{E}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= \boldsymbol{\beta}\\ \mathrm{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}. \end{aligned} \end{split}\]

#### 2D example

Assume that \(Y = X_0 + \epsilon\).

OLS with \(\boldsymbol{X} = (X_0, X_1)\) centered, \(X_1 = c X_0 + (1- c) Z_1\) and

(37)#\[\begin{align} \begin{cases} \mathrm{Var}(X_0) = 1, &\mathrm{Cov}(X_0, Y) = 1, \\ \mathrm{Var}(Z_1) = 1, &\mathrm{Cov}(Z_1, Y) = 0, &\mathrm{Cov}(Z_1, X_0) = 0. \end{cases} \end{align}\]
\[\begin{split} \mathrm{Then,~} \mathbb{E}\left(\frac{\mathbf{X}^\top \mathbf{X}}{N - 1}\right) = \begin{pmatrix} 1 & c\\ c & 1 \end{pmatrix} \mathrm{~and~} \mathbb{E}\left(\frac{\mathbf{X}^\top \mathbf{y}}{N - 1}\right) = \begin{pmatrix} 1\\ c \end{pmatrix}, \end{split}\]
\[\begin{split} \mathrm{Thus,~} \mathbb{E}\left(\left(\frac{\mathbf{X}^\top \mathbf{X}}{N - 1}\right)^{-1}\right) = \frac{1}{1 - c^2} \begin{pmatrix} 1 & -c\\ -c & 1 \end{pmatrix} \mathrm{~and~} \mathbb{E}\left(\hat{\boldsymbol{\beta}}\right) = \begin{pmatrix} 1\\ 0 \end{pmatrix}. \end{split}\]

Thus,

\[\begin{split} \begin{aligned} \mathbb{E}(\mathrm{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X})) &= \frac{\sigma^2}{N - 1} \cdot \frac{1}{1 - c^2} \begin{pmatrix} 1 & -c\\ -c & 1 \end{pmatrix}. \end{aligned} \end{split}\]

So, a large absolute value of inputs correlation imply

  • large coef. variance, i.e. coef. errors

  • large coef. covariance, i.e. errors in one coef. associated with errors in the other

Question (optional)

  • Interpret the role of covariances between inputs in OLS for \(M > 2\) using an eigendecomposition of the covariance matrix.

#### Looking for a solution to the confusion problem

Many correlated variables \(\rightarrow\) high variance.

A widely large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated cousin.

Idea: weight-decay penalization

Limit the size of the coefficients to reduce the variance of the coefficients and predictions.

Ridge Regression#

#### Ridge Regression Problem

  • A very common form of regularization for linear regression;

  • Shrink coefficients by imposing a penalty on their size;

  • Minimize a penalized RSS:

(38)#\[\begin{equation} \hat{\boldsymbol{\beta}}^\mathrm{ridge} = \underset{\boldsymbol{\beta}}{\mathrm{argmin}} \left\{\sum_{i = 1}^N \left(y_i - \beta_0 - \sum_{j = 1}^p x_{ij} \beta_j \right)^2 + \lambda \sum_{j = 1}^p \beta_j^2 \right\}. \end{equation}\]

\(\lambda \ge 0\) controls the amount of shrinkage.

Ridge Regression as a Constrained Optimization Problem#

For any \(\lambda\) there exists a \(t \ge 0\) such that the ridge regression problem is equivalent to:

(39)#\[\begin{equation} \hat{\boldsymbol{\beta}}^\mathrm{ridge} = \underset{\boldsymbol{\beta}}{\mathrm{argmin}} \sum_{i = 1}^N \left(y_i - \beta_0 - \sum_{j = 1}^p x_{ij} \beta_j \right)^2\\ \mathrm{subject~to~} \sum_{j = 1}^p \beta_j^2 \le t. \end{equation}\]
Warning 1

Contrary to OLS and as for many regularization methods, Ridge solutions are not equivarient under scaling:

  • If different inputs represent different kinds of variables, one normally standardizes the inputs before solving.

  • If different inputs represent measurements of the same variable in different situations (location, epoch…), it may be interesting to keep the scales.

Warning 2
  • Penalization of the intercept would make the procedure depend on the origin chosen for \(Y\).

  • After centering the inputs by replacing each \(x_{ij}\) by \(x_{ij} - \bar{x}_j\), OLS gives \(\hat{\beta}_0 = \bar{y}\).

  • We can thus first solve the ridge for centered inputs and outputs and then add \(\hat{\beta}_0 = \bar{y}\) to the predictions.

Effect of Ridge Penality on Coefficients#

Question (optional)

  • Show that the solution to the ridge-penalized RSS is

(40)#\[\begin{equation} \hat{\boldsymbol{\beta}}^\mathrm{ridge} = \left(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}\right)^{-1} \left(\mathbf{X}^\top \mathbf{y}\right), \end{equation}\]

where \(\mathbf{I}\) is the \(p\times p\) identity matrix.

Question

  • How does this formula differ from the OLS solution for the coefficients?

  • When are these solutions unique?

Question (optional)

  • Show that the ridge estimates for orthogonal inputs with variances \(\sigma_j^2\) are given by: $\(\hat{\beta}^\mathrm{Ridge}_j = \hat{\beta}^\mathrm{OLS}_j \frac{\sigma_j^2}{\sigma_j^2 + \frac{\lambda}{N - 1}}.\)$

Question

  • Use these formulas to interpret the effect of the Ridge regularization on the coefficients for orthonormal inputs.

Warning 3

The regularization coefficient \(\lambda\) scales with \(N\) as can be seen from the formulas for \(\hat{\boldsymbol{\beta}}^\mathrm{Ridge}\) and from the fact that the weight penalty is added to the residual sum of squares and not to the mean squared error.

This means that, all other things remaining the same, \(\lambda\) must be increased when increasing the number of points in order to obtain the same effect from the regularization.

#### Ridge Penalty Effect on Variance

Assume again that the generating model is \(Y = \mathbf{X}^\top \boldsymbol{\beta} + \epsilon\), where the observations of \(\epsilon\) are uncorrelated and with mean zero and constant variance \(\sigma^2\)

Question (optional)

  • Show that $\( \mathrm{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1}. \)$

Singular Value Decomposition Interpretation (optional)#

Let the SVD of \(\mathbf{X}\) be \(\mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^\top\) with:

  • \(\mathbf{U}\) an \(N \times p\) orthogonal matrix with columns spanning the column space of \(\mathbf{X}\);

  • \(\mathbf{V}\) a \(p \times p\) orthogonal matrix with columns spanning the row space of \(\mathbf{X}\);

  • \(\mathbf{D}\) a \(p \times p\) diagonal matrix with diagonal entries \(d_1 \ge d_2 \ge \cdots d_p \ge 0\) called the singular values of \(\mathbf{X}\).

Question (optional)

  • Show that the OLS estimates are such that \(\mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{U} \mathbf{U}^\top \mathbf{y} = \sum_{j = 1}^p \mathbf{u}_j \mathbf{u}_j^\top \mathbf{y}\).

  • Show that the ridge estimates are such that \(\mathbf{X} \hat{\boldsymbol{\beta}}^\mathrm{ridge} = \sum_{j = 1}^p \mathbf{u}_j \frac{d_j^2}{d_j^2 + \lambda} \mathbf{u}_j^\top \mathbf{y}\).

Question

  • Interpret how the ridge shrinks the coefficients \(\hat{\boldsymbol{\beta}}\) with respect to the orthonormal basis \(\mathbf{V}\).

  • What is the role of \(\mathrm{df}(\lambda) := \mathrm{tr}\left[\mathbf{X}(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top\right] = \sum_{j = 1}^p \frac{d_j^2}{d_j^2 + \lambda}\)?

Regularization on a Simple Example#

../_images/linreg_noreg_0_nogrey.svg
  • Small training set

  • Fit a linear model without regularization

Regularization on a Simple Example#

../_images/linreg_noreg_0.svg
  • Small training set

  • Fit a linear model without regularization

  • Training points sampled at random

  • Can overfit if the data is noisy!

Regularization on a Simple Example: Random Training Sets#

../_images/linreg_noreg_0.svg

../_images/linreg_noreg_1.svg

../_images/linreg_noreg_2.svg

../_images/linreg_noreg_3.svg

../_images/linreg_noreg_4.svg

../_images/linreg_noreg_5.svg

Bias-Variance Tradeoff in Ridge Regression#

../_images/linreg_noreg_0.svg

Linear regression (OLS)
High variance, no bias

../_images/ridge_0_withreg.svg

Ridge regression
Low variance, but biased!

Bias-Variance Tradeoff in Ridge Regression#

../_images/ridge_alpha_0.svg

../_images/ridge_alpha_50.svg

../_images/ridge_alpha_500.svg

Too much variance

Best tradeoff

Too much bias

Partial Summary#

  • Can overfit when:

    • \(N\) is too small compared to \(p\) and \(\sigma^2\)

    • In particular with non-informative features

  • Regularization for regression:

    • From linear regression to ridge regression \(\rightarrow\) less overfit

    • large regularization parameter \(\rightarrow\) strong regularization \(\rightarrow\) smaller coefficients \(\rightarrow\) smaller variance

Tuning Hyperparameters (model selection)#

  • Some hyperparameters need to be estimated in addition to coefficients (regularization parameter, nonlinear coefficients in linear models, etc.)

  • Validation data or Cross-Validation (CV) may be used to select the best hyperparameters by minimizing the estimated validation error (grid search, random search, etc.)

../_images/grid_search_workflow.png

Testing after validation#

The best validation error is for hyperparameters that were trained using the validation data: it is not the EPE for this choice of hyperparameters!

\(\rightarrow\) Distinguish between train, validation and test data

Training and validation can be combined using CV but test data should be kept to evaluate the PE conditionned on the CV set for the final choice of coefficients and hyperparameters.

To maximize the use of the data and estimate the EPE, nested CV can be used.

Law of Large Numbers?#

Estimates attempt to minimize a function of the training error \(\overline{\mathrm{err}}\).

For estimates to converge with the sample size, so should \(\overline{\mathrm{err}}\).

\(\rightarrow\) We need some Law of Large Numbers to be applicable.

Basic assumptions: independent and identically distributed.

What could go wrong?#

So far, we have assumed that the joint distribution \(f_{\boldsymbol{X}, Y}\) is independent of time.

In particular, we have assumed that the joint process is statistically stationary.

Yet, in the natural and engineering sciences, variables often depend on time as well as distributions (cycles, trends).

Variations in time can rarely be considered purely random:

\(\rightarrow\) some dependence persist between realizations

Yet, we are fine if we can show that:

  • there is a stationary distribution

  • realizations sufficiently distant in time no longer correlate

Violation of Statistical Stationarity#

Surface air temperature variability can be decomposed into:

\(-\) (pseudo-)periodic cycles (diurnal, annual, Milankovitch)

\(-\) a continuous spectrum of frequencies due to chaotic dynamics

\(-\) an increasing trend due to global warming

\(-\) other non-equilibrium variations (effect of volcanoes, solar activity, …)

Violation of Independence#

Climate perspective: Weather conditions tend to persist for a while so that observations that are close in time are more likely to be similar than observations distant in time.

Dynamical system perspective: The evolution of the state of the atmosphere can be described by a system of differential equations whose solutions are relatively smooth (not completely erratic).

Probabilistic perspective: This memory corresponds to a (serial) dependence between an observation at one time step and an observation at a previous time step. The auto-correlation for a given lag is a measure of linear dependence.

Example: Serial dependence in the Paris climate (from ERA5).

# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Plot module
import matplotlib.pyplot as plt
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
# Matplotlib configuration
plt.rc('font', size=14)

# Set data directory
data_dir = Path('data')

# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict(header=0, index_col=0, parse_dates=True)

# Set first and last years
FIRST_YEAR = 2000
LAST_YEAR = 2009

# Define temperature filepath
filename = 'era5_paris_sf_{}_{}.csv'.format(
    FIRST_YEAR, LAST_YEAR)
filepath = Path(data_dir, filename)

# Read ERA5 time series for Paris
df = pd.read_csv(filepath, **kwargs_read_csv).resample('D').mean()
df.index = pd.DatetimeIndex(df.index.date, name='time')
da = df.to_xarray()

# Get anomalies
gp = da.groupby('time.dayofyear')
da_clim = gp.mean('time')
da_anom = gp - da_clim
df_anom = da_anom.to_pandas()
del df_anom['dayofyear']

# Variable selection
VARIABLE_NAMES = ['t2m', 'u10', 'v10', 'sp', 'ssrd', 'tp']
VARIABLE_LONG_NAMES = {
    't2m': '2m Temperature',
    'u10': '10m Zonal Wind',
    'v10': '10m Meridional Wind',
    'sp': 'Surface Pressure',
    'ssrd': 'Surf. Solar Radiation Down.',
    'tp': 'Total Precipitation'
}

# Plot time series
NDAYS_SERIES = 30


fig, ax = plt.subplots(1, 2, figsize=(12, 4))
(df_anom / df_anom.std())[VARIABLE_NAMES].iloc[:NDAYS_SERIES].plot(ax=ax[0])
ax[0].set_ylabel('Normalized Anomalies')
ax[0].set_xlabel('')
ax[0].get_legend().remove()

# Plot auto-correlation functions
NDAYS_ACF = 21

nt = len(df_anom)
for variable_name in VARIABLE_NAMES:
    s = df_anom[variable_name]
    variable_long_name = VARIABLE_LONG_NAMES[variable_name]
    acf = np.correlate(s, s, 'full')[nt - 1:] / s.var() / (nt - 1)  
    ax[1].plot(acf[:NDAYS_ACF], label=variable_long_name)
ax[1].set_xlabel('Lag (days)')
ax[1].set_ylabel('Sample Auto-Correlation Function')
ax[1].set_xlim(0, NDAYS_ACF - 1)
labels = [VARIABLE_LONG_NAMES[variable_name]
          for variable_name in VARIABLE_NAMES]
plt.legend(labels=labels, bbox_to_anchor=(0.7, -0.2), ncols=2)
plt.subplots_adjust(wspace=0.3)
_ = ax[1].grid(True)
../_images/44c1b801471a0f2c5825f23f6f8d0eb7ba4e8a7fec773edc67f24823398ad9e4.png

Partial Summary#

  • In statistics, we always make assumptions about the probability distribution of the data (stationarity, independence, parametric form, etc.)

  • The quality of statistics (predictions, error estimates, etc.) depends on the validity of these assumptions

  • Even the best validation could be wrong about predictions if the new data does not satisfy these assumptions!

To go further#

References#


Credit#

Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.


Logo LMD Logo IPSL Logo E4C Logo EP Logo SU Logo ENS Logo CNRS