# Tutorial: Regularization, Model Selection and Evaluation#

Tutorial to the class Regularization, Model Selection and Evaluation.

**Tutorial Objectives**

Apply regularization methods: Ridge and Lasso

Compute and plot validation curves

Compare \(k\)-nearest neighbors to Ridge/Lasso

## Scientific objective#

To predict the France average wind capacity factor from the geopotential height at 500hPa over the Euro-Atlantic sector.

## Dataset presentation#

Input:

Geopotential height at 500hPa

Domain: North Atlantic

Spatial resolution: \(0.5° \times 0.625°\)

Time resolution: monthly

Period: 1980-2021

Units: m

Source: MERRA-2 reanalysis

Target:

Onshore wind capacity factors

Domain: Metropolitan France

Spatial resolution: regional mean

Time resolution: daily

Period: 2014-2021

Units:

Source: RTE

## Getting ready#

### Reading the wind capacity factor data#

Let us follow the same procedure as in Tutorial: Supervised Learning Problem and Least Squares to import the required modules, read the data and select the domain but for the onshore wind capacity factors instead of the electricity demand.

We also compute monthly averages from the daily data since the we are only interested in predicting the monthly wind capacity factors.

```
# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Structured dataset analysis module
import xarray as xr
# Plot module
import matplotlib.pyplot as plt
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
# Matplotlib configuration
plt.rc('font', size=14)
# Set data directory
data_dir = Path('data')
# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict(index_col=0, parse_dates=True)
# Define electricity demand filepath and label
windcf_filename = 'reseaux_energies_capacityfactor_wind-onshore.csv'
windcf_filepath = Path(data_dir, windcf_filename)
windcf_label = 'Wind capacity factor'
# Read windcf data with pandas
df_windcf_daily = pd.read_csv(windcf_filepath, **kwargs_read_csv)
# Select domain
# REGION_NAME = 'Bretagne'
REGION_NAME = 'National'
if REGION_NAME == 'National':
df_windcf_daily_reg = df_windcf_daily.mean('columns')
df_windcf_daily_reg.name = REGION_NAME
else:
df_windcf_daily_reg = df_windcf_daily[REGION_NAME]
# Resample wind capacity factor from daily to monthly means
df_windcf_reg = df_windcf_daily_reg.resample('MS').mean()
```

### Reading the geopotential height data#

The geopotential height data is in the NetCDF format (structured binary data) and cannot be read as text. We use the

`xarray`

module instead to read it.We also divide the resolution of the gridded data by a factor 4 in both horizontal dimensions.

```
# Define temperature filepath and label
START_DATE = '19800101'
END_DATE = '20220101'
z500_filename = 'merra2_analyze_height_500_month_{}-{}.nc'.format(START_DATE, END_DATE)
z500_filepath = Path(data_dir, z500_filename)
z500_label = 'Geopotential height (m)'
# Read geopotential height dataset with xarray
ds = xr.load_dataset(z500_filepath)
# Select geopotential height variable
z500_name = 'height_500'
da_z500_hr = ds[z500_name]
# Downsample geopotential height
N_GRID_AVG = 8
da_z500 = da_z500_hr.coarsen(lat=N_GRID_AVG, boundary='trim').mean().coarsen(
lon=N_GRID_AVG, boundary='trim').mean()
```

Question

Print the geopotential height

`DataArray`

and make sense of what you read.How many variables constitute the geopotential height field?

```
# answer cell
```

### Representing the first moments of the geopotential height field#

Question

Compute the mean and the variance of the geopotential height with the

`mean`

and`var`

methods.Plot the mean with the

`plot`

method.Do a filled-contour plot of the variance with the

`plot.contourf`

method.

```
# answer cell
```

A strong part of the wind capacity factor signals is the seasonal cycle. The latter can be estimated from historical data and does not need to be predicted from the geopotential height. We thus remove it from both the geopotential height and the wind capacity factor.

We also keep only the dates that are common to both datasets so as to have pairs of feature and target data points used to solve the supervised learning problem.

Finally, we save the time index and the number of complete years in the dataset for later.

```
# Remove seasonal cycle from wind capacity factor
da_windcf_reg = df_windcf_reg.to_xarray()
gp_windcf_cycle = da_windcf_reg.groupby('time.month')
da_windcf_anom = gp_windcf_cycle - gp_windcf_cycle.mean('time')
df_windcf_anom = da_windcf_anom.drop('month').to_dataframe()[REGION_NAME]
# Remove seasonal cycle from geopotential height or not
gp_z500_cycle = da_z500.groupby('time.month')
da_z500_anom = gp_z500_cycle - gp_z500_cycle.mean('time')
# Convert to bandas with grid points as columns
df_z500_anom = da_z500_anom.stack(latlon=('lat', 'lon')).to_dataframe()[
z500_name].unstack(0).transpose()
# Select common index
idx = df_z500_anom.index.intersection(df_windcf_anom.index)
df_z500_anom = df_z500_anom.loc[idx]
df_windcf_anom = df_windcf_anom.loc[idx]
# Number of years in dataset
time = df_windcf_anom.index
n_years = time.year.max() - time.year.min() + 1
```

Question

How many years of common data is there?

Plot the time series of the spatially averaged geopotential height and of the national wind capacity factor.

Compute the correlation between the spatial average of the geopotential height and the wind capacity factor.

```
# answer cell
```

## Regularization#

We want to learn to predict the national wind capacity factor from the geopotential height at different grid points using a linear model.

Since the number of input variables, we need to take care of avoiding overfitting.

To do so, we apply different regularization techniques.

### Ridge regression#

Question

Apply the ridge regression using

`Ridge`

from`sklearn.linear_model`

to predict the national wind capacity factor from the geopotential height field for an arbitrary regularization parameter value of`10**5`

.Plot the histogram estimate of the probability density function of the model coefficients and print the intercept.

How does the distribution of the coefficients change when you increase or decrease the regularization coefficient?

Explain this behavior based on what you have learned in class.

```
# answer cell
```

Answer:

Question

Compute the corresponding validation curves. To do so:

Leave aside a year or more of test data (it will be used later to test the model with regularization).

Compute and plot the train and validation error (using cross-validation) for varying values of the regularization parameter.

What is the best value of the regularization parameter according to your estimations?

Is there any overfitting and/or underfitting occurring here?

```
# answer cell
```

Answer:

Question

Compare the distribution of the model coefficients for the best value of the regularization parameter to that of the ordinary least squares coefficients.

```
# answer cell
```

Answer:

Question

Give an estimate the prediction error conditioned on some train dataset for the best choice or regularization parameter value.

Compare the test value of the score to its validation value.

Evaluate and discuss the sensitivity of the test score to the number of test years retained.

```
# answer cell
```

Answer:

Question (optional)

Use nested cross-validation to estimated the expected prediction error for the best choice of regularization parameter value.

How does the this value of the test score compare with the one conditionned on a few years of data computed in the previous section? Discuss this result.

```
# answer cell
```

Answer:

### Lasso regression#

Question

Same questions as for the ridge but for the Lasso (using

`Lasso`

from`sklearn.linear_model`

).How do the coefficients evolve compared to the Ridge?

```
# answer cell
```

Answer:

## \(K\)-nearest neighbor model#

Question (optional)

Compute and plot the validation curve for the \(k\)-nearest neighbor model using

`KNeighborsRegressor`

from`sklearn.neighbors`

for varying number of neighbors.For which value of \(k\) is the score best? Represent the best prediction above the scatter plot of the train data.

Is there any overfitting occurring?

How does the best \(k\)-nearest neighbor model performs compared to the linear models analyzed so far?

```
# answer cell
```

Answer:

## Credit#

Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.