Tutorial: Regularization, Model Selection and Evaluation

Tutorial: Regularization, Model Selection and Evaluation#

Tutorial to the class Regularization, Model Selection and Evaluation.

Tutorial Objectives

Apply regularization methods: Ridge and Lasso
Compute and plot validation curves
Compare \(k\)-nearest neighbors to Ridge/Lasso

Scientific objective#

To predict the France average wind capacity factor from the geopotential height at 500hPa over the Euro-Atlantic sector.

Dataset presentation#

Input:
- Geopotential height at 500hPa
  - Domain: North Atlantic
  - Spatial resolution: \(0.5° \times 0.625°\)
  - Time resolution: monthly
  - Period: 1980-2021
  - Units: m
  - Source: MERRA-2 reanalysis
Target:
- Onshore wind capacity factors
  - Domain: Metropolitan France
  - Spatial resolution: regional mean
  - Time resolution: daily
  - Period: 2014-2021
  - Units:
  - Source: RTE

Getting ready#

Reading the wind capacity factor data#

Let us follow the same procedure as in Tutorial: Supervised Learning Problem and Least Squares to import the required modules, read the data and select the domain but for the onshore wind capacity factors instead of the electricity demand.
We also compute monthly averages from the daily data since the we are only interested in predicting the monthly wind capacity factors.

# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Structured dataset analysis module
import xarray as xr
# Plot module
import matplotlib.pyplot as plt
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
# Matplotlib configuration
plt.rc('font', size=14)

# Set data directory
data_dir = Path('data')

# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict(index_col=0, parse_dates=True) 

# Define electricity demand filepath and label
windcf_filename = 'reseaux_energies_capacityfactor_wind-onshore.csv'
windcf_filepath = Path(data_dir, windcf_filename)
windcf_label = 'Wind capacity factor'

# Read windcf data with pandas                                                                                                                                                
df_windcf_daily = pd.read_csv(windcf_filepath, **kwargs_read_csv)

# Select domain
# REGION_NAME = 'Bretagne'
REGION_NAME = 'National'
if REGION_NAME == 'National':
    df_windcf_daily_reg = df_windcf_daily.mean('columns')
    df_windcf_daily_reg.name = REGION_NAME
else:
    df_windcf_daily_reg = df_windcf_daily[REGION_NAME]

# Resample wind capacity factor from daily to monthly means
df_windcf_reg = df_windcf_daily_reg.resample('MS').mean()

Reading the geopotential height data#

The geopotential height data is in the NetCDF format (structured binary data) and cannot be read as text. We use the xarray module instead to read it.
We also divide the resolution of the gridded data by a factor 4 in both horizontal dimensions.

# Define temperature filepath and label
START_DATE = '19800101'
END_DATE = '20220101'
z500_filename = 'merra2_analyze_height_500_month_{}-{}.nc'.format(START_DATE, END_DATE)
z500_filepath = Path(data_dir, z500_filename)
z500_label = 'Geopotential height (m)'

# Read geopotential height dataset with xarray                                                                                                                                
ds = xr.load_dataset(z500_filepath)

# Select geopotential height variable                                                                                                                                         
z500_name = 'height_500'                                                                                                                                                     
da_z500_hr = ds[z500_name]

# Downsample geopotential height
N_GRID_AVG = 8
da_z500 = da_z500_hr.coarsen(lat=N_GRID_AVG, boundary='trim').mean().coarsen(                                                                                                 
    lon=N_GRID_AVG, boundary='trim').mean()

Question

Print the geopotential height DataArray and make sense of what you read.

How many variables constitute the geopotential height field?

# answer cell

Representing the first moments of the geopotential height field#

Question

Compute the mean and the variance of the geopotential height with the mean and var methods.

Plot the mean with the plot method.

Do a filled-contour plot of the variance with the plot.contourf method.

# answer cell

A strong part of the wind capacity factor signals is the seasonal cycle. The latter can be estimated from historical data and does not need to be predicted from the geopotential height. We thus remove it from both the geopotential height and the wind capacity factor.
We also keep only the dates that are common to both datasets so as to have pairs of feature and target data points used to solve the supervised learning problem.
Finally, we save the time index and the number of complete years in the dataset for later.

# Remove seasonal cycle from wind capacity factor
da_windcf_reg = df_windcf_reg.to_xarray()                                                                                                                                 
gp_windcf_cycle = da_windcf_reg.groupby('time.month')                                                                                                                     
da_windcf_anom = gp_windcf_cycle - gp_windcf_cycle.mean('time')                                                                                                           
df_windcf_anom = da_windcf_anom.drop('month').to_dataframe()[REGION_NAME]

# Remove seasonal cycle from geopotential height or not
gp_z500_cycle = da_z500.groupby('time.month')                                                                                                                             
da_z500_anom = gp_z500_cycle - gp_z500_cycle.mean('time')

# Convert to bandas with grid points as columns                                                                                                                               
df_z500_anom = da_z500_anom.stack(latlon=('lat', 'lon')).to_dataframe()[
    z500_name].unstack(0).transpose()
    
# Select common index                                                                                                                                                         
idx = df_z500_anom.index.intersection(df_windcf_anom.index)                                                                                                                   
df_z500_anom = df_z500_anom.loc[idx]                                                                                                                                          
df_windcf_anom = df_windcf_anom.loc[idx]                                                                                                                                      
                                                                                                                                                                              
# Number of years in dataset                                                                                                                                                  
time = df_windcf_anom.index                                                                                                                                                   
n_years = time.year.max() - time.year.min() + 1

Question

How many years of common data is there?

Plot the time series of the spatially averaged geopotential height and of the national wind capacity factor.

Compute the correlation between the spatial average of the geopotential height and the wind capacity factor.

# answer cell

Regularization#

We want to learn to predict the national wind capacity factor from the geopotential height at different grid points using a linear model.

Since the number of input variables, we need to take care of avoiding overfitting.

To do so, we apply different regularization techniques.

Ridge regression#

Question

Apply the ridge regression using Ridge from sklearn.linear_model to predict the national wind capacity factor from the geopotential height field for an arbitrary regularization parameter value of 10**5.

Plot the histogram estimate of the probability density function of the model coefficients and print the intercept.

How does the distribution of the coefficients change when you increase or decrease the regularization coefficient?

Explain this behavior based on what you have learned in class.

# answer cell

Answer:

Question

Compute the corresponding validation curves. To do so:

Leave aside a year or more of test data (it will be used later to test the model with regularization).

Compute and plot the train and validation error (using cross-validation) for varying values of the regularization parameter.

What is the best value of the regularization parameter according to your estimations?

Is there any overfitting and/or underfitting occurring here?

# answer cell

Answer:

Question

Compare the distribution of the model coefficients for the best value of the regularization parameter to that of the ordinary least squares coefficients.

# answer cell

Answer:

Question

Give an estimate the prediction error conditioned on some train dataset for the best choice or regularization parameter value.

Compare the test value of the score to its validation value.

Evaluate and discuss the sensitivity of the test score to the number of test years retained.

# answer cell

Answer:

Question (optional)

Use nested cross-validation to estimated the expected prediction error for the best choice of regularization parameter value.

How does the this value of the test score compare with the one conditionned on a few years of data computed in the previous section? Discuss this result.

# answer cell

Answer:

Lasso regression#

Question

Same questions as for the ridge but for the Lasso (using Lasso from sklearn.linear_model).

How do the coefficients evolve compared to the Ridge?

# answer cell

Answer:

\(K\)-nearest neighbor model#

Question (optional)

Compute and plot the validation curve for the \(k\)-nearest neighbor model using KNeighborsRegressor from sklearn.neighbors for varying number of neighbors.

For which value of \(k\) is the score best? Represent the best prediction above the scatter plot of the train data.

Is there any overfitting occurring?

How does the best \(k\)-nearest neighbor model performs compared to the linear models analyzed so far?