Tutorial: Regularization, Model Selection and Evaluation#
Tutorial to the class Regularization, Model Selection and Evaluation.
Apply regularization methods: Ridge and Lasso
Compute and plot validation curves
Compare \(k\)-nearest neighbors to Ridge/Lasso
Scientific objective#
To predict the France average wind capacity factor from the geopotential height at 500hPa over the Euro-Atlantic sector.
Dataset presentation#
Input:
Geopotential height at 500hPa
Domain: North Atlantic
Spatial resolution: \(0.5° \times 0.625°\)
Time resolution: monthly
Period: 1980-2021
Units: m
Source: MERRA-2 reanalysis
Target:
Onshore wind capacity factors
Domain: Metropolitan France
Spatial resolution: regional mean
Time resolution: daily
Period: 2014-2021
Units:
Source: RTE
Getting ready#
Reading the wind capacity factor data#
Let us follow the same procedure as in Tutorial: Supervised Learning Problem and Least Squares to import the required modules, read the data and select the domain but for the onshore wind capacity factors instead of the electricity demand.
We also compute monthly averages from the daily data since the we are only interested in predicting the monthly wind capacity factors.
# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Structured dataset analysis module
import xarray as xr
# Plot module
import matplotlib.pyplot as plt
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
# Matplotlib configuration
plt.rc('font', size=14)
# Set data directory
data_dir = Path('data')
# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict(index_col=0, parse_dates=True)
# Define electricity demand filepath and label
windcf_filename = 'reseaux_energies_capacityfactor_wind-onshore.csv'
windcf_filepath = Path(data_dir, windcf_filename)
windcf_label = 'Wind capacity factor'
# Read windcf data with pandas
df_windcf_daily = pd.read_csv(windcf_filepath, **kwargs_read_csv)
# Select domain
# REGION_NAME = 'Bretagne'
REGION_NAME = 'National'
if REGION_NAME == 'National':
df_windcf_daily_reg = df_windcf_daily.mean('columns')
df_windcf_daily_reg.name = REGION_NAME
else:
df_windcf_daily_reg = df_windcf_daily[REGION_NAME]
# Resample wind capacity factor from daily to monthly means
df_windcf_reg = df_windcf_daily_reg.resample('MS').mean()
Reading the geopotential height data#
The geopotential height data is in the NetCDF format (structured binary data) and cannot be read as text. We use the
xarray
module instead to read it.We also divide the resolution of the gridded data by a factor 4 in both horizontal dimensions.
# Define temperature filepath and label
START_DATE = '19800101'
END_DATE = '20220101'
z500_filename = 'merra2_analyze_height_500_month_{}-{}.nc'.format(START_DATE, END_DATE)
z500_filepath = Path(data_dir, z500_filename)
z500_label = 'Geopotential height (m)'
# Read geopotential height dataset with xarray
ds = xr.load_dataset(z500_filepath)
# Select geopotential height variable
z500_name = 'height_500'
da_z500_hr = ds[z500_name]
# Downsample geopotential height
N_GRID_AVG = 8
da_z500 = da_z500_hr.coarsen(lat=N_GRID_AVG, boundary='trim').mean().coarsen(
lon=N_GRID_AVG, boundary='trim').mean()
Question
Print the geopotential height
DataArray
and make sense of what you read.How many variables constitute the geopotential height field?
# answer cell
Representing the first moments of the geopotential height field#
Question
Compute the mean and the variance of the geopotential height with the
mean
andvar
methods.Plot the mean with the
plot
method.Do a filled-contour plot of the variance with the
plot.contourf
method.
# answer cell
A strong part of the wind capacity factor signals is the seasonal cycle. The latter can be estimated from historical data and does not need to be predicted from the geopotential height. We thus remove it from both the geopotential height and the wind capacity factor.
We also keep only the dates that are common to both datasets so as to have pairs of feature and target data points used to solve the supervised learning problem.
Finally, we save the time index and the number of complete years in the dataset for later.
# Remove seasonal cycle from wind capacity factor
da_windcf_reg = df_windcf_reg.to_xarray()
gp_windcf_cycle = da_windcf_reg.groupby('time.month')
da_windcf_anom = gp_windcf_cycle - gp_windcf_cycle.mean('time')
df_windcf_anom = da_windcf_anom.drop('month').to_dataframe()[REGION_NAME]
# Remove seasonal cycle from geopotential height or not
gp_z500_cycle = da_z500.groupby('time.month')
da_z500_anom = gp_z500_cycle - gp_z500_cycle.mean('time')
# Convert to bandas with grid points as columns
df_z500_anom = da_z500_anom.stack(latlon=('lat', 'lon')).to_dataframe()[
z500_name].unstack(0).transpose()
# Select common index
idx = df_z500_anom.index.intersection(df_windcf_anom.index)
df_z500_anom = df_z500_anom.loc[idx]
df_windcf_anom = df_windcf_anom.loc[idx]
# Number of years in dataset
time = df_windcf_anom.index
n_years = time.year.max() - time.year.min() + 1
Question
How many years of common data is there?
Plot the time series of the spatially averaged geopotential height and of the national wind capacity factor.
Compute the correlation between the spatial average of the geopotential height and the wind capacity factor.
# answer cell
Regularization#
We want to learn to predict the national wind capacity factor from the geopotential height at different grid points using a linear model.
Since the number of input variables, we need to take care of avoiding overfitting.
To do so, we apply different regularization techniques.
Ridge regression#
Question
Apply the ridge regression using
Ridge
fromsklearn.linear_model
to predict the national wind capacity factor from the geopotential height field for an arbitrary regularization parameter value of10**5
.Plot the histogram estimate of the probability density function of the model coefficients and print the intercept.
How does the distribution of the coefficients change when you increase or decrease the regularization coefficient?
Explain this behavior based on what you have learned in class.
# answer cell
Answer:
Question
Compute the corresponding validation curves. To do so:
Leave aside a year or more of test data (it will be used later to test the model with regularization).
Compute and plot the train and validation error (using cross-validation) for varying values of the regularization parameter.
What is the best value of the regularization parameter according to your estimations?
Is there any overfitting and/or underfitting occurring here?
# answer cell
Answer:
Question
Compare the distribution of the model coefficients for the best value of the regularization parameter to that of the ordinary least squares coefficients.
# answer cell
Answer:
Question
Give an estimate the prediction error conditioned on some train dataset for the best choice or regularization parameter value.
Compare the test value of the score to its validation value.
Evaluate and discuss the sensitivity of the test score to the number of test years retained.
# answer cell
Answer:
Question (optional)
Use nested cross-validation to estimated the expected prediction error for the best choice of regularization parameter value.
How does the this value of the test score compare with the one conditionned on a few years of data computed in the previous section? Discuss this result.
# answer cell
Answer:
Lasso regression#
Question
Same questions as for the ridge but for the Lasso (using
Lasso
fromsklearn.linear_model
).How do the coefficients evolve compared to the Ridge?
# answer cell
Answer:
\(K\)-nearest neighbor model#
Question (optional)
Compute and plot the validation curve for the \(k\)-nearest neighbor model using
KNeighborsRegressor
fromsklearn.neighbors
for varying number of neighbors.For which value of \(k\) is the score best? Represent the best prediction above the scatter plot of the train data.
Is there any overfitting occurring?
How does the best \(k\)-nearest neighbor model performs compared to the linear models analyzed so far?
# answer cell
Answer:
Credit#
Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.