Tutorial: Ensemble Methods for Wind Capacity Factor Prediction

Tutorial: Ensemble Methods for Wind Capacity Factor Prediction#

Tutorial to the class Ensemble Methods based on the same case study as in Tutorial: Regularization, Model Selection and Evaluation and Tutorial: Introduction to Unsupervised Learning with a Focus on PCA.

Tutorial Objectives

Use ensemble methods to best predict the France-average wind capacity factor from a large number of input variables;
Control the complexity parameter(s) of each ensemble method to avoid overfitting;
Compare the skills of different methods.

Scientific objective#

To predict the France average wind capacity factor from the geopotential height at 500hPa over the Euro-Atlantic sector.

Dataset presentation#

Input:
- Geopotential height at 500hPa
  - Domain: North Atlantic
  - Spatial resolution: \(0.5° \times 0.625°\)
  - Time resolution: monthly
  - Period: 1980-2021
  - Units: m
  - Source: MERRA-2 reanalysis
Target:
- Onshore wind capacity factors
  - Domain: Metropolitan France
  - Spatial resolution: regional mean
  - Time resolution: daily
  - Period: 2014-2021
  - Units:
  - Source: RTE

Getting ready#

Reading the wind capacity factor and geopotential height data#

We follow the same procedure as in # Tutorial: Regularization, Model Selection and Evaluation.

# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Structured dataset analysis module
import xarray as xr
# Plot module
import matplotlib.pyplot as plt
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
# Matplotlib configuration
plt.rc('font', size=14)

# Set data directory
data_dir = Path('data')

# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict(index_col=0, parse_dates=True) 

# Define electricity demand filepath and label
windcf_filename = 'reseaux_energies_capacityfactor_wind-onshore.csv'
windcf_filepath = Path(data_dir, windcf_filename)
windcf_label = 'Wind capacity factor'

# Read windcf data with pandas                                                                                                                                                
df_windcf_daily = pd.read_csv(windcf_filepath, **kwargs_read_csv)

# Select domain
# REGION_NAME = 'Bretagne'
REGION_NAME = 'National'
if REGION_NAME == 'National':
    df_windcf_daily_reg = df_windcf_daily.mean('columns')
    df_windcf_daily_reg.name = REGION_NAME
else:
    df_windcf_daily_reg = df_windcf_daily[REGION_NAME]

# Resample wind capacity factor from daily to monthly means
df_windcf_reg = df_windcf_daily_reg.resample('MS').mean()

# Define temperature filepath and label
START_DATE = '19800101'
END_DATE = '20220101'
z500_filename = 'merra2_analyze_height_500_month_{}-{}.nc'.format(START_DATE, END_DATE)
z500_filepath = Path(data_dir, z500_filename)
z500_label = 'Geopotential height (m)'

# Read geopotential height dataset with xarray                                                                                                                                
ds = xr.load_dataset(z500_filepath)

# Select geopotential height variable                                                                                                                                         
z500_name = 'height_500'                                                                                                                                                     
da_z500_hr = ds[z500_name]

# Downsample geopotential height
N_GRID_AVG = 8
da_z500 = da_z500_hr.coarsen(lat=N_GRID_AVG, boundary='trim').mean().coarsen(                                                                                                 
    lon=N_GRID_AVG, boundary='trim').mean()

# Remove seasonal cycle from wind capacity factor
da_windcf_reg = df_windcf_reg.to_xarray()                                                                                                                                 
gp_windcf_cycle = da_windcf_reg.groupby('time.month')                                                                                                                     
da_windcf_anom = gp_windcf_cycle - gp_windcf_cycle.mean('time')                                                                                                           
df_windcf_anom = da_windcf_anom.drop('month').to_dataframe()[REGION_NAME]

# Remove seasonal cycle from geopotential height or not
gp_z500_cycle = da_z500.groupby('time.month')                                                                                                                             
da_z500_anom = gp_z500_cycle - gp_z500_cycle.mean('time')

# Convert to bandas with grid points as columns                                                                                                                               
df_z500_anom = da_z500_anom.stack(latlon=('lat', 'lon')).to_dataframe()[
    z500_name].unstack(0).transpose()
    
# Select common index                                                                                                                                                         
idx = df_z500_anom.index.intersection(df_windcf_anom.index)                                                                                                                   
df_z500_anom = df_z500_anom.loc[idx]                                                                                                                                          
df_windcf_anom = df_windcf_anom.loc[idx]                                                                                                                                      
                                                                                                                                                                              
# Number of years in dataset                                                                                                                                                  
time = df_windcf_anom.index                                                                                                                                                   
n_years = time.year.max() - time.year.min() + 1

Ensemble methods#

Regressions evaluation function#

We also define a function to evaluate any of our regressions.

FIRST_TEST_YEAR controls the number of years at the end of the time series to keep as test data.

from sklearn import model_selection

# Default number of test days
FIRST_TEST_YEAR = 2020

POLY_DEGREE = 1

# Get input matrix and output vector for whole time series
X = df_z500_anom.values
y = df_windcf_anom.values

def evaluate_regressor(
    reg, reg_kwargs, param_name, param_range, first_test_year=FIRST_TEST_YEAR,
    n_splits=None, cv_iterator=model_selection.KFold,
    plot_validation=True):
    
    if n_splits is None:
        n_splits = len(np.unique(time[time.year < first_test_year].year))
    
    # Get test data keeping last years
    index_test = time.year >= first_test_year
    X_test = X[index_test]
    y_test = y[index_test]
    
    # Select train data from first years and first days in month 
    index_cv = time.year < first_test_year
    X_cv = X[index_cv]
    y_cv = y[index_cv]
    
    if plot_validation:
        # Set cross-validation iterator
        cv = cv_iterator(n_splits=n_splits)
        groups = time[index_cv].year

        # Get train and validation scores from cross-validation
        train_scores, validation_scores = model_selection.validation_curve(
            reg, X_cv, y_cv, param_name=param_name,
            param_range=param_range, cv=cv, groups=groups)

        # Get train curve
        train_scores_mean = train_scores.mean(1)
        train_scores_max = train_scores.max(1)
        train_scores_min = train_scores.min(1)

        # Get validation curve
        validation_scores_mean = validation_scores.mean(1)
        validation_scores_max = validation_scores.max(1)
        validation_scores_min = validation_scores.min(1)

        # Get best value of the regularization parameter
        i_best = np.argmax(validation_scores_mean)
        param_best = param_range[i_best]
        score_best = validation_scores_mean[i_best]

    
        # Plot validation curve
        lw = 2
        plt.figure()
        plt.semilogx(param_range, train_scores_mean, label="Training score",
                     color="darkorange", lw=lw)
        plt.fill_between(param_range, train_scores_min, train_scores_max,
                         alpha=0.2, color="darkorange", lw=lw)
        plt.semilogx(param_range, validation_scores_mean,
                     label="Cross-validation score", color="navy", lw=lw)
        plt.fill_between(
            param_range, validation_scores_min, validation_scores_max,
            alpha=0.2, color="navy", lw=lw)
        plt.xlabel(param_name)
        plt.ylabel(r'Validation $R^2$')
        plt.xlim(param_range[[0, -1]])
        plt.title(REGION_NAME + r'. Best $R^2$: {:.2} for {} = {:.1e}'.format(
            score_best, param_name, param_best))
        # plt.ylim(0.0, 1.1)
        plt.legend(loc="best")
        
        # Set best parameter
        reg.set_params(**{param_name: param_best})
    else:
        param_best = reg.get_params(param_name)
        
    # Compute prediction error conditioned on first 5 years of data
    reg.fit(X_cv, y_cv)
    test_score = reg.score(X_test, y_test)
    print('\nTest R2: {:.2f}'.format(test_score))

    # Predict for work days and off days
    y_pred = reg.predict(X_test)

    # Scatter plot of prediction against target
    fig, ax = plt.subplots()
    ax.scatter(y_test, y_pred, s=10, alpha=0.5)
    xlim = ax.get_xlim()
    ax.plot(xlim, xlim, '--k', linewidth=1)
    ax.set_xlabel('Target ' + windcf_label)
    ax.set_ylabel('Predicted ' + windcf_label)
    ax.set_xlim(xlim)
    ax.set_title('Wind capacity factor')
    
    return {param_name: param_best}

Question

If 2021 is given as the first test years, how many training years and test years will be available for this dataset?

Identify the lines where:

The validation and train scores are computed;

The test score is computed;

The target is predicted from the test input features.

Answer:

Individual models#

First, make sure that only the heating and cooling temperatures are selected as features, that the monthly factorization is activated and that no polynomial transformation is performed.

Lasso regression#

The following code:

Creates a Lasso regressor with positive coefficients;
Evaluate the regressor over a range of regularization parameter values;
Represent the importance of the coefficients.

The evaluation function:

Plots the training and validation curves with the lines representing the mean score and the shading the minimum and maximum scores;
Plots test predictions against the test inputs over the train data;
Plots the test predictions against the test targets.

Question

Is there an overfit/underfit tradeoff?

Explain the difference between the training curve and the validation curve?

Explain the evolution the variability of the validation curve.

Is the test score in agreement with the best validation score?

How does the the importance of the coefficients vary with climate variable and the month?

from sklearn import linear_model

# Define array of complexity coordinate, regressor and options
# for Lasso regression
param_name, param_range = 'alpha', np.logspace(-4, 1, 50)
reg_kwargs_lasso = dict(max_iter=10000)
reg_lasso = linear_model.Lasso(
    **{param_name: param_range[0]}, **reg_kwargs_lasso)

# Evaluate regressor
param_best_lasso = evaluate_regressor(
    reg_lasso, reg_kwargs_lasso, param_name, param_range)
df_coef = pd.Series(reg_lasso.coef_)
plt.figure()
df_coef.plot(kind='barh', figsize=(8, 8))
plt.xlabel(r'Coefficient value (m$^{-1}$)')
plt.ylabel('Grid point')
_ = plt.title('Model coefficients')

Test R2: 0.31

../_images/7af1c72b9f4f85611eded71ce229783160f6d8ca138797fdfc1815ec31713e2e.png

../_images/929dfe3df42122050676c6369a5b1dc1a2261737d36be19af22e91f0c876b4cf.png

../_images/e64b6fbf6f0e0c51bdef9b598ebd84d7a964b755af617ac3c6111dc7b4e00f86.png

Answer:

Decision-tree regression#

The following code:

Creates a decision-tree regressor;
Evaluate the regressor over a range of maximum tree depth;

Question

Is there an overfit/underfit tradeoff?

How does the tree perform compared to the Lasso?

from sklearn import tree

# Define array of complexity coordinate, regressor and options
# for decision-tree regression
param_name, param_range = 'max_depth', np.arange(1, 20, 1)
reg_kwargs_dt = dict()
reg_dt = tree.DecisionTreeRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_dt)

# Evaluate regressor
param_best_dt = evaluate_regressor(
    reg_dt, reg_kwargs_dt, param_name, param_range)

Test R2: -0.01

../_images/9445cf9c176b1af298651a992483a3aaf5129d4e125df549005e4db19ab444eb.png

../_images/cfa9dc20b981fe6edb922137a66fd74c5ebd8418a09ece1729e5426e6c60acc2.png

Answer:

Ensemble models#

Bagging regressor#

The following code:

Creates a bagging regressor with a decision tree as the base estimator;
Evaluate the regressor over a range of number of estimators;

Question

Use the Scikit-learn documentation to give the parameters defining this bagging regressor with a decision tree as the base estimator.

Question

Is there an overfit/underfit tradeoff?

In this case, how to choose the number of estimators?

Same question without using validation curves.

How does the bagging perform compared to the individual regressors?

Same question but with the Lasso as base estimator.

from sklearn import ensemble

# Define array of complexity coordinate, regressor and options
# for Bagging regression
param_name, param_range = 'n_estimators', np.arange(1, 50, 2)
# estimator = None
estimator = linear_model.Lasso(alpha=0.15)
reg_kwargs_br = dict(estimator=estimator)
reg_br = ensemble.BaggingRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_br)

# Evaluate regressor
param_best_br = evaluate_regressor(
    reg_br, reg_kwargs_br, param_name, param_range)

Test R2: 0.32

../_images/cfeffdc8d981571f4a408ac37361dabb765e5719556915c7bbd69152104c872c.png

../_images/e8471155129a7619b57224d992bf5720d6d9d532070935ac20c8ff3df762ec73.png

Answer:

Random-forest regressor#

The following code:

Creates a random-forest regressor;
Evaluate the regressor over a range of number of estimators;

Question

Use the Scikit-learn documentation to give the parameters defining this random-forest regressor.

Question

How does the random forest perform compared to the bagging regressor?

Compare the varibility of the validation score of the random forest to that of the bagging regressor.

# Define array of complexity coordinate, regressor and options
# for random-forest regression
param_name, param_range = 'n_estimators', np.arange(1, 202, 25)
reg_kwargs_rf = dict()
reg_rf = ensemble.RandomForestRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_rf)

# Evaluate regressor
param_best_rf = evaluate_regressor(
    reg_rf, reg_kwargs_rf, param_name, param_range)

Test R2: 0.06

../_images/c5c28a93c192a7f338fd49070cb14047381e4af4f1dc254eefea0f7407385eb0.png

../_images/12b8378200f2a5d248c8acb7b3edc44c8dcfb990ae46ccdae665698a199f1ac2.png

Answer:

The following plot represents the mean and standard deviation of the importance given to the features by the trees in the random forest (see Feature importance with a forest of trees in Scikit-learn User guide).

Question

How does the the importance of the features vary with climate variable and the month?

Compare the importance of the random-forest features with that of the Lasso.

# Plot importance
importances = reg_rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in reg_rf.estimators_], axis=0)
forest_importances = pd.Series(importances)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
_ = ax.set_ylabel("Mean decrease in impurity")

../_images/273072152f8677730c47130bf97a74c3200ddb0904fe1f383c4febfe8980b79b.png

Answer:

Voting regressor#

The following code creates a voting regressor with the Lasso, the decision tree and the random forest as base estimators and tests it.

Question

Use the Scikit-learn documentation to give the parameters defining this voting regressor.

Question

How does the voting regressor perform compared to the base estimators?

# Define array of complexity coordinate, regressor and options
# for voting regression
param_name, param_range = 'estimators', [
    [('Lasso', reg_lasso), ('DecisionTree', reg_dt), ('RandomForest', reg_rf)]
]
reg_kwargs_vr = dict()
reg_vr = ensemble.VotingRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_vr)

# Evaluate regressor
param_best_vr = evaluate_regressor(
    reg_vr, reg_kwargs_vr, param_name, param_range,
    plot_validation=False)

Test R2: 0.23

../_images/e92bbd4113d22e8db91d3fa2379c611a4ec91cd63ed98f4c10615b794c255518.png

Answer:

Stacking regressor#

The following code creates a stacking regressor with the Lasso, the decision tree and the random forest as base estimators and tests it.
The final regressor is a OLS with positive coefficients.
The weights given to the base estimators by the stacking are also given.

Question

Use the Scikit-learn documentation to give the parameters defining this stacking regressor.

Question

How does the stacking regressor perform compared to the voting regressor? Why?

# Define array of complexity coordinate, regressor and options
# for voting regression
param_name, param_range = 'estimators', [
    [('Lasso', reg_lasso), ('DecisionTree', reg_dt), ('RandomForest', reg_rf)]
]
reg_kwargs_sr = dict(final_estimator=linear_model.LinearRegression(
    positive=True))
reg_sr = ensemble.StackingRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_sr)

# Evaluate regressor
param_best_sr = evaluate_regressor(
    reg_sr, reg_kwargs_sr, param_name, param_range, plot_validation=False)
weights = reg_sr.final_estimator_.coef_
index = [r[0] for r in param_range[0]]
df_weights = pd.Series(weights, index=index)
print('Weights:')
print(df_weights)

Test R2: 0.23
Weights:
Lasso           0.877058
DecisionTree    0.000000
RandomForest    0.076595
dtype: float64

../_images/f8883afd7b2348ba2a6eac4f9e71e4d781fc0a6543a6c39dafeb8faf3f871cf4.png

Answer:

AdaBoost regressor#

The following code:

Creates an AdaBoost regressor;
Evaluate the regressor over a range of number of estimators;

Question

Use the Scikit-learn documentation to give the parameters defining this AdaBoost regressor.

Question

How does the AdaBoost compared to the other regressors?

# Define array of complexity coordinate, regressor and options
# for AdaBoost regression
param_name, param_range = 'n_estimators', np.arange(1, 50, 2)
estimator = linear_model.LinearRegression(fit_intercept=True)
reg_kwargs_abr = dict(estimator=estimator)
reg_abr = ensemble.AdaBoostRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_abr)

# Evaluate regressor
param_best_abr = evaluate_regressor(
    reg_abr, reg_kwargs_abr, param_name, param_range)