Tutorial: Ensemble Methods for Wind Capacity Factor Prediction#

Binder

Tutorial to the class Ensemble Methods based on the same case study as in Tutorial: Regularization, Model Selection and Evaluation and Tutorial: Introduction to Unsupervised Learning with a Focus on PCA.

Tutorial Objectives
  • Use ensemble methods to best predict the France-average wind capacity factor from a large number of input variables;

  • Control the complexity parameter(s) of each ensemble method to avoid overfitting;

  • Compare the skills of different methods.

Scientific objective#

To predict the France average wind capacity factor from the geopotential height at 500hPa over the Euro-Atlantic sector.

Dataset presentation#

  • Input:

  • Target:

    • Onshore wind capacity factors

      • Domain: Metropolitan France

      • Spatial resolution: regional mean

      • Time resolution: daily

      • Period: 2014-2021

      • Units:

      • Source: RTE

Getting ready#

Reading the wind capacity factor and geopotential height data#

We follow the same procedure as in # Tutorial: Regularization, Model Selection and Evaluation.

# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Structured dataset analysis module
import xarray as xr
# Plot module
import matplotlib.pyplot as plt
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
# Matplotlib configuration
plt.rc('font', size=14)

# Set data directory
data_dir = Path('data')

# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict(index_col=0, parse_dates=True) 

# Define electricity demand filepath and label
windcf_filename = 'reseaux_energies_capacityfactor_wind-onshore.csv'
windcf_filepath = Path(data_dir, windcf_filename)
windcf_label = 'Wind capacity factor'

# Read windcf data with pandas                                                                                                                                                
df_windcf_daily = pd.read_csv(windcf_filepath, **kwargs_read_csv)

# Select domain
# REGION_NAME = 'Bretagne'
REGION_NAME = 'National'
if REGION_NAME == 'National':
    df_windcf_daily_reg = df_windcf_daily.mean('columns')
    df_windcf_daily_reg.name = REGION_NAME
else:
    df_windcf_daily_reg = df_windcf_daily[REGION_NAME]

# Resample wind capacity factor from daily to monthly means
df_windcf_reg = df_windcf_daily_reg.resample('MS').mean()

# Define temperature filepath and label
START_DATE = '19800101'
END_DATE = '20220101'
z500_filename = 'merra2_analyze_height_500_month_{}-{}.nc'.format(START_DATE, END_DATE)
z500_filepath = Path(data_dir, z500_filename)
z500_label = 'Geopotential height (m)'

# Read geopotential height dataset with xarray                                                                                                                                
ds = xr.load_dataset(z500_filepath)

# Select geopotential height variable                                                                                                                                         
z500_name = 'height_500'                                                                                                                                                     
da_z500_hr = ds[z500_name]

# Downsample geopotential height
N_GRID_AVG = 8
da_z500 = da_z500_hr.coarsen(lat=N_GRID_AVG, boundary='trim').mean().coarsen(                                                                                                 
    lon=N_GRID_AVG, boundary='trim').mean()

# Remove seasonal cycle from wind capacity factor
da_windcf_reg = df_windcf_reg.to_xarray()                                                                                                                                 
gp_windcf_cycle = da_windcf_reg.groupby('time.month')                                                                                                                     
da_windcf_anom = gp_windcf_cycle - gp_windcf_cycle.mean('time')                                                                                                           
df_windcf_anom = da_windcf_anom.drop('month').to_dataframe()[REGION_NAME]

# Remove seasonal cycle from geopotential height or not
gp_z500_cycle = da_z500.groupby('time.month')                                                                                                                             
da_z500_anom = gp_z500_cycle - gp_z500_cycle.mean('time')

# Convert to bandas with grid points as columns                                                                                                                               
df_z500_anom = da_z500_anom.stack(latlon=('lat', 'lon')).to_dataframe()[
    z500_name].unstack(0).transpose()
    
# Select common index                                                                                                                                                         
idx = df_z500_anom.index.intersection(df_windcf_anom.index)                                                                                                                   
df_z500_anom = df_z500_anom.loc[idx]                                                                                                                                          
df_windcf_anom = df_windcf_anom.loc[idx]                                                                                                                                      
                                                                                                                                                                              
# Number of years in dataset                                                                                                                                                  
time = df_windcf_anom.index                                                                                                                                                   
n_years = time.year.max() - time.year.min() + 1

Ensemble methods#

Regressions evaluation function#

We also define a function to evaluate any of our regressions.

FIRST_TEST_YEAR controls the number of years at the end of the time series to keep as test data.

from sklearn import model_selection

# Default number of test days
FIRST_TEST_YEAR = 2020

POLY_DEGREE = 1

# Get input matrix and output vector for whole time series
X = df_z500_anom.values
y = df_windcf_anom.values

def evaluate_regressor(
    reg, reg_kwargs, param_name, param_range, first_test_year=FIRST_TEST_YEAR,
    n_splits=None, cv_iterator=model_selection.KFold,
    plot_validation=True):
    
    if n_splits is None:
        n_splits = len(np.unique(time[time.year < first_test_year].year))
    
    # Get test data keeping last years
    index_test = time.year >= first_test_year
    X_test = X[index_test]
    y_test = y[index_test]
    
    # Select train data from first years and first days in month 
    index_cv = time.year < first_test_year
    X_cv = X[index_cv]
    y_cv = y[index_cv]
    
    if plot_validation:
        # Set cross-validation iterator
        cv = cv_iterator(n_splits=n_splits)
        groups = time[index_cv].year

        # Get train and validation scores from cross-validation
        train_scores, validation_scores = model_selection.validation_curve(
            reg, X_cv, y_cv, param_name=param_name,
            param_range=param_range, cv=cv, groups=groups)

        # Get train curve
        train_scores_mean = train_scores.mean(1)
        train_scores_max = train_scores.max(1)
        train_scores_min = train_scores.min(1)

        # Get validation curve
        validation_scores_mean = validation_scores.mean(1)
        validation_scores_max = validation_scores.max(1)
        validation_scores_min = validation_scores.min(1)

        # Get best value of the regularization parameter
        i_best = np.argmax(validation_scores_mean)
        param_best = param_range[i_best]
        score_best = validation_scores_mean[i_best]

    
        # Plot validation curve
        lw = 2
        plt.figure()
        plt.semilogx(param_range, train_scores_mean, label="Training score",
                     color="darkorange", lw=lw)
        plt.fill_between(param_range, train_scores_min, train_scores_max,
                         alpha=0.2, color="darkorange", lw=lw)
        plt.semilogx(param_range, validation_scores_mean,
                     label="Cross-validation score", color="navy", lw=lw)
        plt.fill_between(
            param_range, validation_scores_min, validation_scores_max,
            alpha=0.2, color="navy", lw=lw)
        plt.xlabel(param_name)
        plt.ylabel(r'Validation $R^2$')
        plt.xlim(param_range[[0, -1]])
        plt.title(REGION_NAME + r'. Best $R^2$: {:.2} for {} = {:.1e}'.format(
            score_best, param_name, param_best))
        # plt.ylim(0.0, 1.1)
        plt.legend(loc="best")
        
        # Set best parameter
        reg.set_params(**{param_name: param_best})
    else:
        param_best = reg.get_params(param_name)
        
    # Compute prediction error conditioned on first 5 years of data
    reg.fit(X_cv, y_cv)
    test_score = reg.score(X_test, y_test)
    print('\nTest R2: {:.2f}'.format(test_score))

    # Predict for work days and off days
    y_pred = reg.predict(X_test)

    # Scatter plot of prediction against target
    fig, ax = plt.subplots()
    ax.scatter(y_test, y_pred, s=10, alpha=0.5)
    xlim = ax.get_xlim()
    ax.plot(xlim, xlim, '--k', linewidth=1)
    ax.set_xlabel('Target ' + windcf_label)
    ax.set_ylabel('Predicted ' + windcf_label)
    ax.set_xlim(xlim)
    ax.set_title('Wind capacity factor')
    
    return {param_name: param_best}

Question

  • If 2021 is given as the first test years, how many training years and test years will be available for this dataset?

  • Identify the lines where:

    • The validation and train scores are computed;

    • The test score is computed;

    • The target is predicted from the test input features.

Answer:

Individual models#

First, make sure that only the heating and cooling temperatures are selected as features, that the monthly factorization is activated and that no polynomial transformation is performed.

Lasso regression#

The following code:

  • Creates a Lasso regressor with positive coefficients;

  • Evaluate the regressor over a range of regularization parameter values;

  • Represent the importance of the coefficients.

The evaluation function:

  • Plots the training and validation curves with the lines representing the mean score and the shading the minimum and maximum scores;

  • Plots test predictions against the test inputs over the train data;

  • Plots the test predictions against the test targets.

Question

  • Is there an overfit/underfit tradeoff?

  • Explain the difference between the training curve and the validation curve?

  • Explain the evolution the variability of the validation curve.

  • Is the test score in agreement with the best validation score?

  • How does the the importance of the coefficients vary with climate variable and the month?

from sklearn import linear_model

# Define array of complexity coordinate, regressor and options
# for Lasso regression
param_name, param_range = 'alpha', np.logspace(-4, 1, 50)
reg_kwargs_lasso = dict(max_iter=10000)
reg_lasso = linear_model.Lasso(
    **{param_name: param_range[0]}, **reg_kwargs_lasso)

# Evaluate regressor
param_best_lasso = evaluate_regressor(
    reg_lasso, reg_kwargs_lasso, param_name, param_range)
df_coef = pd.Series(reg_lasso.coef_)
plt.figure()
df_coef.plot(kind='barh', figsize=(8, 8))
plt.xlabel(r'Coefficient value (m$^{-1}$)')
plt.ylabel('Grid point')
_ = plt.title('Model coefficients')
Test R2: 0.31
../_images/7af1c72b9f4f85611eded71ce229783160f6d8ca138797fdfc1815ec31713e2e.png ../_images/929dfe3df42122050676c6369a5b1dc1a2261737d36be19af22e91f0c876b4cf.png ../_images/e64b6fbf6f0e0c51bdef9b598ebd84d7a964b755af617ac3c6111dc7b4e00f86.png

Answer:

Decision-tree regression#

The following code:

  • Creates a decision-tree regressor;

  • Evaluate the regressor over a range of maximum tree depth;

Question

  • Is there an overfit/underfit tradeoff?

  • How does the tree perform compared to the Lasso?

from sklearn import tree

# Define array of complexity coordinate, regressor and options
# for decision-tree regression
param_name, param_range = 'max_depth', np.arange(1, 20, 1)
reg_kwargs_dt = dict()
reg_dt = tree.DecisionTreeRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_dt)

# Evaluate regressor
param_best_dt = evaluate_regressor(
    reg_dt, reg_kwargs_dt, param_name, param_range)
Test R2: -0.01
../_images/9445cf9c176b1af298651a992483a3aaf5129d4e125df549005e4db19ab444eb.png ../_images/cfa9dc20b981fe6edb922137a66fd74c5ebd8418a09ece1729e5426e6c60acc2.png

Answer:

Ensemble models#

Bagging regressor#

The following code:

  • Creates a bagging regressor with a decision tree as the base estimator;

  • Evaluate the regressor over a range of number of estimators;

Question

  • Use the Scikit-learn documentation to give the parameters defining this bagging regressor with a decision tree as the base estimator.

Question

  • Is there an overfit/underfit tradeoff?

  • In this case, how to choose the number of estimators?

  • Same question without using validation curves.

  • How does the bagging perform compared to the individual regressors?

  • Same question but with the Lasso as base estimator.

from sklearn import ensemble

# Define array of complexity coordinate, regressor and options
# for Bagging regression
param_name, param_range = 'n_estimators', np.arange(1, 50, 2)
# estimator = None
estimator = linear_model.Lasso(alpha=0.15)
reg_kwargs_br = dict(estimator=estimator)
reg_br = ensemble.BaggingRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_br)

# Evaluate regressor
param_best_br = evaluate_regressor(
    reg_br, reg_kwargs_br, param_name, param_range)
Test R2: 0.32
../_images/cfeffdc8d981571f4a408ac37361dabb765e5719556915c7bbd69152104c872c.png ../_images/e8471155129a7619b57224d992bf5720d6d9d532070935ac20c8ff3df762ec73.png

Answer:

Random-forest regressor#

The following code:

  • Creates a random-forest regressor;

  • Evaluate the regressor over a range of number of estimators;

Question

  • Use the Scikit-learn documentation to give the parameters defining this random-forest regressor.

Question

  • How does the random forest perform compared to the bagging regressor?

  • Compare the varibility of the validation score of the random forest to that of the bagging regressor.

# Define array of complexity coordinate, regressor and options
# for random-forest regression
param_name, param_range = 'n_estimators', np.arange(1, 202, 25)
reg_kwargs_rf = dict()
reg_rf = ensemble.RandomForestRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_rf)

# Evaluate regressor
param_best_rf = evaluate_regressor(
    reg_rf, reg_kwargs_rf, param_name, param_range)
Test R2: 0.06
../_images/c5c28a93c192a7f338fd49070cb14047381e4af4f1dc254eefea0f7407385eb0.png ../_images/12b8378200f2a5d248c8acb7b3edc44c8dcfb990ae46ccdae665698a199f1ac2.png

Answer:

The following plot represents the mean and standard deviation of the importance given to the features by the trees in the random forest (see Feature importance with a forest of trees in Scikit-learn User guide).

Question

  • How does the the importance of the features vary with climate variable and the month?

  • Compare the importance of the random-forest features with that of the Lasso.

# Plot importance
importances = reg_rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in reg_rf.estimators_], axis=0)
forest_importances = pd.Series(importances)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
_ = ax.set_ylabel("Mean decrease in impurity")
../_images/273072152f8677730c47130bf97a74c3200ddb0904fe1f383c4febfe8980b79b.png

Answer:

Voting regressor#

The following code creates a voting regressor with the Lasso, the decision tree and the random forest as base estimators and tests it.

Question

  • Use the Scikit-learn documentation to give the parameters defining this voting regressor.

Question

  • How does the voting regressor perform compared to the base estimators?

# Define array of complexity coordinate, regressor and options
# for voting regression
param_name, param_range = 'estimators', [
    [('Lasso', reg_lasso), ('DecisionTree', reg_dt), ('RandomForest', reg_rf)]
]
reg_kwargs_vr = dict()
reg_vr = ensemble.VotingRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_vr)

# Evaluate regressor
param_best_vr = evaluate_regressor(
    reg_vr, reg_kwargs_vr, param_name, param_range,
    plot_validation=False)
Test R2: 0.23
../_images/e92bbd4113d22e8db91d3fa2379c611a4ec91cd63ed98f4c10615b794c255518.png

Answer:

Stacking regressor#

  • The following code creates a stacking regressor with the Lasso, the decision tree and the random forest as base estimators and tests it.

  • The final regressor is a OLS with positive coefficients.

  • The weights given to the base estimators by the stacking are also given.

Question

  • Use the Scikit-learn documentation to give the parameters defining this stacking regressor.

Question

  • How does the stacking regressor perform compared to the voting regressor? Why?

# Define array of complexity coordinate, regressor and options
# for voting regression
param_name, param_range = 'estimators', [
    [('Lasso', reg_lasso), ('DecisionTree', reg_dt), ('RandomForest', reg_rf)]
]
reg_kwargs_sr = dict(final_estimator=linear_model.LinearRegression(
    positive=True))
reg_sr = ensemble.StackingRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_sr)

# Evaluate regressor
param_best_sr = evaluate_regressor(
    reg_sr, reg_kwargs_sr, param_name, param_range, plot_validation=False)
weights = reg_sr.final_estimator_.coef_
index = [r[0] for r in param_range[0]]
df_weights = pd.Series(weights, index=index)
print('Weights:')
print(df_weights)
Test R2: 0.23
Weights:
Lasso           0.877058
DecisionTree    0.000000
RandomForest    0.076595
dtype: float64
../_images/f8883afd7b2348ba2a6eac4f9e71e4d781fc0a6543a6c39dafeb8faf3f871cf4.png

Answer:

AdaBoost regressor#

The following code:

  • Creates an AdaBoost regressor;

  • Evaluate the regressor over a range of number of estimators;

Question

  • Use the Scikit-learn documentation to give the parameters defining this AdaBoost regressor.

Question

  • How does the AdaBoost compared to the other regressors?

# Define array of complexity coordinate, regressor and options
# for AdaBoost regression
param_name, param_range = 'n_estimators', np.arange(1, 50, 2)
estimator = linear_model.LinearRegression(fit_intercept=True)
reg_kwargs_abr = dict(estimator=estimator)
reg_abr = ensemble.AdaBoostRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_abr)

# Evaluate regressor
param_best_abr = evaluate_regressor(
    reg_abr, reg_kwargs_abr, param_name, param_range)
Test R2: -0.21
../_images/21cbd18364948433b054fc10bd512fa60f80d994d012380e24e1cb88c3f6c0af.png ../_images/a9c9f989b605844152d3b4b44b46191e1244ef7e8e5ba96a18d3469efec47930.png

Question (Optional)

  • Reevaluate your results for a different scoring metric.

Answer:


Credit#

Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.


Logo LMD Logo IPSL Logo E4C Logo EP Logo SU Logo ENS Logo CNRS