Tutorial: Ensemble Methods for Electricity Demand Prediction#


Tutorial to the class Ensemble Methods.

Tutorial Objectives
  • Use ensemble methods to best predict the regional electricity demand in France from a large number of input variables;

  • Control the complexity parameter(s) of each ensemble method to avoid overfitting;

  • Compare the skills of different methods.

Getting ready#

Loading the demand data#

Let us load the regional electricity demand data as in Tutorial: Supervised Learning Problem and Least Squares.

# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis modules
import pandas as pd
import xarray as xr
# Plot module
import matplotlib.pyplot as plt
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
# Matplotlib configuration
plt.rc('font', size=14)

# Set data directory
data_dir = Path('data')

# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict(index_col=0, header=0, parse_dates=True)

# Set first and last years
LAST_YEAR = 2021

# Define electricity demand filepath and label
dem_filename = 'reseaux_energies_demand_demand.csv'
dem_filepath = Path(data_dir, dem_filename)
dem_label = 'Electricity consumption (MWh)'

# Read hourly demand data averaged over each region
df_dem = pd.read_csv(dem_filepath, **kwargs_read_csv)
# region_rename = {
#     'Auvergne-Rhône-Alpes': 'Auvergne-Rhône-Alpes',
#     'Bourgogne-Franche-Comté': 'Bourgogne-Franche-Comté',
#     'Bretagne': 'Bretagne',
#     'Centre-Val de Loire': 'Centre-Val de Loire',
#     'Grand Est': 'Grand Est',
#     'Hauts-de-France': 'Hauts-de-France',
#     'Île-de-France': 'Ile-de-France',
#     'Normandie': 'Normandie',
#     'Nouvelle-Aquitaine': 'Nouvelle-Aquitaine',
#     'Occitanie': 'Occitanie',
#     'Pays de la Loire': 'Pays-de-la-Loire',
#     "Provence-Alpes-Côte d'Azur": 'PACA'
# }
# df_dem.columns = [region_rename[c] for c in df_dem.columns]
df_dem.index = df_dem.index.tz_localize(None)
df_dem = df_dem.resample('D').sum()

Loading the climate data#

Let us load climate data for France and average it over regions to use it as input to predict the demand.

In addition to the surface temperature (in °C), we also load the:

  • surface density (kg/m3)

  • surface downward radiation (in W/m2),

  • surface specific humidity and

  • surface wind (in m/s).

# Directories where you saved the data
filename_climate = (
temp_label = 'Temperature (°C)'

def get_regional_climate(variable_name):
    filename = filename_climate.format(variable_name)
    filepath = Path(data_dir, filename)
    df_climate = pd.read_csv(filepath, **kwargs_read_csv)
    da_climate = df_climate.to_xarray().to_array('region').transpose(
        'time', 'region').to_dataset(name=variable_name)
    return da_climate

# Read a climate variable and plot its mean over time
ds_clim = get_regional_climate('surface_temperature') - 273.15
ds_clim = ds_clim.merge(get_regional_climate('surface_density'))
ds_clim = ds_clim.merge(get_regional_climate('surface_downward_radiation'))
ds_clim = ds_clim.merge(get_regional_climate('surface_specific_humidity'))
ds_clim = ds_clim.merge(get_regional_climate('zonal_wind'))
ds_clim = ds_clim.merge(get_regional_climate('meridional_wind'))
ds_clim['wind_speed'] = np.sqrt(ds_clim['zonal_wind']**2 +
ds_clim['wind_direction'] = np.arctan2(ds_clim['meridional_wind'],
variable_names = list(ds_clim)

Preprocessing the inputs#

We now select data for a given region of France, subsample it to daily frequency and keep only the dates that are common to the demand times series and the climate time series.

# REGION_NAME = 'Ile-de-France'
REGION_NAME = 'Occitanie'

# Select region
df_dem_reg = df_dem[REGION_NAME]
ds_clim_reg = ds_clim.sel(region=REGION_NAME, drop=True)
# Select common index
idx = df_dem_reg.index.intersection(ds_clim_reg.indexes['time'])
df_dem_reg = df_dem_reg.loc[idx]
ds_clim_reg = ds_clim_reg.sel(time=idx, method='nearest')

print('First common date: \t{}\nLast common date: \t{}'.format(
    df_dem_reg.index[0], df_dem_reg.index[-1]))
First common date: 	2014-01-01 00:00:00
Last common date: 	2019-12-31 00:00:00
# Add noise to demand (optional)
df_dem_reg += np.random.normal(0., df_dem_reg.std(), df_dem_reg.shape) * NOISE_LEVEL
# Deteriorate dataset by removing extreme temperatures
T_MIN = -100.
T_MAX = 100.
extremes = (ds_clim_reg['surface_temperature'] < T_MIN) | (ds_clim_reg['surface_temperature'] > T_MAX)
ds_clim_reg = ds_clim_reg.where(~extremes).dropna(dim='time')
df_dem_reg = df_dem_reg.where(~extremes).dropna()
# Deteriorate by selecting a few days per month
step = int(365.25 / 12 / N_DAYS_PER_MONTH)
ds_clim_reg = ds_clim_reg.isel(time=slice(None, None, step))
df_dem_reg = df_dem_reg.iloc[slice(None, None, step)]

Analyzing the relationship between the climate variables and the demand#

The code below:

  • does a scatter plot the demand as a function of each climate variable on separate figures,

  • computes the correlation between the demand and each climate variable,

  • computes the correlation matrix between climate variables removing values smaller than 0.3 in absolute value.


  • Does their seem to be redundancies between climate variables?

  • Which climate variables seem to be most relevant to predict the demand?

  • Discuss the limits of this analysis using correlations alone.

for variable_name, da in ds_clim_reg.items():
    plt.scatter(da.values, df_dem_reg.values, s=10, alpha=0.2)
    c = np.corrcoef(da.values, df_dem_reg.values)[0, 1]
    plt.title('{}, Correlation = {:.2f}'.format(variable_name, c))
../_images/0d0af0e78b049005297c4987a13b57add312a76f222887bbca7d8bd2ab42640a.png ../_images/19fdb0b8196e12a9d8b0155f795cbe27de6eaf315f304006e049952026abcccc.png ../_images/99d85c7be5b89c6488b36c39562934a57d880eb92a35305f3dbb64fea719431d.png ../_images/3a1ffd4454562b67e62ecce6c67ef287d248e1aada32e7f71938d99c6f428ca2.png ../_images/efcb4aed97e0a80b7e440c89ee91b996893e95e57710fff6bdc62ebca87a1106.png ../_images/3998ba774d940e74dc123c9b3f06f367c5db7da9b3e277bd0ee62de492cea237.png ../_images/71f30cf1e00b184190921a27c1acde2f0555bc98e45e49fb8dbbf61f56698c36.png ../_images/aa43413f7216e1accd135928fe0b709aadc0abf27e69e3ff231acb298bef35fe.png
corr_clim = np.corrcoef(ds_clim_reg.to_array().values)
fig, ax = plt.subplots()
plt.imshow(np.where(np.abs(corr_clim) > 0.3, corr_clim, np.nan),
           vmin=-1., vmax=1.)
ticks = range(len(variable_names))
ticklabels = variable_names
ax.set_xticklabels(ticklabels, rotation=90.)
_ = plt.title('Correlation matrix (abs. val. > 0.3)')


Preparing feature extraction#

As in the previous tutorials, we define nonlinear features from the temperature based on heating- and cooling-temperatures.

# Define heating and cooling temperature thresholds

# Define function returning a dictionary from variable name
# to variable train data with base, heating and cooling variables
def get_heat_cool(x):
    return {
        'heat': (TEMP_HEAT - x) * (x < TEMP_HEAT).astype(float),
        'cool': (x - TEMP_COOL) * (x > TEMP_COOL).astype(float)

We also define a factorization by month. In other words, for each variable, we define 12 new variables for which the values are that of the original variable for dates in the corresponding month, zero otherwise.

def factorize_monthly(variables, index):
    new_variables = {}
    for variable_name, variable_data in variables.items():
        for month in range(1, 13):
            new_variable_name = '{}_m{:02d}'.format(variable_name, month)
            new_variables[new_variable_name] = variable_data * (
                index.month == month).astype(float)
    return new_variables


  • If 4 climate variables are selected, how may features will result from the monthly factorization?


Regressions evaluation function#

We also define a function to evaluate any of our regressions.

FIRST_TEST_YEAR controls the number of years at the end of the time series to keep as test data.

from sklearn import model_selection

# Default number of test days

TIME = ds_clim_reg.indexes['time']
def evaluate_regressor(
    reg, reg_kwargs, param_name, param_range, first_test_year=FIRST_TEST_YEAR,
    n_splits=None, cv_iterator=model_selection.GroupKFold,
    if n_splits is None:
        n_splits = len(np.unique(TIME[TIME.year < first_test_year].year))
    # Get test data keeping last years
    index_test = TIME.year >= first_test_year
    x_test = x[index_test]
    X_test = X[index_test]
    y_test = y[index_test]
    # Select train data from first years and first days in month 
    index_train = TIME.year < first_test_year
    X_train = X[index_train]
    y_train = y[index_train]
    if plot_validation:
        # Set cross-validation iterator
        cv = cv_iterator(n_splits=n_splits)
        groups = TIME[index_train].year

        # Get train and validation scores from cross-validation
        train_scores, validation_scores = model_selection.validation_curve(
            reg, X_train, y_train, param_name=param_name,
            param_range=param_range, cv=cv, groups=groups)

        # Get train curve
        train_scores_mean = train_scores.mean(1)
        train_scores_max = train_scores.max(1)
        train_scores_min = train_scores.min(1)

        # Get validation curve
        validation_scores_mean = validation_scores.mean(1)
        validation_scores_max = validation_scores.max(1)
        validation_scores_min = validation_scores.min(1)

        # Get best value of the regularization parameter
        i_best = np.argmax(validation_scores_mean)
        param_best = param_range[i_best]
        score_best = validation_scores_mean[i_best]

        # Plot validation curve
        lw = 2
        plt.semilogx(param_range, train_scores_mean, label="Training score",
                     color="darkorange", lw=lw)
        plt.fill_between(param_range, train_scores_min, train_scores_max,
                         alpha=0.2, color="darkorange", lw=lw)
        plt.semilogx(param_range, validation_scores_mean,
                     label="Cross-validation score", color="navy", lw=lw)
            param_range, validation_scores_min, validation_scores_max,
            alpha=0.2, color="navy", lw=lw)
        plt.ylabel(r'Validation $R^2$')
        plt.title(REGION_NAME + r'. Best $R^2$: {:.2} for {} = {:.1e}'.format(
            score_best, param_name, param_best))
        plt.ylim(0.0, 1.1)
        # Set best parameter
        reg.set_params(**{param_name: param_best})
        param_best = reg.get_params(param_name)
    # Compute prediction error conditioned on first 5 years of data
    reg.fit(X_train, y_train)
    test_score = reg.score(X_test, y_test)
    print('\nTest R2: {:.2f}'.format(test_score))

    # Predict for work days and off days
    y_pred = reg.predict(X_test)

    # Plot prediction on top of train data versus temperature
    plt.scatter(x_test, y_test, s=10, alpha=0.5)
    plt.scatter(x_test, y_pred, s=10)

    # Scatter plot of prediction against target
    plt.scatter(y_test, y_pred, s=10, alpha=0.5)
    plt.xlabel('Target ' + dem_label)
    plt.ylabel('Predicted ' + dem_label)
    return {param_name: param_best}


  • If 2018 is given as the first test years, how many training years and test years will be available for this dataset?

  • Identify the lines where:

    • The validation and train scores are computed;

    • The test score is computed;

    • The target is predicted from the test input features.


Feature selection#

The following code select features, by controlling:

  • The climate variables to select in addition to the heating- and cooling-temperatures;

  • Whether a monthly factorization is performed or not;

  • The degree of polynomial interactions;

  • The standardization of the features.

It also plots the time series features associated with each selected climate variable.


  • Identify the lines where:

    • The heating and cooling temperatures are obtained;

    • The other climate variables are added;

    • The monthly factorization is applied;

    • The polynomial features are computed;

    • The standardization is performed.

  • Are the feature time series in agreement with your expectations?

from sklearn import preprocessing

# Variables in addition to temperature features
#     'surface_density', 'surface_downward_radiation',
#     'surface_specific_humidity', 'wind_speed', 'wind_direction'
# ]

# Degree of polynomial interactions

# Get features
x = ds_clim_reg['surface_temperature'].values
y = df_dem_reg.values
features = get_heat_cool(x)
    features[var_name] = ds_clim_reg[var_name].values
feature_names_nofact = list(features)

# Factorize monthly
features = factorize_monthly(features, TIME)

# Plot features before polynomial transform and standardization
feature_names = list(features)
df_features = pd.DataFrame(features, index=TIME)
n_feat_nofact = len(feature_names_nofact)
fig, axs = plt.subplots(n_feat_nofact, figsize=[12, 5 * n_feat_nofact])
for k, feature_name in enumerate(feature_names_nofact):
        ax = axs[k]
    except TypeError:
        ax = axs
    feat_variables = [feature_name in v for v in feature_names]
    df_features.iloc[:, feat_variables].plot(ax=ax, ylabel=feature_name)
    ax.legend(bbox_to_anchor=(1,1), loc='upper left', title='Features')

# Get input matrix
X = np.array(list(features.values())).T
    poly = preprocessing.PolynomialFeatures(POLY_DEGREE)
    X =  poly.fit_transform(X)
# Standardize
X = preprocessing.StandardScaler().fit_transform(X)


Individual models#

First, make sure that only the heating and cooling temperatures are selected as features, that the monthly factorization is activated and that no polynomial transformation is performed.

Lasso regression#

The following code:

  • Creates a Lasso regressor with positive coefficients;

  • Evaluate the regressor over a range of regularization parameter values;

  • Represent the importance of the coefficients.

The evaluation function:

  • Plots the training and validation curves with the lines representing the mean score and the shading the minimum and maximum scores;

  • Plots test predictions against the test inputs over the train data;

  • Plots the test predictions against the test targets.


  • Is there an overfit/underfit tradeoff?

  • Explain the difference between the training curve and the validation curve?

  • Explain the evolution the variability of the validation curve.

  • Is the test score in agreement with the best validation score?

  • How does the the importance of the coefficients vary with climate variable and the month?

from sklearn import linear_model

# Define array of complexity coordinate, regressor and options
# for Lasso regression
param_name, param_range = 'alpha', np.logspace(-4, 5, 10)
reg_kwargs_lasso = dict(fit_intercept=True, warm_start=True,
                        positive=True, max_iter=10000)
reg_lasso = linear_model.Lasso(
    **{param_name: param_range[0]}, **reg_kwargs_lasso)

# Evaluate regressor
param_best_lasso = evaluate_regressor(
    reg_lasso, reg_kwargs_lasso, param_name, param_range)
index = feature_names if POLY_DEGREE == 1 else None
df_coef = pd.Series(reg_lasso.coef_, index=index)
_ = df_coef.plot(kind='barh', figsize=(8, 8))
Test R2: 0.79
../_images/4c557294da4c8472b0a5246f7ae724120f45d162aa8209a9cf462ba32eee2a34.png ../_images/7553a3d177320f83759e03ffb146108d29c9efefa7e7d860682d2c5b9718d26e.png ../_images/de622b1df3c6a64db7feff9307f8ee352887ccd2ec562d068ca8e29865442e22.png ../_images/8a67421f15ebb231f0dcdf2181292d9b02bcfbdc8df2d3109f119be34ae31c68.png


Decision-tree regression#

The following code:

  • Creates a decision-tree regressor;

  • Evaluate the regressor over a range of maximum tree depth;


  • Is there an overfit/underfit tradeoff?

  • How does the tree perform compared to the Lasso?

from sklearn import tree

# Define array of complexity coordinate, regressor and options
# for decision-tree regression
param_name, param_range = 'max_depth', np.arange(1, 150, 1)
reg_kwargs_dt = dict()
reg_dt = tree.DecisionTreeRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_dt)

# Evaluate regressor
param_best_dt = evaluate_regressor(
    reg_dt, reg_kwargs_dt, param_name, param_range)
Test R2: 0.66
../_images/da94809b9e7f27ca55ff756e3f111d2dae8e312f112d7106bc005bc57e39aba1.png ../_images/7883e1115266c24c239d3ee5e268140ebdd368eb621e2c4cd520b6ccff116859.png ../_images/662fc2866b28c947bd3b43fb26959a38fe734e57b5c351014bd044b83ad2843d.png


Ensemble models#

Bagging regressor#

The following code:

  • Creates a bagging regressor with a decision tree as a base estimator;

  • Evaluate the regressor over a range of number of estimators;


  • Use the Scikit-learn documentation to give the parameters defining this bagging regressor.


  • Is there an overfit/underfit tradeoff?

  • In this case, how to choose the number of estimators?

  • Same question without using validation curves.

  • How does the bagging perform compared to the individual regressors?

from sklearn import ensemble

# Define array of complexity coordinate, regressor and options
# for Bagging regression
param_name, param_range = 'n_estimators', np.arange(1, 50, 2)
estimator = None
reg_kwargs_br = dict(estimator=estimator)
reg_br = ensemble.BaggingRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_br)

# Evaluate regressor
param_best_br = evaluate_regressor(
    reg_br, reg_kwargs_br, param_name, param_range)
Test R2: 0.65
../_images/1ab26ad846ec1309e4001aa6ff837159939dea51e488955267a35ef8d0338829.png ../_images/efc12ca5aa77b1ab005d7648e1b20a84b3a7d0b15364946eeb5d8b990c9ae242.png ../_images/c0491215ff0de16779e663d0b3e9d4c19cf22dc9aa3a1b6a41335c446387ad41.png


Random-forest regressor#

The following code:

  • Creates a random-forest regressor;

  • Evaluate the regressor over a range of number of estimators;


  • Use the Scikit-learn documentation to give the parameters defining this random-forest regressor.


  • How does the random forest perform compared to the bagging regressor?

  • Compare the varibility of the validation score of the random forest to that of the bagging regressor.

# Define array of complexity coordinate, regressor and options
# for random-forest regression
param_name, param_range = 'n_estimators', np.arange(1, 202, 25)
reg_kwargs_rf = dict()
reg_rf = ensemble.RandomForestRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_rf)

# Evaluate regressor
param_best_rf = evaluate_regressor(
    reg_rf, reg_kwargs_rf, param_name, param_range)
Test R2: 0.62
../_images/14c35ef74a3b11be2a38668dfaf5d84797e1af8b95071e3bbe9c19d2d88bdc10.png ../_images/9a5c1188054dd447ff94e66314e6f15857473060131e9219cc10c4caac411c02.png ../_images/ffa2632f3f7977c8cc4b276e7b23f5a26a2b2ac58e308287bd131504ca5b06ee.png


The following plot represents the mean and standard deviation of the importance given to the features by the trees in the random forest (see Feature importance with a forest of trees in Scikit-learn User guide).


  • How does the the importance of the features vary with climate variable and the month?

  • Compare the importance of the random-forest features with that of the Lasso.

# Plot importance
importances = reg_rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in reg_rf.estimators_], axis=0)
forest_importances = pd.Series(importances, index=index)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
_ = ax.set_ylabel("Mean decrease in impurity")


Voting regressor#

The following code creates a voting regressor with the Lasso, the decision tree and the random forest as base estimators and tests it.


  • Use the Scikit-learn documentation to give the parameters defining this voting regressor.


  • How does the voting regressor perform compared to the base estimators?

# Define array of complexity coordinate, regressor and options
# for voting regression
param_name, param_range = 'estimators', [
    [('Lasso', reg_lasso), ('DecisionTree', reg_dt), ('RandomForest', reg_rf)]
reg_kwargs_vr = dict()
reg_vr = ensemble.VotingRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_vr)

# Evaluate regressor
param_best_vr = evaluate_regressor(
    reg_vr, reg_kwargs_vr, param_name, param_range,
Test R2: 0.72
../_images/f8f54f685953e8afd666f55034b5578cd6d7dbed97c19098f8266f335788e032.png ../_images/96c160de9233ebda66d9b71a0bbc221e49eec54f3c684f130e82c974e181feba.png


Stacking regressor#

  • The following code creates a stacking regressor with the Lasso, the decision tree and the random forest as base estimators and tests it.

  • The final regressor is a OLS with positive coefficients.

  • The weights given to the base estimators by the stacking are also given.


  • Use the Scikit-learn documentation to give the parameters defining this stacking regressor.


  • How does the stacking regressor perform compared to the voting regressor? Why?

# Define array of complexity coordinate, regressor and options
# for voting regression
param_name, param_range = 'estimators', [
    [('Lasso', reg_lasso), ('DecisionTree', reg_dt), ('RandomForest', reg_rf)]
reg_kwargs_sr = dict(final_estimator=linear_model.LinearRegression(
reg_sr = ensemble.StackingRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_sr)

# Evaluate regressor
param_best_sr = evaluate_regressor(
    reg_sr, reg_kwargs_sr, param_name, param_range, plot_validation=False)
weights = reg_sr.final_estimator_.coef_
index = [r[0] for r in param_range[0]]
df_weights = pd.Series(weights, index=index)
Test R2: 0.72
Lasso           0.301353
DecisionTree    0.553085
RandomForest    0.093935
dtype: float64
../_images/c1fb261f7482bc65be06e2bb81958ca0204a2cde1e327e0590fa73a175d6ea8d.png ../_images/e093c13be241e5cdabdfee6863433f068be0b85143e691c2be173e6598427918.png


AdaBoost regressor#

The following code:

  • Creates an AdaBoost regressor;

  • Evaluate the regressor over a range of number of estimators;


  • Use the Scikit-learn documentation to give the parameters defining this AdaBoost regressor.


  • How does the AdaBoost compared to the other regressors?

# Define array of complexity coordinate, regressor and options
# for AdaBoost regression
param_name, param_range = 'n_estimators', np.arange(1, 50, 2)
estimator = linear_model.LinearRegression(fit_intercept=True)
reg_kwargs_abr = dict(estimator=estimator)
reg_abr = ensemble.AdaBoostRegressor(
    **{param_name: param_range[0]}, **reg_kwargs_abr)

# Evaluate regressor
param_best_abr = evaluate_regressor(
    reg_abr, reg_kwargs_abr, param_name, param_range)
Test R2: 0.54
../_images/af939a18bf67cdc807789c01819901d3c64ded1a3836f067073a6943ec3c653c.png ../_images/db9200c1ec14b58288051c08bdcb6df2a4fa8e5648e6c5714856183db4f8a1a1.png ../_images/4b2b29e0d67929673b0b4c462dcf056ce258ad9a3846868237214e41a2e878f8.png


Question (optional)

  • Reevaluate your results when changing:

    • The combination of selected climate variables;

    • The degree of the polynomial features;

    • The activation of the monthly factorization.


Question (Optional)

  • Reevaluate your results for different regions.


Question (Optional)

  • Reevaluate your results for a different scoring metric.



Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.

Logo LMD Logo IPSL Logo E4C Logo EP Logo SU Logo ENS Logo CNRS