Tutorial: Supervised Learning Problem and Least Squares#


Tutorial to the classes Supervised Learning Problem and Least Squares and Ordinary Least Squares.

Tutorial Objectives
  • Read, plot and analyze train data

  • Use supervised learning to predict the regional electricity consumption of France in response electric heating based on temperature data

  • Test the linear least squares (OLS) model

  • Evaluate their performance by estimating their Expected Prediction Errors (EPE) using test data

Dataset presentation#

  • Input:

    • 2m-temperature

      • Domain: Metropolitan France

      • Spatial resolution: regional average

      • Time resolution: hourly

      • Period: 2014-2021

      • Units: °C

      • Source: MERRA-2 reanalysis

  • Target:

    • Electricity demand

      • Domain: Metropolitan France

      • Spatial resolution: regional sum

      • Time resolution: hourly

      • Period: 2014-2021

      • Units: MWh

      • Source: RTE

Reading and pre-analysis of the input and output data#

Import data-analysis and plot modules and define paths#

# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Plot module
import matplotlib.pyplot as plt
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
# Matplotlib configuration
plt.rc('font', size=14)
# Set data directory
data_dir = Path('data')

# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict()

# Set first and last years
LAST_YEAR = 2021

# Define temperature filepath
temp_filename = 'surface_temperature_merra2_{}-{}.csv'.format(
temp_filepath = Path(data_dir, temp_filename)
temp_label = 'Temperature (°C)'

# Define electricity demand filepath
dem_filename = 'reseaux_energies_demand_demand.csv'
dem_filepath = Path(data_dir, dem_filename)
dem_label = 'Electricity consumption (MWh)'

Reading and plotting the raw temperature data#

Question (code cells below)

  • Use pd.read_csv with the filepath and appropriate options to make sure to get the column names and the index as dates (DatetimeIndex).

  • Use the resample method from the data frame to compute daily means.

  • Plot the 'Île-de-France' daily-mean temperature time series for (a) the whole period, (b) one year, © one month in winter and (d) one month in summer on 4 different figures (use plt.figure) using plt.plot or the plot method from data frames (preferably).

  • Use the mean and var methods to get mean and variance of the daily-mean temperature.

# answer cell

Reading and plotting the demand data#


  • Same question for the demand but with daily sums instead of daily means

# answer cell

Analyzing the input and target data and their relationships#

Question (write your answer in text box below)

  • Describe the seasonality of the temperature in Île-de-France.

  • Are all years the same?

  • Describe the seasonal and weakly demand patterns.



  • Select the temperature and demand data for their largest common period using the intersection method of the index attribute of the data frames.

  • Represent a scatter plot of the daily demand versus the daily temperature using plt.scatter.

# answer cell


  • Compute the correlation between the daily temperature and the daily demand in Île-de-France using np.corrcoef.

  • Compute the correlation between the monthly temperature and the monthly demand using the resample method.

  • What do you think explains the difference between the daily and the monthly correlation?

# answer cell


Ordinary Least Squares#


  • Perform an OLS with intercept using the entire dataset from the temperature using the formula for the optimal coefficients derived in Supervised Learning Problem and Least Squares (without Scikit-Learn). To do so:

    • Prepare the input matrix and output vector with the np.concatenate function (for instance);

    • Use the matrix-multiplication operator seen in Introduction and the np.linalg.inv function to compute the optimal coefficients and print them.

  • Use the estimated coefficents to predict the target from the input train data.

  • Overlay your prediction to the scatter plot of the train data.

  • Compute the train Mean Squared Error (MSE) and the train coefficient of determination (\(R^2\)) and print them.

# answer cell


  • Compute the optimal coefficients using centered input temperatures.

  • Compute the optimal intercept alone using a single-column input matrix.

  • Compare the resulting two estimations of the intercept with the sample mean of the target train data.

# answer cell


  • Perform an OLS fit with intercept using the entire dataset to predict the demand from the temperature using Scikit-learn. To do so:

    • Import the linear_model module from sklearn (Scikit-Learn)

    • Define a regressor using linear_model.LinearRegression (by default, the regressor is configured to fit an intercept in addition to the features, see fit_intercept option)

    • Prepare the input matrix and output vector for the fit method of the regressor

    • Apply the fit method to the input and output

  • Print the fitted coefficients using the coef_ attribute of the regressor.

  • Compute the train \(R^2\) coefficient using the score method of the regressor.

  • Compare the resulting coefficients and score to those obtained above by applying the formulas yourself.

# answer cell



  • Define and array of 100 temperatures ranging from -5 to 35°C with np.linspace.

  • Make a prediction of the demand for these temperatures using the trained OLS model with the predict method of the regressor.

  • Plot this prediction over the scatter plot of the train data.

  • Does the demand prediction seem satisfactory over the whole range of temperatures?

# answer cell



Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.

Logo LMD Logo IPSL Logo E4C Logo EP Logo SU Logo ENS Logo CNRS