Tutorial: Supervised Learning Problem and Least Squares#
Tutorial to the classes Supervised Learning Problem and Least Squares and Ordinary Least Squares.
Read, plot and analyze train data
Use supervised learning to predict the regional electricity consumption of France in response electric heating based on temperature data
Test the linear least squares (OLS) model
Evaluate their performance by estimating their Expected Prediction Errors (EPE) using test data
Dataset presentation#
Input:
2m-temperature
Domain: Metropolitan France
Spatial resolution: regional average
Time resolution: hourly
Period: 2014-2021
Units: °C
Source: MERRA-2 reanalysis
Target:
Electricity demand
Domain: Metropolitan France
Spatial resolution: regional sum
Time resolution: hourly
Period: 2014-2021
Units: MWh
Source: RTE
Reading and pre-analysis of the input and output data#
Import data-analysis and plot modules and define paths#
# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Plot module
import matplotlib.pyplot as plt
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
# Matplotlib configuration
plt.rc('font', size=14)
# Set data directory
data_dir = Path('data')
# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict()
# Set first and last years
FIRST_YEAR = 2014
LAST_YEAR = 2021
# Define temperature filepath
temp_filename = 'surface_temperature_merra2_{}-{}.csv'.format(
FIRST_YEAR, LAST_YEAR)
temp_filepath = Path(data_dir, temp_filename)
temp_label = 'Temperature (°C)'
# Define electricity demand filepath
dem_filename = 'reseaux_energies_demand_demand.csv'
dem_filepath = Path(data_dir, dem_filename)
dem_label = 'Electricity consumption (MWh)'
Reading and plotting the raw temperature data#
Question (code cells below)
Use
pd.read_csv
with the filepath and appropriate options to make sure to get the column names and the index as dates (DatetimeIndex
).Use the
resample
method from the data frame to compute daily means.Plot the
'Île-de-France'
daily-mean temperature time series for (a) the whole period, (b) one year, © one month in winter and (d) one month in summer on 4 different figures (useplt.figure
) usingplt.plot
or theplot
method from data frames (preferably).Use the
mean
andvar
methods to get mean and variance of the daily-mean temperature.
# answer cell
Reading and plotting the demand data#
Question
Same question for the demand but with daily sums instead of daily means
# answer cell
Analyzing the input and target data and their relationships#
Question (write your answer in text box below)
Describe the seasonality of the temperature in Île-de-France.
Are all years the same?
Describe the seasonal and weakly demand patterns.
Answer:
Question
Select the temperature and demand data for their largest common period using the
intersection
method of theindex
attribute of the data frames.Represent a scatter plot of the daily demand versus the daily temperature using
plt.scatter
.
# answer cell
Question
Compute the correlation between the daily temperature and the daily demand in Île-de-France using
np.corrcoef
.Compute the correlation between the monthly temperature and the monthly demand using the
resample
method.What do you think explains the difference between the daily and the monthly correlation?
# answer cell
Answer:
Ordinary Least Squares#
Question
Perform an OLS with intercept using the entire dataset from the temperature using the formula for the optimal coefficients derived in Supervised Learning Problem and Least Squares (without Scikit-Learn). To do so:
Prepare the input matrix and output vector with the
np.concatenate
function (for instance);Use the matrix-multiplication operator seen in Introduction and the
np.linalg.inv
function to compute the optimal coefficients and print them.Use the estimated coefficents to predict the target from the input train data.
Overlay your prediction to the scatter plot of the train data.
Compute the train Mean Squared Error (MSE) and the train coefficient of determination (\(R^2\)) and print them.
# answer cell
Question
Compute the optimal coefficients using centered input temperatures.
Compute the optimal intercept alone using a single-column input matrix.
Compare the resulting two estimations of the intercept with the sample mean of the target train data.
# answer cell
Question
Perform an OLS fit with intercept using the entire dataset to predict the demand from the temperature using Scikit-learn. To do so:
Import the
linear_model
module fromsklearn
(Scikit-Learn)Define a regressor using
linear_model.LinearRegression
(by default, the regressor is configured to fit an intercept in addition to the features, seefit_intercept
option)Prepare the input matrix and output vector for the
fit
method of the regressorApply the
fit
method to the input and outputPrint the fitted coefficients using the
coef_
attribute of the regressor.Compute the train \(R^2\) coefficient using the
score
method of the regressor.Compare the resulting coefficients and score to those obtained above by applying the formulas yourself.
# answer cell
Answer:
Question
Define and array of 100 temperatures ranging from -5 to 35°C with
np.linspace
.Make a prediction of the demand for these temperatures using the trained OLS model with the
predict
method of the regressor.Plot this prediction over the scatter plot of the train data.
Does the demand prediction seem satisfactory over the whole range of temperatures?
# answer cell
Answer:
Credit#
Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.