# Tutorial: Supervised Learning Problem and Least Squares#

Tutorial to the classes Supervised Learning Problem and Least Squares and Ordinary Least Squares.

**Tutorial Objectives**

Read, plot and analyze train data

Use supervised learning to predict the regional electricity consumption of France in response electric heating based on temperature data

Test the linear least squares (OLS) model

Evaluate their performance by estimating their Expected Prediction Errors (EPE) using test data

## Dataset presentation#

Input:

2m-temperature

Domain: Metropolitan France

Spatial resolution: regional average

Time resolution: hourly

Period: 2014-2021

Units: °C

Source: MERRA-2 reanalysis

Target:

Electricity demand

Domain: Metropolitan France

Spatial resolution: regional sum

Time resolution: hourly

Period: 2014-2021

Units: MWh

Source: RTE

## Reading and pre-analysis of the input and output data#

### Import data-analysis and plot modules and define paths#

```
# Path manipulation module
from pathlib import Path
# Numerical analysis module
import numpy as np
# Formatted numerical analysis module
import pandas as pd
# Plot module
import matplotlib.pyplot as plt
# Default colors
RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']
# Matplotlib configuration
plt.rc('font', size=14)
```

```
# Set data directory
data_dir = Path('data')
# Set keyword arguments for pd.read_csv
kwargs_read_csv = dict()
# Set first and last years
FIRST_YEAR = 2014
LAST_YEAR = 2021
# Define temperature filepath
temp_filename = 'surface_temperature_merra2_{}-{}.csv'.format(
FIRST_YEAR, LAST_YEAR)
temp_filepath = Path(data_dir, temp_filename)
temp_label = 'Temperature (°C)'
# Define electricity demand filepath
dem_filename = 'reseaux_energies_demand_demand.csv'
dem_filepath = Path(data_dir, dem_filename)
dem_label = 'Electricity consumption (MWh)'
```

### Reading and plotting the raw temperature data#

Question (code cells below)

Use

`pd.read_csv`

with the filepath and appropriate options to make sure to get the column names and the index as dates (`DatetimeIndex`

).Use the

`resample`

method from the data frame to compute daily means.Plot the

`'Île-de-France'`

daily-mean temperature time series for (a) the whole period, (b) one year, © one month in winter and (d) one month in summer on 4 different figures (use`plt.figure`

) using`plt.plot`

or the`plot`

method from data frames (preferably).Use the

`mean`

and`var`

methods to get mean and variance of the daily-mean temperature.

```
# answer cell
```

### Reading and plotting the demand data#

Question

Same question for the demand but with daily sums instead of daily means

```
# answer cell
```

### Analyzing the input and target data and their relationships#

Question (write your answer in text box below)

Describe the seasonality of the temperature in Île-de-France.

Are all years the same?

Describe the seasonal and weakly demand patterns.

Answer:

Question

Select the temperature and demand data for their largest common period using the

`intersection`

method of the`index`

attribute of the data frames.Represent a scatter plot of the daily demand versus the daily temperature using

`plt.scatter`

.

```
# answer cell
```

Question

Compute the correlation between the daily temperature and the daily demand in Île-de-France using

`np.corrcoef`

.Compute the correlation between the monthly temperature and the monthly demand using the

`resample`

method.What do you think explains the difference between the daily and the monthly correlation?

```
# answer cell
```

Answer:

## Ordinary Least Squares#

Question

Perform an OLS with intercept using the entire dataset from the temperature using the formula for the optimal coefficients derived in Supervised Learning Problem and Least Squares (without Scikit-Learn). To do so:

Prepare the input matrix and output vector with the

`np.concatenate`

function (for instance);Use the matrix-multiplication operator seen in Introduction and the

`np.linalg.inv`

function to compute the optimal coefficients and print them.Use the estimated coefficents to predict the target from the input train data.

Overlay your prediction to the scatter plot of the train data.

Compute the train Mean Squared Error (MSE) and the train coefficient of determination (\(R^2\)) and print them.

```
# answer cell
```

Question

Compute the optimal coefficients using centered input temperatures.

Compute the optimal intercept alone using a single-column input matrix.

Compare the resulting two estimations of the intercept with the sample mean of the target train data.

```
# answer cell
```

Question

Perform an OLS fit with intercept using the entire dataset to predict the demand from the temperature using Scikit-learn. To do so:

Import the

`linear_model`

module from`sklearn`

(Scikit-Learn)Define a regressor using

`linear_model.LinearRegression`

(by default, the regressor is configured to fit an intercept in addition to the features, see`fit_intercept`

option)Prepare the input matrix and output vector for the

`fit`

method of the regressorApply the

`fit`

method to the input and outputPrint the fitted coefficients using the

`coef_`

attribute of the regressor.Compute the train \(R^2\) coefficient using the

`score`

method of the regressor.Compare the resulting coefficients and score to those obtained above by applying the formulas yourself.

```
# answer cell
```

Answer:

Question

Define and array of 100 temperatures ranging from -5 to 35°C with

`np.linspace`

.Make a prediction of the demand for these temperatures using the trained OLS model with the

`predict`

method of the regressor.Plot this prediction over the scatter plot of the train data.

Does the demand prediction seem satisfactory over the whole range of temperatures?

```
# answer cell
```

Answer:

## Credit#

Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.