{ "cells": [ { "cell_type": "markdown", "id": "9bc1a928", "metadata": {}, "source": [ "# Tutorial: Supervised Learning Problem and Least Squares\n", "\n", "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgitlab.in2p3.fr%2Fenergy4climate%2Fpublic%2Feducation%2Fmachine_learning_for_climate_and_energy/master?filepath=book%2Fnotebooks%2F02_tutorial_supervised_learning_problem_ols.ipynb)\n", "\n", "Tutorial to the classes [Supervised Learning Problem and Least Squares](02_supervised_learning_problem.ipynb) and [Ordinary Least Squares](03_ordinary_least_squares.ipynb)." ] }, { "cell_type": "markdown", "id": "a9e82f9b", "metadata": {}, "source": [ "
\n", " Tutorial Objectives\n", " \n", "- Read, plot and analyze train data\n", "- Use supervised learning to predict the regional electricity consumption of France in response electric heating based on temperature data\n", "- Test the linear least squares (OLS) model\n", "- Evaluate their performance by estimating their Expected Prediction Errors (EPE) using test data\n", "
" ] }, { "cell_type": "markdown", "id": "083fce0a", "metadata": {}, "source": [ "## Dataset presentation\n", "\n", "- Input:\n", " - 2m-temperature\n", " - Domain: Metropolitan France\n", " - Spatial resolution: regional average\n", " - Time resolution: hourly\n", " - Period: 2014-2021\n", " - Units: °C\n", " - Source: [MERRA-2 reanalysis](https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/)\n", "- Target:\n", " - Electricity demand\n", " - Domain: Metropolitan France\n", " - Spatial resolution: regional sum\n", " - Time resolution: hourly\n", " - Period: 2014-2021\n", " - Units: MWh\n", " - Source: [RTE](https://opendata.reseaux-energies.fr/)" ] }, { "cell_type": "markdown", "id": "2ea67a13", "metadata": {}, "source": [ "## Reading and pre-analysis of the input and output data" ] }, { "cell_type": "markdown", "id": "c36e8f5c", "metadata": {}, "source": [ "### Import data-analysis and plot modules and define paths" ] }, { "cell_type": "code", "execution_count": null, "id": "f665256a", "metadata": {}, "outputs": [], "source": [ "# Path manipulation module\n", "from pathlib import Path\n", "# Numerical analysis module\n", "import numpy as np\n", "# Formatted numerical analysis module\n", "import pandas as pd\n", "# Plot module\n", "import matplotlib.pyplot as plt\n", "# Default colors\n", "RC_COLORS = plt.rcParams['axes.prop_cycle'].by_key()['color']\n", "# Matplotlib configuration\n", "plt.rc('font', size=14)" ] }, { "cell_type": "code", "execution_count": null, "id": "00db59f1", "metadata": {}, "outputs": [], "source": [ "# Set data directory\n", "data_dir = Path('data')\n", "\n", "# Set keyword arguments for pd.read_csv\n", "kwargs_read_csv = dict()\n", "\n", "# Set first and last years\n", "FIRST_YEAR = 2014\n", "LAST_YEAR = 2021\n", "\n", "# Define temperature filepath\n", "temp_filename = 'surface_temperature_merra2_{}-{}.csv'.format(\n", " FIRST_YEAR, LAST_YEAR)\n", "temp_filepath = Path(data_dir, temp_filename)\n", "temp_label = 'Temperature (°C)'\n", "\n", "# Define electricity demand filepath\n", "dem_filename = 'reseaux_energies_demand_demand.csv'\n", "dem_filepath = Path(data_dir, dem_filename)\n", "dem_label = 'Electricity consumption (MWh)'" ] }, { "cell_type": "markdown", "id": "87097463", "metadata": {}, "source": [ "### Reading and plotting the raw temperature data" ] }, { "cell_type": "markdown", "id": "65bf41ef", "metadata": {}, "source": [ "> ***Question (code cells below)***\n", "> - Use `pd.read_csv` with the filepath and appropriate options to make sure to get the column names and the index as dates (`DatetimeIndex`).\n", "> - Use the `resample` method from the data frame to compute daily means.\n", "> - Plot the `'Île-de-France'` daily-mean temperature time series for (a) the whole period, (b) one year, (c) one month in winter and (d) one month in summer on 4 different figures (use `plt.figure`) using `plt.plot` or the `plot` method from data frames (preferably).\n", "> - Use the `mean` and `var` methods to get mean and variance of the daily-mean temperature." ] }, { "cell_type": "code", "execution_count": null, "id": "706b1be4", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "c19747cc", "metadata": {}, "source": [ "### Reading and plotting the demand data" ] }, { "cell_type": "markdown", "id": "5b376b53", "metadata": {}, "source": [ "> ***Question***\n", "> - Same question for the demand but with daily sums instead of daily means" ] }, { "cell_type": "code", "execution_count": null, "id": "571cbb34", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "6a5f17ce", "metadata": {}, "source": [ "### Analyzing the input and target data and their relationships" ] }, { "cell_type": "markdown", "id": "555ff7b5", "metadata": {}, "source": [ "> ***Question (write your answer in text box below)***\n", "> - Describe the seasonality of the temperature in Île-de-France.\n", "> - Are all years the same?\n", "> - Describe the seasonal and weakly demand patterns." ] }, { "cell_type": "markdown", "id": "9b6e901c", "metadata": {}, "source": [ "Answer:" ] }, { "cell_type": "markdown", "id": "36516633", "metadata": {}, "source": [ "> ***Question***\n", "> - Select the temperature and demand data for their largest common period using the `intersection` method of the `index` attribute of the data frames.\n", "> - Represent a scatter plot of the daily demand versus the daily temperature using `plt.scatter`." ] }, { "cell_type": "code", "execution_count": null, "id": "7ddccc23", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "35edda21", "metadata": {}, "source": [ "> ***Question***\n", "> - Compute the correlation between the daily temperature and the daily demand in Île-de-France using `np.corrcoef`.\n", "> - Compute the correlation between the monthly temperature and the monthly demand using the `resample` method.\n", "> - What do you think explains the difference between the daily and the monthly correlation?" ] }, { "cell_type": "code", "execution_count": null, "id": "0a0f6f9d", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "460e3bf4-b1f2-4d0d-b34b-44753ebd6029", "metadata": {}, "source": [ "Answer:" ] }, { "cell_type": "markdown", "id": "38ff4637", "metadata": {}, "source": [ "## Ordinary Least Squares" ] }, { "cell_type": "markdown", "id": "17781be6", "metadata": {}, "source": [ "> ***Question***\n", "> - Perform an OLS with intercept using the entire dataset from the temperature using the formula for the optimal coefficients derived in [Supervised Learning Problem and Least Squares](2_supervised_learning_problem_ols.ipynb) (without Scikit-Learn). To do so:\n", "> - Prepare the input matrix and output vector with the `np.concatenate` function (for instance);\n", "> - Use the matrix-multiplication operator seen in [Introduction](1_introduction.ipynb) and the `np.linalg.inv` function to compute the optimal coefficients and print them.\n", "> - Use the estimated coefficents to predict the target from the input train data.\n", "> - Overlay your prediction to the scatter plot of the train data.\n", "> - Compute the train Mean Squared Error (MSE) and the train coefficient of determination ($R^2$) and print them." ] }, { "cell_type": "code", "execution_count": null, "id": "cde14e85", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "499b3fb1", "metadata": {}, "source": [ "> ***Question***\n", "> - Compute the optimal coefficients using centered input temperatures.\n", "> - Compute the optimal intercept alone using a single-column input matrix.\n", "> - Compare the resulting two estimations of the intercept with the sample mean of the target train data." ] }, { "cell_type": "code", "execution_count": null, "id": "4bf783c5", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "af1100d4", "metadata": {}, "source": [ "> ***Question***\n", "> - Perform an OLS fit with intercept using the entire dataset to predict the demand from the temperature using Scikit-learn. To do so:\n", "> - Import the `linear_model` module from `sklearn` (Scikit-Learn)\n", "> - Define a regressor using `linear_model.LinearRegression` (by default, the regressor is configured to fit an intercept in addition to the features, see `fit_intercept` option)\n", "> - Prepare the input matrix and output vector for the `fit` method of the regressor\n", "> - Apply the `fit` method to the input and output\n", "> - Print the fitted coefficients using the `coef_` attribute of the regressor.\n", "> - Compute the train $R^2$ coefficient using the `score` method of the regressor.\n", "> - Compare the resulting coefficients and score to those obtained above by applying the formulas yourself." ] }, { "cell_type": "code", "execution_count": null, "id": "a7149924", "metadata": { "code_folding": [] }, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "a9159306", "metadata": {}, "source": [ "Answer: " ] }, { "cell_type": "markdown", "id": "59513617", "metadata": {}, "source": [ "> ***Question***\n", "> - Define and array of 100 temperatures ranging from -5 to 35°C with `np.linspace`.\n", "> - Make a prediction of the demand for these temperatures using the trained OLS model with the `predict` method of the regressor.\n", "> - Plot this prediction over the scatter plot of the train data.\n", "> - Does the demand prediction seem satisfactory over the whole range of temperatures?" ] }, { "cell_type": "code", "execution_count": null, "id": "5d27a60c", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "7fe141ec", "metadata": {}, "source": [ "Answer:" ] }, { "cell_type": "markdown", "id": "fab5ec4a", "metadata": {}, "source": [ "***\n", "## Credit\n", "\n", "[//]: # \"This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education).\"\n", "Contributors include Bruno Deremble and Alexis Tantet.\n", "Several slides and images are taken from the very good [Scikit-learn course](https://inria.github.io/scikit-learn-mooc/).\n", "\n", "
\n", "\n", "
\n", " \n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", " \n", "
\n", "\n", "
\n", "\n", "
\n", " \"Creative\n", "
This work is licensed under a   Creative Commons Attribution-ShareAlike 4.0 International License.\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": true, "autocomplete": false, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }