{ "cells": [ { "cell_type": "markdown", "id": "f7b500a8", "metadata": {}, "source": [ "# Tutorial: Overfitting/Underfitting and Bias/Variance\n", "\n", "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgitlab.in2p3.fr%2Fenergy4climate%2Fpublic%2Feducation%2Fmachine_learning_for_climate_and_energy/master?filepath=book%2Fnotebooks%2F03_tutorial_overfitting_underfitting_bias_variance.ipynb)\n", "\n", "Tutorial to the class [Overfitting/Underfitting and Bias/Variance](03_overfitting_underfitting_bias_variance.ipynb) " ] }, { "cell_type": "markdown", "id": "884267e1", "metadata": {}, "source": [ "
\n", " Tutorial Objectives\n", " \n", "- Evaluate model performance by estimating the Expected Prediction Errors (EPE) using test data.\n", "- Same as above but with cross-validation.\n", "- Compute and plot learning curves.\n", "- Estimate the irreducible error and bias error.\n", "- Improve the models by modifying the input features.\n", "
" ] }, { "cell_type": "markdown", "id": "859b5e19-f950-4f68-9aa3-324eb7804885", "metadata": {}, "source": [ "
\n", "\n", "We are going to study the temperature variations in the upper part of the equatorial ocean. Our goal is to get an estimate of the temperature as a function of depth $T = f(z)$. We are going to suppose that $f$ is linear such that $T = \\alpha z + \\beta$. \n", "\n", "
" ] }, { "cell_type": "markdown", "id": "79c90125-341f-44c6-879f-cc4ff3eff4f4", "metadata": {}, "source": [ "## Argo profiles\n", "\n", "To measure temperature and salinity in the ocean, the scientific community has deployed Argo Floats in all ocean basins. Argo floats look like cylinders that can adjust their density (like a submarine). They are programmed to go up and down in the ocean and measure temperature, salinity and pressure along their path. The figure below is an illustration of a typical work cycle of such a float. **Each time the Argo float moves up, it records a profile**, as illustrated in the figure below.\n", "\n", "\"Argo\"\n", "Figure from Walczowski et al (2020)" ] }, { "cell_type": "markdown", "id": "e3e75e3f-6a80-405d-9116-91b206d2dc90", "metadata": {}, "source": [ "### Download the data\n", "\n", "We are going to analyze the data taken from one argo float\n", "\n", "> ***Question***\n", "> - Check in the `data` folder that you have a file named `nodc_13858_prof.nc`.\n", "> - If not, you can download it from `https://data.nodc.noaa.gov/argo/gadr/data/aoml/13858/nodc_13858_prof.nc` and place it in the `data` folder. (as in [Tutorial 1](01_introduction.ipynb))\n" ] }, { "cell_type": "markdown", "id": "1d27006e-e595-4284-b361-95b1c191aacd", "metadata": { "jp-MarkdownHeadingCollapsed": true }, "source": [ "### Information about the data\n", "\n", "- The dataset contains 48 profiles of temperature, salinity recorded at predefined pressure levels near the Equator in the Atlantic ocean : Lat: +2 deg, longitude; Lon: -14 deg. It is a subset of the [Argo dataset](https://argo.ucsd.edu/data/)\n", "\n", "\n", "- In our subset, there are `n_prof = 48` vertical profiles recorded between July 1997 and December 1998. At the equator, there is *no seasonal cycle* so that *all profiles are from the same distribution*. Each profile should be considered as *one coherent sample*.\n", "\n", "\n", "- We are going to suppose that pressure in decibar is equivalent to depth in meters.\n", "\n", "\n", "- We will focus on the upper part of the ocean (upper 100 m) which corresponds to the first 15 measurements of each profile\n" ] }, { "cell_type": "markdown", "id": "a00b50e8", "metadata": {}, "source": [ "## Getting ready\n", "\n", "In the cell below we import the main libraries and load the dataset in the `ds` object. We extract a subsample of pressures (depths) and temperatures that we store in the variables `x_pres` and `y_temp` respectively. These are 2d arrays where each row correspond to one individual profile." ] }, { "cell_type": "code", "execution_count": 1, "id": "c4b6fdce", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 48 profiles in this dataset\n", "For each profile, we are going to keep only the first 15 measurements\n" ] } ], "source": [ "# Path manipulation module\n", "from pathlib import Path\n", "# Numerical analysis module\n", "import numpy as np\n", "# Formatted numerical analysis module\n", "import pandas as pd\n", "# Plot module\n", "import matplotlib.pyplot as plt\n", "# read netCDF files\n", "import xarray as xr\n", "\n", "\n", "# Set data directory\n", "data_dir = Path('data')\n", "argo_filename = 'nodc_13858_prof.nc'\n", "argo_filepath = Path(data_dir, argo_filename)\n", "\n", "# name of the temperature and pressure variables in the netcdf file\n", "var_temp = 'temp' \n", "var_pres = 'pres'\n", "\n", "# load the data\n", "ds = xr.open_dataset(argo_filepath)\n", "\n", "n_prof = int(ds['n_prof'][-1]) + 1\n", "print(f'There are {n_prof} profiles in this dataset')\n", "\n", "n_max = 15\n", "print(f'For each profile, we are going to keep only the first {n_max} measurements')\n", "\n", "x_pres = ds[var_pres].values[:,:n_max]\n", "y_temp = ds[var_temp].values[:,:n_max]" ] }, { "cell_type": "markdown", "id": "a7e3d5af-0fe9-4013-b290-63597b97cf8d", "metadata": {}, "source": [ "\n", "> ***Question***\n", "> - Check in the `ds` object that the unit of temperature is degree Celsius and the unit of pressure is decibar. In the remainder of the tutorial, we will consider that the pressure in decibar is equivalent to the depth in meters.\n", "> - Pick the first profile `ip = 0`, and plot the temperature as a function of depth. Don't forget the labels!" ] }, { "cell_type": "code", "execution_count": 2, "id": "a80dd67e-e0c3-4ca2-a8de-5777c8679d5d", "metadata": {}, "outputs": [], "source": [ "# your answer here\n", "ip = 0" ] }, { "cell_type": "markdown", "id": "49f27972-d1d6-4400-a13d-8d20e144609c", "metadata": {}, "source": [ "> ***Question***\n", "> - Do a linear regression to estimate the coefficients $\\alpha$ and $\\beta$ (as explained above) with this profile as training data." ] }, { "cell_type": "code", "execution_count": 3, "id": "26a0bca5-48d5-4634-abd6-b0e4f7b505e8", "metadata": {}, "outputs": [], "source": [ "# your answer here" ] }, { "cell_type": "markdown", "id": "f8b3132b-34f4-4bfe-9e19-19effa285274", "metadata": {}, "source": [ "> ***Question***\n", "> - Plot your linear regression on top of the raw data. Don't forget the labels" ] }, { "cell_type": "code", "execution_count": 4, "id": "22db37d4-cd06-457c-93d7-5e72a1961638", "metadata": {}, "outputs": [], "source": [ "# your answer here" ] }, { "cell_type": "markdown", "id": "c2da0d64-2add-48ac-8657-54b16bc799db", "metadata": {}, "source": [ "> ***Question***\n", "> - What is the training score $R^2$ for this linear regression?\n", "> - What do you think of this value of $R^2$?" ] }, { "cell_type": "code", "execution_count": 5, "id": "9a9d6c7e-770c-4518-b41f-ce3de5cfa7e3", "metadata": {}, "outputs": [], "source": [ "# your answer here" ] }, { "cell_type": "markdown", "id": "4ead2e41-3b88-43a9-bcfb-611f06a2e917", "metadata": {}, "source": [ "your answer here" ] }, { "cell_type": "markdown", "id": "f82e300d-0e37-44fe-901b-fe80c4a9f268", "metadata": {}, "source": [ "> ***Question***\n", "> - Select a random profile `ip2` between `1` and `n_prof -1`.\n", "> - We are going to use this new profile as testing set. What is the testing score $R^2$?\n", "> - Are you overfitting? Justify your answer" ] }, { "cell_type": "code", "execution_count": 6, "id": "65c31e65-6de4-4373-bcc6-8b99221dcb67", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Working with profile number: ip2 = 2\n" ] } ], "source": [ "# your answer here\n", "ip2 = int(np.random.rand()*(n_prof - 2)) + 1\n", "\n", "print(f\"Working with profile number: ip2 = {ip2}\")" ] }, { "cell_type": "markdown", "id": "ccb769a0-1494-4881-915f-ddeca3af1090", "metadata": {}, "source": [ "## Learning curve\n", "\n", "In general, there are two possible reasons that can explain the fact that we are overfitting the data. Either we do not have enough data or the model is to complicated. Since linear regression is one of the simplest possible models, we will focus more on the amount of data needed to extract a general law for this problem. This can be achieved with a learning curve." ] }, { "cell_type": "markdown", "id": "4dc6d098-993b-4d42-8e70-4c94e61b7bfe", "metadata": {}, "source": [ "### What is a sample?\n", "\n", "A key assumption regarding datasets is that each point should be Independent Identically Distributed (also noted [*iid*](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables)). This is in practice rarely true and in fact, for our dataset, this is clearly not the case because the points within a profile are correlated (i.e. not independent). There are then two possible approaches\n", "\n", "- Treat each profile as a coherent block of data (the new sample size is then a whole profile rather than one data point)\n", "- Shuffle the entire data set\n", "\n", "In this tutorial, we will adopt the first strategy to design well balanced **training** and **testing** sets. So keep in mind that for this example ***one sample = one profile***. Note that this type of consideration is not specific to this dataset: if you are aware of a pre-existing structure (seasonality, space proximity, etc), you will need to keep it in mind to build your train and test data.\n", "\n" ] }, { "cell_type": "markdown", "id": "87c34eed-aa34-4fe5-983e-849326797fdf", "metadata": {}, "source": [ "> ***Question***\n", "> - Compute and plot a learning curve. To do so:\n", "> - Partition your dataset into `n_train` samples for training and `n_test = n_prof - n_train` samples for testing.\n", "> - Define a list of train period of increasing lengths\n", "> - Loop over these train periods to iteratively:\n", "> - Select data for this train period\n", "> - Train the model\n", "> - Compute the train error from the train data for the train period\n", "> - Compute the test error from the test data for the test period\n", "> - Save both errors\n", "> - Plot both errors curves\n", "> - Interpret the results." ] }, { "cell_type": "code", "execution_count": 7, "id": "c36cca5e-b19b-447b-bfe5-9adec383ab7a", "metadata": {}, "outputs": [], "source": [ "# your answer here" ] }, { "cell_type": "markdown", "id": "bd55a194-b508-4adb-90f9-c1e1a1c536a2", "metadata": {}, "source": [ "## Estimating the expected prediction error with cross-validation\n", "\n", "> ***Question***\n", "> - Perform a $k$-fold cross-validation of your own by repeating the above estimation of the test error on all samples. To do so:\n", "> - Use the `split` method of a `sklearn.model_selection.KFold` object initialized with the `n_splits` option to get a sequence train and test indices over which to loop.\n", "> - For each pair of train and test indices:\n", "> - Select the train and test data from the input and output data;\n", "> - Fit the model using the train data;\n", "> - Use the fitted model to predict the target from the test inputs;\n", "> - Estimate the $R^2$ from the test output.\n", "> - Average the $R^2$ estimates." ] }, { "cell_type": "code", "execution_count": 8, "id": "c29b9803-8065-4a8a-9f5a-6808bf8e27d6", "metadata": {}, "outputs": [], "source": [ "# your answer here" ] }, { "cell_type": "markdown", "id": "e35e1b0d", "metadata": {}, "source": [ "> ***Question***\n", "> - Verify your results using the `cross_val_score` function of `sklearn.model_selection` with the appropriate value for the `cv` option.\n", "> - How does the $R^2$ estimate from the cross-validation compare to your estimation above?" ] }, { "cell_type": "code", "execution_count": 9, "id": "fadc17f9", "metadata": {}, "outputs": [], "source": [ "# answer cell\n" ] }, { "cell_type": "markdown", "id": "87b2cb6c", "metadata": {}, "source": [ "Answer: " ] }, { "cell_type": "markdown", "id": "d93e83de-bfae-43c5-9812-6409463e09fb", "metadata": {}, "source": [ "## Irreducible error, Bias error\n", "\n", "This data set is special because all measurements are at the same depth. We can use this property to measure specific quantities such as the irreducible error or the bias error.\n", "\n", "> ***Question***\n", "> - In one single figure, plot:\n", "> - All sample points with small dots\n", "> - The mean temperature at each depth\n", "> - your best linear regression\n", "> - Propose a graphical interpretation for the irreducible error and the bias error\n", "> - Where can you read the irreducible error on the learning curve?" ] }, { "cell_type": "code", "execution_count": 10, "id": "86e3cfb1-2ff4-4fbf-801c-e43d27d1e27d", "metadata": {}, "outputs": [], "source": [ "# your answer here" ] }, { "cell_type": "markdown", "id": "a29b2825-2eff-47f1-a632-7c3934793a8e", "metadata": {}, "source": [ "> ***Question***\n", "> - In order to verify that the data is Independent Identically Distributed, plot the variance of the temperature at each depth.\n", "> - What do you conclude?" ] }, { "cell_type": "markdown", "id": "aa6ac18b", "metadata": {}, "source": [ "## Beyond a linear model\n", "\n", "We consider the full depth of the ocean and so do no restrict ourselves to the first 100 m.\n", "\n", "> ***Question***\n", "> - Look at the whole temperature data. Two profiles have a lot of NaNs. Remove them. What would be a good depth to keep as many data points as possible?\n", "> - Do a linear regression for the full dataset and plot it. Do you observe a bias error?\n", "> - Add input features in the form of powers of $z$ and use the linear regression method of scikit-learn to fit a function of the type $f(z) = a_0 + a_1 z + a_2 z^2 ...$. Do you observe a reduction of the bias?\n", "> - Do you think that with this dataset, going from a linear fit to a polynomial fit will increase the variance error? why? " ] }, { "cell_type": "code", "execution_count": 11, "id": "b6a777e1-6877-458b-8980-e6a79e927558", "metadata": {}, "outputs": [], "source": [ "# your answer here" ] }, { "cell_type": "markdown", "id": "c467e84a", "metadata": {}, "source": [ "***\n", "## Credit\n", "\n", "[//]: # \"This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education).\"\n", "Contributors include Bruno Deremble and Alexis Tantet.\n", "Several slides and images are taken from the very good [Scikit-learn course](https://inria.github.io/scikit-learn-mooc/).\n", "\n", "
\n", "\n", "
\n", " \n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", "\n", "\"Logo\n", " \n", "
\n", "\n", "
\n", "\n", "
\n", " \"Creative\n", "
This work is licensed under a   Creative Commons Attribution-ShareAlike 4.0 International License.\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": true, "autocomplete": false, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }