{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "92234cfe",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Regularization, Model Selection and Evaluation\n",
    "\n",
    "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgitlab.in2p3.fr%2Fenergy4climate%2Fpublic%2Feducation%2Fmachine_learning_for_climate_and_energy/master?filepath=book%2Fnotebooks%2F04_regularization_selection_validation.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42edfa17",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Prerequisites</b>\n",
    "    \n",
    "- [Elements of Probability Theory](appendix_elements_of_probability_theory.ipynb)  \n",
    "- [Understand the overfitting-underfitting and bias-variance tradeoffs](3_overfitting_underfitting_bias_variance)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75768246",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>Learning Outcomes</b>\n",
    "    \n",
    "- Estimate the expected prediction error using cross-validation\n",
    "- Use regularization to prevent overfitting\n",
    "- Be aware of underlying statistical assumptions (identity, independence)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56a78dba",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Cross-Validation\n",
    "\n",
    "- Data is often scarce so that living aside test data to estimat the Prediction Error (PE) is a problem\n",
    "- *Cross-validation* is used to estimate the Expected PE (EPE), while avoiding living aside test data\n",
    "- It uses part of the data to train, part of the data to test, repeating the operation on different subset selections"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "146b2247",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### $K$-Fold Cross Validation\n",
    "\n",
    "1. Slice the train data into $K$ roughly equal-sized parts;\n",
    "2. Keep the $K$th part for validation and train on the $K - 1$ other parts;\n",
    "3. Compute the prediction error $K$ using the $K$th part;\n",
    "4. Repeat the operation for all $K$ parts;\n",
    "5. Average the prediction errors to estimate the EPE.\n",
    "\n",
    "<img src=\"images/K-fold_cross_validation_EN.svg\" width=\"500\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "867149e3",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Warning</b>\n",
    "    \n",
    "Cross-validation allows one to estimate **expected** prediction error rather than the prediction error conditioned on a particular training set (see Chap. 7.12 in Hastie *et al.* 2009).\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d19b6666",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Choosing the number $K$ of parts\n",
    "\n",
    "- **$K$ small** (e.g. $K = 2$):\n",
    "    <br>biases towards large error (CV training sets size $(K - 1) / K \\times N$ much smaller than original training set size $N$);\n",
    "- **$K$ large** (e.g. $K = N$, leave-one out CV):\n",
    "    <br>unbiased estimate of EPE, but with high variance ($N$ \"training sets\" are similar to each other);\n",
    "- **$K = 5$ or 10**:\n",
    "    <br>recommended by Hastie *et al.* (2009) as a good compromise, but depends on case study."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5224f338",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Adapting Cross-Validation to the Data Structure\n",
    "\n",
    "For instance:\n",
    "- By choosing the the number of parts $K$ to adapt to cycles (e.g. avoid splitting years)\n",
    "- By grouping (e.g. if measurements for different people should be kept together)\n",
    "- Shuffling splits (e.g. if the ordering of the data is special)\n",
    "- By taking serial correlations into account in time series.\n",
    "\n",
    "See https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3d01d2f",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Regularizing to Avoid Overfitting: Linear Models"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a87795b4",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Linear Models Can Overfit\n",
    "\n",
    "- Linear models are simpler than alternatives\n",
    "- $\\rightarrow$ they tend to overfit less than alternatives\n",
    "- They even often underfit when:\n",
    "  - $p$ is small\n",
    "  - the problem is not linearly separable\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52ef481a",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- But linear models can also overfit, when:\n",
    "  - $N$ is small\n",
    "  - Many uninformative features (they depend a lot on other features)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "849a2a8e",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Effect of uninformative features\n",
    "\n",
    "Recal that, if we assume that generating model is $Y = \\boldsymbol{X}^\\top \\boldsymbol{\\beta} + \\epsilon$, where the observations of $\\epsilon$ are *uncorrelated* and with *mean zero* and *constant variance* $\\sigma^2$, then,\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\mathbb{E}(\\hat{\\boldsymbol{\\beta}} | \\mathbf{X}) &= \\boldsymbol{\\beta}\\\\\n",
    "\\mathrm{Var}(\\hat{\\boldsymbol{\\beta}} | \\mathbf{X}) &= \\sigma^2 (\\mathbf{X}^\\top \\mathbf{X})^{-1}.\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "210a3dd5",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**2D example:** Assume that,\n",
    "$$\n",
    "X_1 = c X_0 + (1- c) Z\\\\\n",
    "\\mathrm{~with~}\n",
    "\\mathrm{Var}(X_0) = \\mathrm{Var}(Z) = 1\n",
    "\\mathrm{~and~}\n",
    "\\mathrm{Cov}(Z, X_0) = 0,\\\\\n",
    "\\mathrm{Cov}(X_0, Y) = 1\n",
    "\\mathrm{~and~}\n",
    "\\mathrm{Cov}(Z, Y) = 0.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12660c41",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "$$\n",
    "\\mathrm{Then,~}\n",
    "\\frac{\\mathbf{X}^\\top \\mathbf{X}}{N - 1} \\to\n",
    "\\begin{pmatrix}\n",
    "    1 & c\\\\\n",
    "    c & 1\n",
    "\\end{pmatrix}\n",
    "\\mathrm{~and~}\n",
    "\\frac{\\mathbf{X}^\\top \\mathbf{y}}{N - 1} \\to\n",
    "\\begin{pmatrix}\n",
    "    1\\\\\n",
    "    c\n",
    "\\end{pmatrix} \\mathrm{~when~} N \\to \\infty.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d887a4c",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "$$\n",
    "\\mathrm{Thus,~}\n",
    "\\left(\\frac{\\mathbf{X}^\\top \\mathbf{X}}{N - 1}\\right)^{-1} \\to \\frac{1}{1 - c^2}\n",
    "\\begin{pmatrix}\n",
    "    1 & -c\\\\\n",
    "    -c & 1\n",
    "\\end{pmatrix}\n",
    "\\mathrm{~and~}\n",
    "\\hat{\\boldsymbol{\\beta}} \\to \n",
    "\\begin{pmatrix}\n",
    "    1\\\\\n",
    "    0\n",
    "\\end{pmatrix}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85368880",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**Example interpretation:**\n",
    "\n",
    "- Absolute correlation large $\\rightarrow$ large coefficient variance;\n",
    "- Large positive correlation $\\rightarrow$ one coefficient large at expense of other.\n",
    "\n",
    "**To go further:**\n",
    "\n",
    "Interpret the role of covariances between inputs in OLS for $M > 2$ using an eigendecomposition of the covariance matrix. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "635c72fb",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**Problem:** Many correlated variables $\\rightarrow$ high variance.\n",
    "\n",
    "**Origin of the problem:** A widely large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated cousin $\\rightarrow$ large coefficients.\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>Idea: weight-decay penalization</b>\n",
    "    \n",
    "Limit the size of the coefficients to reduce the variance of the predictions.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbac8ae2",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Ridge Regression\n",
    "\n",
    "- A very common form of regularization for linear regression;\n",
    "- Shrink coefficients by imposing a penalty on their size;\n",
    "- Minimize a penalized RSS:\n",
    "\n",
    "\\begin{equation}\n",
    "\\hat{\\boldsymbol{\\beta}}^\\mathrm{ridge} = \\underset{\\boldsymbol{\\beta}}{\\mathrm{argmin}} \\left\\{\\sum_{i = 1}^N \\left(y_i - \\beta_0 - \\sum_{j = 1}^p x_{ij} \\beta_j \\right)^2 + \\lambda \\sum_{j = 1}^p \\beta_j^2 \\right\\}.\n",
    "\\end{equation}\n",
    "\n",
    "$\\lambda \\ge 0$ controls the amount of shrinkage."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4747739a",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Ridge Regression as a Constrained Optimization\n",
    "\n",
    "For any $\\lambda$ there exists a $t \\ge 0$ such that the ridge regression problem is equivalent to:\n",
    "\n",
    "\\begin{equation}\n",
    "\\hat{\\boldsymbol{\\beta}}^\\mathrm{ridge} = \\underset{\\boldsymbol{\\beta}}{\\mathrm{argmin}} \\sum_{i = 1}^N \\left(y_i - \\beta_0 - \\sum_{j = 1}^p x_{ij} \\beta_j \\right)^2\\\\\n",
    "\\mathrm{subject~to~} \\sum_{j = 1}^p \\beta_j^2 \\le t.\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d51f0961",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Warning 1</b>\n",
    "    \n",
    "Contrary to OLS and like in many regularizations, Ridge solutions are not equivarient under scaling:\n",
    "    \n",
    "- If different inputs represent different kinds of variables, one normally standardizes the inputs before solving.\n",
    "- If different inputs represent measurements of the same variable in different situations (location, epoch...), it may be interesting to keep the scales.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f199c51a",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Warning 2</b>\n",
    "    \n",
    "- Penalization of the intercept would make the procedure depend on the origin\n",
    "chosen for $Y$.\n",
    "- After centering the inputs by replacing each $x_{ij}$ by $x_{ij} - \\bar{x}_j$, OLS gives $\\hat{\\beta}_0 = \\bar{y}$.\n",
    "- We can thus first solve the ridge for centered inputs and outputs and then add $\\hat{\\beta}_0 = \\bar{y}$ to the predictions.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0501ec1c",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "> ***Question (optional)***\n",
    "> - Show that the solution to the ridge-penalized RSS is\n",
    "> \\begin{equation}\n",
    "    \\hat{\\boldsymbol{\\beta}}^\\mathrm{ridge} = \\left(\\mathbf{X}^\\top \\mathbf{X} + \\lambda \\mathbf{I}\\right)^{-1} \\left(\\mathbf{X}^\\top \\mathbf{y}\\right),\n",
    "\\end{equation}\n",
    "> where $\\mathbf{I}$ is the $p\\times p$ identity matrix.\n",
    "\n",
    "> ***Question***\n",
    "> - How does this formula differ from the OLS solution for the coefficients?\n",
    "> - When are these solutions unique?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e090895",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "> ***Question (optional)***\n",
    "> - Show that in the case of **orthonormal inputs**, the ridge estimates are just a scaled version of the OLS estimates: $\\hat{\\boldsymbol{\\beta}}^\\mathrm{ridge} = \\hat{\\boldsymbol{\\beta}} / (1 + \\lambda)$.\n",
    "\n",
    "> ***Question***\n",
    "> - Use this formula to interpret the effect of the Ridge regularization on the coefficients for orthonormal inputs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5512e8e",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Singular Value Decomposition Interpretation (optional)\n",
    "\n",
    "Let the SVD of $\\mathbf{X}$ be $\\mathbf{X} = \\mathbf{U} \\mathbf{D} \\mathbf{V}^\\top$ with:\n",
    "- $\\mathbf{U}$ an $N \\times p$ orthogonal matrix with columns spanning the column space of $\\mathbf{X}$;\n",
    "- $\\mathbf{V}$ a $p \\times p$ orthogonal matrix with columns spanning the row space of $\\mathbf{X}$;\n",
    "- $\\mathbf{D}$ a $p \\times p$ diagonal matrix with diagonal entries $d_1 \\ge d_2 \\ge \\cdots d_p \\ge 0$ called the singular values of $\\mathbf{X}$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60257172",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "> ***Question (optional)***\n",
    "> - Show that the OLS estimates are such that $\\mathbf{X} \\hat{\\boldsymbol{\\beta}} = \\mathbf{U} \\mathbf{U}^\\top \\mathbf{y}$.\n",
    "> - Show that the ridge estimates are such that $\\mathbf{X} \\hat{\\boldsymbol{\\beta}}^\\mathrm{ridge} = \\sum_{j = 1}^p \\mathbf{u}_j \\frac{d_j^2}{d_j^2 + \\lambda} \\mathbf{u}_j^\\top \\mathbf{y}$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e83baa27",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "> ***Question***\n",
    "> - Interpret how the ridge shrinks the coordinates of $\\mathbf{y}$ with respect to the orthonormal basis $\\mathbf{U}$.\n",
    "> - What is the role of $\\mathrm{df}(\\lambda) := \\mathrm{tr}\\left[\\mathbf{X}(\\mathbf{X}^\\top \\mathbf{X} + \\lambda \\mathbf{I})^{-1} \\mathbf{X}^\\top\\right] = \\sum_{j = 1}^p \\frac{d_j^2}{d_j^2 + \\lambda}$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8f412aa",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Regularization on a Simple Example\n",
    "\n",
    "<img src=\"images/linreg_noreg_0_nogrey.svg\" width=\"400\" style=\"float:left;margin-right=25px\">\n",
    "\n",
    "- Small training set\n",
    "- Fit a linear model without regularization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55cba17d",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Regularization on a Simple Example\n",
    "\n",
    "<img src=\"images/linreg_noreg_0.svg\" width=\"400\" style=\"float:left;margin-right=25px\">\n",
    "\n",
    "- Small training set\n",
    "- Fit a linear model without regularization\n",
    "- Training points sampled at random\n",
    "- Can overfit if the data is noisy!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de6d7f9c",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Regularization on a Simple Example: Random Training Sets\n",
    "\n",
    "| <img src=\"images/linreg_noreg_0.svg\" width=\"300\"> | <img src=\"images/linreg_noreg_1.svg\" width=\"300\"> | <img src=\"images/linreg_noreg_2.svg\" width=\"300\"> |\n",
    "|---|---|---|\n",
    "| <img src=\"images/linreg_noreg_3.svg\" width=\"300\"> | <img src=\"images/linreg_noreg_4.svg\" width=\"300\"> | <img src=\"images/linreg_noreg_5.svg\" width=\"300\"> |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96135a45",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Bias-Variance Tradeoff in Ridge Regression\n",
    "\n",
    "<div style=\"text-align:center;float:left\">\n",
    "    \n",
    "<img src=\"images/linreg_noreg_0.svg\" width=\"400\">\n",
    "    \n",
    "<div>\n",
    "\n",
    "**Linear** regression (OLS)\n",
    "<br>\n",
    "High variance, no bias\n",
    "</div>\n",
    "</div>\n",
    "<div style=\"text-align:center;float:right\">\n",
    "    \n",
    "<img src=\"images/ridge_0_withreg.svg\" width=\"400\">\n",
    "    \n",
    "<div>\n",
    "\n",
    "**Ridge** regression\n",
    "<br>\n",
    "Low variance, but biased!\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "<div style=\"clear:both\"></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7736c1cb",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Bias-Variance Tradeoff in Ridge Regression\n",
    "\n",
    "| <img src=\"images/ridge_alpha_0.svg\" width=\"100%\"> | <img src=\"images/ridge_alpha_50.svg\" width=\"100%\"> | <img src=\"images/ridge_alpha_500.svg\" width=\"100%\"> |\n",
    "|---|---|---|\n",
    "|  Too much variance  |  Best tradeoff  |  Too much bias  |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "babfae57",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Partial Summary\n",
    "\n",
    "- Can overfit when:\n",
    "  - $N$ is too small and $p$ is large\n",
    "  - In particular with non-informative features\n",
    "- Regularization for regression:\n",
    "  - From linear regression to ridge regression $\\rightarrow$ less overfit\n",
    "  - large regularization parameter $\\rightarrow$ strong regularization $\\rightarrow$ smaller coefficients"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e76f6dbd",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Tuning Hyperparameters (model selection)\n",
    "\n",
    "- Some **hyperparameters** need to be estimated in addition to coefficients (regularization parameter, nonlinear coefficients in linear models, etc.)\n",
    "- Validation data or Cross-Validation (CV) may be used to **select the best hyperparameters** by minimizing the estimated validation error (grid search, random search, etc.)\n",
    "\n",
    "<img src=\"images/grid_search_workflow.png\" width=\"400\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6a24cd7",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Testing after validation\n",
    "\n",
    "\n",
    "The best validation error is for hyperparameters that were **trained using the validation data**: it is **not the EPE** for this choice of hyperparameters!\n",
    "\n",
    "$\\rightarrow$ Distinguish between **train**, **validation** and **test data**\n",
    "\n",
    "Training and validation can be combined using CV but **test data should be kept to evaluate the PE conditionned on the CV set** for the final choice of coefficients and hyperparameters.\n",
    "\n",
    "To maximize the use of the data and estimate the EPE, **nested CV** can be used."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "190839ab",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Law of Large Numbers?\n",
    "\n",
    "Estimates attempt to minimize a function of the training error $\\overline{\\mathrm{err}}$.\n",
    "\n",
    "For estimates to converge with the sample size, so should $\\overline{\\mathrm{err}}$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aaaae356",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "$\\rightarrow$ We need some **Law of Large Numbers** to be applicable.\n",
    "\n",
    "Basic assumptions: **independent** and **identically distributed**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "354d07c8",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### What could go wrong?\n",
    "\n",
    "In the natural and engineering sciences many problems depend on **time**.\n",
    "\n",
    "So far, we have assumed that the joint distribution $f_{\\boldsymbol{X}, Y}$ is **independent of time**.\n",
    "\n",
    "In particular, we have assumed that the joint process is **statistically stationary**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36e33e9e",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Variations in time can rarely be considered purely random:\n",
    "\n",
    "$\\rightarrow$ some **dependence** persist between realizations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00f1bd2c",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Yet, we are fine if we can show that:\n",
    "- there is a **stationary distribution**\n",
    "- realizations sufficiently distant in time **no longer correlate**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e08e20cf",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "However, distributions may change with **cycles** and **trends**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2944fc2",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Violation of Statistical Stationarity\n",
    "\n",
    "<div style=\"float:left;margin-right:20px\">\n",
    "    \n",
    "<img src=\"images/640px-20200324_Global_average_temperature_-_NASA-GISS_HadCrut_NOAA_Japan_BerkeleyE.svg.png\" width=\"400\">\n",
    "\n",
    "[By RCraig09 - Own work, CC BY-SA 4.0](https://commons.wikimedia.org/w/index.php?curid=88535596)\n",
    "</div>\n",
    "\n",
    "Surface air temperature variability can be decomposed into:\n",
    "\n",
    "$-$ (pseudo-)periodic **cycles** (diurnal, annual, Milankovitch)\n",
    "\n",
    "$-$ a **continuous spectrum** of frequencies due to chaotic dynamics\n",
    "\n",
    "$-$ an increasing **trend** due to global warming\n",
    "\n",
    "$-$ other non-equilibrium variations (effect of volcanoes, solar activity, ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e1afd14",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Partial Summary\n",
    "\n",
    "- In statistics, we **always** make assumptions about the probability distribution of the data (stationarity, independence, parametric form, etc.)\n",
    "- The quality of statistics (predictions, error estimates, etc.) depends on the validity of these assumptions\n",
    "- Even the best validation could be wrong about predictions if the new data does not satisfy these assumptions!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33262c38",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## To go further\n",
    "\n",
    "- [Linear model inspection in Scikit-learn course](https://inria.github.io/scikit-learn-mooc/python_scripts/dev_features_importance.html#linear-model-inspection);\n",
    "- [Introduction of the evaluation metrics in Scikit-learn course](https://inria.github.io/scikit-learn-mooc/evaluation/02_metrics.html#introduction-of-the-evaluation-metrics);\n",
    "- *Generalized CV* approximation of leave-one out CV (Chap. 7.10 in Hastie *et al.* 2009);\n",
    "- The *bootstrap* as a way of assessing the accuracy of a parameter estimate or a prediction (shuffled version of cross-validation, Chap. 7.11 and 8 in Hastie *et al.* 2009).\n",
    "- Avoiding validation thanks to Bayesian approach."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64018dbd",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## References\n",
    "\n",
    "- [James, G., Witten, D., Hastie, T., Tibshirani, R., n.d. *An Introduction to Statistical Learning*, 2st ed. Springer, New York, NY.](https://www.statlearning.com/)\n",
    "- Chap. 2, 3 and 7 in [Hastie, T., Tibshirani, R., Friedman, J., 2009. *The Elements of Statistical Learning*, 2nd ed. Springer, New York.](https://doi.org/10.1007/978-0-387-84858-7)\n",
    "- Chap. 5 and 7 in [Wilks, D.S., 2019. *Statistical Methods in the Atmospheric Sciences*, 4th ed. Elsevier, Amsterdam.](https://doi.org/10.1016/C2017-0-03921-6)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "126964e5",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "***\n",
    "## Credit\n",
    "\n",
    "[//]: # \"This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education).\"\n",
    "Contributors include Bruno Deremble and Alexis Tantet.\n",
    "Several slides and images are taken from the very good [Scikit-learn course](https://inria.github.io/scikit-learn-mooc/).\n",
    "\n",
    "<br>\n",
    "\n",
    "<div style=\"display: flex; height: 70px\">\n",
    "    \n",
    "<img alt=\"Logo LMD\" src=\"images/logos/logo_lmd.jpg\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo IPSL\" src=\"images/logos/logo_ipsl.png\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo E4C\" src=\"images/logos/logo_e4c_final.png\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo EP\" src=\"images/logos/logo_ep.png\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo SU\" src=\"images/logos/logo_su.png\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo ENS\" src=\"images/logos/logo_ens.jpg\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo CNRS\" src=\"images/logos/logo_cnrs.png\" style=\"display: inline-block\"/>\n",
    "    \n",
    "</div>\n",
    "\n",
    "<hr>\n",
    "\n",
    "<div style=\"display: flex\">\n",
    "    <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-sa/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0; margin-right: 10px\" src=\"https://i.creativecommons.org/l/by-sa/4.0/88x31.png\" /></a>\n",
    "    <br>This work is licensed under a &nbsp; <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-sa/4.0/\">Creative Commons Attribution-ShareAlike 4.0 International License</a>.\n",
    "</div>"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": true,
   "autocomplete": false,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}