{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ef7bc077",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Ordinary Least Squares\n",
    "\n",
    "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgitlab.in2p3.fr%2Fenergy4climate%2Fpublic%2Feducation%2Fmachine_learning_for_climate_and_energy/master?filepath=book%2Fnotebooks%2F03_ordinary_least_squares.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccdfea6e-5bff-4e31-b35a-6ba7a1554a52",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Prerequisites</b>\n",
    "    \n",
    "- Define a supervised learning problem.\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>Learning Outcomes</b>\n",
    "    \n",
    "- Apply the supervised learning methodology to a multiple linear regression\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eae3c60c",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Strengths of the OLS\n",
    "\n",
    "- Simple to use;\n",
    "- Easily interpretable in terms of variances and covariances;\n",
    "- Can outperform fancier nonlinear models for prediction, especially in situations with:\n",
    "  - small training samples,\n",
    "  - low signal-to-noise ratio,\n",
    "  - sparse data.\n",
    "- Expandable to nonlinear transformations of the inputs;\n",
    "- Can be used as a simple reference to learn about machine learning methodologies (supervised learning, in particular)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94862dfd",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Linear Model\n",
    "\n",
    "\\begin{equation}\n",
    "f_{\\boldsymbol{\\beta}}(X) = \\underbrace{\\beta_0}_{\\mathrm{intercept}} + \\sum_{j = 1}^p X_j \\beta_j\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad85fa16",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "$X_j$ can come from :\n",
    "- quantitative inputs;\n",
    "- transformations of quantitative inputs, such as log or square;\n",
    "- basis expansions, such as $X_2 = X_1^2$, $X_3 = X_1^3$;\n",
    "- interactions between variables, for example, $X_3 = X_1 \\cdot X_2$;\n",
    "- numeric or \"dummy\" coding of the levels of qualitative inputs. For example, $X_j, j = 1, \\ldots, 5$, such that $X_j = I(G = j)$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b9392d1",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Residual Sum of Squares\n",
    "\n",
    "The sample-mean estimate of the Expected Training Error with Squared Error Loss gives the *Residual Sum of Squares* (RSS) depending on the parameters:\n",
    "\n",
    "\\begin{equation}\n",
    "\\mathrm{RSS}(\\beta)\n",
    "= \\sum_{i = 1}^N \\left(y_i - f(x_i)\\right)^2\n",
    " = \\sum_{i = 1}^N \\left(y_i - \\beta_0 - \\sum_{j = 1}^p x_{ij} \\beta_j\\right)^2.\n",
    "\\end{equation}\n",
    "\n",
    "<img alt=\"Linear fit\" src=\"images/linear_fit_red.svg\" width=\"360\" style=\"float:left\">\n",
    "<img alt=\"Linear fit\" src=\"images/lin_reg_3D.svg\" width=\"400\" style=\"float:right\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b8487bb",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "> ***Question***\n",
    "> - Assume that $f(x) = \\bar{y}$ (the sample mean of the target).\n",
    "The corresponding RSS is called the Total Sum of Squares (TSS).\n",
    "How does the TSS relate to the sample variance $s_Y^2$ of $Y$?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "267951b5",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "The *coefficient of determination* $R^2$ relates to the RSS as such,\n",
    "\n",
    "\\begin{equation}\n",
    "R^2(\\boldsymbol{\\beta}) = 1 - \\frac{\\mathrm{RSS}(\\boldsymbol{\\beta})}{\\mathrm{TSS}}.\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f429234",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## How to Minimize the RSS?\n",
    "\n",
    "Denote by $\\mathbf{X}$ the $N \\times (p + 1)$ input-data matrix.\n",
    "\n",
    "The 1st column of $\\mathbf{X}$ is associated with the intercept and is given by the $N$-dimensional vector $\\mathbf{1}$ with all elements equal to 1.\n",
    "\n",
    "Then,\n",
    "\\begin{equation}\n",
    "\\mathrm{RSS}(\\beta) = \\left(\\mathbf{y} - \\mathbf{X} \\boldsymbol{\\beta}\\right)^\\top \\left(\\mathbf{y} - \\mathbf{X} \\boldsymbol{\\beta}\\right).\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a38d8c9c",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "> ***Question (optional)***\n",
    "> - Show that the following parameter estimate minimizes the RSS.\n",
    "> - Show that this solution is unique if and only if $\\mathbf{X}^\\top\\mathbf{X}$ is positive definite.\n",
    "> - When could this condition not be fulfilled?\n",
    "\n",
    "\\begin{equation}\n",
    "    \\hat{\\boldsymbol{\\beta}} = \\left(\\mathbf{X}^\\top \\mathbf{X}\\right)^{-1} \\left(\\mathbf{X}^\\top \\mathbf{y}\\right)\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce6d258f",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "> ***Question (optional)***\n",
    "> - Express $R^2(\\hat{\\boldsymbol{\\beta}})$ (above) in terms of explained variance.\n",
    "> - Show that $R^2(\\hat{\\boldsymbol{\\beta}})$ is invariant under linear transformations the target."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be577b50",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>Remark</b>\n",
    "    \n",
    "The formula for the optimal coefficients is in closed form, meaning that it can be directly computed using a finite number of standard operations.\n",
    "    \n",
    "Nonlinear models (e.g. neural networks) will instead require solving numerical problems iteratively with a finite precision.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "397f6d19",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Suppose that the inputs $\\mathbf{x}_1, \\ldots, \\mathbf{x}_p$ (the columns of the data matrix $\\mathbf{X}$) are orthogonal; that is $\\mathbf{x}_j^\\top \\mathbf{x}_k = 0$ for all $j \\ne k$.\n",
    "\n",
    "> ***Question***\n",
    "> - Show that $\\hat{\\beta} = \\mathbf{x}_j^\\top \\mathbf{y} / (\\mathbf{x}_j^\\top \\mathbf{x}_j)$ for all $j$.\n",
    "> - Interpret these coefficients in terms of correlations and variances.\n",
    "> - How do the inputs influence each other's parameter estimates in the model?\n",
    "> - Find a simple expression of $R^2(\\hat{\\beta})$ in that case."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f4d0f27",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "We now assume that the target is generated by this model $Y = \\boldsymbol{X}^\\top \\boldsymbol{\\beta} + \\epsilon$, where the observations of $\\epsilon$ are *uncorrelated* and with *mean zero* and *constant variance* $\\sigma^2$.\n",
    "\n",
    "> ***Question (optional)***\n",
    "> - Knowing that $\\boldsymbol{X} = \\boldsymbol{x}$, show that the observations of $y$ are uncorrelated, with mean $\\boldsymbol{x}^\\top \\boldsymbol{\\beta}$ and variance $\\sigma^2$.\n",
    "> - Show that $\\mathbb{E}(\\hat{\\boldsymbol{\\beta}} | \\mathbf{X}) = \\boldsymbol{\\beta}$ and $\\mathrm{Var}(\\hat{\\boldsymbol{\\beta}} | \\mathbf{X}) = \\sigma^2 (\\mathbf{X}^\\top \\mathbf{X})^{-1}$.\n",
    "> - Show that $\\hat{\\sigma}^2 = \\sum_{i = 1}^N (y_i - \\hat{y}_i)^2 / (N - p - 1)$ is an unbiased estimate of $\\sigma^2$, i.e $\\mathbb{E}(\\hat{\\sigma}^2) = \\sigma^2$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc2ef5ab",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Confidence Intervals\n",
    "\n",
    "We now assume that the error $\\epsilon$ is a Gaussian random variable, i.e $\\epsilon \\sim N(0, \\sigma^2)$ and would like to test the null hypothesis that $\\beta_j = 0$.\n",
    "\n",
    "> ***Question (optional)***\n",
    "> - Show that the $1 - 2 \\alpha$ confidence interval for $\\beta_j$ is\n",
    ">\n",
    "> $(\\hat{\\beta}_j - z^{(1 - \\alpha)}_{N - p - 1} \\hat{\\sigma} \\sqrt{v_j}, \\ \\ \\ \\ \\hat{\\beta}_j + z^{(1 - \\alpha)}_{N - p - 1} \\hat{\\sigma} \\sqrt{v_j})$,\n",
    ">\n",
    "> where $v_j = [(\\mathbf{X}^\\top \\mathbf{X})^{-1}]_{jj}$ and $z^{(1 - \\alpha)}_{N - p - 1}$ is the $(1 - \\alpha)$ percentile of $t_{N - p - 1}$ (see [Supplementary Material](appendix_supplementary_matrial.ipynb))."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db5011ed",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## To go further\n",
    "\n",
    "- Basis expansion models : polynomials, splines, etc. (Chap. 5 in Hastie *et al.* 2009)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7005fab6",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## References\n",
    "\n",
    "- [James, G., Witten, D., Hastie, T., Tibshirani, R., n.d. *An Introduction to Statistical Learning*, 2st ed. Springer, New York, NY.](https://www.statlearning.com/)\n",
    "- Chap. 2, 3 and 7 in [Hastie, T., Tibshirani, R., Friedman, J., 2009. *The Elements of Statistical Learning*, 2nd ed. Springer, New York.](https://doi.org/10.1007/978-0-387-84858-7)\n",
    "- Chap. 5 and 7 in [Wilks, D.S., 2019. *Statistical Methods in the Atmospheric Sciences*, 4th ed. Elsevier, Amsterdam.](https://doi.org/10.1016/C2017-0-03921-6)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e186998",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "***\n",
    "## Credit\n",
    "\n",
    "[//]: # \"This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education).\"\n",
    "Contributors include Bruno Deremble and Alexis Tantet.\n",
    "Several slides and images are taken from the very good [Scikit-learn course](https://inria.github.io/scikit-learn-mooc/).\n",
    "\n",
    "<br>\n",
    "\n",
    "<div style=\"display: flex; height: 70px\">\n",
    "    \n",
    "<img alt=\"Logo LMD\" src=\"images/logos/logo_lmd.jpg\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo IPSL\" src=\"images/logos/logo_ipsl.png\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo E4C\" src=\"images/logos/logo_e4c_final.png\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo EP\" src=\"images/logos/logo_ep.png\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo SU\" src=\"images/logos/logo_su.png\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo ENS\" src=\"images/logos/logo_ens.jpg\" style=\"display: inline-block\"/>\n",
    "\n",
    "<img alt=\"Logo CNRS\" src=\"images/logos/logo_cnrs.png\" style=\"display: inline-block\"/>\n",
    "    \n",
    "</div>\n",
    "\n",
    "<hr>\n",
    "\n",
    "<div style=\"display: flex\">\n",
    "    <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-sa/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0; margin-right: 10px\" src=\"https://i.creativecommons.org/l/by-sa/4.0/88x31.png\" /></a>\n",
    "    <br>This work is licensed under a &nbsp; <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-sa/4.0/\">Creative Commons Attribution-ShareAlike 4.0 International License</a>.\n",
    "</div>"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": false,
   "autoclose": true,
   "autocomplete": false,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}