# Ordinary Least Squares

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgitlab.in2p3.fr%2Fenergy4climate%2Fpublic%2Feducation%2Fmachine_learning_for_climate_and_energy/master?filepath=book%2Fnotebooks%2F03_ordinary_least_squares.ipynb)

<div class="alert alert-block alert-warning">
    <b>Prerequisites</b>
    
- Define a supervised learning problem.
</div>

<div class="alert alert-block alert-info">
    <b>Learning Outcomes</b>
    
- Apply the supervised learning methodology to a multiple linear regression
</div>

## Strengths of the OLS

- Simple to use;
- Easily interpretable in terms of variances and covariances;
- Can outperform fancier nonlinear models for prediction, especially in situations with:
  - small training samples,
  - low signal-to-noise ratio,
  - sparse data.
- Expandable to nonlinear transformations of the inputs;
- Can be used as a simple reference to learn about machine learning methodologies (supervised learning, in particular).

## Linear Model

\begin{equation}
f_{\boldsymbol{\beta}}(X) = \underbrace{\beta_0}_{\mathrm{intercept}} + \sum_{j = 1}^p X_j \beta_j
\end{equation}

$X_j$ can come from :
- quantitative inputs;
- transformations of quantitative inputs, such as log or square;
- basis expansions, such as $X_2 = X_1^2$, $X_3 = X_1^3$;
- interactions between variables, for example, $X_3 = X_1 \cdot X_2$;
- numeric or "dummy" coding of the levels of qualitative inputs. For example, $X_j, j = 1, \ldots, 5$, such that $X_j = I(G = j)$.

## Residual Sum of Squares

The sample-mean estimate of the Expected Training Error with Squared Error Loss gives the *Residual Sum of Squares* (RSS) depending on the parameters:

\begin{equation}
\mathrm{RSS}(\beta)
= \sum_{i = 1}^N \left(y_i - f(x_i)\right)^2
 = \sum_{i = 1}^N \left(y_i - \beta_0 - \sum_{j = 1}^p x_{ij} \beta_j\right)^2.
\end{equation}

<img alt="Linear fit" src="images/linear_fit_red.svg" width="360" style="float:left">
<img alt="Linear fit" src="images/lin_reg_3D.svg" width="400" style="float:right">

> ***Question***
> - Assume that $f(x) = \bar{y}$ (the sample mean of the target).
The corresponding RSS is called the Total Sum of Squares (TSS).
How does the TSS relate to the sample variance $s_Y^2$ of $Y$?

The *coefficient of determination* $R^2$ relates to the RSS as such,

\begin{equation}
R^2(\boldsymbol{\beta}) = 1 - \frac{\mathrm{RSS}(\boldsymbol{\beta})}{\mathrm{TSS}}.
\end{equation}

## How to Minimize the RSS?

Denote by $\mathbf{X}$ the $N \times (p + 1)$ input-data matrix.

The 1st column of $\mathbf{X}$ is associated with the intercept and is given by the $N$-dimensional vector $\mathbf{1}$ with all elements equal to 1.

Then,
\begin{equation}
\mathrm{RSS}(\beta) = \left(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\right)^\top \left(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\right).
\end{equation}

> ***Question (optional)***
> - Show that the following parameter estimate minimizes the RSS.
> - Show that this solution is unique if and only if $\mathbf{X}^\top\mathbf{X}$ is positive definite.
> - When could this condition not be fulfilled?

\begin{equation}
    \hat{\boldsymbol{\beta}} = \left(\mathbf{X}^\top \mathbf{X}\right)^{-1} \left(\mathbf{X}^\top \mathbf{y}\right)
\end{equation}

> ***Question (optional)***
> - Express $R^2(\hat{\boldsymbol{\beta}})$ (above) in terms of explained variance.
> - Show that $R^2(\hat{\boldsymbol{\beta}})$ is invariant under linear transformations the target.

<div class="alert alert-block alert-warning">
    <b>Remark</b>
    
The formula for the optimal coefficients is in closed form, meaning that it can be directly computed using a finite number of standard operations.
    
Nonlinear models (e.g. neural networks) will instead require solving numerical problems iteratively with a finite precision.
</div>

Suppose that the inputs $\mathbf{x}_1, \ldots, \mathbf{x}_p$ (the columns of the data matrix $\mathbf{X}$) are orthogonal; that is $\mathbf{x}_j^\top \mathbf{x}_k = 0$ for all $j \ne k$.

> ***Question***
> - Show that $\hat{\beta} = \mathbf{x}_j^\top \mathbf{y} / (\mathbf{x}_j^\top \mathbf{x}_j)$ for all $j$.
> - Interpret these coefficients in terms of correlations and variances.
> - How do the inputs influence each other's parameter estimates in the model?
> - Find a simple expression of $R^2(\hat{\beta})$ in that case.

We now assume that the target is generated by this model $Y = \boldsymbol{X}^\top \boldsymbol{\beta} + \epsilon$, where the observations of $\epsilon$ are *uncorrelated* and with *mean zero* and *constant variance* $\sigma^2$.

> ***Question (optional)***
> - Knowing that $\boldsymbol{X} = \boldsymbol{x}$, show that the observations of $y$ are uncorrelated, with mean $\boldsymbol{x}^\top \boldsymbol{\beta}$ and variance $\sigma^2$.
> - Show that $\mathbb{E}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \boldsymbol{\beta}$ and $\mathrm{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$.
> - Show that $\hat{\sigma}^2 = \sum_{i = 1}^N (y_i - \hat{y}_i)^2 / (N - p - 1)$ is an unbiased estimate of $\sigma^2$, i.e $\mathbb{E}(\hat{\sigma}^2) = \sigma^2$.

## Confidence Intervals

We now assume that the error $\epsilon$ is a Gaussian random variable, i.e $\epsilon \sim N(0, \sigma^2)$ and would like to test the null hypothesis that $\beta_j = 0$.

> ***Question (optional)***
> - Show that the $1 - 2 \alpha$ confidence interval for $\beta_j$ is
>
> $(\hat{\beta}_j - z^{(1 - \alpha)}_{N - p - 1} \hat{\sigma} \sqrt{v_j}, \ \ \ \ \hat{\beta}_j + z^{(1 - \alpha)}_{N - p - 1} \hat{\sigma} \sqrt{v_j})$,
>
> where $v_j = [(\mathbf{X}^\top \mathbf{X})^{-1}]_{jj}$ and $z^{(1 - \alpha)}_{N - p - 1}$ is the $(1 - \alpha)$ percentile of $t_{N - p - 1}$ (see [Supplementary Material](appendix_supplementary_matrial.ipynb)).

## To go further

- Basis expansion models : polynomials, splines, etc. (Chap. 5 in Hastie *et al.* 2009)

## References

- [James, G., Witten, D., Hastie, T., Tibshirani, R., n.d. *An Introduction to Statistical Learning*, 2st ed. Springer, New York, NY.](https://www.statlearning.com/)
- Chap. 2, 3 and 7 in [Hastie, T., Tibshirani, R., Friedman, J., 2009. *The Elements of Statistical Learning*, 2nd ed. Springer, New York.](https://doi.org/10.1007/978-0-387-84858-7)
- Chap. 5 and 7 in [Wilks, D.S., 2019. *Statistical Methods in the Atmospheric Sciences*, 4th ed. Elsevier, Amsterdam.](https://doi.org/10.1016/C2017-0-03921-6)

***
## Credit

[//]: # "This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education)."
Contributors include Bruno Deremble and Alexis Tantet.
Several slides and images are taken from the very good [Scikit-learn course](https://inria.github.io/scikit-learn-mooc/).

<br>

<div style="display: flex; height: 70px">
    
<img alt="Logo LMD" src="images/logos/logo_lmd.jpg" style="display: inline-block"/>

<img alt="Logo IPSL" src="images/logos/logo_ipsl.png" style="display: inline-block"/>

<img alt="Logo E4C" src="images/logos/logo_e4c_final.png" style="display: inline-block"/>

<img alt="Logo EP" src="images/logos/logo_ep.png" style="display: inline-block"/>

<img alt="Logo SU" src="images/logos/logo_su.png" style="display: inline-block"/>

<img alt="Logo ENS" src="images/logos/logo_ens.jpg" style="display: inline-block"/>

<img alt="Logo CNRS" src="images/logos/logo_cnrs.png" style="display: inline-block"/>
    
</div>

<hr>

<div style="display: flex">
    <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0; margin-right: 10px" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a>
    <br>This work is licensed under a &nbsp; <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div>