# Ordinary Least Squares#

**Prerequisites**

Define a supervised learning problem.

**Learning Outcomes**

Apply the supervised learning methodology to a multiple linear regression

## Strengths of the OLS#

Simple to use;

Easily interpretable in terms of variances and covariances;

Can outperform fancier nonlinear models for prediction, especially in situations with:

small training samples,

low signal-to-noise ratio,

sparse data.

Expandable to nonlinear transformations of the inputs;

Can be used as a simple reference to learn about machine learning methodologies (supervised learning, in particular).

## Linear Model#

\(X_j\) can come from :

quantitative inputs;

transformations of quantitative inputs, such as log or square;

basis expansions, such as \(X_2 = X_1^2\), \(X_3 = X_1^3\);

interactions between variables, for example, \(X_3 = X_1 \cdot X_2\);

numeric or “dummy” coding of the levels of qualitative inputs. For example, \(X_j, j = 1, \ldots, 5\), such that \(X_j = I(G = j)\).

## Residual Sum of Squares#

The sample-mean estimate of the Expected Training Error with Squared Error Loss gives the *Residual Sum of Squares* (RSS) depending on the parameters:

Question

Assume that \(f(x) = \bar{y}\) (the sample mean of the target). The corresponding RSS is called the Total Sum of Squares (TSS). How does the TSS relate to the sample variance \(s_Y^2\) of \(Y\)?

The *coefficient of determination* \(R^2\) relates to the RSS as such,

## How to Minimize the RSS?#

Denote by \(\mathbf{X}\) the \(N \times (p + 1)\) input-data matrix.

The 1st column of \(\mathbf{X}\) is associated with the intercept and is given by the \(N\)-dimensional vector \(\mathbf{1}\) with all elements equal to 1.

Then,

Question (optional)

Show that the following parameter estimate minimizes the RSS.

Show that this solution is unique if and only if \(\mathbf{X}^\top\mathbf{X}\) is positive definite.

When could this condition not be fulfilled?

Question (optional)

Express \(R^2(\hat{\boldsymbol{\beta}})\) (above) in terms of explained variance.

Show that \(R^2(\hat{\boldsymbol{\beta}})\) is invariant under linear transformations the target.

**Remark**

The formula for the optimal coefficients is in closed form, meaning that it can be directly computed using a finite number of standard operations.

Nonlinear models (e.g. neural networks) will instead require solving numerical problems iteratively with a finite precision.

Suppose that the inputs \(\mathbf{x}_1, \ldots, \mathbf{x}_p\) (the columns of the data matrix \(\mathbf{X}\)) are orthogonal; that is \(\mathbf{x}_j^\top \mathbf{x}_k = 0\) for all \(j \ne k\).

Question

Show that \(\hat{\beta} = \mathbf{x}_j^\top \mathbf{y} / (\mathbf{x}_j^\top \mathbf{x}_j)\) for all \(j\).

Interpret these coefficients in terms of correlations and variances.

How do the inputs influence each other’s parameter estimates in the model?

Find a simple expression of \(R^2(\hat{\beta})\) in that case.

We now assume that the target is generated by this model \(Y = \boldsymbol{X}^\top \boldsymbol{\beta} + \epsilon\), where the observations of \(\epsilon\) are *uncorrelated* and with *mean zero* and *constant variance* \(\sigma^2\).

Question (optional)

Knowing that \(\boldsymbol{X} = \boldsymbol{x}\), show that the observations of \(y\) are uncorrelated, with mean \(\boldsymbol{x}^\top \boldsymbol{\beta}\) and variance \(\sigma^2\).

Show that \(\mathbb{E}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \boldsymbol{\beta}\) and \(\mathrm{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}\).

Show that \(\hat{\sigma}^2 = \sum_{i = 1}^N (y_i - \hat{y}_i)^2 / (N - p - 1)\) is an unbiased estimate of \(\sigma^2\), i.e \(\mathbb{E}(\hat{\sigma}^2) = \sigma^2\).

## Confidence Intervals#

We now assume that the error \(\epsilon\) is a Gaussian random variable, i.e \(\epsilon \sim N(0, \sigma^2)\) and would like to test the null hypothesis that \(\beta_j = 0\).

Question (optional)

Show that the \(1 - 2 \alpha\) confidence interval for \(\beta_j\) is

\((\hat{\beta}_j - z^{(1 - \alpha)}_{N - p - 1} \hat{\sigma} \sqrt{v_j}, \ \ \ \ \hat{\beta}_j + z^{(1 - \alpha)}_{N - p - 1} \hat{\sigma} \sqrt{v_j})\),

where \(v_j = [(\mathbf{X}^\top \mathbf{X})^{-1}]_{jj}\) and \(z^{(1 - \alpha)}_{N - p - 1}\) is the \((1 - \alpha)\) percentile of \(t_{N - p - 1}\) (see Supplementary Material).

## To go further#

Basis expansion models : polynomials, splines, etc. (Chap. 5 in Hastie

*et al.*2009)

## References#

## Credit#

Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.