Ordinary Least Squares#


  • Define a supervised learning problem.

Learning Outcomes
  • Apply the supervised learning methodology to a multiple linear regression

Strengths of the OLS#

  • Simple to use;

  • Easily interpretable in terms of variances and covariances;

  • Can outperform fancier nonlinear models for prediction, especially in situations with:

    • small training samples,

    • low signal-to-noise ratio,

    • sparse data.

  • Expandable to nonlinear transformations of the inputs;

  • Can be used as a simple reference to learn about machine learning methodologies (supervised learning, in particular).

Linear Model#

(30)#\[\begin{equation} f_{\boldsymbol{\beta}}(X) = \underbrace{\beta_0}_{\mathrm{intercept}} + \sum_{j = 1}^p X_j \beta_j \end{equation}\]

\(X_j\) can come from :

  • quantitative inputs;

  • transformations of quantitative inputs, such as log or square;

  • basis expansions, such as \(X_2 = X_1^2\), \(X_3 = X_1^3\);

  • interactions between variables, for example, \(X_3 = X_1 \cdot X_2\);

  • numeric or “dummy” coding of the levels of qualitative inputs. For example, \(X_j, j = 1, \ldots, 5\), such that \(X_j = I(G = j)\).

Residual Sum of Squares#

The sample-mean estimate of the Expected Training Error with Squared Error Loss gives the Residual Sum of Squares (RSS) depending on the parameters:

(31)#\[\begin{equation} \mathrm{RSS}(\beta) = \sum_{i = 1}^N \left(y_i - f(x_i)\right)^2 = \sum_{i = 1}^N \left(y_i - \beta_0 - \sum_{j = 1}^p x_{ij} \beta_j\right)^2. \end{equation}\]
Linear fit Linear fit


  • Assume that \(f(x) = \bar{y}\) (the sample mean of the target). The corresponding RSS is called the Total Sum of Squares (TSS). How does the TSS relate to the sample variance \(s_Y^2\) of \(Y\)?

The coefficient of determination \(R^2\) relates to the RSS as such,

(32)#\[\begin{equation} R^2(\boldsymbol{\beta}) = 1 - \frac{\mathrm{RSS}(\boldsymbol{\beta})}{\mathrm{TSS}}. \end{equation}\]

How to Minimize the RSS?#

Denote by \(\mathbf{X}\) the \(N \times (p + 1)\) input-data matrix.

The 1st column of \(\mathbf{X}\) is associated with the intercept and is given by the \(N\)-dimensional vector \(\mathbf{1}\) with all elements equal to 1.


(33)#\[\begin{equation} \mathrm{RSS}(\beta) = \left(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\right)^\top \left(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\right). \end{equation}\]

Question (optional)

  • Show that the following parameter estimate minimizes the RSS.

  • Show that this solution is unique if and only if \(\mathbf{X}^\top\mathbf{X}\) is positive definite.

  • When could this condition not be fulfilled?

(34)#\[\begin{equation} \hat{\boldsymbol{\beta}} = \left(\mathbf{X}^\top \mathbf{X}\right)^{-1} \left(\mathbf{X}^\top \mathbf{y}\right) \end{equation}\]

Question (optional)

  • Express \(R^2(\hat{\boldsymbol{\beta}})\) (above) in terms of explained variance.

  • Show that \(R^2(\hat{\boldsymbol{\beta}})\) is invariant under linear transformations the target.


The formula for the optimal coefficients is in closed form, meaning that it can be directly computed using a finite number of standard operations.

Nonlinear models (e.g. neural networks) will instead require solving numerical problems iteratively with a finite precision.

Suppose that the inputs \(\mathbf{x}_1, \ldots, \mathbf{x}_p\) (the columns of the data matrix \(\mathbf{X}\)) are orthogonal; that is \(\mathbf{x}_j^\top \mathbf{x}_k = 0\) for all \(j \ne k\).


  • Show that \(\hat{\beta} = \mathbf{x}_j^\top \mathbf{y} / (\mathbf{x}_j^\top \mathbf{x}_j)\) for all \(j\).

  • Interpret these coefficients in terms of correlations and variances.

  • How do the inputs influence each other’s parameter estimates in the model?

  • Find a simple expression of \(R^2(\hat{\beta})\) in that case.

We now assume that the target is generated by this model \(Y = \boldsymbol{X}^\top \boldsymbol{\beta} + \epsilon\), where the observations of \(\epsilon\) are uncorrelated and with mean zero and constant variance \(\sigma^2\).

Question (optional)

  • Knowing that \(\boldsymbol{X} = \boldsymbol{x}\), show that the observations of \(y\) are uncorrelated, with mean \(\boldsymbol{x}^\top \boldsymbol{\beta}\) and variance \(\sigma^2\).

  • Show that \(\mathbb{E}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \boldsymbol{\beta}\) and \(\mathrm{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}\).

  • Show that \(\hat{\sigma}^2 = \sum_{i = 1}^N (y_i - \hat{y}_i)^2 / (N - p - 1)\) is an unbiased estimate of \(\sigma^2\), i.e \(\mathbb{E}(\hat{\sigma}^2) = \sigma^2\).

Confidence Intervals#

We now assume that the error \(\epsilon\) is a Gaussian random variable, i.e \(\epsilon \sim N(0, \sigma^2)\) and would like to test the null hypothesis that \(\beta_j = 0\).

Question (optional)

  • Show that the \(1 - 2 \alpha\) confidence interval for \(\beta_j\) is

\((\hat{\beta}_j - z^{(1 - \alpha)}_{N - p - 1} \hat{\sigma} \sqrt{v_j}, \ \ \ \ \hat{\beta}_j + z^{(1 - \alpha)}_{N - p - 1} \hat{\sigma} \sqrt{v_j})\),

where \(v_j = [(\mathbf{X}^\top \mathbf{X})^{-1}]_{jj}\) and \(z^{(1 - \alpha)}_{N - p - 1}\) is the \((1 - \alpha)\) percentile of \(t_{N - p - 1}\) (see Supplementary Material).

To go further#

  • Basis expansion models : polynomials, splines, etc. (Chap. 5 in Hastie et al. 2009)



Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.

Logo LMD Logo IPSL Logo E4C Logo EP Logo SU Logo ENS Logo CNRS