Appendix: Supplementary Material#

On Ordinary Least Squares (OLS)#

How to Minimize the Residual Sum of Squares (RSS)?#

The predictions with parameters \(\hat{\boldsymbol{\beta}}\) from the input data are given by

(193)#\[\begin{equation} \hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X} \left(\mathbf{X}^\top \mathbf{X}\right)^{-1} \left(\mathbf{X}^\top \mathbf{y}\right). \end{equation}\]

The residual vector is given by \(\hat{\mathbf{z}} = \mathbf{y} - \hat{\mathbf{y}}\).

Question (optional)

  • Show that \(\hat{\mathbf{y}}\) is the orthogonal projection of \(\mathbf{y}\) on the subspace of \(\mathbb{R}^N\) spanned by the columns of \(\mathbf{X}\) (i.e the column space of \(\mathbf{X}\)) and that \(\hat{\mathbf{z}}\) is orthogonal to this space.

Graphical Interpretation and Gram-Schmidt Algorithm#

By regressing \(\mathbf{b}\) on \(\mathbf{a}\) we mean regressing with input \(\mathbf{a}\) and target \(\mathbf{b}\).


  • Regress \(\mathbf{x}\) on \(\mathbf{1}\) and compute the resulting residual \(\hat{\mathbf{z}}_1\).

  • Regress \(\mathbf{y}\) on \(\hat{\mathbf{z}}_1\). The result should be familiar.

  • Interpret the above procedure graphically.

  • Generalize this procedure to the case of \(p\) inputs and express the \(j\)th estimate in terms of some \(\hat{\mathbf{z}}_j\) as \(\hat{\beta}_j = \hat{\mathbf{z}_j}^\top \mathbf{y} / (\hat{\mathbf{z}_j}^\top \hat{\mathbf{z}_j})\) (optional).

Gauss-Markov Theorem#

We now assume that \(Y = \boldsymbol{X}^\top \boldsymbol{\beta} + \epsilon\), where the observations of \(\epsilon\) are uncorrelated and with mean zero and constant variance \(\sigma^2\).

Question (optional)

  • Express the variances of the parameter estimates in terms of the orthogonal basis of the column space of \(\mathbf{X}\) constructed above.

  • How does the precision of \(\hat{\beta}_j\) depend on the input data?

Gauss-Markov Theorem

Least-squares estimates of the parameters have the smallest variance among all linear unbiased estimates. The OLS is BLUE (Best Linear Unbiased Estimator).

Let \(\tilde{\boldsymbol{\beta}}\) be any estimate of the parameters. We mean that for any linear combination defined by the vector \(\boldsymbol{a}\),

(194)#\[\begin{equation} \mathrm{Var}(\boldsymbol{a}^\top \hat{\boldsymbol{\beta}}) \le \mathrm{Var}(\boldsymbol{a}^\top \tilde{\boldsymbol{\beta}}). \end{equation}\]

Question (optional)

  • Prove this theorem.

Confidence Intervals#

We now assume that the error \(\epsilon\) is a Gaussian random variable, i.e \(\epsilon \sim N(0, \sigma^2)\) and would like to test the null hypothesis that \(\beta_j = 0\).

Question (optional)

  • Show that \(\hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, (\mathbf{X}^\top \mathbf{X}) \sigma^2)\).

  • Show that \((N - p - 1) \hat{\sigma}^2 \sim \sigma^2 \ \chi^2_{N - p - 1}\), a chi-squared distribution with \(N - p - 1\) degrees of freedom.

  • Show that \(\hat{\boldsymbol{\beta}}\) and \(\hat{\sigma}^2\) are statistically independent.

With \(v_j = [(\mathbf{X}^\top \mathbf{X})^{-1}]_{jj}\), we define the standardized coefficient or Z-score

(195)#\[\begin{equation} z_j = \frac{\hat{\beta}_j}{\hat{\sigma} \sqrt{v_j}}. \end{equation}\]

Question (optional)

  • Show that \(z_j\) is distributed as \(t_{N - p - 1}\) (a Student’s-\(t\) distribution with \(N - p - 1\) degrees of freedom).

  • Show that the \(1 - 2 \alpha\) confidence interval for \(\beta_j\) is \((\hat{\beta}_j - z^{(1 - \alpha)}_{N - p - 1} \hat{\sigma} \sqrt{v_j}, \hat{\beta}_j + z^{(1 - \alpha)}_{N - p - 1} \hat{\sigma} \sqrt{v_j})\), where \(z^{(1 - \alpha)}_{N - p - 1}\) is the \((1 - \alpha)\) percentile of \(t_{N - p - 1}\).