Ordinary Least Squares#
Define a supervised learning problem.
Apply the supervised learning methodology to a multiple linear regression
Strengths of the OLS#
Simple to use;
Easily interpretable in terms of variances and covariances;
Can outperform fancier nonlinear models for prediction, especially in situations with:
small training samples,
low signal-to-noise ratio,
sparse data.
Expandable to nonlinear transformations of the inputs;
Can be used as a simple reference to learn about machine learning methodologies (supervised learning, in particular).
Linear Model#
\(X_j\) can come from :
quantitative inputs;
transformations of quantitative inputs, such as log or square;
basis expansions, such as \(X_2 = X_1^2\), \(X_3 = X_1^3\);
interactions between variables, for example, \(X_3 = X_1 \cdot X_2\);
numeric or “dummy” coding of the levels of qualitative inputs. For example, \(X_j, j = 1, \ldots, 5\), such that \(X_j = I(G = j)\).
Residual Sum of Squares#
The sample-mean estimate of the Expected Training Error with Squared Error Loss gives the Residual Sum of Squares (RSS) depending on the parameters:
Question
Assume that \(f(x) = \bar{y}\) (the sample mean of the target). The corresponding RSS is called the Total Sum of Squares (TSS). How does the TSS relate to the sample variance \(s_Y^2\) of \(Y\)?
The coefficient of determination \(R^2\) relates to the RSS as such,
How to Minimize the RSS?#
Denote by \(\mathbf{X}\) the \(N \times (p + 1)\) input-data matrix.
The 1st column of \(\mathbf{X}\) is associated with the intercept and is given by the \(N\)-dimensional vector \(\mathbf{1}\) with all elements equal to 1.
Then,
Question (optional)
Show that the following parameter estimate minimizes the RSS.
Show that this solution is unique if and only if \(\mathbf{X}^\top\mathbf{X}\) is positive definite.
When could this condition not be fulfilled?
Question (optional)
Express \(R^2(\hat{\boldsymbol{\beta}})\) (above) in terms of explained variance.
Show that \(R^2(\hat{\boldsymbol{\beta}})\) is invariant under linear transformations the target.
The formula for the optimal coefficients is in closed form, meaning that it can be directly computed using a finite number of standard operations.
Nonlinear models (e.g. neural networks) will instead require solving numerical problems iteratively with a finite precision.
Suppose that the inputs \(\mathbf{x}_1, \ldots, \mathbf{x}_p\) (the columns of the data matrix \(\mathbf{X}\)) are orthogonal; that is \(\mathbf{x}_j^\top \mathbf{x}_k = 0\) for all \(j \ne k\).
Question
Show that \(\hat{\beta} = \mathbf{x}_j^\top \mathbf{y} / (\mathbf{x}_j^\top \mathbf{x}_j)\) for all \(j\).
Interpret these coefficients in terms of correlations and variances.
How do the inputs influence each other’s parameter estimates in the model?
Find a simple expression of \(R^2(\hat{\beta})\) in that case.
We now assume that the target is generated by this model \(Y = \boldsymbol{X}^\top \boldsymbol{\beta} + \epsilon\), where the observations of \(\epsilon\) are uncorrelated and with mean zero and constant variance \(\sigma^2\).
Question (optional)
Knowing that \(\boldsymbol{X} = \boldsymbol{x}\), show that the observations of \(y\) are uncorrelated, with mean \(\boldsymbol{x}^\top \boldsymbol{\beta}\) and variance \(\sigma^2\).
Show that \(\mathbb{E}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \boldsymbol{\beta}\) and \(\mathrm{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}\).
Show that \(\hat{\sigma}^2 = \sum_{i = 1}^N (y_i - \hat{y}_i)^2 / (N - p - 1)\) is an unbiased estimate of \(\sigma^2\), i.e \(\mathbb{E}(\hat{\sigma}^2) = \sigma^2\).
Confidence Intervals#
We now assume that the error \(\epsilon\) is a Gaussian random variable, i.e \(\epsilon \sim N(0, \sigma^2)\) and would like to test the null hypothesis that \(\beta_j = 0\).
Question (optional)
Show that the \(1 - 2 \alpha\) confidence interval for \(\beta_j\) is
\((\hat{\beta}_j - z^{(1 - \alpha)}_{N - p - 1} \hat{\sigma} \sqrt{v_j}, \ \ \ \ \hat{\beta}_j + z^{(1 - \alpha)}_{N - p - 1} \hat{\sigma} \sqrt{v_j})\),
where \(v_j = [(\mathbf{X}^\top \mathbf{X})^{-1}]_{jj}\) and \(z^{(1 - \alpha)}_{N - p - 1}\) is the \((1 - \alpha)\) percentile of \(t_{N - p - 1}\) (see Supplementary Material).
To go further#
Basis expansion models : polynomials, splines, etc. (Chap. 5 in Hastie et al. 2009)
References#
Credit#
Contributors include Bruno Deremble and Alexis Tantet. Several slides and images are taken from the very good Scikit-learn course.