Flashcards in L2 Linear Regression Deck (6)

Loading flashcards...

1

## Linear Regression, definition

###
Predict yˆ \in R (label, response) from x \in R^d (features, covariate)

Least squares model: yˆ = w_1^⊤ x + w_2 (bias), where w_1 \in R^d and w_2 \in R (that is, w \in R^{d+1})

Learning: choose (w_1, w_2) based on data ((x(i), y(i)))^N_{i=1}.

Prediction: given x, predict yˆ = w_1^⊤ x + w_2.

- Closed form solution

- Gaussian probability model

- Ideal for regression, often not well suited for classification

2

## Linear Regression, learning

###
arg min(w1 in R^d, w2 in R) 1/N Sum^N_{i=1} 1/2 (w1^T x(i) + w2 - y(i) )^2

Simplification: arg min(w in R^{d+1}) 1/2 ||Xw - y||^2_2

Solving gives OLS (Ordinary least squares) estimator wˆ = (X^T X)^{-1} X^T y (when it exists)

3

## Linear Regression, problems/solutions

###
Not inversible if (X^T X)^{-1} does not exists. Does not exists if n < d+1

1. Pseudoinverse (X^⊤ X)^† X^⊤ y = X^† y – still satisfies the “derivative condition” i.e. (X^⊤ X)wˆ = X^⊤ y

2. Ridge regression (regularisation, to make sure there are no null eigenvalues): arg min(w∈{Rd+1}) 1/2 ∥Xw − y∥^2_2 + λ/2∥w∥^2, giving w ̃ = (X^⊤ X + λI)^{−1} X^⊤ y

If \lambda \to \infty, w_i \to 0.

4

## Linear Regression, justifications/interpretations

###
- Geometric interpretation, the residual Xwˆ − y is orthogonal to span z1,..zd+1 because X^⊤(Xwˆ − y) = 0.

- Probabilistic model: y | x ~ N(w^⊤ x, σ2) – solve = maximize likelihood

- Loss minimization (ERM)

5

## Empirical Risk Minimization

###
l_ls(y, yˆ) = 1/2 (y − yˆ)^2 is the least squares loss

ERM: arg min(f) 1/N Sum^N_{i=1} l(y(i), f(x(i)))

6