Recall

For linear regression, we’ve learned two learning algorithms, one based on gradient descent, and another one based on the normal equation.

Now we’ll take those two algorithms and generalized them as a regularized linear regression.

Refering from the pervious post, we know that

$$J(\theta)=\frac{1}{2m}\left[\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda\sum_{i=1}^n\theta_j^2\right]\tag{1}$$

Our goal is to find $\theta$s which leads $J(\theta)$ reach to a minimum value.

Rewrite Gradient Descent

Original

$$\theta_0:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})$$
$$\theta_1:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})*x^{(i)}$$

Modify

$$\theta_0:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)}\tag{2}$$
$$\theta_j:=\theta_j-\alpha\left[\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})*x_j^{(i)}+\frac{\lambda}{m}\theta_j\right]\tag{3}$$

We write the case of updating $\theta_0$ seperately as $(2)$, for we don’t penalize $\theta_0(j=0)$ in regularized linear regression, but penalized other $\theta_j(j=1,2,3,…,n)$.

Note that the term in square brackets in $(3)$ is the result of $\frac{\partial}{\partial\theta_j}J(\theta)$, similarly, the corresponding term in $(2)$ is $\frac{\partial}{\partial\theta_0}J(\theta)$.

Rewrite $(3)$ for $\theta_j$ term
$$\theta_j:=\theta_j\left(1-\alpha\frac{\lambda}{m}\right)-\alpha\frac{1}{m}\sum_{i=1}^m\left(h_{\theta}(x^{(i)})-y^{(i)}\right)x_j^{(i)}\tag{4}$$

Rewrite Normal Equation

$$
\mathbf{X}=
\begin{bmatrix}
\mathbf{(x^{(1)})^\top} \\
\mathbf{(x^{(2)})^\top} \\
\mathbf{(x^{(3)})^\top} \\
\vdots \\
\mathbf{(x^{(m)})^\top} \\
\end{bmatrix}_{m\times(n+1)}
$$

$$\mathbf{y}=
\begin{Bmatrix}
y^{(1)}\\
y^{(2)}\\
y^{(3)}\\
\vdots \\
y^{(m)}\\
\end{Bmatrix}_{m\times1}
\in\mathbb{R^{m}}$$

To solve $\mathbf{\theta}$ of the normal equation, let
$$\frac{\partial}{\partial\theta_j}J(\theta)=0$$

then we obtain
$\mathbf\Theta=(\mathbf{X^\top}\mathbf{X})^{-1}\mathbf{X^\top}\mathbf{y}$

For solving the regularized normal equation, we get
$$\mathbf\Theta=\left(\mathbf{X^\top}\mathbf{X}+\lambda
\begin{bmatrix}
0 & 0 & \cdots & 0 \\
0 & 1 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & 1 \\
\end{bmatrix}_{(n+1)\times{(n+1)}}\right)^{-1}\mathbf{X^\top}\mathbf{y}\tag{5}$$

Non-invertibility issue

Suppose $m\le n$

For those who’s numbers of examples less than or equal to the numbrs of features,
$\mathbf\Theta=(\mathbf{X^\top}\mathbf{X})^{-1}\mathbf{X^\top}\mathbf{y}$

The $(\mathbf{X^\top}\mathbf{X})^{-1}$ term will be a singular term.

If $\lambda\gt0$,
$$\mathbf\Theta=\left(\mathbf{X^\top}\mathbf{X}+\lambda
\begin{bmatrix}
0 & 0 & \cdots & 0 \\
0 & 1 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & 1 \\
\end{bmatrix}_{(n+1)\times{(n+1)}}\right)^{-1}\mathbf{X^\top}\mathbf{y}$$

The inverse term is invertable now.