Normal equation

Method to solve for $\theta$ analytically.

1-D

$\theta\in\mathbb{R},$
$$J(\theta)=a\theta^2+b\theta+c$$
Let
$$\frac{d}{d\theta}J(\theta)=0$$
Slove $\theta$.

n-D

$\theta\in\mathbb{R^{n+1}},$
$$J(\theta_0,\theta_1,…,\theta_n)=\frac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2$$
for every $j$, let
$$\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1,…,\theta_n)=…=0$$
Solve for $\theta_0,\theta_1,…,\theta_n$.

$$\mathbf\Theta_{(n+1)\times1}=(\mathbf X_{(n+1)\times m}^\top\mathbf X_{m\times(n+1)})^{-1}\mathbf X_{(n+1)\times m}^\top\mathbf y_{m\times1}$$

for $m$ set of data examples, say $(x^{(1)},y^{(1)}),…,(x^{(m)},y^{(m)})$ and $n$ features,

$$\mathbf{x^{(i)}}=
\begin{Bmatrix}
x_0^{(i)}\\
x_1^{(i)}\\
x_2^{(i)}\\
\vdots \\
x_n^{(i)}\\
\end{Bmatrix}_{(n+1)\times1}
\in\mathbb{R^{n+1}}$$

Design matrix $\mathbf{X}$:

$$
\mathbf{X}=
\begin{bmatrix}
\mathbf{(x^{(1)})^\top} \\
\mathbf{(x^{(2)})^\top} \\
\mathbf{(x^{(3)})^\top} \\
\vdots \\
\mathbf{(x^{(m)})^\top} \\
\end{bmatrix}_{m\times(n+1)}
$$

Gradient descent vs Normal equation

Feature scaling is not so important in normal equation, but is important in gradient descent.
Gradient descent need to choose $\alpha$ and iterate many tiomes. Normal equation, by contrast, no need to choose $\alpha$ also no need to iterate.
For $m$ training examples and $n$ features, gradient descent works well even when $n$ is large. In contrast with normal equation, we nned to calculate $(\mathbf{X^\top}\mathbf{X})_{n\times n}^{-1}$, so we’ll get slow down if $n$ is large. It takes $O(n^3)$ time, if $n\ge 10^4$ we might choose gradient descent method.