Optimization algorithm
To compute $J(\theta)$ and $\frac{\partial}{\partial\theta_j}J(\theta)$ with given $\theta$ more efficiently, here are some algorithms:
For those three algorithms after gradient descent, their advantages are no need to manually pick $\alpha$ and faster than gradient descent oftenly, but are even more sophsiticated.
Think about that the three algorithms have more clever inter-loop called line search algorithm, which automatically tries out differet learning rate $\alpha$ and picks a good learning rate for every iteration so we don’t have to choose it ourselves.
Because of the better $\alpha$ chosen, these algorithms end up converging much faster than gradient descent.
Example
For $n=2$
Given
$$\mathbf{\theta}=
\begin{Bmatrix}
\theta_1 \\
\theta_2
\end{Bmatrix}$$
$$J(\theta)=(\theta_1-5)^2+(\theta_2-5)^2\tag{1}$$
Derivate terms
$$\frac{\partial}{\partial\theta_1}J(\theta)=2(\theta_1-5)$$
$$\frac{\partial}{\partial\theta_2}J(\theta)=2(\theta_2-5)$$
We can get $\min_{\theta}J(\theta)$ easily by observing $(1)$, i.e.
$$\mathbf{\theta}=
\begin{Bmatrix}
5 \\
5
\end{Bmatrix}$$
Implementation
|
|
costFunction
returns two values: cost function value jVal
and derivate terms value in a matrix type assign to gradient
.
|
|
fminunc
is stands for function minimization unconstrained in Octave.
Remind that $\theta\in\mathbb{R^d},d\ge2$
For n $\theta$
Implementation
|
|
Where
$$\frac{\partial}{\partial\theta_0}J(\theta)=\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)}))-y^{(i)}*x_0^{(i)}]$$
$$\frac{\partial}{\partial\theta_1}J(\theta)=\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)}))-y^{(i)}*x_1^{(i)}]$$