Optimization algorithm

To compute $J(\theta)$ and $\frac{\partial}{\partial\theta_j}J(\theta)$ with given $\theta$ more efficiently, here are some algorithms:

Gradient descent
Conjugate gradient
BFGS
L-BFGS

For those three algorithms after gradient descent, their advantages are no need to manually pick $\alpha$ and faster than gradient descent oftenly, but are even more sophsiticated.

Think about that the three algorithms have more clever inter-loop called line search algorithm, which automatically tries out differet learning rate $\alpha$ and picks a good learning rate for every iteration so we don’t have to choose it ourselves.

Because of the better $\alpha$ chosen, these algorithms end up converging much faster than gradient descent.

Example

For $n=2$

Given
$$\mathbf{\theta}=
\begin{Bmatrix}
\theta_1 \\
\theta_2
\end{Bmatrix}$$

$$J(\theta)=(\theta_1-5)^2+(\theta_2-5)^2\tag{1}$$

Derivate terms
$$\frac{\partial}{\partial\theta_1}J(\theta)=2(\theta_1-5)$$
$$\frac{\partial}{\partial\theta_2}J(\theta)=2(\theta_2-5)$$

We can get $\min_{\theta}J(\theta)$ easily by observing $(1)$, i.e.
$$\mathbf{\theta}=
\begin{Bmatrix}
5 \\
5
\end{Bmatrix}$$

Implementation

function [jVal, gradient] = costFunction(theta);
jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
gradient = zeros(2,1);
gradient(1) = 2*(theta(1)-5);
gradient(2) = 2*(theta(2)-5);

costFunction returns two values: cost function value jVal and derivate terms value in a matrix type assign to gradient.

options = optimset('GradObj','on','MaxIter','100');
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag]...
 = fminunc(@costFunction, initialTheta, options);

fminunc is stands for function minimization unconstrained in Octave.

Remind that $\theta\in\mathbb{R^d},d\ge2$

For n $\theta$

Implementation

theta = [theta0; theta1; theta2; ...; thetaN];
function [jVal, gradient] = costFunction(theta);
jVal = [code to compute J(theta)];
gradient(1) = [code to compute the first derivative to J(theta) with respect to theta0];
gradient(2) = [code to compute the first derivative to J(theta) w.r.t. theta1];
...
gradient(n+1) = [code to compute the first derivative to J(theta) w.r.t. thetaN];

Where

$$\frac{\partial}{\partial\theta_0}J(\theta)=\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)}))-y^{(i)}*x_0^{(i)}]$$
$$\frac{\partial}{\partial\theta_1}J(\theta)=\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)}))-y^{(i)}*x_1^{(i)}]$$