Backpropagation Algorithm for Minimize Cost Function (CS229)

Recall cost function in neural network

$$J(\Theta)=-\frac{1}{m}\left[\sum_{i=1}^m\sum_{k=1}^Ky_k^{(i)}\log\left(h_{\Theta}(x^{(i)})\right)_k+(1-y_k^{(i)})\log\left(1-h_{\theta}(x^{(i)})_k\right)\right]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}\left(\Theta_{ji}^{(l)}\right)^2$$

Goal

Access theta $\min_{\Theta}J(\Theta)$

To get theta, we need to compute cost function value $J(\Theta)$ and deriviate terms of thetas $\frac{\partial}{\partial\Theta_{ij}^{(l)}}J(\Theta)$.

Gradient computation

One training set

One training example $(x,y)$

Recall for forward propagation


(source: https://www.coursera.org/learn/machine-learning)

Introduce backpropagation algorithm

Notation: $\delta_j^{(l)}$

Error of node $j$ in layer $l$.

Term $g’(z)$ means the deriviate of $g(z)$, it can be proof that $g’(z)=\mathbf{a}.*(\mathbf{1}-\mathbf{a})$.

There is no $\delta^{(1)}$ term because $x_j=a_j^{(1)}$ are exacte given training data, we do not need to compute the error of them.

It can also be proof that if ignoring $\lambda$ term (if $\lambda=0$)
$$\frac{\partial}{\partial\Theta_{ij}}J(\Theta)=a_j^{(l)}\delta_i^{(l+1)}$$

Backpropagation algorithm for $m$ training set

$m$ training set $(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),(x^{(3)},y^{(3)}),…,(x^{(m)},y^{(m)})$

Set $(\forall{l,i,j})$
$$\Delta_{ij}^{(l)}=0\tag{1}$$

$(1)$ is used to compute $\frac{\partial}{\partial\Theta_{ij}^{(l)}}J(\Theta)$

Implementation steps

  • For i = 1 : m
    • Set $a^{(1)}=x^{(i)}$
    • Perform forward propagation to compute $a^{(l)}$ for $l=1,2,3,…,L$
    • Using $y^{(i)}$, compute $\delta^{(L)}=a^{(L)}-y^{(i)}$, i.e. the hypothesis output minus the target label
    • Compute $\delta^{(L-1)},\delta^{(L-2)},…,\delta^{(2)}$
    • $\Delta_{ij}^{(l)}:=\Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}\tag{2}$

The vectorization of $(2)$ is
$$\mathbf{\Delta}^{(l)}:=\mathbf{\Delta}^{(l)}+\mathbf{\delta}^{(l+1)}\left(\mathbf{a}^{(l)}\right)^\top$$

  • $D_{ij}^{(l)}:=\frac{1}{m}\Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)},j\neq0$
  • $D_{ij}^{(l)}:=\frac{1}{m}\Delta_{ij}^{(l)},j=0$

Remind that $\frac{\partial}{\partial\Theta_{ij}^{(l)}}J(\Theta)=D_{ij}^{(l)}$