Notations

$(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),…,(x^{(m)},y^{(m)})$

$m$ training sets.

$L$

Total numbers of layers in network.

$s_l$

numbers of units (without bias unit) in layer $l$ of the network. So $s_L$ will be the numbers of the output layer.

For example, in binary classification problem, $s_L=1$, $y$ can only be $0$ or $1$ for one output unit, it means that the result implies the input is / isn’t a specific class. Furthermore, in multi-classes problem, say $K$ distint classes, $y\in\mathbb{R}^{K}$, there are $K$ output units and $s_L=K$, $h_{\theta}(x)\in\mathbb{R}^{K}$.

As a note, we only use one vs all method when the classes number is greater than or equal to three, i.e. $K\ge3$ in a multi-classes problem.

$K$

$K=s_L$ also means the numbers of the output layer.

$(h_{\Theta}(x))_i$

$i^{th}$ output.

Cost function

Regularized Logistic regression

$$J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^my^{(i)}\log\left(h_{\theta}(x^{(i)})\right)+(1-y^{(i)})\log\left(1-h_{\theta}(x^{(i)})\right)\right]+\frac{\lambda}{2m}\sum_{i=1}^n\theta_j^2$$
where
$j=1,2,3,…,n$

Generalization of Regularized Logistic regression

For $h_{\Theta}(x)\in\mathbb{R}^K$, i.e. $h_{\Theta}(x)$ is a $K$ dimensional vector,

$$J(\Theta)=-\frac{1}{m}\left[\sum_{i=1}^m\sum_{k=1}^Ky_k^{(i)}\log\left(h_{\Theta}(x^{(i)})\right)_k+(1-y_k^{(i)})\log\left(1-h_{\theta}(x^{(i)})_k\right)\right]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}\left(\Theta_{ji}^{(l)}\right)^2$$

Same as above, $i$ start from $1$, we don’t regularize $\Theta_{i0}^{(l)}$.