Simplify Cost Function and Gradient Descent of Logistic Regression (CS229)

Recall

Recall from the previous post, we know that

$$J(\theta)=\frac{1}{m}\sum_{i=1}^m\frac{1}{2}(h_{\theta}(x^{(i)})-y^{(i)})^2\\=\frac{1}{m}\sum_{i=1}^mCost(h_{\theta}(x^{(i)}),y^{(i)})$$

Where cost function was define as

$$
Cost(h_{\theta}(x),y)=
\begin{cases}
-\log(h_{\theta}(x)),\text{if $y=1$} \\
-\log(1-h_{\theta}(x)),\text{if $y=0$}
\end{cases} \tag{1}
$$

Note that $y$’s value is always be either $0$ or $1$.

Combine equations $(1)$

We obtain

$$Cost(h_{\theta}(x),y)=-y\log(h_{\theta}(x))-(1-y)\log(1-h_{\theta}(x))\tag{2}$$

$(2)$ is an equivalent equation of $(1)$, to verify, we subsititute $y=0$ and $y=1$ two cases into $(2)$:

If $y=0$,
$$Cost(h_{\theta}(x),y)=-0*\log(h_{\theta}(x))-(1-0)\log(1-h_{\theta}(x)) \\
= -\log(1-h_{\theta}(x))$$

If $y=1$,
$$Cost(h_{\theta}(x),y)=-1*\log(h_{\theta}(x))-(1-1)\log(1-h_{\theta}(x)) \\
= -\log(h_{\theta}(x))$$

Rewrite Logistic Regression Cost Function

$$J(\theta)=\frac{1}{m}\sum_{i=1}^mCost(h_{\theta}(x^{(i)},y^{(i)})) \\
=-\frac{1}{m}[\sum_{i=1}^my^{(i)}\log(h_{\theta}(x^{(i)}))+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))]\tag{3}$$

For the reason why we choose this kind of cost function will not be derive in this class, it can be proof by statistics, using the principle of maximum likelihood estimation. We can proof that the function which we select is efficiency and convex.

Fitting parameters \theta (get \theta)

$$\min_{\theta}J(\theta)$$

Make a prediction given new $x$

Output
$$h_{\theta} = \frac{1}{1+e^{-\mathbf{\theta^\top x}}}$$

Gradient descent

The way we’re going to minimize cost function is gradient descent.

Repeat to get $\min_{\theta}J(\theta)$:

$$\theta_j:=\theta_j-\alpha\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})*x_j^{(i)}\tag{4}$$

The algorithm $(4)$ above is identical to linear regression! There is only one thing different between linear regression and logistic regression is $h_{\theta}(x)$:

For linear regression,

$$h_{\theta}(x)=\mathbf{\theta^\top x}$$

On the other hand, logistic regression

$$h_{\theta}(x)=\frac{1}{1+e^{\mathbf{-\theta^\top x}}}$$

where

$$\mathbf{\theta}=
\begin{Bmatrix}
\theta_0 \\
\theta_1 \\
\theta_2 \\
\vdots \\
\theta_n
\end{Bmatrix}$$

Rather than for loop iteration, we prefer to use vectorize implementation.

Vectorize implementation
$$\mathbf{\theta}_j:=\mathbf{\theta}_j-\alpha\sum_{i=1}^m\left[\left(h_{\theta}(x^{(i)})-y^{(i)}\right)*x_j^{(i)}\right]$$