Regularization of Cost Function (CS229)

Small values for parameters $\theta_0,\theta_1,\theta_3,…,\theta_n$

  • simpler hypothesis
  • less prone to overfitting

    Example

    Housing

  • features : $x_1,x_2,x_3,..,x_{100}$
  • parameters : $\theta_0,\theta_1,\theta_3,…,\theta_{100}$

Because of not knowing what parameters to be chose to ignore $(\approx0)$ that the shrinkage will not effect the hypothesis result, we add a term to try each parameter $\theta_j$ to see if it has impact on the value of $J(\theta)$.

Rewrite $J(\theta)$
$$J(\theta)=\frac{1}{2m}\left[\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda\sum_{i=1}^n\theta_j^2\right]\tag{1}$$

where
$$\theta=\theta_1,\theta_2,…,\theta_{100}$$

$\lambda\sum_{i=1}^n\theta_j^2$ is called regularization term, and $\lambda$ here is called regularization parameter.

What $\lambda$ does is controls a trade-off between two different goals. The first goal is that we would like to fit the training data well, the second goal is that we want to keep the parameters small and therefore to keep the hypothesis be simple to avoid overfitting.

If $\lambda$ is set to an extremely large value, say $10^{10}$, and hypothetsis is $h_{\theta}(x)=\theta_0+\theta_1x+\theta_2x^2+\theta_3x^3+\theta_4x^4$, for convergence, the solution of $\theta$ is all approximately equal to zero $(\theta_1\approx0,\theta_2\approx0,\theta_3\approx0,\theta_4\approx0)$.

Then $h_{\theta}(x)$ will be

$$h_{\theta}(x)=\theta_0$$

So the result is lead to underfitting as an horizontal flat line.

This situation gives the idea of choosing the ideal model in the next post.