Machine Learning (Stanford CS229)

Regression problem vs Classification problem

Regression problem – Predict real-valued output

  • We can treat the data as a set of row value.
  • For example, we have a large inventory of identical items and we want to predict how many of these items will sell over the next 3 months.

Linear regression problem

One variable – Squared error function

  • Expression:
    • $m$ = number(size) of training examples
    • $x$ = input variable/features
    • $y$ = output variable/features
    • $(x,y)$ = one training example
    • $(x^{(i)}, y^{(i)})$ = $i^{th}$ training example
  • Training flow: Training set -> Learning Algorithm -> $h$(hypothesis)
  • Predicting flow: $x$ (input) -> $h$ -> $y$ (output)
  • $h_{\theta}(x)$ = $\theta_0$ + $\theta_1x$, write in short hand: $h(x)$ = $\theta_0+\theta_1x$
  • $h(x)$ maps from $x$’s to $y$’s
  • we consider $h(x)$ as a linear function, i.e. $h(x)$ is a linear regression with one variable.
  • Choosing parameters $(\theta_0$ and $\theta_1)$: let $h_{\theta}(x)$ is close to y for our training examples $(x, y)$.

$$ J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^m [h_{\theta}(x^{(i)}) - y^{(i)}]^2$$

$\min_{(\theta_0,\theta_1)}J(\theta_0,\theta_1)$ says that we’ll find $(\theta_1,\theta_2)$ values that causes the above expression be minimize.

We call $J(\theta_0,\theta_1)$ as cost function in the example, also as Squared error function in general. We’ll talk about this funtion more deeper in another paragraph.

squared function is the most common in regression problem.

One variable – Gradient descent

  • General case
    • $\theta \in \Re, \alpha > 0$
    • Function $J(\theta_0,\theta_1,…,\theta_n)$
    • Goal $\min_{\theta_0,\theta_1,…,\theta_n}J(\theta_0,\theta_1,…,\theta_n)$
  • Choose $n=2$
    • Start with some $\theta_0,\theta_1$
    • Keep changing $\theta_0, \theta_1$ to reduce $J(\theta_0,\theta_1)$ until we hopefully end up at a $\min_{\theta_0,\theta_1}$.
    • Draw an $J(\theta_i)$ image, we discover that this is a 3-D surface with 3 axises: $\theta_0$, $\theta_1$ and $J(\theta_0,\theta_1)$. Choose a point on the surface and look around, move forward by choosing the most gradient descent point, i.e. $J(\theta_0,\theta_1)$ will converge the most of this path. Follow this rule again and again until we get the $\min$ of $J(\theta_0,\theta_1)$.
    • Repeat until convergence, for $j=0,1$
      $${\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)}$$
      • $\alpha$ is the learning rate, it represents how aggressive we step down stairs to the next point.
      • $\frac{\partial}{\partial\theta_i}J(\theta_i)$ is the tangent value (slope) of $J(\theta_1)$ curve. As the slope be positive, new $\theta_i$ decreased.
      • $:=$ stands for assignment, for example, $a:=b$ is set value b to value a.
      • $=$ stands for truth asserts, $a=b$ will produce a result of a boolean.
      • update $\theta_0$ and $\theta_1$ simutaneously, i.e.
        $$temp0:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1) \tag{1}$$
        $$temp1:=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1) \tag{2}$$
        $$\theta_0:=temp0 \tag{3}$$
        $$\theta_1:=temp1 \tag{4}$$
      • If $\alpha$ is too small, we’ll take a small step to the next point, and the gradient descent can be slow. On the other hand, if $\alpha$ is too large, we’ll take a big step to the next point of finding the minimum of $J(\theta_i)$, this can cause overshoot the minimum, and may fail to converge or even diverge.
      • Notice that the value updating equations $(3)$ and $(4)$ must be placed behind all of the calculate results. Because if we switch $(2)$ and $(3)$ equations, we’ll use the new $\theta_0$ to calculate $temp1$, but the right way is use the original $\theta_0$ to calculate it.
    • It might be lead to different result of $J(\theta_0,\theta_1)$ when we get the $min$ by choosing different point as an initial point.

Classification problem – Discrete-valued output

  • We can map the situation to some value
  • For example, we would like software to examine individual customer accounts, and for each account decide if it has been haked/compromised.
    • In the previous example, we can set 0 to not hacked and 1 to hacked.

Supervised Learning

  • Dozens of “right answers” are given to predict the output. We have to “tell” the computer what’s “the right answer” and let it figure out by itself.
    Eg. classfication problem (mapping)
  • We put labels upon the data to recognize them

Unsupervised learning

  • We give data WITHOUT labels.
  • We do not say what’s the right thing and what’s wrong, just ONLY provide the data and ask the computer to find some structure.
  • For example, given a set of news articles found on the web, group them onto set of articles about the same story.

References