(Week3) SVM

 

머신러닝 수업 3주차


Support Vector Machine

Margin and Max-Margin Classifier

  • Margin: Smallest gap between the two classes
  • Max-Margin Classifier: A classifier that maximizes the margin.
  • What do we need?
  1. How to measure the distance from a point to be a plane?
  2. How to formulate a max margin problem?
  3. How to solve the max margin problem?

Recall: Linear Discriminant Function

  • In high-dimension,
\[g(x)={\bf{w}}^\top x + {\bf{w}}_0\]

is a hyperplane. How to

  • Separating Hyperplane:
\[\mathcal{H}=\{x \mid g(x)=0 \}\] \[=\{x \mid {\bf{w}}^{\top}x+{\bf{w}}_0=0\}\]
  • $x\in\mathcal{H}$ means $x$ is on the decision boundary.

  • ${\bf{w}}/|{\bf{w}}|_2$ is the normal vector of $\mathcal{H}$.

Recall: Distance from $x_0$ to $g(x)=0$

  • Pick a point $x_p$ on $\mathcal{H}$
  • $x_p$ is the closest point to $x_0$.
  • $x_0-x_p$ is the normal direction.
  • So, for some scalar $\eta>0$,
\[x_0-x_p=\eta{w\over{\|w\|}_2}\]
  • $x_p$ is on $\mathcal{H}$. So
\[g(x_p)={\bf{w}}^\top x_p + {\bf{w}}_0=0\]

Therefore, we can show that

\[g(x_0)={\bf{w}}^\top x_0+{\bf{w}}_0\] \[{\bf{w}}^\top(x_p+\eta{w\over\|w\|_2})+{\bf{w}_0}\] \[=g(x_p)+\eta\| {\bf{w}}\|_2=\eta\|\bf{w}\|_2\]
  • So distance is
\[\eta={g(x_0) \over {\|w\|_2}}\]

Conclusion:

distance, normal vector

\[x_p=x_0-{g(x_0)\over\|w\|_2}\cdot{w\over\|w\|_2}\]

Unsigned Distance

  • We define the distance between a data point $x_j$ and a separating hyperplane as
\[d(x_j, \mathcal{H})={\mid g(x_j) \mid \over\|{\bf{w}}\|_2}={\mid {\bf{w}}^\top x_j+{\bf{w}}_0\mid \over\|{\bf{w}}\|_2}\]
  • $d(x_j, \mathcal{H})$ is called unsigned distance.

Margin

  • Among all the unsgined distances, pick the smallest one.

  • Margin: $\gamma$ such that

\[\gamma=\text{min}_{j=1,...,N}d(x_j, \mathcal{H})\]
  • Without loss of generality, assume $x_N$ is the closest point.

More about Margin

  • Margin: $\gamma$ such that
\[\gamma=\min_{j=1,...,N}d(x_j, \mathcal{H})\]
  • $\gamma$ depends on $(w, w_0)$.

  • $\gamma$ always exist because training set is finite.

  • $\gamma \geq 0$, and is zero when $x_N$ is on the boundary.

Signed Distance

  • $d(x_j, \mathcal{H})$ is unsigned

  • So $\gamma$ does not tell whether a point $x_j$ is correctly classified or not

  • Assume that the labels are defined as $y_j \in {-1, +1 }$

  • Then define a signed distance

\[d_{signed}(x_j, \mathcal{H})=y_j({w^\top x_j + w_0 \over w_2})\]

$\geq0$, correctly classify $x_j$.

$<0$, incorrectly classify $x_j$.

  • Recall perceptron loss:
\[\mathcal{L}(x_j)=\max \{-y_j(w^\top x_j+w_0), 0 \}\]

### Unsigned VS Signed Distance

  • Unsigned distance: Just the distance
  • Signed distance: Distance plus whether on the correct side

### Max-Margin Objective

  • Assumptions: Linearly separable
  • This means
\[y_j({w^\top x_j + w_0 \over \|w\|_2}) \geq \gamma, j=1,...,N\]
  • All training samples are correctly classified.

  • All training samples are at lest $\gamma$ from the boundary.

  • So the max-margin classifier is

\[\text{maximize}_{w, w_0} \gamma\]

subject to $y_j({w^\top x_j + w_0\over |w|_2}) \geq \gamma, j=1,..,N$

### Unfortunately…

  • If I can solve the optimization problem
\[\text{maximize}_{w, w_0} \gamma\]

subject to

\[y_j({w^\top x_j + w_0\over \|w\|_2}) \geq \gamma, j=1,..,N\]

Then I can obtain a good SVM.

  • But solving the optimization is not easy!

  • $\gamma$ depends on $(w, w_0)$. If you change $(w, w_0)$, you also change $\gamma$.

  • There is a term $1/|w|_2$. Nonlinear

Trick 1: Scaling

  • The optmization is
\[\text{maximize}_{w, w_0} \gamma\]

subject to

\[y_j({w^\top x_j + w_0\over \|w\|_2}) \geq \gamma, j=1,..,N\]
  • Let $x_N$ be the point closest to the boundary

  • Define the smallest unsigned distance

\[\tilde{\gamma}=\mid w^\top x_n+w_0\mid\]
  • Then, we can show that
\[\gamma={w^\top x_n + w_0 \over \|w\|_2}={\tilde{\gamma}\over\|w\|_2}\]
  • So we can turn this optimization
\[\text{maximize}_{w, w_0} \gamma\]

subject to

\[y_j({w^\top x_j + w_0\over \|w\|_2}) \geq \gamma, j=1,..,N\]
  • into this optimization
\[\text{maximize}_{w, w_0} {\tilde{\gamma}\over \|w\|_2}\]

subject to

\[y_j({w^\top x_j + w_0 \over \|w\|_2}) \geq {\tilde{\gamma} \over \|w\|_2}\]
  • $1/|w|_2$ goes to objective function!

Eliminate $\tilde{\gamma}$

  • How about we turn this optimization
\[\text{maximize}_{w, w_0}{\tilde{\gamma} \over \|w\|_2 }\]

subject to $y_j(w^\top x_j + w_0) \geq \tilde{\gamma}$, $j=1,…,N$

  • Into this optimization?
\[\text{maximize}_{w / \tilde{\gamma}, w_0 / \tilde{\gamma}}{1 \over \|w / \tilde{\gamma} \|_2 }\]

subject to

\[y_j({w \over \tilde{\gamma}}^\top x_j + {w_0 \over \tilde{\gamma}}) \geq 1, j=1,...,N\]
  • You can refine the variables $w \leftarrow {w \over \tilde{\gamma}}$ and $w_0 \leftarrow {w_0 \over \tilde{\gamma}}$

  • This gives you

\[\text{maximize}_{w, w_0} {1 \over \|w\|_2}\]

subject to $y_j(w^\top x_j + w_0) \geq 1$, $j=1, …, N$

  • So $\tilde{\gamma}$ is eliminated!

  • Geometrically: A scaling.

Trick 2. Max to Min

  • You want to solve
\[\text{maximize}_{w, w_0} {1 \over \|w\|_2}\]

subject to

\[y_j(w^\top x_j + w_0) \geq 1, j=1, ..., N\]
  • How about
\[\text{minimize}_{w, w_0} {1 \over 2} \|w\|^2\]

subject to

\[y_j(w^\top x_j + w_0) \geq 1, j=1, ..., N\]
  • This is a quadratic minimization with linear constraint.

  • Convex

  • Solution is called a support vector machine.

Hand Crafted Example

  • You have four data points and
\[x_1=\begin{bmatrix} 0 \\ 0 \end{bmatrix} x_2=\begin{bmatrix} 2 \\ 2 \end{bmatrix} x_3=\begin{bmatrix} 2 \\ 0 \end{bmatrix} x_4=\begin{bmatrix} 3 \\ 0 \end{bmatrix}\]
  • Labels are
\[y_1=-1, y_2=-1,y_3=+1,y_4=+1\]
  • Weight vector $w=(w_1, w_2)$ and off-set $w_0$.

  • The constraints are $y_j(w^\top x_j + w_0) \geq 1$:

(1) $-w_0 \geq 1$

(2) $-(2w_1+2w_2+w_0) \geq 1$

(3) $2w_1+w_0 \geq 1$

(4) $3w_1+w_0 \geq 1$

  • Combine (1) and (3): $w_1 \geq 1$
  • Combine (2) and (3): $w_2 \leq -1$
  • Combine (1) and (3): $w_1 \geq 1$
  • Combine (2) and (3): $w_2 \leq -1$

  • Objective function is ${1\over2}|w|^2$
  • Can show that
\[{1 \over 2} \|w\|^2={1\over 2}(w^2_1+w^2_2) \geq 1\]
  • Equality holds when
\[w^*_1=1, w^*_2=-1\]
  • So $(w^_1, w^_2)=(1,-1)$ is a minimizer of the objective.

  • Can further show that $w^*_0=-1$

  • All constraints are satisfied at $(w^_1, w^_2, w^*_0)=(1, -1, -1)$

  • Separating hyperplane:

\[h(x)=\text{sign}(x_1-x_2-1)\]
  • Margin
\[{1\over\|w^*\|_2}={1\over\sqrt{2}}\]
  • Boxed data points: Meet the constraints (i), (ii) and(iii) with equality.

  • These are the support vectors.