(Week3) SVM

머신러닝 수업 3주차

Support Vector Machine

Margin and Max-Margin Classifier

Margin: Smallest gap between the two classes
Max-Margin Classifier: A classifier that maximizes the margin.
What do we need?

How to measure the distance from a point to be a plane?
How to formulate a max margin problem?
How to solve the max margin problem?

Recall: Linear Discriminant Function

In high-dimension,

\[g(x)={\bf{w}}^\top x + {\bf{w}}_0\]

is a hyperplane. How to

Separating Hyperplane:

\[\mathcal{H}=\{x \mid g(x)=0 \}\] \[=\{x \mid {\bf{w}}^{\top}x+{\bf{w}}_0=0\}\]

$x\in\mathcal{H}$ means $x$ is on the decision boundary.
${\bf{w}}/|{\bf{w}}|_2$ is the normal vector of $\mathcal{H}$.

Recall: Distance from $x_0$ to $g(x)=0$

Pick a point $x_p$ on $\mathcal{H}$
$x_p$ is the closest point to $x_0$.
$x_0-x_p$ is the normal direction.
So, for some scalar $\eta>0$,

\[x_0-x_p=\eta{w\over{\|w\|}_2}\]

$x_p$ is on $\mathcal{H}$. So

\[g(x_p)={\bf{w}}^\top x_p + {\bf{w}}_0=0\]

Therefore, we can show that

\[g(x_0)={\bf{w}}^\top x_0+{\bf{w}}_0\] \[{\bf{w}}^\top(x_p+\eta{w\over\|w\|_2})+{\bf{w}_0}\] \[=g(x_p)+\eta\| {\bf{w}}\|_2=\eta\|\bf{w}\|_2\]

So distance is

\[\eta={g(x_0) \over {\|w\|_2}}\]

Conclusion:

distance, normal vector

\[x_p=x_0-{g(x_0)\over\|w\|_2}\cdot{w\over\|w\|_2}\]

Unsigned Distance

We define the distance between a data point $x_j$ and a separating hyperplane as

\[d(x_j, \mathcal{H})={\mid g(x_j) \mid \over\|{\bf{w}}\|_2}={\mid {\bf{w}}^\top x_j+{\bf{w}}_0\mid \over\|{\bf{w}}\|_2}\]

$d(x_j, \mathcal{H})$ is called unsigned distance.

Margin

Among all the unsgined distances, pick the smallest one.
Margin: $\gamma$ such that

\[\gamma=\text{min}_{j=1,...,N}d(x_j, \mathcal{H})\]

Without loss of generality, assume $x_N$ is the closest point.

More about Margin

Margin: $\gamma$ such that

\[\gamma=\min_{j=1,...,N}d(x_j, \mathcal{H})\]

$\gamma$ depends on $(w, w_0)$.
$\gamma$ always exist because training set is finite.
$\gamma \geq 0$, and is zero when $x_N$ is on the boundary.

Signed Distance

$d(x_j, \mathcal{H})$ is unsigned
So $\gamma$ does not tell whether a point $x_j$ is correctly classified or not
Assume that the labels are defined as $y_j \in {-1, +1 }$
Then define a signed distance

\[d_{signed}(x_j, \mathcal{H})=y_j({w^\top x_j + w_0 \over w_2})\]

$\geq0$, correctly classify $x_j$.

$<0$, incorrectly classify $x_j$.

Recall perceptron loss:

\[\mathcal{L}(x_j)=\max \{-y_j(w^\top x_j+w_0), 0 \}\]

### Unsigned VS Signed Distance

Unsigned distance: Just the distance
Signed distance: Distance plus whether on the correct side

### Max-Margin Objective

Assumptions: Linearly separable
This means

\[y_j({w^\top x_j + w_0 \over \|w\|_2}) \geq \gamma, j=1,...,N\]

All training samples are correctly classified.
All training samples are at lest $\gamma$ from the boundary.
So the max-margin classifier is

\[\text{maximize}_{w, w_0} \gamma\]

subject to $y_j({w^\top x_j + w_0\over |w|_2}) \geq \gamma, j=1,..,N$

### Unfortunately…

If I can solve the optimization problem

\[\text{maximize}_{w, w_0} \gamma\]

subject to

\[y_j({w^\top x_j + w_0\over \|w\|_2}) \geq \gamma, j=1,..,N\]

Then I can obtain a good SVM.

But solving the optimization is not easy!
$\gamma$ depends on $(w, w_0)$. If you change $(w, w_0)$, you also change $\gamma$.
There is a term $1/|w|_2$. Nonlinear

Trick 1: Scaling

The optmization is

\[\text{maximize}_{w, w_0} \gamma\]

subject to

\[y_j({w^\top x_j + w_0\over \|w\|_2}) \geq \gamma, j=1,..,N\]

Let $x_N$ be the point closest to the boundary
Define the smallest unsigned distance

\[\tilde{\gamma}=\mid w^\top x_n+w_0\mid\]

Then, we can show that

\[\gamma={w^\top x_n + w_0 \over \|w\|_2}={\tilde{\gamma}\over\|w\|_2}\]

So we can turn this optimization

\[\text{maximize}_{w, w_0} \gamma\]

subject to

\[y_j({w^\top x_j + w_0\over \|w\|_2}) \geq \gamma, j=1,..,N\]

into this optimization

\[\text{maximize}_{w, w_0} {\tilde{\gamma}\over \|w\|_2}\]

subject to

\[y_j({w^\top x_j + w_0 \over \|w\|_2}) \geq {\tilde{\gamma} \over \|w\|_2}\]

$1/|w|_2$ goes to objective function!

Eliminate $\tilde{\gamma}$

How about we turn this optimization

\[\text{maximize}_{w, w_0}{\tilde{\gamma} \over \|w\|_2 }\]

subject to $y_j(w^\top x_j + w_0) \geq \tilde{\gamma}$, $j=1,…,N$

Into this optimization?

\[\text{maximize}_{w / \tilde{\gamma}, w_0 / \tilde{\gamma}}{1 \over \|w / \tilde{\gamma} \|_2 }\]

subject to

\[y_j({w \over \tilde{\gamma}}^\top x_j + {w_0 \over \tilde{\gamma}}) \geq 1, j=1,...,N\]

You can refine the variables $w \leftarrow {w \over \tilde{\gamma}}$ and $w_0 \leftarrow {w_0 \over \tilde{\gamma}}$
This gives you

\[\text{maximize}_{w, w_0} {1 \over \|w\|_2}\]

subject to $y_j(w^\top x_j + w_0) \geq 1$, $j=1, …, N$

So $\tilde{\gamma}$ is eliminated!
Geometrically: A scaling.

Trick 2. Max to Min

You want to solve

\[\text{maximize}_{w, w_0} {1 \over \|w\|_2}\]

subject to

\[y_j(w^\top x_j + w_0) \geq 1, j=1, ..., N\]

How about

\[\text{minimize}_{w, w_0} {1 \over 2} \|w\|^2\]

subject to

\[y_j(w^\top x_j + w_0) \geq 1, j=1, ..., N\]

This is a quadratic minimization with linear constraint.
Convex
Solution is called a support vector machine.

Hand Crafted Example

You have four data points and

\[x_1=\begin{bmatrix} 0 \\ 0 \end{bmatrix} x_2=\begin{bmatrix} 2 \\ 2 \end{bmatrix} x_3=\begin{bmatrix} 2 \\ 0 \end{bmatrix} x_4=\begin{bmatrix} 3 \\ 0 \end{bmatrix}\]

Labels are

\[y_1=-1, y_2=-1,y_3=+1,y_4=+1\]

Weight vector $w=(w_1, w_2)$ and off-set $w_0$.
The constraints are $y_j(w^\top x_j + w_0) \geq 1$:

(1) $-w_0 \geq 1$

(2) $-(2w_1+2w_2+w_0) \geq 1$

(3) $2w_1+w_0 \geq 1$

(4) $3w_1+w_0 \geq 1$

Combine (1) and (3): $w_1 \geq 1$
Combine (2) and (3): $w_2 \leq -1$
Combine (1) and (3): $w_1 \geq 1$
Combine (2) and (3): $w_2 \leq -1$
Objective function is ${1\over2}|w|^2$
Can show that

\[{1 \over 2} \|w\|^2={1\over 2}(w^2_1+w^2_2) \geq 1\]

Equality holds when

\[w^*_1=1, w^*_2=-1\]

So $(w^_1, w^_2)=(1,-1)$ is a minimizer of the objective.
Can further show that $w^*_0=-1$
All constraints are satisfied at $(w^_1, w^_2, w^*_0)=(1, -1, -1)$
Separating hyperplane:

\[h(x)=\text{sign}(x_1-x_2-1)\]

Margin

\[{1\over\|w^*\|_2}={1\over\sqrt{2}}\]

Boxed data points: Meet the constraints (i), (ii) and(iii) with equality.
These are the support vectors.

PREVIOUS2022 ML in Tobigs

NEXT(Week4) SVM