머신러닝 수업 3주차
Support Vector Machine
Margin and Max-Margin Classifier
- Margin: Smallest gap between the two classes
- Max-Margin Classifier: A classifier that maximizes the margin.
- What do we need?
- How to measure the distance from a point to be a plane?
- How to formulate a max margin problem?
- How to solve the max margin problem?
Recall: Linear Discriminant Function
- In high-dimension,
is a hyperplane. How to
- Separating Hyperplane:
-
$x\in\mathcal{H}$ means $x$ is on the decision boundary.
-
${\bf{w}}/|{\bf{w}}|_2$ is the normal vector of $\mathcal{H}$.
Recall: Distance from $x_0$ to $g(x)=0$
- Pick a point $x_p$ on $\mathcal{H}$
- $x_p$ is the closest point to $x_0$.
- $x_0-x_p$ is the normal direction.
- So, for some scalar $\eta>0$,
- $x_p$ is on $\mathcal{H}$. So
Therefore, we can show that
\[g(x_0)={\bf{w}}^\top x_0+{\bf{w}}_0\] \[{\bf{w}}^\top(x_p+\eta{w\over\|w\|_2})+{\bf{w}_0}\] \[=g(x_p)+\eta\| {\bf{w}}\|_2=\eta\|\bf{w}\|_2\]- So distance is
Conclusion:
distance, normal vector
\[x_p=x_0-{g(x_0)\over\|w\|_2}\cdot{w\over\|w\|_2}\]Unsigned Distance
- We define the distance between a data point $x_j$ and a separating hyperplane as
- $d(x_j, \mathcal{H})$ is called unsigned distance.
Margin
-
Among all the unsgined distances, pick the smallest one.
-
Margin: $\gamma$ such that
- Without loss of generality, assume $x_N$ is the closest point.
More about Margin
- Margin: $\gamma$ such that
-
$\gamma$ depends on $(w, w_0)$.
-
$\gamma$ always exist because training set is finite.
-
$\gamma \geq 0$, and is zero when $x_N$ is on the boundary.
Signed Distance
-
$d(x_j, \mathcal{H})$ is unsigned
-
So $\gamma$ does not tell whether a point $x_j$ is correctly classified or not
-
Assume that the labels are defined as $y_j \in {-1, +1 }$
-
Then define a signed distance
$\geq0$, correctly classify $x_j$.
$<0$, incorrectly classify $x_j$.
- Recall perceptron loss:
### Unsigned VS Signed Distance
- Unsigned distance: Just the distance
- Signed distance: Distance plus whether on the correct side
### Max-Margin Objective
- Assumptions: Linearly separable
- This means
-
All training samples are correctly classified.
-
All training samples are at lest $\gamma$ from the boundary.
-
So the max-margin classifier is
subject to $y_j({w^\top x_j + w_0\over |w|_2}) \geq \gamma, j=1,..,N$
### Unfortunately…
- If I can solve the optimization problem
subject to
\[y_j({w^\top x_j + w_0\over \|w\|_2}) \geq \gamma, j=1,..,N\]Then I can obtain a good SVM.
-
But solving the optimization is not easy!
-
$\gamma$ depends on $(w, w_0)$. If you change $(w, w_0)$, you also change $\gamma$.
-
There is a term $1/|w|_2$. Nonlinear
Trick 1: Scaling
- The optmization is
subject to
\[y_j({w^\top x_j + w_0\over \|w\|_2}) \geq \gamma, j=1,..,N\]-
Let $x_N$ be the point closest to the boundary
-
Define the smallest unsigned distance
- Then, we can show that
- So we can turn this optimization
subject to
\[y_j({w^\top x_j + w_0\over \|w\|_2}) \geq \gamma, j=1,..,N\]- into this optimization
subject to
\[y_j({w^\top x_j + w_0 \over \|w\|_2}) \geq {\tilde{\gamma} \over \|w\|_2}\]- $1/|w|_2$ goes to objective function!
Eliminate $\tilde{\gamma}$
- How about we turn this optimization
subject to $y_j(w^\top x_j + w_0) \geq \tilde{\gamma}$, $j=1,…,N$
- Into this optimization?
subject to
\[y_j({w \over \tilde{\gamma}}^\top x_j + {w_0 \over \tilde{\gamma}}) \geq 1, j=1,...,N\]-
You can refine the variables $w \leftarrow {w \over \tilde{\gamma}}$ and $w_0 \leftarrow {w_0 \over \tilde{\gamma}}$
-
This gives you
subject to $y_j(w^\top x_j + w_0) \geq 1$, $j=1, …, N$
-
So $\tilde{\gamma}$ is eliminated!
-
Geometrically: A scaling.
Trick 2. Max to Min
- You want to solve
subject to
\[y_j(w^\top x_j + w_0) \geq 1, j=1, ..., N\]- How about
subject to
\[y_j(w^\top x_j + w_0) \geq 1, j=1, ..., N\]-
This is a quadratic minimization with linear constraint.
-
Convex
-
Solution is called a support vector machine.
Hand Crafted Example
- You have four data points and
- Labels are
-
Weight vector $w=(w_1, w_2)$ and off-set $w_0$.
-
The constraints are $y_j(w^\top x_j + w_0) \geq 1$:
(1) $-w_0 \geq 1$
(2) $-(2w_1+2w_2+w_0) \geq 1$
(3) $2w_1+w_0 \geq 1$
(4) $3w_1+w_0 \geq 1$
- Combine (1) and (3): $w_1 \geq 1$
- Combine (2) and (3): $w_2 \leq -1$
- Combine (1) and (3): $w_1 \geq 1$
-
Combine (2) and (3): $w_2 \leq -1$
- Objective function is ${1\over2}|w|^2$
- Can show that
- Equality holds when
-
So $(w^_1, w^_2)=(1,-1)$ is a minimizer of the objective.
-
Can further show that $w^*_0=-1$
-
All constraints are satisfied at $(w^_1, w^_2, w^*_0)=(1, -1, -1)$
-
Separating hyperplane:
- Margin
-
Boxed data points: Meet the constraints (i), (ii) and(iii) with equality.
-
These are the support vectors.