(Week5) SVM

머신러닝 수업 5주차

Support Vector Machine

Linearly Not Separable

The points can be linearly sepearted but there is a very narrow margin
But possibly the large margin solution is better, even though one constraint is violated.

Soft Margin

We want to allow data points to stay inside the margin.
How about change

\[y_j(w^\top x_j + w_0) \geq 1, j=1,...,N\]

to this one:

\[y_j(w^\top x_j + w_0) \geq 1-\xi_j, j=1,...,N\]

and $\xi_j \geq 0$

If $\xi_j > 1$, then $x_j$ will be misclassified.
We can consider this problem

\[\text{minimize}_{w, w_0, \xi} {1\over 2}\|w\|^2_2\]

subject to

\[y_j(w^\top x_j + w_0) \geq 1-\xi_j, j=1,...,N\] \[\xi_j \geq 0\]

But we need to control $\xi$, for otherwise the solution will be $\xi=\inf$
How about this:

\[\text{minimize}_{w, w_0, \xi} {1\over2}\|w\|^2_2 + C \|\xi\|^2\]

subject to

\[y_j(w^\top x_j + w_0) \geq 1-\xi_j\] \[\xi_j \geq 0\]

Control the energe of $\xi$.

Role of $C$

If $C$ is big, then we enforce $\xi$ to be small.
If $C$ is small, then $\xi$ can be big.

No Misclassification?

You can have misclassification in soft SVM
$\xi_j$ can be big for a few outliers

\[\text{minimize}_{w, w_0, \xi} {1\over2}\|w\|^2_2 + C \|\xi\|^2\] \[y_j(w^\top x_j + w_0) \geq 1-\xi_j\] \[\xi_j \geq 0\]

L1-Regularization

Instead of $l_1$ norm, you can also do

\[\text{minimize}_{w, w_0, \xi} {1\over2}\|w\|^2_2 + C \|\xi\|_1\] \[y_j(w^\top x_j + w_0) \geq 1-\xi_j\] \[\xi_j \geq 0\]

This enforces $\xi$ to be sparse.
Only a few entries samples are allowed to live in the margin.
The problem remains convex.
So you can still use CVX to solve the problem.

Connection with Perceptron Algorithm

In soft-margin SVM, $\xi_j \geq 0$ and $y_j(w^\top x_j + w_0) \geq 1-\xi_j$ imply that

\[\xi_j \geq 0\] \[\xi_j \geq 1-y_j(w^\top x_j + w_0)\]

We can combine them to get

\[\xi_j \geq \max\{0, 1-y_j(w^\top x_j + w_0)\}\] \[=[1-y_j(w^\top x_j + w_0)]_{+}\]

So if we use SVM with $l_1$ penalty, then

\[J(w, w_0, \xi)={1\over2}\|w\|^2_2+C\sum^N_{j=1}\xi_j\] \[={1\over2}\|w\|^2_2+C\sum^N_{j=1}[1-y_j(w^\top x_j + w_0)]_{+}\]

This means that the training loss is

\[J(w, w_0)=\sum^N_{j=1}[1-y_j(w^\top x_j + w_0)]_{+}+{\lambda\over2}\|w\|^2_2\]

if we define $\lambda=1/C$

Now, you can make $\lambda \rightarrow 0$. This means $C\rightarrow \inf$
Then,

\[J(w, w_0)=\sum^N_{j=1}[1-y_j(w^\top w_j + w_0)]_{+}\] \[\sum^N_{j=1}\max\{0, 1-y_j(w^\top x_j + w_0)\}\] \[\sum^N_{j=1}\max\{0, 1-y_jg(x_j)\}\]

SVM Loss

\[J(w, w_0)=\sum^N_{j=1}\max\{0, 1-y_jg(x_j)\}\]

Perceptron Loss:

\[J(w, w_0)=\sum^N_{j=1}\max\{0, 1-y_jg(x_j)\} + {\lambda \over 2}\|w\|^2_2\]

$|w|^2_2$ regularizes the solution.

The Kernel trick

A trick to turn linear classifier to nonlinear classifier
Dual SVM

\[\text{maximize}_{\lambda \geq 0}-{1\over2}\sum^n_{i=1}\sum^n_{j=1}\lambda_i\lambda_jy_i y_j x^\top_i x_j + \sum^n_{j=1}\lambda_j\]

subject to $\sum^n_{j=1}\lambda_jy_j=0$

Kernel Trick

\[\text{maximize}_{\lambda \geq 0}-{1\over2}\sum^n_{i=1}\sum^n_{j=1}\lambda_i\lambda_jy_i y_j \phi{(x}^\top_i) \phi(x_j) + \sum^n_{j=1}\lambda_j\]

subject to $\sum^n_{j=1}\lambda_jy_j=0$

You have to do this in dual. Primal is hard.
Define

\[K(x_i, vx_j)=\phi(x_i)^\top \phi(x_j)\]

The matrix $Q$ is

\[Q=\begin{bmatrix}y_1y_1x^\top_1 x_1 & .... & y_1y_Nx^\top_1x_N \\ y_2y_1x^\top_2 x_1 & .... & y_2y_Nx^\top_2x_N \\ ... & ... & ... \\ y_Ny_1x^\top_N x_1 & .... & y_Ny_Nx^\top_Nx_N \\ \end{bmatrix}\]

By Kernel Trick

\[Q=\begin{bmatrix}y_1y_1K(x_1, x_1) & .... & y_1y_NK(x_1, x_N) \\ y_2y_1K(x_2, x_1) & .... & y_2y_NK(x_2, x_N) \\ ... & ... & ... \\ y_Ny_1K(x_N, x_1) & .... & y_Ny_NK(x_N, x_N)\\ \end{bmatrix}\]

Kernel

The inner product $\phi(x_i)^\top \phi(x_j)$ called a kernel

\[K(x_i, x_j)=\phi(x_i)^\top \phi(x_j)\]

Second-Order Polynomial kernel

\[K(u, v)=(u^\top v)^2\]

Gaussian Radial Function (RBF) Kernel

\[K(u,v)=\exp\{-{\|u-v\|^2\over2\sigma^2}\}\]

Radial Basis Function

Radial Basis Function takes the form of

\[K(u,v)=\exp\{-\gamma\|u-v\|^2\}\]

Typical $\gamma \in [0,1]$
$\gamma$ too big: Over-fit

Non-Linear Transform for RBF?

Let us consider scalar $u \in \mathbb{R}$

$K(u,v)=\exp\{-(u-v)^2\}$ $=\exp\{-u^2\}\exp\{2uv\}\exp\{-v^2\}$ $=\exp\{-u^2\}(\sum^{\infty}_{k=0}{2^k u^k v^k \over k!})\exp\{-v^2\}$ $=\exp\{-u^2\}(1, \sqrt{2^1\over1!}u, \sqrt{2^2\over2!}u^2, \sqrt{2^3\over3!}u^3, ..., )^\top \times (1, \sqrt{2^1\over1!}v, \sqrt{2^2\over2!}v^2, \sqrt{2^3\over3!}v^3, ..., )\exp\{-v^2\}$

So $\phi$ is

\[\phi(x)=\exp\{-x^2\}(1, \sqrt{2^1\over1!}x, \sqrt{2^2\over2!}x^2, \sqrt{2^3\over3!}x^3, ..., )\]

So You Need

Example. Radial Basis Function

\[K(u,v)=\exp\{-\gamma\|u-v\|^2\}\]

The non-linear transform is

\[\phi(x)=\exp\{-x^2\}(1, \sqrt{2^1\over1!}x, \sqrt{2^2\over2!}x^2, \sqrt{2^3\over3!}x^3, ..., )\]

You need infinite dimensional non-linear transform!
But to compute the kernel $K(u,v)$ you do not need $\phi$.
Another Good thing about Dual SVM: You can do infinite dimensional non-linear transform.
Cost of computing $K(u,v)$ is bottleneck by $|u-v|^2$

Is RBF Always Better than Linear?

Noisy dataset: Linear works well.
RBF: Over fit.

Testing with Kernels

Recall:

\[w^{*}=\sum^N_{n=1}\lambda^{*}_ny_nx_n\]

The hypothesis function is

\[h(x)=\text{sign}(w^{* \top} x + w^{*}_0)\] \[=\text{sign}((\sum^N_{n=1}\lambda^{*}_n y_n x_n)^\top x + w^{*}_0)\] \[=\text{sign}(\sum^N_{n=1}\lambda^{*}_ny_nx_n^\top x + w^{*}_0)\]

Now you can replace $x^\top_n x$ by $K(x_n, x)$

PREVIOUS(Week4) SVM

NEXTMachine-Learning Mid Term