(Week5) SVM

 

머신러닝 수업 5주차


Support Vector Machine

Linearly Not Separable

  • The points can be linearly sepearted but there is a very narrow margin

  • But possibly the large margin solution is better, even though one constraint is violated.

Soft Margin

  • We want to allow data points to stay inside the margin.

  • How about change

\[y_j(w^\top x_j + w_0) \geq 1, j=1,...,N\]

to this one:

\[y_j(w^\top x_j + w_0) \geq 1-\xi_j, j=1,...,N\]

and $\xi_j \geq 0$

  • If $\xi_j > 1$, then $x_j$ will be misclassified.

  • We can consider this problem

\[\text{minimize}_{w, w_0, \xi} {1\over 2}\|w\|^2_2\]

subject to

\[y_j(w^\top x_j + w_0) \geq 1-\xi_j, j=1,...,N\] \[\xi_j \geq 0\]
  • But we need to control $\xi$, for otherwise the solution will be $\xi=\inf$

  • How about this:

\[\text{minimize}_{w, w_0, \xi} {1\over2}\|w\|^2_2 + C \|\xi\|^2\]

subject to

\[y_j(w^\top x_j + w_0) \geq 1-\xi_j\] \[\xi_j \geq 0\]
  • Control the energe of $\xi$.

Role of $C$

  • If $C$ is big, then we enforce $\xi$ to be small.

  • If $C$ is small, then $\xi$ can be big.

No Misclassification?

  • You can have misclassification in soft SVM

  • $\xi_j$ can be big for a few outliers

\[\text{minimize}_{w, w_0, \xi} {1\over2}\|w\|^2_2 + C \|\xi\|^2\] \[y_j(w^\top x_j + w_0) \geq 1-\xi_j\] \[\xi_j \geq 0\]

L1-Regularization

  • Instead of $l_1$ norm, you can also do
\[\text{minimize}_{w, w_0, \xi} {1\over2}\|w\|^2_2 + C \|\xi\|_1\] \[y_j(w^\top x_j + w_0) \geq 1-\xi_j\] \[\xi_j \geq 0\]
  • This enforces $\xi$ to be sparse.

  • Only a few entries samples are allowed to live in the margin.

  • The problem remains convex.

  • So you can still use CVX to solve the problem.

Connection with Perceptron Algorithm

  • In soft-margin SVM, $\xi_j \geq 0$ and $y_j(w^\top x_j + w_0) \geq 1-\xi_j$ imply that
\[\xi_j \geq 0\] \[\xi_j \geq 1-y_j(w^\top x_j + w_0)\]
  • We can combine them to get
\[\xi_j \geq \max\{0, 1-y_j(w^\top x_j + w_0)\}\] \[=[1-y_j(w^\top x_j + w_0)]_{+}\]
  • So if we use SVM with $l_1$ penalty, then
\[J(w, w_0, \xi)={1\over2}\|w\|^2_2+C\sum^N_{j=1}\xi_j\] \[={1\over2}\|w\|^2_2+C\sum^N_{j=1}[1-y_j(w^\top x_j + w_0)]_{+}\]
  • This means that the training loss is
\[J(w, w_0)=\sum^N_{j=1}[1-y_j(w^\top x_j + w_0)]_{+}+{\lambda\over2}\|w\|^2_2\]

if we define $\lambda=1/C$

  • Now, you can make $\lambda \rightarrow 0$. This means $C\rightarrow \inf$

  • Then,

\[J(w, w_0)=\sum^N_{j=1}[1-y_j(w^\top w_j + w_0)]_{+}\] \[\sum^N_{j=1}\max\{0, 1-y_j(w^\top x_j + w_0)\}\] \[\sum^N_{j=1}\max\{0, 1-y_jg(x_j)\}\]
  • SVM Loss
\[J(w, w_0)=\sum^N_{j=1}\max\{0, 1-y_jg(x_j)\}\]
  • Perceptron Loss:
\[J(w, w_0)=\sum^N_{j=1}\max\{0, 1-y_jg(x_j)\} + {\lambda \over 2}\|w\|^2_2\]
  • $|w|^2_2$ regularizes the solution.

The Kernel trick

  • A trick to turn linear classifier to nonlinear classifier

  • Dual SVM

\[\text{maximize}_{\lambda \geq 0}-{1\over2}\sum^n_{i=1}\sum^n_{j=1}\lambda_i\lambda_jy_i y_j x^\top_i x_j + \sum^n_{j=1}\lambda_j\]

subject to $\sum^n_{j=1}\lambda_jy_j=0$

  • Kernel Trick
\[\text{maximize}_{\lambda \geq 0}-{1\over2}\sum^n_{i=1}\sum^n_{j=1}\lambda_i\lambda_jy_i y_j \phi{(x}^\top_i) \phi(x_j) + \sum^n_{j=1}\lambda_j\]

subject to $\sum^n_{j=1}\lambda_jy_j=0$

  • You have to do this in dual. Primal is hard.

  • Define

\[K(x_i, vx_j)=\phi(x_i)^\top \phi(x_j)\]
  • The matrix $Q$ is
\[Q=\begin{bmatrix}y_1y_1x^\top_1 x_1 & .... & y_1y_Nx^\top_1x_N \\ y_2y_1x^\top_2 x_1 & .... & y_2y_Nx^\top_2x_N \\ ... & ... & ... \\ y_Ny_1x^\top_N x_1 & .... & y_Ny_Nx^\top_Nx_N \\ \end{bmatrix}\]
  • By Kernel Trick
\[Q=\begin{bmatrix}y_1y_1K(x_1, x_1) & .... & y_1y_NK(x_1, x_N) \\ y_2y_1K(x_2, x_1) & .... & y_2y_NK(x_2, x_N) \\ ... & ... & ... \\ y_Ny_1K(x_N, x_1) & .... & y_Ny_NK(x_N, x_N)\\ \end{bmatrix}\]

Kernel

  • The inner product $\phi(x_i)^\top \phi(x_j)$ called a kernel
\[K(x_i, x_j)=\phi(x_i)^\top \phi(x_j)\]
  • Second-Order Polynomial kernel
\[K(u, v)=(u^\top v)^2\]
  • Gaussian Radial Function (RBF) Kernel
\[K(u,v)=\exp\{-{\|u-v\|^2\over2\sigma^2}\}\]

Radial Basis Function

Radial Basis Function takes the form of

\[K(u,v)=\exp\{-\gamma\|u-v\|^2\}\]
  • Typical $\gamma \in [0,1]$
  • $\gamma$ too big: Over-fit

Non-Linear Transform for RBF?

  • Let us consider scalar $u \in \mathbb{R}$

\(K(u,v)=\exp\{-(u-v)^2\}\) \(=\exp\{-u^2\}\exp\{2uv\}\exp\{-v^2\}\) \(=\exp\{-u^2\}(\sum^{\infty}_{k=0}{2^k u^k v^k \over k!})\exp\{-v^2\}\) \(=\exp\{-u^2\}(1, \sqrt{2^1\over1!}u, \sqrt{2^2\over2!}u^2, \sqrt{2^3\over3!}u^3, ..., )^\top \times (1, \sqrt{2^1\over1!}v, \sqrt{2^2\over2!}v^2, \sqrt{2^3\over3!}v^3, ..., )\exp\{-v^2\}\)

  • So $\phi$ is
\[\phi(x)=\exp\{-x^2\}(1, \sqrt{2^1\over1!}x, \sqrt{2^2\over2!}x^2, \sqrt{2^3\over3!}x^3, ..., )\]

So You Need

Example. Radial Basis Function

\[K(u,v)=\exp\{-\gamma\|u-v\|^2\}\]

The non-linear transform is

\[\phi(x)=\exp\{-x^2\}(1, \sqrt{2^1\over1!}x, \sqrt{2^2\over2!}x^2, \sqrt{2^3\over3!}x^3, ..., )\]
  • You need infinite dimensional non-linear transform!

  • But to compute the kernel $K(u,v)$ you do not need $\phi$.

  • Another Good thing about Dual SVM: You can do infinite dimensional non-linear transform.

  • Cost of computing $K(u,v)$ is bottleneck by $|u-v|^2$

Is RBF Always Better than Linear?

  • Noisy dataset: Linear works well.
  • RBF: Over fit.

Testing with Kernels

  • Recall:
\[w^{*}=\sum^N_{n=1}\lambda^{*}_ny_nx_n\]
  • The hypothesis function is
\[h(x)=\text{sign}(w^{* \top} x + w^{*}_0)\] \[=\text{sign}((\sum^N_{n=1}\lambda^{*}_n y_n x_n)^\top x + w^{*}_0)\] \[=\text{sign}(\sum^N_{n=1}\lambda^{*}_ny_nx_n^\top x + w^{*}_0)\]
  • Now you can replace $x^\top_n x$ by $K(x_n, x)$