머신러닝 수업 5주차
Support Vector Machine
Linearly Not Separable
- 
    
The points can be linearly sepearted but there is a very narrow margin
 - 
    
But possibly the large margin solution is better, even though one constraint is violated.
 
Soft Margin
- 
    
We want to allow data points to stay inside the margin.
 - 
    
How about change
 
to this one:
\[y_j(w^\top x_j + w_0) \geq 1-\xi_j, j=1,...,N\]and $\xi_j \geq 0$
- 
    
If $\xi_j > 1$, then $x_j$ will be misclassified.
 - 
    
We can consider this problem
 
subject to
\[y_j(w^\top x_j + w_0) \geq 1-\xi_j, j=1,...,N\] \[\xi_j \geq 0\]- 
    
But we need to control $\xi$, for otherwise the solution will be $\xi=\inf$
 - 
    
How about this:
 
subject to
\[y_j(w^\top x_j + w_0) \geq 1-\xi_j\] \[\xi_j \geq 0\]- Control the energe of $\xi$.
 
Role of $C$
- 
    
If $C$ is big, then we enforce $\xi$ to be small.
 - 
    
If $C$ is small, then $\xi$ can be big.
 
No Misclassification?
- 
    
You can have misclassification in soft SVM
 - 
    
$\xi_j$ can be big for a few outliers
 
L1-Regularization
- Instead of $l_1$ norm, you can also do
 
- 
    
This enforces $\xi$ to be sparse.
 - 
    
Only a few entries samples are allowed to live in the margin.
 - 
    
The problem remains convex.
 - 
    
So you can still use CVX to solve the problem.
 
Connection with Perceptron Algorithm
- In soft-margin SVM, $\xi_j \geq 0$ and $y_j(w^\top x_j + w_0) \geq 1-\xi_j$ imply that
 
- We can combine them to get
 
- So if we use SVM with $l_1$ penalty, then
 
- This means that the training loss is
 
if we define $\lambda=1/C$
- 
    
Now, you can make $\lambda \rightarrow 0$. This means $C\rightarrow \inf$
 - 
    
Then,
 
- SVM Loss
 
- Perceptron Loss:
 
- $|w|^2_2$ regularizes the solution.
 
The Kernel trick
- 
    
A trick to turn linear classifier to nonlinear classifier
 - 
    
Dual SVM
 
subject to $\sum^n_{j=1}\lambda_jy_j=0$
- Kernel Trick
 
subject to $\sum^n_{j=1}\lambda_jy_j=0$
- 
    
You have to do this in dual. Primal is hard.
 - 
    
Define
 
- The matrix $Q$ is
 
- By Kernel Trick
 
Kernel
- The inner product $\phi(x_i)^\top \phi(x_j)$ called a kernel
 
- Second-Order Polynomial kernel
 
- Gaussian Radial Function (RBF) Kernel
 
Radial Basis Function
Radial Basis Function takes the form of
\[K(u,v)=\exp\{-\gamma\|u-v\|^2\}\]- Typical $\gamma \in [0,1]$
 - $\gamma$ too big: Over-fit
 
Non-Linear Transform for RBF?
- Let us consider scalar $u \in \mathbb{R}$
 
\(K(u,v)=\exp\{-(u-v)^2\}\) \(=\exp\{-u^2\}\exp\{2uv\}\exp\{-v^2\}\) \(=\exp\{-u^2\}(\sum^{\infty}_{k=0}{2^k u^k v^k \over k!})\exp\{-v^2\}\) \(=\exp\{-u^2\}(1, \sqrt{2^1\over1!}u, \sqrt{2^2\over2!}u^2, \sqrt{2^3\over3!}u^3, ..., )^\top \times (1, \sqrt{2^1\over1!}v, \sqrt{2^2\over2!}v^2, \sqrt{2^3\over3!}v^3, ..., )\exp\{-v^2\}\)
- So $\phi$ is
 
So You Need
Example. Radial Basis Function
\[K(u,v)=\exp\{-\gamma\|u-v\|^2\}\]The non-linear transform is
\[\phi(x)=\exp\{-x^2\}(1, \sqrt{2^1\over1!}x, \sqrt{2^2\over2!}x^2, \sqrt{2^3\over3!}x^3, ..., )\]- 
    
You need infinite dimensional non-linear transform!
 - 
    
But to compute the kernel $K(u,v)$ you do not need $\phi$.
 - 
    
Another Good thing about Dual SVM: You can do infinite dimensional non-linear transform.
 - 
    
Cost of computing $K(u,v)$ is bottleneck by $|u-v|^2$
 
Is RBF Always Better than Linear?
- Noisy dataset: Linear works well.
 - RBF: Over fit.
 
Testing with Kernels
- Recall:
 
- The hypothesis function is
 
- Now you can replace $x^\top_n x$ by $K(x_n, x)$