머신러닝 수업 5주차
Support Vector Machine
Linearly Not Separable
-
The points can be linearly sepearted but there is a very narrow margin
-
But possibly the large margin solution is better, even though one constraint is violated.
Soft Margin
-
We want to allow data points to stay inside the margin.
-
How about change
to this one:
\[y_j(w^\top x_j + w_0) \geq 1-\xi_j, j=1,...,N\]and $\xi_j \geq 0$
-
If $\xi_j > 1$, then $x_j$ will be misclassified.
-
We can consider this problem
subject to
\[y_j(w^\top x_j + w_0) \geq 1-\xi_j, j=1,...,N\] \[\xi_j \geq 0\]-
But we need to control $\xi$, for otherwise the solution will be $\xi=\inf$
-
How about this:
subject to
\[y_j(w^\top x_j + w_0) \geq 1-\xi_j\] \[\xi_j \geq 0\]- Control the energe of $\xi$.
Role of $C$
-
If $C$ is big, then we enforce $\xi$ to be small.
-
If $C$ is small, then $\xi$ can be big.
No Misclassification?
-
You can have misclassification in soft SVM
-
$\xi_j$ can be big for a few outliers
L1-Regularization
- Instead of $l_1$ norm, you can also do
-
This enforces $\xi$ to be sparse.
-
Only a few entries samples are allowed to live in the margin.
-
The problem remains convex.
-
So you can still use CVX to solve the problem.
Connection with Perceptron Algorithm
- In soft-margin SVM, $\xi_j \geq 0$ and $y_j(w^\top x_j + w_0) \geq 1-\xi_j$ imply that
- We can combine them to get
- So if we use SVM with $l_1$ penalty, then
- This means that the training loss is
if we define $\lambda=1/C$
-
Now, you can make $\lambda \rightarrow 0$. This means $C\rightarrow \inf$
-
Then,
- SVM Loss
- Perceptron Loss:
- $|w|^2_2$ regularizes the solution.
The Kernel trick
-
A trick to turn linear classifier to nonlinear classifier
-
Dual SVM
subject to $\sum^n_{j=1}\lambda_jy_j=0$
- Kernel Trick
subject to $\sum^n_{j=1}\lambda_jy_j=0$
-
You have to do this in dual. Primal is hard.
-
Define
- The matrix $Q$ is
- By Kernel Trick
Kernel
- The inner product $\phi(x_i)^\top \phi(x_j)$ called a kernel
- Second-Order Polynomial kernel
- Gaussian Radial Function (RBF) Kernel
Radial Basis Function
Radial Basis Function takes the form of
\[K(u,v)=\exp\{-\gamma\|u-v\|^2\}\]- Typical $\gamma \in [0,1]$
- $\gamma$ too big: Over-fit
Non-Linear Transform for RBF?
- Let us consider scalar $u \in \mathbb{R}$
\(K(u,v)=\exp\{-(u-v)^2\}\) \(=\exp\{-u^2\}\exp\{2uv\}\exp\{-v^2\}\) \(=\exp\{-u^2\}(\sum^{\infty}_{k=0}{2^k u^k v^k \over k!})\exp\{-v^2\}\) \(=\exp\{-u^2\}(1, \sqrt{2^1\over1!}u, \sqrt{2^2\over2!}u^2, \sqrt{2^3\over3!}u^3, ..., )^\top \times (1, \sqrt{2^1\over1!}v, \sqrt{2^2\over2!}v^2, \sqrt{2^3\over3!}v^3, ..., )\exp\{-v^2\}\)
- So $\phi$ is
So You Need
Example. Radial Basis Function
\[K(u,v)=\exp\{-\gamma\|u-v\|^2\}\]The non-linear transform is
\[\phi(x)=\exp\{-x^2\}(1, \sqrt{2^1\over1!}x, \sqrt{2^2\over2!}x^2, \sqrt{2^3\over3!}x^3, ..., )\]-
You need infinite dimensional non-linear transform!
-
But to compute the kernel $K(u,v)$ you do not need $\phi$.
-
Another Good thing about Dual SVM: You can do infinite dimensional non-linear transform.
-
Cost of computing $K(u,v)$ is bottleneck by $|u-v|^2$
Is RBF Always Better than Linear?
- Noisy dataset: Linear works well.
- RBF: Over fit.
Testing with Kernels
- Recall:
- The hypothesis function is
- Now you can replace $x^\top_n x$ by $K(x_n, x)$