(Week4) SVM

머신러닝 수업 4주차

Support Vector Machine

SVM is the solution of this optimization

\[\text{minimize}_{w, w_0}{1\over2}\|w\|^2_2\] \[\text{subject to } y_j(w^\top x_j + w_0) \geq 1, j=1,...,N\]

Lagrange Function

Goal : Construct the dual problem of

\[\text{minimize}_{w, w_0}{1\over2}\|w\|^2_2\] \[\text{subject to } y_j(w^\top x_j + w_0) \geq 1, j=1,...,N\]

Approach: Consider the Lagrangian function

\[\mathcal{L}(w, w_0, \lambda)={1\over2}\|w\|^2_2+\sum^N_{j=1}\lambda_j[1-y_j(w^\top x_j + w_0)]\]

Solution happens at the saddle points of $\mathcal{L}$:

\[\nabla_{w, w_0}\mathcal{L}(w, w_0, \lambda)=0\] \[\nabla_\lambda \mathcal{L}(w, w_0, \lambda)=0\]

Inequality Constrained Optimization

Inequality constrained optimization:

\[\text{minimize }_{x\in \mathbb{R}^n} f(x)\] \[g_i(x) \geq 0, i=1,...,m\] \[h_j(x) \geq 0, j=1,...,k\]

Requires a function: Lagrangian function

\[\mathcal{L}(x,\mu, v)=f(x)-\sum^{m}_{i=1}\mu_ig_i(x)-\sum^k_{j=1}v_jh_j(x)\]

$\mu \in \mathbb{R}^m$ and $v \in \mathbb{R}^k$ are called the Lagrange multipliers or the dual variables.

Karush-Kahn-Tucker Conditions

If $(x^{}, \mu^{}, v^{*})$ is the solution to the constrained optimization, then all the following conditions should hold:

$\nabla_x \mathcal{L}(x^{}, \mu^{}, v^{*})=0$

Stationarity.
The primal variables shoud be stationary.

$g_i(x^{}) \geq 0$ and $h_j(x^{})=0$ for all $i$ and $j$.

Primal Feasibility.
Ensures that constraints are satisfied.

$\mu^{*}_I \geq 0$ for all $i$ and $j$.

Dual Feasibility.
Require $\mu^{}_i \geq 0$ but no constraint on $v^{}_i$.

$\mu^{}_i g_i(x^{})=0$ for all $i$ and $j$.

Complementary Slackness
Either $\mu^{}_i=0$ or $g_i(x^{})=0$ (or both).

KKT Condition is a first order necessary condition.

Lagrangian function

The Lagrangian function is

\[\mathcal{L}(w, w_0, \lambda)={1\over2 }\|w\|^2_2+\sum^N_{j=1}\lambda_i[1-y_j(w^\top x_j + w_0)]\] \[[1-y_j(w^\top x_j + w_0)] \leq 0\]

Complementary Condition says

$\lambda_j > 0$ and $[1-y_j(w^\top x_j + w_0)]=0$
$\lambda_j = 0$ and $[1-y_j(w^\top x_j + w_0)]<0$

So, if we want $\nabla_\lambda \mathcal{L}(w,w_0, \lambda)=0$, then must be one of the two cases:

\[\sum^N_{j=1}\lambda_j [1-y_j(w^\top x_j + w_0)]\]

No saddle point because linear in $\lambda$.

But $1-y_j(w^\top x_j + w_0) \leq 0$. So unbounded minimum. So must go with maximum.

Primal Problem

Let $\lambda^*$ as the maximizer

\[\lambda^*=\text{argmax}_{\lambda \geq 0} \{ \sum^N_{j=1} \lambda_j[1-y_j(w^\top x_j + w_0)] \}\]

Then the primal problem is

\[\text{minimize}_{w,w_0}\mathcal{L(w, w_0, \lambda^*)}\] \[=\text{minimize}_{w,w_0} \{ {1\over2}\|w\|^2_2+\text{max}_{\lambda \geq 0} \{\sum^N_{j=1} \lambda_j[1-y_j(w^\top x_j +w_0)] \} \}\] \[=\text{minimize}_{w, w_0} \{ \text{max}_{\lambda \geq 0} \mathcal{L}(w, w_0, \lambda) \}\]

This is a min-max problem.

\[\min_{w, w_0} \max_{\lambda \geq 0} \mathcal{L}(w, w_0, \lambda )\]

Strong Duality

Recall that our problem is quadratic programming (QP)
Strong Duality holds for QP:

Primal

\[\min_{w,w_0}\max_{\lambda \geq 0} \mathcal{L}(w, w_0, \lambda)\]

Dual

\[\max_{\lambda \geq 0}\min_{w, w_0} \mathcal{L}(w, w_0, \lambda)\]

Toy Example

The SVM problem

\[\text{minimize}_{w, w_0} {1\over 2}\|w\|^2_2\]

is in the form of

$\text{minimize}_u |u|^2$, subject to $a^\top_j u \geq b_j$, $j=1,2,…,N$

Example:

\[\text{minimize}_{u1, u2} {u^2_1 + u^2_2}\] \[\begin{bmatrix} 1 & 2 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} \geq \begin{bmatrix} 2 \\ 0 \\ 0\end{bmatrix}\]

Can we write down its dual problem?
Lagrangian function is

\[\mathcal{L}(u, v)=u^2_1+u^2_2 + \lambda_1(2-u_1-2u_2)-\lambda_2u1-\lambda_3u_3\]

Minimize over $u$.
Plugging into the Lagrangian function yields

\[\text{maximize}_\lambda \mathcal{L}(\lambda)=-{5\over4}\lambda^2_1-{1\over4}\lambda^2_2-{1\over4}\lambda^2_3-{1\over2}\lambda_1\lambda_2-\lambda_1\lambda_3+2\lambda_1\]

Primal is QP. Dual is also QP.

Have We Grained Anything?

Here is the dual problem:

\[\text{maxmize}_\lambda \mathcal{L}(\lambda) = -{5\over4}\lambda_1^2-{1\over4}\lambda_2^2-{1\over4}\lambda_3^2-{1\over2}\lambda_1\lambda_2-\lambda_1\lambda_3+2\lambda_1\]

subject to $\lambda \geq 0$

These terms are all negative! So we must have $\lambda_2=\lambda_3=0$.

\[-{1\over4}\lambda_2^2-{1\over4}\lambda_3^2-{1\over2}\lambda_1\lambda_2-\lambda_1\lambda_3\]

This gives

\[\text{maximize}_{\lambda_1 \geq 0}-{5\over4}\lambda^2_1 + 2\lambda_1\]

which is maximized at $\lambda_1={4\over5}$

Plugging into the primal yields

\[u_1={\lambda_1+\lambda_2\over2}={2\over5}\] \[u_2={2\lambda_1+\lambda_3\over2}={4\over5}\]

Dual of SVM

We want to find the dual problem of

\[\text{minimize}_{w, w_0}{1\over2}\|w\|^2_2\] \[\text{subject to } y_j(w^\top x_j + w_0) \geq 1, j=1,...,N\]

We start with the Lagrangian function

\[\mathcal{L}(w, w_0, \lambda)={1\over2}\|w\|^2_2+\sum^N_{j=1}\lambda_j[1-y_j(w^\top x_j + w_0)]\]

Let us minimize over $(w, w_0)$ : Primal stationarity

\[\nabla_w \mathcal{L}(w, w_0, \lambda)=w-\sum^N_{j=1}\lambda_jy_jx_j=0\] \[\nabla_{w_0}\mathcal{L}(w, w_0, \lambda)=\sum^N_{j=1}\lambda_jy_j=0\]

Interpreting $\nabla_w \mathcal{L}(w,w_0,\lambda)=0$

The first result suggests that

\[w=\sum^N_{j=1}\lambda_jy_jx_j\]

This is support vector: $\lambda_j$ is either $\lambda_j=0$ or $\lambda_j > 0$
The complementarity condition states that

\[\lambda_j^{*}[1-y_j(w^\top x_j + w_0^{*})]=0\]

If $1-y_j(w^{\top}x_j + w_0^{}) > 0, \lambda^{*}_j=0$
If $\lambda^{}_j > 0, 1-y_j(w^{\top}x_j + w_0^{*})=0$
So you can define the support vector set:

\[\nu=\{ j \mid \lambda^{*}_j > 0\}\]

So the optimal weight is

\[w^{*}=\sum_{j \in \nu} \lambda^{*}_j y_j x_j\]

Back to Duality

The Lagrangian function is

\[\mathcal{L}(w^{*}, w^{*}_0, \lambda)={1\over2}\|w\|^2_2+\sum^N_{j=1}\lambda_j[1-y_j((w^{*\top})x_j+w_0)]\] \[={1\over2}\|\sum^{N}_{j=1}\lambda_jy_jx_j\|^2_2+\sum^N_{j=1}\lambda_j[1-y_j((\sum^n_{i=1}\lambda_iy_ix_i)^\top x_j + w_0]\]

We can show that

\[A={1\over2}\sum^{N}_{i=1}\sum^{N}_{j=1}\lambda_i\lambda_jy_iy_jx_i^\top x_j\] \[\nabla_{w_0}\mathcal{L}(w, w_0, \lambda)=\sum_{j=1}^N{\lambda_jy_j}=0\] \[B=\sum^N_{j=1}\lambda_j-\sum^N_{i=1} \sum^N_{j=1} \lambda_i \lambda_j y_i y_j x^\top_i x_j - (\sum^N_{j=1}\lambda_jy_j)w_0\]

and we can show that

\[A+B=\sum^N_{j=1}\lambda_j-{1\over2}\sum^N_{i=1}\sum^N_{j=1}\lambda_i\lambda_j y_iy_jx^\top_ix_j\]

Therefore, the dual problem is

\[\text{maximize}_{\lambda \geq 0} -{1\over2}\sum^N_{i=1}\sum^N_{j=1}\lambda_i\lambda_j y_iy_jx^\top_ix_j + \sum^N_{j=1}\lambda_j\]

subject to

\[\sum^N_{j=1}\lambda_jy_j=0\]

If you prefer matrix-vector:

\[\text{maximize}_{\lambda \geq 0} -{1\over2}\lambda^\top Q \lambda + 1^\top \lambda\]

subject to $y^\top \lambda =0$

We can combine the constraints $\lambda \geq 0$ and $y^\top \lambda =0$ as $A\lambda \geq 0$

$y^\top \lambda =0$ means

\[y^\top \lambda \geq 0 \text{ and } y^\top \lambda \leq 0\]

Thus, we can write $y^\top \lambda=0$ as

\[\begin{bmatrix}y^\top \\ -y^\top \end{bmatrix} \lambda \geq \begin{bmatrix}0 \\ 0 \end{bmatrix}\]

Therefore, the matrices $Q$ and $A$ are

\[Q=\begin{bmatrix}y_1y_1x^\top_1 x_1 & .... & y_1y_Nx^\top_1x_N \\ y_2y_1x^\top_2 x_1 & .... & y_2y_Nx^\top_2x_N \\ ... & ... & ... \\ y_Ny_1x^\top_N x_1 & .... & y_Ny_Nx^\top_Nx_N \\ \end{bmatrix}\]

and

\[A=\begin{bmatrix}y^\top \\ -y^\top \\ I\end{bmatrix}\]

So How to Solve the SVM Problem?

You look at the dual problem

\[\text{maximize}_\lambda {-{1\over2}\lambda^\top Q \lambda + 1^\top \lambda}\]

subject to

\[A\lambda \geq 0\]

You get the solution $\lambda^{*}$
Then compute $w^{*}$

\[w*=\sum_{j\in \nu}\lambda^{*}_jy_jx_j\]

$\nu$ is the set of support vectors: $\lambda_j > 0$

Are We Done Yet?

Not quite
We still need to find out $w_0^{*}$
Pick any support vector $x^+ \in \mathcal{C}+$ and $x^{-} \in \mathcal{C}{-}.$
Then we have

\[w^\top x^+ + w_0 = +1\] \[w^\top x^{-} + w_0 = -1\]

Sum them, we have $w^\top(x^{+} + x^{-}) + 2w_0=0$, which means

\[w^{*}_0=-{(x^{+}+x^{-})^\top w^{*} \over2}\]

Summary of Dual SVM

Primal

\[\text{minimize}_{w,w_0}{1\over2}\|w\|^2_2\]

subject to $y_j(w^\top x_j + w_0) \geq 1$

Strong Duality

\[\min_{w, w_0}\max_{\lambda \geq 0} \mathcal{L}(w, w_0, \lambda)=\max_{\lambda \geq 0}\min_{w, w_0} \mathcal{L}(w, w_0, \lambda)\]

Dual

\[\text{maximize}_{\lambda \geq 0} -{1\over2}\sum^N_{i=1}\sum^N_{j=1}\lambda_i\lambda_j y_i y_j x^\top_i x_j + \sum^N_{j=1} \lambda_j\]

subject to $\sum^N_{j=1} \lambda_jy_j=0$

The weights are computed as

\[w^{*}=\sum^N_{j=1}\lambda^{*}_jy_jx_j\]

This is support vector: $\lambda_j$ is either $\lambda_j=0$ or $\lambda_j > 0$
Pick any support vector $x^{+} \in \mathcal{C}{+}$ and $x^{-} \in \mathcal{C}{-}$
Then we have

\[w^\top x^{+} + w_0 = +1\] \[w^\top x^{-} + w_0 = -1\]

Sum them, we have $x^\top(x^{+}+x^{-}) + 2w_0=0$, which means

\[w^{*}_0=-{(x^{+}+x^{-})^\top w^{*} \over2}\]

PREVIOUS(Week3) SVM

NEXT(Week5) SVM