(Week2) Regularization

머신러닝 수업 2주차

Regularization

Ridge Regularization

Applies to both over and under determined systems.
The loss function of the ridge regression is defined as the loss function

\[J(\theta)=\| A\theta-\bf{y} \|^2 + \lambda\|\theta\|^2\]

$|\theta|^2$ : Regularization function
$\lambda$ : Regularization parameter
The solution of the ridge regression is

\[\nabla_\theta J(\theta)=\nabla_\theta \{ \| A\theta-{\bf{y}} \|^2 + \lambda\|\theta\|^2 \}=2A^{\top}(A\theta-y)+2\lambda\theta=0\] \[2A^{\top}A\theta-2A^{\top}{\bf{y}}+2\lambda\theta=0\] \[(A^{\top}A+\lambda I)\theta=A^\top {\bf{y}}\]

which gives us $\hat{\theta}=(A^{\top}A+\lambda I)^{-1}A^{\top}{\bf{y}}$

Change in Eigen-values

Ridge regression improves the eigen-Values

Eigen-decomposition of $A^{\top}A$:

\[A^{\top}A=USU^{\top}\geq0\]

where $U=$ eigen-value matrix, $S=$ eigen-value matrix.

$S$ is a diagonal matrix with non-negative entries:
Therefore, $S+\lambda I$ is always positive for any $\lambda > 0$, implying that

\[A^{\top}A+\lambda I=U(S+\lambda I)U^\top > 0\]

If $\lambda \rightarrow 0$, then $\hat{\theta}=(A^{\top}A)^{-1}A^{\top}{\bf{y}}$

\[J(\theta)=\| A\theta-\bf{y} \|^2\]

If $\lambda \rightarrow \infty$,

\[J(\theta)=\lambda \| \theta \|^2\]

There is a trade-off curve between the two terms by varying $\lambda$.

Comparing Vanilla and Ridge

Suppose ${\bf{y}}=A\theta^* + {\bf{e}}$ for some ground truth $\theta^*$ and noise vector ${\bf{e}}$. Then, the vanilla linear regression will give us

\[\hat{\theta}=(A^\top A)^{-1}A^\top {\bf{y}}\] \[=(A^\top A)^{-1}A^\top (A\theta^*+{\bf{e}})\] \[=\theta^*+(A^\top A)^{-1}A^{\top}{\bf{e}}\]

If ${\bf{e}}$ has zero mean and varianze $\sigma^2$, we can show that

\[\mathbb{E}[\hat{\theta}]=\theta^*\] \[\text{Cov}[\hat{\theta}]=\sigma^2(A^\top A)^{-1}\]

Therefore, the regression coefficients are unbiased but have large variance. We can further show that the mean-squred error(MSE) is

\[\text{MSE}(\hat{\theta})=\sigma^2\text{Tr}\{ (A^\top A)^{-1} \}.\]

On the other hand, if we use ridge regression, then

\[\hat{\theta}(\lambda)=(A^\top A + \lambda I)^{-1}A^\top (A\theta^*+{\bf{e}})\] \[=(A^\top A + \lambda I)^{-1}A^{\top}A\theta^*+(A^\top A + \lambda I)^{-1}A^\top{\bf{e}}\]

Again, if ${\bf{e}}$ is zero mean and has a variance $\sigma^2$, then

\[\mathbb{E}[\hat{\theta}(\lambda)]=(A^\top A+\lambda I )^{-1}A^\top \theta^*\] \[\text{Cov}[\hat{\theta}(\lambda)]=\sigma^2(A^\top A + \lambda I)^{-1}A^\top A(A^\top A+\lambda I)^{-\top}\] \[\text{MSE}[\hat{\theta}(\lambda)]=\sigma^2\text{Tr}\{ W_\lambda(A^\top A)^{-1}W^\top_\lambda \}+\theta^{*\top}(W_\lambda - I)^\top(W_\lambda -I)\theta^*\]

where $W_\lambda=(A^\top A + \lambda I)^{-1}A^\top A$. In particular, we show that

Theorem(Theobald 1974)

For $\lambda < 2\sigma^2|\theta^*|^{-2}$, it holds that $\text{MSE}(\hat{\theta}(\lambda)) < \text{MSE}(\hat{\theta})$

Choosing $\lambda$

Because the following three problems are equivalent

\[\theta^{*}_\lambda=\text{argmin}_\theta\|A\theta-{\bf{y}}\|^2+\lambda\|\theta\|^2\]

subject to $|\theta|^2\leq\alpha$

\[\theta^{*}_\alpha=\text{argmin}_\theta\|A\theta-{\bf{y}}\|^2\]

subject to $|A\theta-{\bf{y}}|^2 \leq \epsilon$

We can seek $\lambda$ that satisfies $|\theta|^2 \leq \alpha$ :

You know how much $|\theta|^2$ would be appropriate.

We can seek $\lambda$ that satisfies $|A\theta -{\bf{y}}|^2 \leq \epsilon$

You know how much $|A\theta-{\bf{y}}|^2$ would be tolerable.

Other approaches:

Akaike’s information criterion: Balance model fit with complexity
Cross validation: Leave one out
Generalized cross-validation: Cross-validation + weight

LASSO Regression

An alternative to the Ridge Regression is Least Absolute Shrinkage and Selection Operator (LASSO)
The loss function is

\[J(\theta)=\|A\theta-{\bf{y}}\|^2+\lambda\|\theta\|_1\]

Intuition behind LASSO: Many features are not active.

Interpreting the LASSO Solution

\[\hat{\theta}=\text{argmin}_\theta\|A\theta-{\bf{y}}\|^2+\lambda\|\theta\|_1\]

$|\theta|_1$ promotes sparsity of $\theta$. It is the nearest convex approximation to $|\theta|_0$, which is the number of non-zero.

Why are Sparse Models Useful?

Images are sparse in transform domains, e.g., Fourier and wavelet.
Intuition: There are more low frequency components and less high frequency components
Examples above: $A$ is the wavelet basis matrix. $\theta$ are the wavelet coefficents.
We can truncate the wavelet coefficients and retain a good image.jpg
Many image comression schemes are based on this, JPEG, JPEG2000.

LASSO for Image Reconstruction

Image inpainting via KSVD dictionary-learning

$y=$ image with missing pixels. $A=$ a matrix storing a set of trained feature vectors (called dictionary atoms). $\theta=$ coefficients.
minimize $|{\bf{y}}-A\theta|^2+\lambda|\theta|_1$
KSVD = k-means + Singular Value Decomposition (SVD) : A method to train the feature vectors that demonstrate sparse representations.

Shrinkage Operator

The LASSO problem can be solved using a shrinkage operator. Consider a simplified problem (with $A=I$)

\[J(\theta)={1 \over 2} \|{\bf{y}}-\theta\|^2+\lambda\|\theta\|_1\] \[=\sum^d_{j=1} ({1 \over 2}(y_j-\theta_j)^2+\lambda\|\theta_j\|_1 )\]

Since the loss is separable, the optimization is solved when each individual term is minimized. The individual problem

\[\hat{\theta}=\text{argmin} \{ {1 \over 2}(y-\theta)^2+\lambda|\theta| \}\] \[=\text{max}(|y|-\lambda, 0)\text{sign}(y)\] \[=S_\lambda(y)\]

Shrinkage VS Hard Threshold

The shrinkage operator looks as follows.
Any number between $[-\lambda, \lambda]$ is “shrink” to zero.
Try compare with the hard threshold operator $H_\lambda(y)=y\cdot { |y|\geq \lambda }$

Algorithms to Solve LASSO Regression

In general, the LASSO problem requires iterative algorithms:

ISTA Algorithm (Daubechies et al. 2004)

For $k=1,2,…,$
$v^k=\theta^k-2\gamma A^\top(A\theta^k-{\bf{y}})$
$\theta^{k+1}=\text{max}(|v^k|-\lambda, 0)\text{sign}(v^k)$

FISTA Algorithm (Beck-Teboulle 2008)

For $k=1,2,…,$
$v^k=\theta^k-2\gamma A^\top(A\theta^k-{\bf{y}})$
$z^k=\text{max}(|v^k|-\lambda, 0)\text{sign}(v^k)$
$\theta^{k+1}=\alpha_k\theta^k+(1-\alpha_k)z^k$

ADMM Algorithm (Eckstein-Bertsekas 1992, Boyd et al. 2011)

For $k=1,2,…,$
$\theta^{k+1}=(A^\top A+\rho I)^{-1}(A^{\top}y+\rho z^k-u^k)$
$z^{k+1}=\text{max}(|\theta^{k+1}+u^k\rho|-\lambda / \rho, 0) \text{sign}(\theta^{k+1}+u^k/\rho)$
$u^{k+1}=u^k+\rho(\theta^{k+1}-z^{k+1})$

And many others.

Example: Crime Rate Data

city	funding	not-hs	college	college4	crime rate
1	40	74	11	20	478
2	32	72	11	18	494
3	57	70	18	16	643
4	31	71	11	19	341
5	67	72	9	24	773
…	…	…	…	…
50	66	67	26	16	940

Consider the following two optimizations

\[\hat{\theta}_1=\text{argmin}_\theta J_1(\theta)=\|A\theta-{\bf{y}}\|^2+\lambda \|\theta\|_1\] \[\hat{\theta}_2=\text{argmin}_\theta J_2(\theta)=\|A\theta-{\bf{y}}\|^2+\lambda \|\theta\|_2\]

Comparison between L-1 and L-2 norm

Plot $\hat{\theta}_1(\lambda)$ and $\hat{\theta}_2(\lambda)$ vs. $\lambda$.
LASSO tells us which factor appears first.
If we are allowed to use only one feature, then % high is the one.
Two features, then % high + funding.

Pros and Cons

Ridge regression

(+) Analytic solution, because the loss function is differentiable.
(+) As such, a lot of well-established theoretical gurantees.
(+) Algorithm is simple, just one equation.
(-) Limited interpretability, since the solution is usually a dense vector.
(-) Does not reflect the nature of certain problems, e.g., sparsity

LASSO

(+) Proben applications in many domains, e.g., images and speechs.
(+) Echoes particularly well in modern deep learning where parameter space is huge.
(+) Increasing number of theoretical guarantees for special matrices.
(+) Algorithms are available.
(-) No closed-form solution. Algorithms are iterative.

Regularization

Weight Decay

L2 Regularization (learning rate, gradient, gradient of l2-regularization)

\[\theta_{k+1}=\theta_k-\epsilon\nabla_\theta(\theta_k, x, y)-\lambda\theta_k\]

Penalizes large weights
Improves generalization

Early stopping

Easy form of regularization

Bagging and Ensemble Methods

Train multiple models and average their results.
E.g., use a different algorithm for optimization or change the objective function / loss function.
If errors are uncorrelated, the expected combined error will decrease linearly with the ensemble size.
Bagging: uses k different datasets.

Dropout

Disable a random set of neurons (typically 50%)
Using half the network = half capacity
1. Redundant representations
2. Base your scores on more features
Consider it as a model ensemble.
Two models in one.
1. Training a large ensemble of models, each on different set of data(mini-batch) and with SHARED parameters $\rightarrow$ Reducing co-adaption between neurons

Dropout: Test Time

All neurons are “turned on” - no dropout $\rightarrow$ Conditions at train and test time are not the same.
Test : $z=(\theta_1x_1+\theta_2 x_2) \cdot p$, p is a dropout probability
Train :

\[E(z)={1 \over 4}(\theta_10+\theta_20+\theta_1x_1+\theta_20+\theta_1 0+\theta_2 x_2+\theta_1x_1+\theta_2x_2)={1 \over 2}(\theta_1x_1+\theta_2x_2)\]

Dropout: Verdict

Efficient bagging method with parameter sharing
Try it!
Dropout reduces the effective capacity of a model $\rightarrow$ larger models, more training time.

Batch Normalization

Wish: Unit Gaussian activations (in our example)
Solution: let’s do it

\[\hat{x}^{k}={x^{k}-E[x^{k}]\over \sqrt{Var[x^{k}]}}\]

All we want is that our activations do not die out
In each dimension of the features, you have a unit gaussian (in our example)
For NN in general $\rightarrow$ BN normalizes the mean and variance of the inputs to your activation functions
A layer to be applied after Fully Conneted layers and before non-linear activation functions.

Normalize

\[\hat{x}^{k}={x^{k}-E[x^{k}]\over \sqrt{Var[x^{k}]}}\]

The network can learn to undo the Normalization

\[r^{k}=\sqrt{Var[x^{(k)}]}\] \[\beta^{(k)}=E[x^{(k)}]\]

Allow the network to change the range

\[y^{(k)}=r^{(k)}\hat{x}^{(k)}+\beta^{(k)}\]

OK to treat dimensions separately? Shown empirically that even if features are not correlated, convergence is still faster with this method
You can set all biases of the layers before BN to zero, because they will be cancelled out by BN anyway.

BN: Train vs Test

Training : Compute mean and variance from mini-batch 1,2,3…

Testing : Compute mean and variance by running an exponentially weighted averaged across training mini-batches. Use them as $\sigma^2{test}$ and $\mu{test}$

\[Var_{running}=\beta_{m}*Var_{running}+(1-\beta_m)*Var_{minibatch}\] \[\mu_{running}=\beta_{m}*\mu_{running} + (1-\beta_m)*\mu_{minimbatch}\]

$\beta_m$ : momentum (hyperparameter)

BN: What do you get?

Very deep nets are much easier to train $\rightarrow$ more stable gradients
A much larger range of hyperparameters works similariy when using BN

Other Normalizations

Data Augmentation

A classifier has to b einvariant to a wide variety of transformations
Helping the classifier: synthesize data simulating plausible transformations

Data Augmentation: Brightness

Random brightness and contrast changes

Data Augmentation: Random Crops

Training: random crops
1. Pick a random L in [256, 480]
2. Resize training image, short side L
3. Randomly sample crops of 224x224
Testing: fixed set of crops
1. Resize image at N scales
2. 10 fixed crops of 224x224:(4corners + 1center) x 2 flips

Data Augmentation

When comparing two networks make sure to use the same data augmentation!
Consider data augmentation a part of your network design

Kernel Method

Why Another Method?

Linear regression: Pick a global model, best fit globally.
Kernel method: Pick a local model, best fit locally.
In kernel method, instead of picking a line / a quadratic equation, we pick a kernel.
A kernel is a measure of distance between training samples.
Kernel method buys us the ability to handle nonlinearity.
Ordinary regression is based on the columns (features) of A.
Kernel method is based on the rows (samples) of A.

Pictorial Illustration

goal : learn the surface
prediction : when new sample comes, interpolate on the surface

Overview of the Method

Model Parameter

We want the model parameter $\hat{\theta}$ to look like:

\[\hat{\theta}=\sum^N_{n=1}\alpha_nx^n\]

This model expresses $\hat{\theta}$ as a combination of the samples.
The trainable parameters are $\alpha_n$, where $n=1,…,N$.
If we can make $\alpha_n$ local, i.e., non-zero for only a few of them, then we can achieve our goal: localized, sample-dependent.

Predicted Value

The predicted value of a new sample $x$ is

\[\hat{y}=\hat{\theta}^\top x =\sum^N_{n=1}\alpha_n\langle x, x^n\rangle\]

Dual Form of Linear Regression

Goal : Address Question 1: Express $\hat{\theta}$ as

\[\hat{\theta}=\sum^N_{n=1}\alpha_nx^n\]

We start by listing out a technical lemma:

Lemma

For any $A \in \mathbb{R}^{N\times d}$, $y\in \mathbb{R}^d$, and $\lambda > 0$

\[(A^{\top}A+\lambda I)^{-1}A^{\top}y=A^\top(AA^{\top}+\lambda I)^{-1}y\]

Remark:

The dimensions of $I$ on the left is $d \times d$, on the right is $N \times N$.
If $\lambda=0$, then the above is true only when $A$ is invertible.

Dual Form of Linear Regression

Using the Lemma, we can show that

Primal Form

\[\hat{\theta}=(A^\top A+\lambda I)^{-1}A^{\top}y\]

Dual Form

\[\hat{\theta}=A^{\top}(AA^{\top}+\lambda I)^{-1}y\]

The solution to the dual problem provides a lower bound to the solution of the primal problem.

\[=\begin{bmatrix} -(x^1)^{\top}- \\ -(x^2)^{\top}- \\ ... \\ -(x^N)^{\top}-\end{bmatrix}^{\top} \begin{bmatrix} \alpha_1 \\ ... \\ \alpha_N \end{bmatrix}=\sum^N_{n=1}\alpha_nx^n\] \[\alpha_n=[(AA^\top+\lambda I)^{-1}y]\]

The Kernel Trick

Goal: Addresses Question 2: Introduce nonlinearity to

\[\hat{y}=\hat{\theta}^\top x = \sum^N_{n=1}\alpha\langle x, x^n\rangle\]

The Idea:

Replace the inner product $\langle x, x^n \rangle$ by $k(x, x^n)$:

\[\hat{y}=\hat{\theta}^\top x=\sum^N_{n=1}\alpha_n k(x, x^n)\]

$k(\cdot, \cdot)$ is called a kernel.
A kernel is a measure of the distance between two samples $x^i$ and $x^j$.
$\langle x^i, x^j \rangle$ measure distance in the ambient space, $k(x^i, x^j)$ measure distance in a transformed space.
In particular, a valid kernel takes the from $k(x^i, x^j)=\langle \phi(x^i), \phi(x^j) \rangle$ for some nonlinear transforms $\phi$.

Kernels Illustrated

A kernel typically lifts the ambient dimension to a higher one.
For example, mapping from $\mathbb{R}^2$ to $\mathbb{R}^3$

\[x^n=\begin{bmatrix}x_1\\x_2\end{bmatrix}\] \[\phi(x_n)=\begin{bmatrix}x^2_1 \\ x_1x_2 \\ x^2_2 \end{bmatrix}\]

Relationship between Kernel and Transform

Consider the following kernel $k(u,v)=(u^\top v)^2$. What is the transform?

Suppose $u$ and $v$ are in $\mathbb{R}^2$. Then $(u^\top v)^2 $ is

\[(u^\top v)^2=(\sum^2_{i=1}u_iv_i)(\sum^2_{j=1}u_jv_j)\] \[=\sum^2_{i=1}\sum^2_{j=1}(u_iu_j)(v_iv_j)=\begin{bmatrix}u^2_1 & u_1u_2 & u_2u_1 & u^2_2\end{bmatrix}\begin{bmatrix}v^2_1 \\ v_1v_2 \\ v_2v_1 \\ v^2_2\end{bmatrix}\]

So if we define $\phi$ as

\[u=\begin{bmatrix}u_1 \\ u_2 \end{bmatrix}\] \[\phi(u)=\begin{bmatrix}u^2_1 \\ u_1u_2 \\ u_2u_1 \\ u^2_2 \end{bmatrix}\]

then $(u^\top v)^2=\langle \phi(u),\phi(v) \rangle$

Radial Basis Function

A useful kernel is the radial basis kernel(RBF):

\[k(u,v)=\text{exp}\{-{\|u-v\|^2\over 2\sigma^2}\}\]

The corresponding nonlinear transform of RBF is infinite dimensional. See Apendix.
$|u-v|^2$ measures the distance between two data points $u$ and $v$.
$\sigma$ is the std dev, defining “far and close”
RBF enforces local structure; Only a few samples aure used.

Kernel Method

Given the choice of the kernel function, we can write down the algorithm as follows.

Pick a kernel function $k(\cdot, \cdot).$
Construct a kernel matrix $K \in \mathbb{R}^{N \times N}$, where $[K]_{ij}=k(x_i, x_j)$ for $i=1,…,N$, $j=1,…,N$
Compute the coefficients $\alpha \in \mathbb{R}$, with

\[\alpha_n=[(K+\lambda I)^{-1}y]_n\]

Estimate the predicted value for a new sample $x$:

\[g_\theta(x)=\sum^N_{n=1}\alpha_nk(x,x^n)\]

Therefore, the coice of the regression is shifted to the choice of the kernel.

Example

Goal: Use the kernel method to fit the data points shown as follows.

What is the input feature vector $x^n$? $x^n=t_n$: The time stamps
What is the output $y_n$? $y^n$ is the height.
Which kernel to choose? Let us consider RBF.

Example (using RBF)

Define the fitted function as $g_\theta(t)$. [Here, $\theta$ refers to $\alpha$.]
The RBF is defined as $k(t_i, t_j)=\text{exp}{-(t_i-t_j)^2/2\sigma^2}$, for some $\alpha$.
The matrix $K$ looks something below

\[[K]_{ij}=\text{exp}\{-(t_i-t_j)^2/2\sigma^2\}\]

$K$ is a banded diagonal matrix.
The coefficient vector is $\alpha_n=[(K+\lambda I)^{-1}y]_n$
Using the RBF, the predicted value is given by

\[g_\theta(t^{new})=\sum^N_{n=1}\alpha_nk(t^{new}, t_n)=\sum^{N}_{n=1}\alpha_n e^{-(t^{new}-t_n)^2\over 2\sigma^2}\]

Pictorially, the predicted function $g_\theta$ can be viewed as the linear combination of the Gaussian kernels.

Effect of $\sigma$

Large $\sigma$: Flat kernel. Over-smoothing
Small $\sigma$: Narrow kernel. Under-smoothing
Below shows an example of the fitting and the kernel matrix $K$.

Any Improvement?

We can improve the above kernel by considering $x^n=[y_n, t_n]^\top$
Define the kernel as

\[k(x_i, x_j)=\exp\{-({(y_i-y_j)^2\over2\sigma^2_r}+{(t_i-t_j)^2\over2\sigma^2_s}) \}\]

This new kernel is adaptive (edge-aware).

This idea is known as bilateral filter in the computer vision literature.
Can be further extended to 2D image where $x^n=[y_n, s_n]$, for some spatial coordinate $s_n$.

Kernel Methods in Classification

The concept of lifting the data to higher dimension is useful for classification.

Kernels in Support Vector Machines

Example. RBF for SVM

Radial Basis Function is often used in support vector machine.
Poor choice of parameter can lead to low training loss, but with the risk of over-fit.
Under-fitted data can sometimes give better generalization.

PREVIOUS(Week1) Logistic and Linear Regression

NEXT2022 ML in Tobigs

city	funding	not-hs	college	college4	crime rate
1	40	74	11	20	478
2	32	72	11	18	494
3	57	70	18	16	643
4	31	71	11	19	341
5	67	72	9	24	773
…	…	…	…	…
50	66	67	26	16	940

city	funding	not-hs	college	college4	crime rate
1	40	74	11	20	478
2	32	72	11	18	494
3	57	70	18	16	643
4	31	71	11	19	341
5	67	72	9	24	773
…	…	…	…	…
50	66	67	26	16	940

city	funding	not-hs	college	college4	crime rate
1	40	74	11	20	478
2	32	72	11	18	494
3	57	70	18	16	643
4	31	71	11	19	341
5	67	72	9	24	773
…	…	…	…	…
50	66	67	26	16	940