(Week1) Logistic and Linear Regression

 

머신러닝 수업 1주차


Logistic and Linear Regression

Basic Notation

  • Scalar: $a, b, c \in \mathbb{R}$
  • Vector: $\bf{a}, \bf{b}, \bf{c} \in \mathbb{R}^d$
  • Rows and Columns
  • ${ {\bf {a} }_j }$: The j-th feature. ${ {\bf {x}}^n}$: The n-th samples
  • Identity Matrix: $I$
  • All-one vector $1$ and all-zero vector $0$
  • Standard basis ${\bf {e}}_i$

Linear Regression

The problem of regression can be summarized as:

  • Given measurement: $y^n$, (where $n=1,…,N$)
  • Given inputs: $x^n$
  • Given a model: $g_\theta(x^n)$ parameterized by $\theta$

Linear regression is one type of regression:

  • Restrict $g_\theta(\cdot)$ to a line: \(g_\theta(x) = x^\top \theta\)

  • The inputs $x$ and the parameters $\theta$ are ${\bf{x}}=[x_1,…,x_d]^{\top}$ and $\theta=[\theta_1, …, \theta_d]^{\top}$

  • This is equivalent to $g_\theta({\bf{x}})={\bf{x}}^\top\theta=\sum^d_{j=1}{\bf{x}}_j\theta_j$

Solving the Regression Problem

The (general) regression can be solved via the following logic:

  • Define a (Squred-Error) Loss Function
\[J(\theta)=\sum^N_{n=1}\mathcal{L}(g_\theta({\bf{x}}^n), y^n)=\sum^N_{n=1}(g_\theta(x^n)-y^n)^2\]
  • Other loss functions can be used.
  • The goal is to solve an optimization
\[\hat{\theta}=\text{argmin}_\theta J(\theta)\]
  • The prediction of a new input ${\bf{x}}^\text{new}$ is ${\bf{y}}^\text{new}=\hat{\theta}^{\top}{\bf{x}}^\text{new}$

Linear Regression Solution

The linear regression problem is a special case which we can solve analytically.

  • Restrict $g_\theta(\cdot)$ to a line: \(g_\theta({\bf{x}}^n)=\theta^{\top}{\bf{x}}^n\)

  • Then the loss function becomes

\[J(\theta)=\sum^N_{n=1}(\theta^{\top}{\bf{x}}^n-{\bf{y}}^n)^2=||A\theta-{\bf{y}}||^2\]
  • The matrix and vectors are defined as follows:
\[A = \begin{bmatrix} -(x^2)^\top- \\ -(x^2)^\top- \\ ... \\ -(x^N)^\top- \\ \end{bmatrix}, \theta=\begin{bmatrix}\theta_1\\ ...\\ \theta_d \end{bmatrix}, {\bf{y}}=\begin{bmatrix}{\bf{y}}_1\\ ...\\ {\bf{y}}_N \end{bmatrix}\]

Theorem

For a linear regression problem

\[\hat{\theta}=\text{argmin}_\theta J(\theta)=||A\theta-y||^2\]

the minimizer is

\[\hat{\theta}=(A^{\top}A)^{-1}A^{\top}y\]
  • Take derivative and setting to zero:
\[\nabla_\theta J(\theta)=\nabla_\theta{||A\theta-y||^2}=2A^{\top}(A\theta-y)=0\]
  • So solution is $\hat{\theta}=(A^\top A)^{-1}A^{\top}y$, assuming $A^{\top}A$ is invertible.

Example on Cases

Example 1. Second-order polynomial fitting

\[y_n=ax^2_n+bx_n+c\] \[A=\begin{bmatrix} x^2_1 & x_1 & 1 \\ & ...& \\ x^2_N & x_n & 1\end{bmatrix}, y=\begin{bmatrix}y_1\\...\\y_N\end{bmatrix}, \theta=\begin{bmatrix}a\\b\\c\end{bmatrix}\]

Example 2. Auto-Regression

\[y_n=ay_{n-1}+by_{n-2}\] \[A=\begin{bmatrix}y_2 & y_1 \\ y_3 & y_2 \\ ... & ... \\ y_{N-1} & y_{N-2} \end{bmatrix}, y=\begin{bmatrix}y_3 \\ y_4 \\ ... \\ y_N \end{bmatrix}, \theta=\begin{bmatrix} a \\ b\end{bmatrix}\]

Linear Spans

Given a set of vectors ${ {\bf{a}}_1, …., {\bf{a}}_d }$, the span is the set of all possible linear combinations of these vectors.

\[\text{span} \{ {\bf{a}}_1, ...., {\bf{a}}_d \} = \{ z|z=\sum^d_{j=1}\alpha_j{\bf{a}}_j \}\]

Q. Which of the following sets of vectors can span $\mathbb{R}^3$?

Geometry of Linear regression

Given $\theta$, the product $A\theta$ can be viewed as

\[A\theta=\begin{bmatrix} {\bf{a}_1} & {\bf{a}_2} & ... & {\bf{a}_d}\end{bmatrix} \begin{bmatrix} \theta_1 \\ ...\\ \theta_d \end{bmatrix} = \sum^d_{j=1} \theta_j{\bf{a}}_j\]

So the set of all possible $A \theta$ is equivalent to span $ { \bf{a}_1, …, \bf{a}_d }$.

Define the range of A as $\mathcal{R}(A)={ \hat{y}\mid\hat{y} = A\theta }$.

Orthogonality Principle

  • Consider the error ${\bf{e}}={\bf{y}}-A\theta$
  • For the error to minimize, it must be orthogonal to $\mathcal{R}(A)$, which is the span of the columns.
  • This orthogonality principle means that ${\bf{a}}^{\top}_j{\bf{e}}=0$ for all $j=1,…,d$, which implies $A^\top{\bf{e}}=0$.

Normal Equation

  • The orthogonality principle, which states that $A^\top {\bf{e}}=0$, implies that $A^\top({\bf{y}}-A\theta)=0$ by substituting ${\bf{e}}={\bf{y}}-A\theta$.

  • This is called the normal equation:

\[A^\top A\theta=A^\top y.\]
  • The predicted value is
\[\hat{\bf{y}}=A\hat{\theta}=A(A^\top A)^{-1}A^{\top}{\bf{y}}\]
  • The matrix $P=A(A^\top A)^{-1}A^{\top}$ is a projection onto the span of ${ {\bf{a}}_1, {\bf{a}}_2, …, {\bf{a}}_d }$, i,e., the range of $A$.

  • $P$ is called the projection matrix. It holds that $PP=P$.

  • The error ${\bf{e}}={\bf{y}}-\hat{\bf{y}}$ is

\[{\bf{e}}={\bf{y}}-A(A^\top A)^{-1}A^{\top}{\bf{y}}=(I-P){\bf{y}}\]

Over-determined and Under-determined Systems

  • Assume $A$ has full column rank.
  • Over-determined $A$: Tall and skinny. $\hat\theta=(A^\top A)^{-1}A^{\top}{\bf{y}}$

  • Under-determined $A$: Fat and short. $\hat\theta=A^{\top}(AA^{\top})^{-1}{\bf{y}}$

  • If $A$ is under-determined, then there exists a non-trival nullspace
\[\mathcal{N}(A)=\{ \theta|A\theta=0 \}\]
  • This implies that if $\hat{\theta}$ is a solution, then $\hat{\theta}+\theta_0$ is also a solution as long as $\theta_0 \in \mathcal{N}(A)$.

What is Lagrange Multiplier

We are going to take a look at another way of optimizing a function subject to given constraint(s). The constraint(s) may be the equation(s) that describe the boundary of a region although in this section we won’t concentrate on those types of problems since this method just requires a general constraint and doesn’t really care where the constraint came from.

  • So, let’s get things set up. We want to optimize (i.e. find the minimum and maximum value of) a function, $f(x,y,z)$, subject to the constraint $g(x, y, z)=k$. Again, the constraint may be the equation that describes the boundary of a region or it may not be. The process is actually, fairly simple, although the work can still be a little overwhelming at times.

Method of Lagrange Multipliers

  1. Solve the following system of equations using
\[\nabla f(x, y, z)=\lambda \nabla g(x,y,z)\] \[g(x,y,z)=k\]
  1. Plug in all solutions, $(x,y,z)$, from the first step into $f(x,y,z)$ and identify the minimum and maximum values, provided they exist and $\nabla g \neq \vec{0}$ at the point.

The constant $\lambda$ is called the Lagrange Multiplier.

Example1. Find the maximum and minimum of $f(x,y)=5x-3y$ subject to the constraint $x^2+y^2=136$.

\[\mathcal{L}=f(x,y)+\lambda(k-g(x,y))\] \[\mathcal{L}=5x-3y+\lambda(136-x^2-y^2)\] \[\mathcal{L}_x=5+\lambda(-2x)=0\] \[\mathcal{L}_y=-3+\lambda(-2y)=0\] \[x=-{5 \over 2\lambda}, y={3 \over 2\lambda}\] \[{34 \over 4\lambda^2}=136\] \[\lambda=\pm{1\over4}\] \[f(x,y)=5x-3y=-{34\over 2\lambda}\] \[\lambda={1\over4}, \text{min}f(x,y)=-68\] \[\lambda=-{1\over4}, \text{max}f(x,y)=68\]

Example2. Find the maximum and minimum values of $f(x,y)=4x^2+10y^2$ on the disk $x^2+y^2 \leq 4$.

\[\mathcal{L}=f(x,y)+\lambda(k-g(x,y))\] \[\mathcal{L}=4x^2+10y^2+\lambda(4-x^2-y^2)\] \[\mathcal{L}_x=8x+\lambda(-2x)=0\] \[\mathcal{L}_y=20y+\lambda(-2y)=0\] \[\mathcal{L}_{\lambda}=4-x^2-y^2=0\] \[x=0, \lambda=4\] \[y=0, \lambda=10\]
  • $x=0$, $y=\pm2$
\[\mathcal{L}=4x^2+10y^2+\lambda(4-x^2-y^2)=40\]
  • $x=\pm 2$, $y=0$
\[\mathcal{L}=4x^2+10y^2+\lambda(4-x^2-y^2)=16\]
  • $\lambda=4$
\[\mathcal{L}=4x^2+10y^2+4\cdot(4-x^2-y^2)=16+6y^2\] \[16\leq\mathcal{L}\leq40\]
  • $\lambda=10$
\[\mathcal{L}=4x^2+10y^2+10\cdot(4-x^2-y^2)=40-6x^2\] \[16\leq\mathcal{L}\leq40\]

Answer : minimum=16, maximum=40

Example3. Find the dimensions of the box with largest volume if the total surface area is $64\text{cm}^2$

\[\mathcal{L}=f(x,y,z)+\lambda(k-g(x,y,z))=xyz+\lambda(64-2xy-2yz-2xz)\] \[\mathcal{L}_x=yz+\lambda(-2y-2z)=0\] \[\mathcal{L}_y=xz+\lambda(-2x-2z)=0\] \[\mathcal{L}_z=xy+\lambda(-2x-2y)=0\] \[\mathcal{L}_{\lambda}=64-2xy-2yz-2xz=0\]

$x>0, y>0, z>0$

\[\lambda={yz\over 2y+2z}={xz\over 2x+2z}={xy\over 2x+2y}\] \[\mathcal{L}_x+\mathcal{L}_y+\mathcal{L}_z=xy+yz+xz-4\lambda(x+y+z)=32-4\lambda(x+y+z)=0\] \[8-\lambda(x+y+z)=0\] \[{y\over 2y+2z}={x\over 2x+2z}\] \[2xy+2xz=2xy+2yz\] \[\therefore x=y\]

$x=y=z$ and $\lambda=x/4$ So,

\[\mathcal{L}=x^3+{x\over4}(64-6x^2)=x^3+16x-{3\over2}x^2=x^3\]

$x^2=32/3$, $x=4\sqrt2/\sqrt3$

\[\mathcal{L}=x^3={128\sqrt2\over3\sqrt3}\]

Minimum-Norm Solutions

  • Assume $A$ is fat and ha full row rank.
  • Since $A$ is fat, there exists infinitely many $\hat \theta$ such that $A \hat \theta = y$
  • So we need to pick one in order to be unique.
  • It turns out that $ \theta=A^\top(AA^\top)^{-1}y$ is the solution to $ \theta
\[\hat \theta = \text{argmin}_\theta ||\theta||^2\]

subject to $A\theta=y$ You can solve this problem using Lagrange multiplier.

  • This is called the minimum-norm solution.

Solving the Minimum Norm problem

Consider this problem $\hat\theta=\text{argmin}_\theta|\theta|^2$ subject to $A \theta=y$.

The Lagrangian is defined

\[\mathcal{L}(\theta, \lambda)=||\theta||^2+\lambda^\top(A\theta-y)\]

Take derivative with respect to $\theta$ and $\lambda$ yields

\[\nabla_\theta \mathcal{L}=2\theta+A^\top\lambda=0\] \[\nabla_\lambda \mathcal{L}=A\theta-y=0\]
  • First equation gives us $\theta=-A^\top\lambda /2$
  • Substitude into second equation yields $-AA^\top\lambda/2-y=0$, $\lambda=-2(AA^\top)^{-1}y$.
  • Therefore, $\theta=A^\top(AA^\top)^{-1}y$

What if Rank-Deficient?

Rank deficiency in this context says there is insufficient information contained in your data to estimate the model you desire.

A matrix is said to have full rank if its rank is either equal to its number of columns or to its number of rows. A matrix that does not have full rank is said to be rank deficient.

For example, the matrices \(\begin{pmatrix} 2 & 3 & 4 \\ 1 & 1 & 1 \\ \end{pmatrix}\)

and

\[\begin{pmatrix} 3 & 5 \\ 1 & -1 \\ 1 & 0 \\ \end{pmatrix}\]

both have full rank.

However, the matrices

\[\begin{pmatrix} 1 & 1 \\ 1 & 1 \\ \end{pmatrix}\]

and

\[\begin{pmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \\ \end{pmatrix}\]

are rank deficeint.

Why Rank-Deficient is problem?

  • If $A$ is rank-deficient, then $A^\top A$ is not invertible.
  • Approach 1 : Regularization.
  • Approach 2 : Pseudo-inverse. Decompose $A=USV^\top$.
  • $U\in \mathbb{R}^{N \times N}$, with $U^{\top}U=I. V\in \mathbb{R}^{d \times d}$, with $V^\top V=I$
  • The diagonal block of $S\in \mathbb{R}^{N \times d}$ is diag ${ s_1,…,s_r,0…, }$
  • The solution is called the pseudio-inverse.
\[\hat{\theta}=VS^{+}U^{\top} {\bf y}\]

where $S^{+}=\text{diag} {1/s_1,..,1/s_r,0…,0 }$