머신러닝 수업 1주차
Logistic and Linear Regression
Basic Notation
- Scalar: $a, b, c \in \mathbb{R}$
- Vector: $\bf{a}, \bf{b}, \bf{c} \in \mathbb{R}^d$
- Rows and Columns
- ${ {\bf {a} }_j }$: The j-th feature. ${ {\bf {x}}^n}$: The n-th samples
- Identity Matrix: $I$
- All-one vector $1$ and all-zero vector $0$
- Standard basis ${\bf {e}}_i$
Linear Regression
The problem of regression can be summarized as:
- Given measurement: $y^n$, (where $n=1,…,N$)
- Given inputs: $x^n$
- Given a model: $g_\theta(x^n)$ parameterized by $\theta$
Linear regression is one type of regression:
-
Restrict $g_\theta(\cdot)$ to a line: \(g_\theta(x) = x^\top \theta\)
-
The inputs $x$ and the parameters $\theta$ are ${\bf{x}}=[x_1,…,x_d]^{\top}$ and $\theta=[\theta_1, …, \theta_d]^{\top}$
-
This is equivalent to $g_\theta({\bf{x}})={\bf{x}}^\top\theta=\sum^d_{j=1}{\bf{x}}_j\theta_j$
Solving the Regression Problem
The (general) regression can be solved via the following logic:
- Define a (Squred-Error) Loss Function
- Other loss functions can be used.
- The goal is to solve an optimization
- The prediction of a new input ${\bf{x}}^\text{new}$ is ${\bf{y}}^\text{new}=\hat{\theta}^{\top}{\bf{x}}^\text{new}$
Linear Regression Solution
The linear regression problem is a special case which we can solve analytically.
-
Restrict $g_\theta(\cdot)$ to a line: \(g_\theta({\bf{x}}^n)=\theta^{\top}{\bf{x}}^n\)
-
Then the loss function becomes
- The matrix and vectors are defined as follows:
Theorem
For a linear regression problem
\[\hat{\theta}=\text{argmin}_\theta J(\theta)=||A\theta-y||^2\]the minimizer is
\[\hat{\theta}=(A^{\top}A)^{-1}A^{\top}y\]- Take derivative and setting to zero:
- So solution is $\hat{\theta}=(A^\top A)^{-1}A^{\top}y$, assuming $A^{\top}A$ is invertible.
Example on Cases
Example 1. Second-order polynomial fitting
\[y_n=ax^2_n+bx_n+c\] \[A=\begin{bmatrix} x^2_1 & x_1 & 1 \\ & ...& \\ x^2_N & x_n & 1\end{bmatrix}, y=\begin{bmatrix}y_1\\...\\y_N\end{bmatrix}, \theta=\begin{bmatrix}a\\b\\c\end{bmatrix}\]Example 2. Auto-Regression
\[y_n=ay_{n-1}+by_{n-2}\] \[A=\begin{bmatrix}y_2 & y_1 \\ y_3 & y_2 \\ ... & ... \\ y_{N-1} & y_{N-2} \end{bmatrix}, y=\begin{bmatrix}y_3 \\ y_4 \\ ... \\ y_N \end{bmatrix}, \theta=\begin{bmatrix} a \\ b\end{bmatrix}\]Linear Spans
Given a set of vectors ${ {\bf{a}}_1, …., {\bf{a}}_d }$, the span is the set of all possible linear combinations of these vectors.
\[\text{span} \{ {\bf{a}}_1, ...., {\bf{a}}_d \} = \{ z|z=\sum^d_{j=1}\alpha_j{\bf{a}}_j \}\]Q. Which of the following sets of vectors can span $\mathbb{R}^3$?
Geometry of Linear regression
Given $\theta$, the product $A\theta$ can be viewed as
\[A\theta=\begin{bmatrix} {\bf{a}_1} & {\bf{a}_2} & ... & {\bf{a}_d}\end{bmatrix} \begin{bmatrix} \theta_1 \\ ...\\ \theta_d \end{bmatrix} = \sum^d_{j=1} \theta_j{\bf{a}}_j\]So the set of all possible $A \theta$ is equivalent to span $ { \bf{a}_1, …, \bf{a}_d }$.
Define the range of A as $\mathcal{R}(A)={ \hat{y}\mid\hat{y} = A\theta }$.
Orthogonality Principle
- Consider the error ${\bf{e}}={\bf{y}}-A\theta$
- For the error to minimize, it must be orthogonal to $\mathcal{R}(A)$, which is the span of the columns.
- This orthogonality principle means that ${\bf{a}}^{\top}_j{\bf{e}}=0$ for all $j=1,…,d$, which implies $A^\top{\bf{e}}=0$.
Normal Equation
-
The orthogonality principle, which states that $A^\top {\bf{e}}=0$, implies that $A^\top({\bf{y}}-A\theta)=0$ by substituting ${\bf{e}}={\bf{y}}-A\theta$.
-
This is called the normal equation:
- The predicted value is
-
The matrix $P=A(A^\top A)^{-1}A^{\top}$ is a projection onto the span of ${ {\bf{a}}_1, {\bf{a}}_2, …, {\bf{a}}_d }$, i,e., the range of $A$.
-
$P$ is called the projection matrix. It holds that $PP=P$.
-
The error ${\bf{e}}={\bf{y}}-\hat{\bf{y}}$ is
Over-determined and Under-determined Systems
- Assume $A$ has full column rank.
-
Over-determined $A$: Tall and skinny. $\hat\theta=(A^\top A)^{-1}A^{\top}{\bf{y}}$
-
Under-determined $A$: Fat and short. $\hat\theta=A^{\top}(AA^{\top})^{-1}{\bf{y}}$
- If $A$ is under-determined, then there exists a non-trival nullspace
- This implies that if $\hat{\theta}$ is a solution, then $\hat{\theta}+\theta_0$ is also a solution as long as $\theta_0 \in \mathcal{N}(A)$.
What is Lagrange Multiplier
We are going to take a look at another way of optimizing a function subject to given constraint(s). The constraint(s) may be the equation(s) that describe the boundary of a region although in this section we won’t concentrate on those types of problems since this method just requires a general constraint and doesn’t really care where the constraint came from.
- So, let’s get things set up. We want to optimize (i.e. find the minimum and maximum value of) a function, $f(x,y,z)$, subject to the constraint $g(x, y, z)=k$. Again, the constraint may be the equation that describes the boundary of a region or it may not be. The process is actually, fairly simple, although the work can still be a little overwhelming at times.
Method of Lagrange Multipliers
- Solve the following system of equations using
- Plug in all solutions, $(x,y,z)$, from the first step into $f(x,y,z)$ and identify the minimum and maximum values, provided they exist and $\nabla g \neq \vec{0}$ at the point.
The constant $\lambda$ is called the Lagrange Multiplier.
Example1. Find the maximum and minimum of $f(x,y)=5x-3y$ subject to the constraint $x^2+y^2=136$.
\[\mathcal{L}=f(x,y)+\lambda(k-g(x,y))\] \[\mathcal{L}=5x-3y+\lambda(136-x^2-y^2)\] \[\mathcal{L}_x=5+\lambda(-2x)=0\] \[\mathcal{L}_y=-3+\lambda(-2y)=0\] \[x=-{5 \over 2\lambda}, y={3 \over 2\lambda}\] \[{34 \over 4\lambda^2}=136\] \[\lambda=\pm{1\over4}\] \[f(x,y)=5x-3y=-{34\over 2\lambda}\] \[\lambda={1\over4}, \text{min}f(x,y)=-68\] \[\lambda=-{1\over4}, \text{max}f(x,y)=68\]Example2. Find the maximum and minimum values of $f(x,y)=4x^2+10y^2$ on the disk $x^2+y^2 \leq 4$.
\[\mathcal{L}=f(x,y)+\lambda(k-g(x,y))\] \[\mathcal{L}=4x^2+10y^2+\lambda(4-x^2-y^2)\] \[\mathcal{L}_x=8x+\lambda(-2x)=0\] \[\mathcal{L}_y=20y+\lambda(-2y)=0\] \[\mathcal{L}_{\lambda}=4-x^2-y^2=0\] \[x=0, \lambda=4\] \[y=0, \lambda=10\]- $x=0$, $y=\pm2$
- $x=\pm 2$, $y=0$
- $\lambda=4$
- $\lambda=10$
Answer : minimum=16, maximum=40
Example3. Find the dimensions of the box with largest volume if the total surface area is $64\text{cm}^2$
\[\mathcal{L}=f(x,y,z)+\lambda(k-g(x,y,z))=xyz+\lambda(64-2xy-2yz-2xz)\] \[\mathcal{L}_x=yz+\lambda(-2y-2z)=0\] \[\mathcal{L}_y=xz+\lambda(-2x-2z)=0\] \[\mathcal{L}_z=xy+\lambda(-2x-2y)=0\] \[\mathcal{L}_{\lambda}=64-2xy-2yz-2xz=0\]$x>0, y>0, z>0$
\[\lambda={yz\over 2y+2z}={xz\over 2x+2z}={xy\over 2x+2y}\] \[\mathcal{L}_x+\mathcal{L}_y+\mathcal{L}_z=xy+yz+xz-4\lambda(x+y+z)=32-4\lambda(x+y+z)=0\] \[8-\lambda(x+y+z)=0\] \[{y\over 2y+2z}={x\over 2x+2z}\] \[2xy+2xz=2xy+2yz\] \[\therefore x=y\]$x=y=z$ and $\lambda=x/4$ So,
\[\mathcal{L}=x^3+{x\over4}(64-6x^2)=x^3+16x-{3\over2}x^2=x^3\]$x^2=32/3$, $x=4\sqrt2/\sqrt3$
\[\mathcal{L}=x^3={128\sqrt2\over3\sqrt3}\]Minimum-Norm Solutions
- Assume $A$ is fat and ha full row rank.
- Since $A$ is fat, there exists infinitely many $\hat \theta$ such that $A \hat \theta = y$
- So we need to pick one in order to be unique.
- It turns out that $ \theta=A^\top(AA^\top)^{-1}y$ is the solution to $ \theta
subject to $A\theta=y$ You can solve this problem using Lagrange multiplier.
- This is called the minimum-norm solution.
Solving the Minimum Norm problem
Consider this problem $\hat\theta=\text{argmin}_\theta|\theta|^2$ subject to $A \theta=y$.
The Lagrangian is defined
\[\mathcal{L}(\theta, \lambda)=||\theta||^2+\lambda^\top(A\theta-y)\]Take derivative with respect to $\theta$ and $\lambda$ yields
\[\nabla_\theta \mathcal{L}=2\theta+A^\top\lambda=0\] \[\nabla_\lambda \mathcal{L}=A\theta-y=0\]- First equation gives us $\theta=-A^\top\lambda /2$
- Substitude into second equation yields $-AA^\top\lambda/2-y=0$, $\lambda=-2(AA^\top)^{-1}y$.
- Therefore, $\theta=A^\top(AA^\top)^{-1}y$
What if Rank-Deficient?
Rank deficiency in this context says there is insufficient information contained in your data to estimate the model you desire.
A matrix is said to have full rank if its rank is either equal to its number of columns or to its number of rows. A matrix that does not have full rank is said to be rank deficient.
For example, the matrices \(\begin{pmatrix} 2 & 3 & 4 \\ 1 & 1 & 1 \\ \end{pmatrix}\)
and
\[\begin{pmatrix} 3 & 5 \\ 1 & -1 \\ 1 & 0 \\ \end{pmatrix}\]both have full rank.
However, the matrices
\[\begin{pmatrix} 1 & 1 \\ 1 & 1 \\ \end{pmatrix}\]and
\[\begin{pmatrix} 1 & 2 \\ 2 & 4 \\ 3 & 6 \\ \end{pmatrix}\]are rank deficeint.
Why Rank-Deficient is problem?
- If $A$ is rank-deficient, then $A^\top A$ is not invertible.
- Approach 1 : Regularization.
- Approach 2 : Pseudo-inverse. Decompose $A=USV^\top$.
- $U\in \mathbb{R}^{N \times N}$, with $U^{\top}U=I. V\in \mathbb{R}^{d \times d}$, with $V^\top V=I$
- The diagonal block of $S\in \mathbb{R}^{N \times d}$ is diag ${ s_1,…,s_r,0…, }$
- The solution is called the pseudio-inverse.
where $S^{+}=\text{diag} {1/s_1,..,1/s_r,0…,0 }$