L2 regression (Ridge)
L2 regularization is to add a L2-norm term to the optimization problem $\min_{\mathbf{x}}f(\mathbf{x})$ as \begin{equation}\min_{\mathbf{x}}\,l_{\lambda}(\mathbf{x})\equiv f(\mathbf{x})+\frac{\lambda}{2} \left|\left| \mathbf{x}\right|\right|^2_2\,.\tag{1}\end{equation}
Remarks:
In the linear regression, L2 regularization is to solve \begin{equation}\hat{\boldsymbol{\beta}}^{\text{ridge}}=\text{argmin}_{\boldsymbol{\beta}}\left\{\sum_{i=1}^n\left(y^{(i)}-\beta_0-\sum_{j=1}^p \beta_jx^{(i)}_{j}\right)^2+\lambda\sum_{j=1}^p \beta_j^2\right\}\,.\end{equation} Note that there is no regularization for the interception term $\beta_0$ since $\beta_0$ is a global shift to all $y$. We can remove $\beta_0$ by making $\mathbf{y}$ and $\mathbf{X}$ centered. With centered inputs, we have \begin{eqnarray}\hat{\boldsymbol{\beta}}^{\text{ridge}}&=&\text{argmin}_{\boldsymbol{\beta}}\left\{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})+\lambda\boldsymbol{\beta}^T\boldsymbol{\beta}\right\}\\&=&\left(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I}\right)^{-1}\mathbf{X}^T\mathbf{y}\,.\tag{2}\end{eqnarray}
- In deep learning, L2 regularization is also referred as weight decay. This is because in the vallina SGD \begin{eqnarray}\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)}-\gamma\left[\nabla f(\mathbf{x}^{(t)})+\lambda\mathbf{x}^{(t)}\right]=(1-\gamma\, \lambda)\,\mathbf{x}^{(t)}-\gamma\,\nabla f(\mathbf{x}^{(t)})\,,\end{eqnarray} L_2 regularization is to decay the weight $\mathbf{x}^{(t)}$ by a factor $1-\gamma \lambda$.
- In more sophisticated optimizers beyond vanilla SGD, L2 regularization is different from weight decay:
- L2 regularization: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla f(\mathbf{x})+\lambda\mathbf{x}$;
- Weight decay: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla f\left((1-\gamma\lambda)\mathbf{x}\right)$.
- For Adam, the pytorch class 'Adam' uses L2 regularization while weight decay is in another class 'AdamW'. This paper claims that AdamW is better.
- If $f(\mathbf{x})$ is scale invariant under $\mathbf{x}\rightarrow s\mathbf{x}$, for example, having batch normalization layers, $\lambda$ has NO regularizing effect because of $l_{\lambda}(\mathbf{x}) = l_{\lambda/s^2}(s\mathbf{x})$. Instead, the effect of $\lambda$ is on the effective learning rate. Check this paper for more details.
Remarks:
- With L2 regularization, $\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I}$ is always invertible.
- For orthogonal inputs $\mathbf{X}^T\mathbf{X}=\mathbf{I}$, \begin{equation}\hat{\boldsymbol{\beta}}^{\text{ridge}}=\frac{1}{1+\lambda}\hat{\boldsymbol{\beta}}^{\text{ols}}\,,\end{equation} which is a global scaling to all $\hat{\beta}^{\text{ols}}$.
- For a general case, by SVD $\mathbf{X}=\mathbf{U}\,\text{diag}\left[\sigma_1,\cdots,\sigma_p\right]\,\mathbf{V}^T$ with $\mathbf{U}^T\mathbf{U}=\mathbf{I}$ and $\mathbf{V}^{T}\mathbf{V}=\mathbf{I}$, we can derive \begin{equation}\mathbf{X}\hat{\boldsymbol{\beta}}^{\text{ridge}}=\mathbf{U}\,\text{diag}\left[\frac{\sigma^2_1}{\sigma^2_1+\lambda},\cdots,\frac{\sigma^2_p}{\sigma^2_p+\lambda}\right]\mathbf{U}^T\mathbf{y}\,.\end{equation} Ridge shrinks all principal component directions, but shrinks more on the low-variance components than the high-variance components.
- Principal component regression (PCR) is ordinary least squares on the reduced inputs $\mathbf{X}_{k}=\mathbf{U}\text{diag}\left[\sigma_1,\cdots,\sigma_k, 0, \cdots, 0\right]\mathbf{V}^T$ by PCA, in which \begin{equation}\mathbf{X}\hat{\boldsymbol{\beta}}^{\text{pca}}=\mathbf{U}\,\text{diag}\left[1,\cdots, 1, 0, \cdots, 0\right]\mathbf{U}^T\mathbf{y}\,.\end{equation} Compared to the ridge above, PCR simply leaves high-variance components alone and discards low-variance components.
- Ridge solution (2) is also ordinary least squares on the augmented inputs $\left[\begin{array}{c} \mathbf{X} \\ \sqrt{\lambda}\mathbf{I}_{p\times p} \end{array}\right]$ and $\left[\begin{array}{c} \mathbf{y} \\ \mathbf{0} \end{array}\right]$. In this view, ridge shrinkage is from the artificial data that has zero responses.
- Finally, note that linear regression implies $\mathbf{Y}|\mathbf{X}\boldsymbol{\beta}\sim \cal{N}(\mathbf{X}\boldsymbol{\beta}, \tau^2\mathbf{I})$, ridge can be viewed as a Gaussian prior $\boldsymbol{\beta}\sim\cal{N}\left(0, \tau^2/\lambda\,\mathbf{I}\right)$ from Bayes' perspective. The posterior of maximum likelihood estimates leads to the ridge regression (2).
- The ridge regression (2) is equivalent to a constraint problem \begin{eqnarray} \min_{\boldsymbol{\beta}} \quad \left|\left|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\right|\right|_2^2 \quad \quad \text{s.t.}&&\quad \left|\left|\boldsymbol{\beta}\right|\right|_2^2\leq t \,,\end{eqnarray} as proved by KKT conditions.
Reference: Chapter 3.4 of ESL book.
Comments
Post a Comment