L2 regression (Ridge)

June 30, 2020

L2 regularization is to add a L2-norm term to the optimization problem

$\min_{\mathbf{x}}f(\mathbf{x})$ as

$\begin{equation}\min_{\mathbf{x}}\,l_{\lambda}(\mathbf{x})\equiv f(\mathbf{x})+\frac{\lambda}{2} \left|\left| \mathbf{x}\right|\right|^2_2\,.\tag{1}\end{equation}$

Remarks:

In deep learning, L2 regularization is also referred as weight decay. This is because in the vallina SGD $\begin{eqnarray}\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)}-\gamma\left[\nabla f(\mathbf{x}^{(t)})+\lambda\mathbf{x}^{(t)}\right]=(1-\gamma\, \lambda)\,\mathbf{x}^{(t)}-\gamma\,\nabla f(\mathbf{x}^{(t)})\,,\end{eqnarray}$ L_2 regularization is to decay the weight $\mathbf{x}^{(t)}$ by a factor $1-\gamma \lambda$ .
In more sophisticated optimizers beyond vanilla SGD, L2 regularization is different from weight decay:

L2 regularization: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla f(\mathbf{x})+\lambda\mathbf{x}$ ;
Weight decay: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla f\left((1-\gamma\lambda)\mathbf{x}\right)$ .

For Adam, the pytorch class 'Adam' uses L2 regularization while weight decay is in another class 'AdamW'. This paper claims that AdamW is better.
If $f(\mathbf{x})$ is scale invariant under $\mathbf{x}\rightarrow s\mathbf{x}$ , for example, having batch normalization layers, $\lambda$ has NO regularizing effect because of $l_{\lambda}(\mathbf{x}) = l_{\lambda/s^2}(s\mathbf{x})$ . Instead, the effect of $\lambda$ is on the effective learning rate. Check this paper for more details.

In the linear regression, L2 regularization is to solve

$\begin{equation}\hat{\boldsymbol{\beta}}^{\text{ridge}}=\text{argmin}_{\boldsymbol{\beta}}\left\{\sum_{i=1}^n\left(y^{(i)}-\beta_0-\sum_{j=1}^p \beta_jx^{(i)}_{j}\right)^2+\lambda\sum_{j=1}^p \beta_j^2\right\}\,.\end{equation}$ Note that there is no regularization for the interception term

$\beta_0$ since

$\beta_0$ is a global shift to all

$y$ . We can remove

$\beta_0$ by making

$\mathbf{y}$ and

$\mathbf{X}$ centered. With centered inputs, we have

$\begin{eqnarray}\hat{\boldsymbol{\beta}}^{\text{ridge}}&=&\text{argmin}_{\boldsymbol{\beta}}\left\{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})+\lambda\boldsymbol{\beta}^T\boldsymbol{\beta}\right\}\\&=&\left(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I}\right)^{-1}\mathbf{X}^T\mathbf{y}\,.\tag{2}\end{eqnarray}$

Remarks:

With L2 regularization, $\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I}$ is always invertible.
For orthogonal inputs $\mathbf{X}^T\mathbf{X}=\mathbf{I}$ , $\begin{equation}\hat{\boldsymbol{\beta}}^{\text{ridge}}=\frac{1}{1+\lambda}\hat{\boldsymbol{\beta}}^{\text{ols}}\,,\end{equation}$ which is a global scaling to all $\hat{\beta}^{\text{ols}}$ .
For a general case, by SVD $\mathbf{X}=\mathbf{U}\,\text{diag}\left[\sigma_1,\cdots,\sigma_p\right]\,\mathbf{V}^T$ with $\mathbf{U}^T\mathbf{U}=\mathbf{I}$ and $\mathbf{V}^{T}\mathbf{V}=\mathbf{I}$ , we can derive $\begin{equation}\mathbf{X}\hat{\boldsymbol{\beta}}^{\text{ridge}}=\mathbf{U}\,\text{diag}\left[\frac{\sigma^2_1}{\sigma^2_1+\lambda},\cdots,\frac{\sigma^2_p}{\sigma^2_p+\lambda}\right]\mathbf{U}^T\mathbf{y}\,.\end{equation}$ Ridge shrinks all principal component directions, but shrinks more on the low-variance components than the high-variance components.
Principal component regression (PCR) is ordinary least squares on the reduced inputs $\mathbf{X}_{k}=\mathbf{U}\text{diag}\left[\sigma_1,\cdots,\sigma_k, 0, \cdots, 0\right]\mathbf{V}^T$ by PCA, in which $\begin{equation}\mathbf{X}\hat{\boldsymbol{\beta}}^{\text{pca}}=\mathbf{U}\,\text{diag}\left[1,\cdots, 1, 0, \cdots, 0\right]\mathbf{U}^T\mathbf{y}\,.\end{equation}$ Compared to the ridge above, PCR simply leaves high-variance components alone and discards low-variance components.
Ridge solution (2) is also ordinary least squares on the augmented inputs $\left[\begin{array}{c} \mathbf{X} \\ \sqrt{\lambda}\mathbf{I}_{p\times p} \end{array}\right]$ and $\left[\begin{array}{c} \mathbf{y} \\ \mathbf{0} \end{array}\right]$ . In this view, ridge shrinkage is from the artificial data that has zero responses.
Finally, note that linear regression implies $\mathbf{Y}|\mathbf{X}\boldsymbol{\beta}\sim \cal{N}(\mathbf{X}\boldsymbol{\beta}, \tau^2\mathbf{I})$ , ridge can be viewed as a Gaussian prior $\boldsymbol{\beta}\sim\cal{N}\left(0, \tau^2/\lambda\,\mathbf{I}\right)$ from Bayes' perspective. The posterior of maximum likelihood estimates leads to the ridge regression (2).
The ridge regression (2) is equivalent to a constraint problem $\begin{eqnarray} \min_{\boldsymbol{\beta}} \quad \left|\left|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\right|\right|_2^2 \quad \quad \text{s.t.}&&\quad \left|\left|\boldsymbol{\beta}\right|\right|_2^2\leq t \,,\end{eqnarray}$ as proved by KKT conditions.

Reference: Chapter 3.4 of ESL book.

Search This Blog

Life with Physics

L2 regression (Ridge)

Comments

Post a Comment

Popular posts from this blog

529 Plan

How to offset W2 tax

Retirement Accounts