L1 regression (Lasso)

June 30, 2020

L1 regularization is to add a L1-norm term:

$\begin{equation}\hat{\boldsymbol{\beta}}^{\text{lasso}}=\text{argmin}_{\boldsymbol{\beta}}\left\{\sum_{i=1}^n\left(y^{(i)}-\beta_0-\sum_{j=1}^p \beta_jx^{(i)}_{j}\right)^2+\lambda\sum_{j=1}^p \left|\beta_j\right|\right\}\,.\end{equation}$

Note that there is no regularization for the interception term

$\beta_0$ since

$\beta_0$ is a global shift to all

$y$ . We can remove

$\beta_0$ by making

$\mathbf{y}$ and

$\mathbf{X}$ centered. With centered inputs, we have

$\begin{equation}\hat{\boldsymbol{\beta}}^{\text{lasso}}=\text{argmin}_{\boldsymbol{\beta}}\left\{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})+\lambda \left|\left|\boldsymbol{\beta}\right|\right|_1\right\}\,.\tag{1}\end{equation}$

Compared with ridge, lasso can be used for continuous subset (feature) selection besides regularization since it leads to sparse coefficients. The argument is from an equivalent optimization problem of (1):

$\begin{equation}\min_{\boldsymbol{\beta}}(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})\,,\quad \text{s.t.} \quad \left|\left|\boldsymbol{\beta}\right|\right|_1\leq t\,.\end{equation}$

Credit: this figure is taken from chapter 3.4 of ESL book

As in the figure of a two-variable case, the optimal of lasso locates at a corner of the diamond region that forces one of the coefficients to be zero.

We can solve (1) explicitly for the orthogonal inputs

$\mathbf{X}^T\mathbf{X}=\mathbf{I}$ . In this case,

$\boldsymbol{\beta}^{\text{ols}}=\mathbf{X}^T\mathbf{y}$ and we reduce (1) to

$\hat{\boldsymbol{\beta}}^{\text{lasso}}=\text{argmin}_{\boldsymbol{\beta}}\left\{(\boldsymbol{\beta}-\boldsymbol{\beta}^{\text{ols}})^T(\boldsymbol{\beta}-\boldsymbol{\beta}^{\text{ols}})+\lambda \left|\left|\boldsymbol{\beta}\right|\right|_1\right\}\,,$ and thus each coefficient is decoupled from one another:

$\begin{equation}\hat{\beta}_j^{\text{lasso}} =\text{argmin}_{\beta}\left\{\left(\beta -\hat{\beta}_j^{\text{ols}}\right)^2+\lambda\left|\beta\right| \right\}\,.\tag{2}\end{equation}$

A few algebra shows that

$\begin{equation}\min_{\beta\geq 0}\,\beta^2-2\left(\hat{\beta}^{\text{ols}}-\frac{\lambda}{2}\right)\beta=\left\{\begin{array}{c}-\left(\hat{\beta}^{\text{ols}}-\frac{\lambda}{2}\right)^2 &\text{when}&\, \hat{\beta}^{\text{ols}}\geq\frac{\lambda}{2}\,,\beta=\hat{\beta}^{\text{ols}}-\frac{\lambda}{2}\\ 0 &\text{when}&\,\hat{\beta}^{\text{ols}}<\frac{\lambda}{2}\,,\beta=0\end{array}\right.\,,\end{equation}$ and

$\begin{equation}\min_{\beta\leq 0}\,\beta^2-2\left(\hat{\beta}^{\text{ols}}+\frac{\lambda}{2}\right)\beta=\left\{\begin{array}{c}-\left(\hat{\beta}^{\text{ols}}+\frac{\lambda}{2}\right)^2 &\text{when}&\, \hat{\beta}^{\text{ols}}\leq-\frac{\lambda}{2}\,,\beta=\hat{\beta}^{\text{ols}}+\frac{\lambda}{2}\\ 0 &\text{when}&\,\hat{\beta}^{\text{ols}}>-\frac{\lambda}{2}\,,\beta=0\end{array}\right.\,.\end{equation}$ Therefore, we have the optimal of (2):

$\begin{equation}\hat{\beta}_j^{\text{lasso}}=\text{sign}\left(\hat{\beta}_j^{\text{ols}}\right)\text{Relu}\left(\left|\hat{\beta}_j^{\text{ols}}\right|-\frac{\lambda}{2}\right)\,.\tag{3}\end{equation}$

Indeed,

$\hat{\boldsymbol{\beta}}^{\text{lasso}}$ is sparse and its coefficient

$\hat{\beta}_j^{\text{lasso}}$ is nonzero only when

$\left|\hat{\beta}_j^{\text{ols}}\right|>\frac{\lambda}{2}$ .

Search This Blog

Life with Physics

L1 regression (Lasso)

Comments

Post a Comment

Popular posts from this blog

529 Plan

How to offset W2 tax

Retirement Accounts