L1 regression (Lasso)

L1 regularization is to add a L1-norm term: \begin{equation}\hat{\boldsymbol{\beta}}^{\text{lasso}}=\text{argmin}_{\boldsymbol{\beta}}\left\{\sum_{i=1}^n\left(y^{(i)}-\beta_0-\sum_{j=1}^p \beta_jx^{(i)}_{j}\right)^2+\lambda\sum_{j=1}^p \left|\beta_j\right|\right\}\,.\end{equation}
Note that there is no regularization for the interception term $\beta_0$ since $\beta_0$ is a global shift to all $y$. We can remove $\beta_0$ by making $\mathbf{y}$ and $\mathbf{X}$ centered. With centered inputs, we have \begin{equation}\hat{\boldsymbol{\beta}}^{\text{lasso}}=\text{argmin}_{\boldsymbol{\beta}}\left\{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})+\lambda \left|\left|\boldsymbol{\beta}\right|\right|_1\right\}\,.\tag{1}\end{equation}
Compared with ridge, lasso can be used for continuous subset (feature) selection besides regularization since it leads to sparse coefficients. The argument is from an equivalent optimization problem of (1): \begin{equation}\min_{\boldsymbol{\beta}}(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})\,,\quad \text{s.t.} \quad \left|\left|\boldsymbol{\beta}\right|\right|_1\leq t\,.\end{equation}
Credit: this figure is taken from chapter 3.4 of ESL book
As in the figure of a two-variable case, the optimal of lasso locates at a corner of the diamond region that forces one of the coefficients to be zero.

We can solve (1) explicitly for the orthogonal inputs $\mathbf{X}^T\mathbf{X}=\mathbf{I}$. In this case, $\boldsymbol{\beta}^{\text{ols}}=\mathbf{X}^T\mathbf{y}$ and we reduce (1) to $\hat{\boldsymbol{\beta}}^{\text{lasso}}=\text{argmin}_{\boldsymbol{\beta}}\left\{(\boldsymbol{\beta}-\boldsymbol{\beta}^{\text{ols}})^T(\boldsymbol{\beta}-\boldsymbol{\beta}^{\text{ols}})+\lambda \left|\left|\boldsymbol{\beta}\right|\right|_1\right\}\,,$ and thus each coefficient is decoupled from one another: \begin{equation}\hat{\beta}_j^{\text{lasso}} =\text{argmin}_{\beta}\left\{\left(\beta -\hat{\beta}_j^{\text{ols}}\right)^2+\lambda\left|\beta\right| \right\}\,.\tag{2}\end{equation}
A few algebra shows that \begin{equation}\min_{\beta\geq 0}\,\beta^2-2\left(\hat{\beta}^{\text{ols}}-\frac{\lambda}{2}\right)\beta=\left\{\begin{array}{c}-\left(\hat{\beta}^{\text{ols}}-\frac{\lambda}{2}\right)^2 &\text{when}&\, \hat{\beta}^{\text{ols}}\geq\frac{\lambda}{2}\,,\beta=\hat{\beta}^{\text{ols}}-\frac{\lambda}{2}\\ 0  &\text{when}&\,\hat{\beta}^{\text{ols}}<\frac{\lambda}{2}\,,\beta=0\end{array}\right.\,,\end{equation} and \begin{equation}\min_{\beta\leq 0}\,\beta^2-2\left(\hat{\beta}^{\text{ols}}+\frac{\lambda}{2}\right)\beta=\left\{\begin{array}{c}-\left(\hat{\beta}^{\text{ols}}+\frac{\lambda}{2}\right)^2 &\text{when}&\, \hat{\beta}^{\text{ols}}\leq-\frac{\lambda}{2}\,,\beta=\hat{\beta}^{\text{ols}}+\frac{\lambda}{2}\\ 0  &\text{when}&\,\hat{\beta}^{\text{ols}}>-\frac{\lambda}{2}\,,\beta=0\end{array}\right.\,.\end{equation} Therefore, we have the optimal of (2): \begin{equation}\hat{\beta}_j^{\text{lasso}}=\text{sign}\left(\hat{\beta}_j^{\text{ols}}\right)\text{Relu}\left(\left|\hat{\beta}_j^{\text{ols}}\right|-\frac{\lambda}{2}\right)\,.\tag{3}\end{equation}
Indeed, $\hat{\boldsymbol{\beta}}^{\text{lasso}}$ is sparse and its coefficient $\hat{\beta}_j^{\text{lasso}}$ is nonzero only when $\left|\hat{\beta}_j^{\text{ols}}\right|>\frac{\lambda}{2}$. 



Comments

Popular posts from this blog

529 Plan

How to offset W2 tax

Retirement Accounts