Posts

Showing posts from June, 2020

L1 regression (Lasso)

Image
L1 regularization is to add a L1-norm term: \begin{equation}\hat{\boldsymbol{\beta}}^{\text{lasso}}=\text{argmin}_{\boldsymbol{\beta}}\left\{\sum_{i=1}^n\left(y^{(i)}-\beta_0-\sum_{j=1}^p \beta_jx^{(i)}_{j}\right)^2+\lambda\sum_{j=1}^p \left|\beta_j\right|\right\}\,.\end{equation} Note that there is no regularization for the interception term $\beta_0$ since $\beta_0$ is a global shift to all $y$. We can remove $\beta_0$ by making $\mathbf{y}$ and $\mathbf{X}$ centered.  With centered inputs , we have \begin{equation}\hat{\boldsymbol{\beta}}^{\text{lasso}}=\text{argmin}_{\boldsymbol{\beta}}\left\{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})+\lambda \left|\left|\boldsymbol{\beta}\right|\right|_1\right\}\,.\tag{1}\end{equation} Compared with ridge , lasso can be used for continuous subset (feature) selection besides regularization since it leads to sparse coefficients. The argument is from an equivalent optimization problem of (1): \begin{...

L2 regression (Ridge)

L2 regularization is to add a L2-norm term to the optimization problem $\min_{\mathbf{x}}f(\mathbf{x})$ as \begin{equation}\min_{\mathbf{x}}\,l_{\lambda}(\mathbf{x})\equiv f(\mathbf{x})+\frac{\lambda}{2} \left|\left| \mathbf{x}\right|\right|^2_2\,.\tag{1}\end{equation} Remarks: In deep learning, L2 regularization is also referred as weight decay . This is because in the vallina SGD \begin{eqnarray}\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)}-\gamma\left[\nabla f(\mathbf{x}^{(t)})+\lambda\mathbf{x}^{(t)}\right]=(1-\gamma\, \lambda)\,\mathbf{x}^{(t)}-\gamma\,\nabla f(\mathbf{x}^{(t)})\,,\end{eqnarray} L_2 regularization is to decay the weight $\mathbf{x}^{(t)}$ by a factor $1-\gamma \lambda$. In more sophisticated optimizers beyond vanilla SGD, L2 regularization is different from weight decay: L2 regularization: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla f(\mathbf{x})+\lambda\mathbf{x}$; Weight decay: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla ...

R-squared

Image
Given $n$ samples $y^{(k)}$ and the corresponding fitting values $\hat{y}^{(k)}$, one can evaluate the goodness-of-fit by \begin{equation} R^2 = 1 - \frac{\sum_{k=1}^n \left(y^{(k)}-\hat{y}^{(k)}\right)^2}{\sum_{k=1}^n\left(y^{(k)}-\bar{y}\right)^2}\,,\tag{1}\end{equation} where $\bar{y}\equiv \frac{1}{n}\sum_{i=1}^n y^{(k)}$.  Let $Y=\hat{Y}+e$, we can write $R^2$ in short as \begin{equation}R^2=1-\frac{\mathbb{E}(e^2)}{\text{Var}(Y)}\,.\end{equation} In the following, we will discuss $R^2$ in the scope of linear regression. With interception term, as shown in this post , we have $\mathbb{E}(e)=0$ and $\text{Var}(Y)=\text{Var}(\hat{Y})+\text{Var}(e)$. As a result, \begin{equation}R^2=1-\frac{\text{Var}(e)}{\text{Var}(Y)}=\frac{\text{Var}(\hat{Y})}{\text{Var}(Y)}\geq 0\,.\tag{2}\end{equation} Without interception term, as shown in this  post , we can only have $\mathbb{E}(Y^2)=\mathbb{E}(\hat{Y}^2)+\mathbb{E}(e^2)$. As a result, \begin{equation} R^2=\frac{\mathbb{E}\left(...

Geometric interpretation of linear regression

Image
Hilbert space Given a linear space $V$ over the field $\mathbb{C}$, we define an inner product , i.e, a map $\langle \cdot, \cdot\rangle : \, V\times V \rightarrow \mathbb{C}$ that satisfies the following axioms for all vectors $\mathbf{x}, \mathbf{y}, \mathbf{z}\in V$ and all scalars $\alpha \in \mathbb{C}$: $\langle \mathbf{x}, \mathbf{y}\rangle =\langle \mathbf{y}, \mathbf{x}\rangle^*$ $\langle \mathbf{x}+\mathbf{y}, \mathbf{z}\rangle = \langle \mathbf{x}, \mathbf{z}\rangle + \langle \mathbf{y}, \mathbf{z}\rangle$ $\langle \alpha\,\mathbf{x}, \mathbf{y}\rangle=\alpha \langle \mathbf{x}, \mathbf{y}\rangle$ $\langle \mathbf{x}, \mathbf{x}\rangle \geq 0$. Equality holds iff $\mathbf{x}=\mathbf{0}\,.$ We can then define the norm $\left|\left|\mathbf{x}\right|\right|^2\equiv \langle \mathbf{x}, \mathbf{x}\rangle$. The norm satisfies the following two properties: Cauchy-Schwarz inequality: $\left|\langle \mathbf{x}, \mathbf{y}\rangle\right|^2\leq \left|\left|\mathbf{x}\right|\right|^2\,\...

Interview Problems: Correlation

Problems: (1) Why correlation matrix is positive semi-definite?  (2) For n random variables, if all their pairwise correlations are $\rho$, what is the range of $\rho$? (3) For three random variables X, Y and Z, the correlation between X and Y is $a$ and the correlation between X and Z is $b$. What is the range of correlation between Y and Z? Solve it in three ways. (4) For two covariance matrices A and B, is AB a covariance matrix when AB=BA.  Solution: (1) Since $\text{Var}(\sum_{i} a_iX_i)=\sum_{i,j}a_ia_j\text{Cov}(X_i, X_j) \geq 0$, covariance matrix is positive semi-definite. So does correlation matrix since it is a covariance matrix when all random variables have unit variance.   (2) The correlation matrix is of the form $\boldsymbol{\Sigma}=(1-\rho)\mathbf{I}+\rho\, \mathbf{e}\mathbf{e}^T$ where $\mathbf{e}\equiv[1, \cdots, 1]^T$. $\mathbf{e}\mathbf{e}^T$ has one eigenvector $\mathbf{e}$ with eigenvalue $n$ and $n-1$ eigenvectors th...

Principal components analysis

PCA is a common technique to reduce data in high dimensions to a lower dimensional subspace, either for visualization or data compression.  Suppose we have a dataset consisting of $n$ samples and each sample is of $p$ features. We represent the dataset by a $n\times p$ matrix $\mathbf{X}$. We also assume that $\mathbf{X}$ is centered so that the sample covariance matrix $\boldsymbol{\Sigma}$ in the feature space is: \begin{eqnarray} \boldsymbol{\Sigma} = \frac{1}{n}\mathbf{X}^T\mathbf{X}\,.\tag{1}\end{eqnarray} PCA is to find a set of $k\leq p$ feature vectors $\mathbf{v}_1, \cdots, \mathbf{v}_k$ (called principal component directions) such that these new feature vectors have the largest $k$ sample variances $\mathbf{v}^T_1 \boldsymbol{\Sigma} \mathbf{v}_1,\cdots, \mathbf{v}^T_k \boldsymbol{\Sigma} \mathbf{v}_k$. That is, $\mathbf{v}_1, \cdots, \mathbf{v}_k$ are the eigenvectors of $\boldsymbol{\Sigma}$ corresponding to the $k$ largest eigenvalues $\lambda_1\geq \cdots \geq \la...

Generalized linear model

A generalized linear model (GLM) is said to have three components: A random component: $Y\sim \mathbb{P}(Y|\theta)$ in exponential family A linear predictor: $\eta=\mathbf{x}\cdot\boldsymbol{\beta}$ A link function $f$: $f(\mathbb{E}[Y|\theta])=\eta$ Remarks: Given $\mathbf{X}$, GLM predicts the conditional expectation of $Y$ instead of $Y$ itself. In GLM, the link function directly relates the conditional expectation of $Y$ to the linear predictor $\mathbf{x}\cdot\boldsymbol{\beta}$ as $f(\mathbb{E}[Y|\theta]) = \mathbf{x}\cdot\boldsymbol{\beta}$. In contrast, one may transform $Y$ and then do ordinary linear regression, which implies $\mathbb{E}[f(Y)|\theta] = \mathbf{x}\cdot\boldsymbol{\beta}$.  From $\mathbb{P}(Y|\theta)$, we can compute $\mathbb{E}[Y|\theta]=\mu(\theta)$. The linear predictor $\eta$ relates to $\theta$ by $f\left(\mu(\theta)\right)=\eta$. If we set $\eta=\theta$, ${\mu}^{-1}$ is a link function, called canonical link function. For example, the canonical l...

Exponential family

An exponential family is a large class of distributions with the form \begin{equation} \mathbb{P}\left(Y|\theta\right)= \frac{1}{Z(\theta)} g(Y)\,e^{\theta \,H(Y)}\,.\end{equation} Such form is the Boltzmann distribution in statistical physics, where $Y$ is a microstate of the ensemble; $H(Y)$ and $g(Y)$ are the energy and degeneracy; $\theta$ relates to the temperature; Finally, $Z(\theta)$ is the partition function : \begin{equation}Z(\theta)=\int g(y)\,e^{\theta \,H(y)}\,dy\,. \end{equation} As well known in statistical physics, from partition function we can compute \begin{equation}\mu(\theta) \equiv\mathbb{E}\left[H(Y) |\theta\right]=\frac{d}{d\theta} \log Z(\theta)\,,\tag{1}\end{equation}\begin{equation}\sigma^2(\theta) \equiv \text{Var}\left[H(Y)|\theta\right] = \frac{d^2}{d\theta^2} \log Z(\theta)\,. \end{equation}  Given observations $Y_1, \cdots, Y_n$, the log-likelihood of the exponential family is \begin{equation} l_n(\theta)\equiv \frac{1}{n}\log \sum_{i=1}^n\mathbb{P}...

Interview problems: Gaussian

Image
Problems: (1) X, Y are i.i.d. standard normal $\mathcal{N}(0, 1)$. What is $\mathbb{P}(X >3Y| X > 0)$? (2) X, Y, Z are standard normal $\mathcal{N}(0, 1)$ with $\rho_{X, Y}=\rho_{X, Z}=\rho_{Y, Z}=0.5$. What is $\mathbb{P}(X+Y>Z+2)$? (3) X, Y are i.i.d. standard normal $\mathcal{N}(0, 1)$. What is $\mathbb{E}[X|X+Y=1]$? Solution: (1) (X, Y) joint pdf is rotation invariant. Therefore, as shown in the figure, $$\mathbb{P}(X >3Y| X > 0)=\frac{\theta}{\pi} = \frac{1}{2} + \frac{\arctan 1/3}{\pi}.$$ Similarly, we can compute, for example, $\mathbb{P}(X >5Y)=1/2$ or $\mathbb{P}(X>0| X+Y>0)=3/4$. (2) Let $W\equiv X+Y-Z$, $\text{Var}(W)=\text{Cov}(X+Y-Z, X+Y-Z)=2$. As a result, $W\sim \mathcal{N}(0, 2)$ and $\mathbb{P}(W>2)=1-\Phi(\sqrt{2})$. (3) $\mathbb{E}[X|X+Y=1]=\mathbb{E}[X+Y-Y|X+Y=1]=1-\mathbb{E}[Y|X+Y=1]=1-\mathbb{E}[X|X+Y=1]$, and thus $\mathbb{E}[X|X+Y=1]=1/2$. For a more general case, such conditional expectation can be solved by two properties of Gaus...

Limit matters in improper integral

Recently, one of my best friends, Joking, asked me to solve an integral \begin{equation}\int_0^1 \frac{x-1}{\log x}dx\,.\end{equation} His intention is to show me a trick as in this video : Let $I(b)\equiv \int_0^1 \frac{x^b-1}{\log x}dx$. By taking derivative to $b$, we obtain an simple differential equation $I'(b)=\frac{1}{b+1}$. With $I(0)=0$, we solve $I(b)$ and thus have $I(1)=\log 2$. However, my first attempt gave the answer of zero: Let $y=x^2$, $\int_0^1 \frac{x}{\log x}dx=\int_0^1 \frac{dx^2}{\log x^2}=\int_0^1 \frac{dy}{\log y}$. As a result, $\int_0^1 \frac{x-1}{\log x}dx=\int_0^1 \frac{x}{\log x}dx-\int_0^1 \frac{dx}{\log x}=\int_0^1 \frac{dy}{\log y}-\int_0^1 \frac{dx}{\log x}=0$. Clearly, the integral cannot be zero. So what's wrong here? The reason is that this is an improper integral. Strictly speaking, we should solve \begin{equation} \lim_{\epsilon\rightarrow 0} \int_0^{1-\epsilon} \frac{x-1}{\log x}dx\,.\end{equation} The limit here matters. With this limit,...

Interview problem: Chow-Robbins game

Problem: (1) Let's play a game. You toss a fair coin repeatedly. You can stop whenever you want before reaching 100 times. The payoff is the proportion of heads once stopped or reaching 100 times. What is the expected payoff according to the best strategy? (2) There are a random number of pencils and we two take turns to take the pencils. Each time you or me can take 1 or 2 pencils and whoever get the last one loses. Would you like to play first?  (3) Dicing problem:  (a) Toss a dice. You can get face value in dollars or you can choose to try again twice. What is the expected payoff according to the best strategy? (b) If you can get the maximum face values of two tosses, compare this payoff with that in (a) without calculation. (c) What if you can try again at most 50 times? Estimate the number to the 3 decimal. (d) What if you can try again for infinite times but each toss will cost you 1 dollar. The first toss is free.     Solution: (1) This Chow-Robbins game....

Interview problem: linear regression (1)

Problem: A collection of 2D points $(x_i, y_i)$ are i.i.d. randomly sampled from a uniform distribution in a triangular region $0\leq X \leq 1, 0 \leq Y \leq 1, X+Y\geq 3/2$. If we fit a linear regression, say $y=kx+b$, on the sampled points, what are the expectation values of $k$ and $b$? Brute force solution: Linear regression is to solve the optimization problem $\min_{k, b} \mathbb{E}\left(Y-kX-b\right)^2$. As a result, we can derive $k=\frac{\text{Cov}(X, Y)}{Var(X)}$ and $b=\mathbb{E}(Y)-k\,\mathbb{E}(X)$. For the given distribution, we can compute $\mathbb{E}(X)=\mathbb{E}(Y)=5/6$, $\text{Cov}(X, Y)=-1/144$ and $\text{Var}(X)=1/72$. The final answer is $k=-1/2$ and $b=5/4$. Quick solution: As proved in chapter 2.4 of the ESL book , the solution to the optimization problem $\min_f \mathbb{E}(Y-f(X))^2$ is $f(X)=\mathbb{E}(Y|X)$. For the given distribution, $Y|X\sim \text{Uniform}[3/2-X, 1]$. As a result, $\mathbb{E}(Y|X)=\frac{1}{2}(3/2-X+1)=-\frac{1}{2}X + 5/4$, which is linear ...

Deep learning optimizers

There are a lot of optimizers for training deep learning models, among which SGD (stochastic gradient descent) and Adam (Adaptive Momentum Estimation) are mostly used at least in object detection. For example, FAIR loves SGD with momentum 0.9 in their seminal papers including Faster R-CNN , RetinaNet and Mask R-CNN , while Adam are the common optimizer in 3D LiDAR object detections . SGD SGD with momentum in Pytorch in of the form \begin{equation} { \begin{array}{c}\mathbf{m}^{(t+1)} &=& \mu\,\mathbf{m}^{(t)} + \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\\mathbf{x}^{(t+1)} &=&\mathbf{x}^{(t)} - \gamma\,\mathbf{m}^{(t+1)} \,,\end{array}}\tag{1}\end{equation}  where $\mu$ is the momentum coefficient and $\gamma$ is the learning rate. Remarks: Vanilla SGD updates $\mathbf{x}^{(t)}$ only by its gradient $\nabla f\left(\mathbf{x}^{(t)}\right)$, i.e. $\mu=0$. Here, $\mathbf{x}^{(t)}$ is updated by the exponentially weighted moving average (EWMA) of all the past...

Bilinear upsampling

Image
We perform bilinear upsampling of a source image $\mathcal{S}$  with height h and width w to a destination image $\mathcal{D}$  with height H   and width W : The coordinates of the destination image pixels (orange dots) pulled back in the source image frame (with pixels in blue circles) when h=w=3 and H=4, W=5.   For each pixel $(i, j)$ in $\mathcal{D}$ with $0\leq i < H$ and $0\leq j < W$, pull back to the coordinates $(x, y)$ in $\mathcal{S}$ linearly. As shown in the figure, if aligned with corners: \begin{eqnarray} y = \frac{h-1}{H-1} \,i\,,\quad x = \frac{w-1}{W-1}\, j\,.\nonumber\end{eqnarray} Otherwise, aligned with boundaries: \begin{eqnarray} y = \frac{h}{H}(i+0.5)-0.5\,,\quad x = \frac{w}{W}(j+0.5)-0.5\,,\nonumber\end{eqnarray} in which pixel values are from pixel center points. For each computed $(x, y)$, do bilinear interpolation as in wiki : \begin{eqnarray}\mathcal{D}(i, j)=\left[\begin{array}{cc} \lceil{y}\rceil-y & y-\lfloor{y}\...