Posts

Showing posts with the label deep learning

Note on Denoising Diffusion Probabilistic Models

I've recently discovered a fantastic online course titled " TinyML and Efficient Deep Learning Computing " taught by  Prof. Song Han  at MIT. This course delves into the latest advancements in large language models and generative AI. While   Lecture 16  provides a comprehensive overview on diffusion models and their recent generalizations, it skips some mathematical details regarding  Denoising Diffusion Probabilistic Models  (DDPM).  This post serves as my notes on these skipped mathematical details from the lecture. Especially,  We provide a simplified and much more transparent derivation on the training loss than the one presented in the  DDPM paper .  We show that the dropped $L_T$ term in the  DDPM paper  should not appear at all if we start with the correct loss.  No special treatment is needed for the $L_0$ term in the  DDPM paper , i.e. $L_{t-1}$ is applicable for $t=1$ as well.  Forwa...

Balanced L1 loss

Smooth L1 loss   For regression tasks, we usually start by the L2 loss function ${\cal{l}}_2(x)=x^2 / 2$. Since its gradient is linear, i.e., ${\cal{l}}'_2(x)=x \propto\sqrt{{\cal{l}}_2(x)}$, the batch gradient is dominated by data examples with large L2 losses (outliers). Starting from Fast R-CNN , one usually applies the so-called smooth L1 loss for object detection. The motivation is very simple: truncating the gradient from linear to some constant $\gamma$ if x is too large, leading to a truncated gradient function \begin{equation} \frac{d}{dx}\text{Smooth-}{\cal{l}}_1(x) =  \begin{cases} x & |x| < 1  \\ \pm 1 & |x| \geq 1 \end{cases} \,,\tag{1}\end{equation} in which $\gamma=1$ for gradient continuity at $|x|=1$. In addition, one can introduce a scaling factor $\beta$ by replacing $x$ in Eq. (1) with $x/\beta$ and obtain a more general gradient function as \begin{equation} \frac{d}{dx}\text{Smooth-}{\cal{l}}_1(x) =  \begin{cases} x/\bet...

Note on convolutions

Image
For simplicity, we consider a 1d convolution on 1d input array as an example. Most important parameters of a convolution are kernel size, stride and padding, denoted by $K, S, P$, respectively. Let $I$ be the size of 1d input array, the index of padded array can be viewed of value from $-P$ to $I-1+P$ (inclusive). The convolution can be interpreted in a sliding window picture: At the start (0 step), the first element in the window is aligned at the index $-P$ of the input array; In each step, the window slides $S$ elements along the input array. That is, in the $i$-th step, the first element in the window is aligned at the index $-P+i*S$ and the last element is thus aligned at the index $-P+i*S + K-1$, leading to a constraint $-P+i*S + K-1\leq I-1+P$. As a result, we have $0 \leq i \leq (I + 2P-K) / S$ and the size of output array is \begin{equation}\left\lfloor\frac{I + 2P - K}{S}\right\rfloor + 1\,.\end{equation} The above process can be summarized in the python code: 1 2 3...

L2 regression (Ridge)

L2 regularization is to add a L2-norm term to the optimization problem $\min_{\mathbf{x}}f(\mathbf{x})$ as \begin{equation}\min_{\mathbf{x}}\,l_{\lambda}(\mathbf{x})\equiv f(\mathbf{x})+\frac{\lambda}{2} \left|\left| \mathbf{x}\right|\right|^2_2\,.\tag{1}\end{equation} Remarks: In deep learning, L2 regularization is also referred as weight decay . This is because in the vallina SGD \begin{eqnarray}\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)}-\gamma\left[\nabla f(\mathbf{x}^{(t)})+\lambda\mathbf{x}^{(t)}\right]=(1-\gamma\, \lambda)\,\mathbf{x}^{(t)}-\gamma\,\nabla f(\mathbf{x}^{(t)})\,,\end{eqnarray} L_2 regularization is to decay the weight $\mathbf{x}^{(t)}$ by a factor $1-\gamma \lambda$. In more sophisticated optimizers beyond vanilla SGD, L2 regularization is different from weight decay: L2 regularization: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla f(\mathbf{x})+\lambda\mathbf{x}$; Weight decay: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla ...

Deep learning optimizers

There are a lot of optimizers for training deep learning models, among which SGD (stochastic gradient descent) and Adam (Adaptive Momentum Estimation) are mostly used at least in object detection. For example, FAIR loves SGD with momentum 0.9 in their seminal papers including Faster R-CNN , RetinaNet and Mask R-CNN , while Adam are the common optimizer in 3D LiDAR object detections . SGD SGD with momentum in Pytorch in of the form \begin{equation} { \begin{array}{c}\mathbf{m}^{(t+1)} &=& \mu\,\mathbf{m}^{(t)} + \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\\mathbf{x}^{(t+1)} &=&\mathbf{x}^{(t)} - \gamma\,\mathbf{m}^{(t+1)} \,,\end{array}}\tag{1}\end{equation}  where $\mu$ is the momentum coefficient and $\gamma$ is the learning rate. Remarks: Vanilla SGD updates $\mathbf{x}^{(t)}$ only by its gradient $\nabla f\left(\mathbf{x}^{(t)}\right)$, i.e. $\mu=0$. Here, $\mathbf{x}^{(t)}$ is updated by the exponentially weighted moving average (EWMA) of all the past...

Bilinear upsampling

Image
We perform bilinear upsampling of a source image $\mathcal{S}$  with height h and width w to a destination image $\mathcal{D}$  with height H   and width W : The coordinates of the destination image pixels (orange dots) pulled back in the source image frame (with pixels in blue circles) when h=w=3 and H=4, W=5.   For each pixel $(i, j)$ in $\mathcal{D}$ with $0\leq i < H$ and $0\leq j < W$, pull back to the coordinates $(x, y)$ in $\mathcal{S}$ linearly. As shown in the figure, if aligned with corners: \begin{eqnarray} y = \frac{h-1}{H-1} \,i\,,\quad x = \frac{w-1}{W-1}\, j\,.\nonumber\end{eqnarray} Otherwise, aligned with boundaries: \begin{eqnarray} y = \frac{h}{H}(i+0.5)-0.5\,,\quad x = \frac{w}{W}(j+0.5)-0.5\,,\nonumber\end{eqnarray} in which pixel values are from pixel center points. For each computed $(x, y)$, do bilinear interpolation as in wiki : \begin{eqnarray}\mathcal{D}(i, j)=\left[\begin{array}{cc} \lceil{y}\rceil-y & y-\lfloor{y}\...