Life with Physics

Posts

Showing posts with the label deep learning

Note on Denoising Diffusion Probabilistic Models

April 02, 2024

I've recently discovered a fantastic online course titled " TinyML and Efficient Deep Learning Computing " taught by Prof. Song Han at MIT. This course delves into the latest advancements in large language models and generative AI. While Lecture 16 provides a comprehensive overview on diffusion models and their recent generalizations, it skips some mathematical details regarding Denoising Diffusion Probabilistic Models (DDPM). This post serves as my notes on these skipped mathematical details from the lecture. Especially, We provide a simplified and much more transparent derivation on the training loss than the one presented in the DDPM paper . We show that the dropped $L_T$ term in the DDPM paper should not appear at all if we start with the correct loss. No special treatment is needed for the $L_0$ term in the DDPM paper , i.e. $L_{t-1}$ is applicable for $t=1$ as well. Forwa...

Balanced L1 loss

March 12, 2021

Smooth L1 loss For regression tasks, we usually start by the L2 loss function ${\cal{l}}_2(x)=x^2 / 2$. Since its gradient is linear, i.e., ${\cal{l}}'_2(x)=x \propto\sqrt{{\cal{l}}_2(x)}$, the batch gradient is dominated by data examples with large L2 losses (outliers). Starting from Fast R-CNN , one usually applies the so-called smooth L1 loss for object detection. The motivation is very simple: truncating the gradient from linear to some constant $\gamma$ if x is too large, leading to a truncated gradient function \begin{equation} \frac{d}{dx}\text{Smooth-}{\cal{l}}_1(x) = \begin{cases} x & |x| < 1 \\ \pm 1 & |x| \geq 1 \end{cases} \,,\tag{1}\end{equation} in which $\gamma=1$ for gradient continuity at $|x|=1$. In addition, one can introduce a scaling factor $\beta$ by replacing $x$ in Eq. (1) with $x/\beta$ and obtain a more general gradient function as \begin{equation} \frac{d}{dx}\text{Smooth-}{\cal{l}}_1(x) = \begin{cases} x/\bet...

Note on convolutions

August 08, 2020

For simplicity, we consider a 1d convolution on 1d input array as an example. Most important parameters of a convolution are kernel size, stride and padding, denoted by $K, S, P$, respectively. Let $I$ be the size of 1d input array, the index of padded array can be viewed of value from $-P$ to $I-1+P$ (inclusive). The convolution can be interpreted in a sliding window picture: At the start (0 step), the first element in the window is aligned at the index $-P$ of the input array; In each step, the window slides $S$ elements along the input array. That is, in the $i$-th step, the first element in the window is aligned at the index $-P+i*S$ and the last element is thus aligned at the index $-P+i*S + K-1$, leading to a constraint $-P+i*S + K-1\leq I-1+P$. As a result, we have $0 \leq i \leq (I + 2P-K) / S$ and the size of output array is \begin{equation}\left\lfloor\frac{I + 2P - K}{S}\right\rfloor + 1\,.\end{equation} The above process can be summarized in the python code: 1 2 3...

L2 regression (Ridge)

June 30, 2020

L2 regularization is to add a L2-norm term to the optimization problem $\min_{\mathbf{x}}f(\mathbf{x})$ as \begin{equation}\min_{\mathbf{x}}\,l_{\lambda}(\mathbf{x})\equiv f(\mathbf{x})+\frac{\lambda}{2} \left|\left| \mathbf{x}\right|\right|^2_2\,.\tag{1}\end{equation} Remarks: In deep learning, L2 regularization is also referred as weight decay . This is because in the vallina SGD \begin{eqnarray}\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)}-\gamma\left[\nabla f(\mathbf{x}^{(t)})+\lambda\mathbf{x}^{(t)}\right]=(1-\gamma\, \lambda)\,\mathbf{x}^{(t)}-\gamma\,\nabla f(\mathbf{x}^{(t)})\,,\end{eqnarray} L_2 regularization is to decay the weight $\mathbf{x}^{(t)}$ by a factor $1-\gamma \lambda$. In more sophisticated optimizers beyond vanilla SGD, L2 regularization is different from weight decay: L2 regularization: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla f(\mathbf{x})+\lambda\mathbf{x}$; Weight decay: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla ...

Deep learning optimizers

June 22, 2020

There are a lot of optimizers for training deep learning models, among which SGD (stochastic gradient descent) and Adam (Adaptive Momentum Estimation) are mostly used at least in object detection. For example, FAIR loves SGD with momentum 0.9 in their seminal papers including Faster R-CNN , RetinaNet and Mask R-CNN , while Adam are the common optimizer in 3D LiDAR object detections . SGD SGD with momentum in Pytorch in of the form \begin{equation} { \begin{array}{c}\mathbf{m}^{(t+1)} &=& \mu\,\mathbf{m}^{(t)} + \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\\mathbf{x}^{(t+1)} &=&\mathbf{x}^{(t)} - \gamma\,\mathbf{m}^{(t+1)} \,,\end{array}}\tag{1}\end{equation} where $\mu$ is the momentum coefficient and $\gamma$ is the learning rate. Remarks: Vanilla SGD updates $\mathbf{x}^{(t)}$ only by its gradient $\nabla f\left(\mathbf{x}^{(t)}\right)$, i.e. $\mu=0$. Here, $\mathbf{x}^{(t)}$ is updated by the exponentially weighted moving average (EWMA) of all the past...

Bilinear upsampling

June 22, 2020

We perform bilinear upsampling of a source image $\mathcal{S}$ with height h and width w to a destination image $\mathcal{D}$ with height H and width W : The coordinates of the destination image pixels (orange dots) pulled back in the source image frame (with pixels in blue circles) when h=w=3 and H=4, W=5. For each pixel $(i, j)$ in $\mathcal{D}$ with $0\leq i < H$ and $0\leq j < W$, pull back to the coordinates $(x, y)$ in $\mathcal{S}$ linearly. As shown in the figure, if aligned with corners: \begin{eqnarray} y = \frac{h-1}{H-1} \,i\,,\quad x = \frac{w-1}{W-1}\, j\,.\nonumber\end{eqnarray} Otherwise, aligned with boundaries: \begin{eqnarray} y = \frac{h}{H}(i+0.5)-0.5\,,\quad x = \frac{w}{W}(j+0.5)-0.5\,,\nonumber\end{eqnarray} in which pixel values are from pixel center points. For each computed $(x, y)$, do bilinear interpolation as in wiki : \begin{eqnarray}\mathcal{D}(i, j)=\left[\begin{array}{cc} \lceil{y}\rceil-y & y-\lfloor{y}\...