Posts

Showing posts with the label pytorch

Balanced L1 loss

Smooth L1 loss   For regression tasks, we usually start by the L2 loss function ${\cal{l}}_2(x)=x^2 / 2$. Since its gradient is linear, i.e., ${\cal{l}}'_2(x)=x \propto\sqrt{{\cal{l}}_2(x)}$, the batch gradient is dominated by data examples with large L2 losses (outliers). Starting from Fast R-CNN , one usually applies the so-called smooth L1 loss for object detection. The motivation is very simple: truncating the gradient from linear to some constant $\gamma$ if x is too large, leading to a truncated gradient function \begin{equation} \frac{d}{dx}\text{Smooth-}{\cal{l}}_1(x) =  \begin{cases} x & |x| < 1  \\ \pm 1 & |x| \geq 1 \end{cases} \,,\tag{1}\end{equation} in which $\gamma=1$ for gradient continuity at $|x|=1$. In addition, one can introduce a scaling factor $\beta$ by replacing $x$ in Eq. (1) with $x/\beta$ and obtain a more general gradient function as \begin{equation} \frac{d}{dx}\text{Smooth-}{\cal{l}}_1(x) =  \begin{cases} x/\bet...

L2 regression (Ridge)

L2 regularization is to add a L2-norm term to the optimization problem $\min_{\mathbf{x}}f(\mathbf{x})$ as \begin{equation}\min_{\mathbf{x}}\,l_{\lambda}(\mathbf{x})\equiv f(\mathbf{x})+\frac{\lambda}{2} \left|\left| \mathbf{x}\right|\right|^2_2\,.\tag{1}\end{equation} Remarks: In deep learning, L2 regularization is also referred as weight decay . This is because in the vallina SGD \begin{eqnarray}\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)}-\gamma\left[\nabla f(\mathbf{x}^{(t)})+\lambda\mathbf{x}^{(t)}\right]=(1-\gamma\, \lambda)\,\mathbf{x}^{(t)}-\gamma\,\nabla f(\mathbf{x}^{(t)})\,,\end{eqnarray} L_2 regularization is to decay the weight $\mathbf{x}^{(t)}$ by a factor $1-\gamma \lambda$. In more sophisticated optimizers beyond vanilla SGD, L2 regularization is different from weight decay: L2 regularization: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla f(\mathbf{x})+\lambda\mathbf{x}$; Weight decay: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla ...

Deep learning optimizers

There are a lot of optimizers for training deep learning models, among which SGD (stochastic gradient descent) and Adam (Adaptive Momentum Estimation) are mostly used at least in object detection. For example, FAIR loves SGD with momentum 0.9 in their seminal papers including Faster R-CNN , RetinaNet and Mask R-CNN , while Adam are the common optimizer in 3D LiDAR object detections . SGD SGD with momentum in Pytorch in of the form \begin{equation} { \begin{array}{c}\mathbf{m}^{(t+1)} &=& \mu\,\mathbf{m}^{(t)} + \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\\mathbf{x}^{(t+1)} &=&\mathbf{x}^{(t)} - \gamma\,\mathbf{m}^{(t+1)} \,,\end{array}}\tag{1}\end{equation}  where $\mu$ is the momentum coefficient and $\gamma$ is the learning rate. Remarks: Vanilla SGD updates $\mathbf{x}^{(t)}$ only by its gradient $\nabla f\left(\mathbf{x}^{(t)}\right)$, i.e. $\mu=0$. Here, $\mathbf{x}^{(t)}$ is updated by the exponentially weighted moving average (EWMA) of all the past...

Bilinear upsampling

Image
We perform bilinear upsampling of a source image $\mathcal{S}$  with height h and width w to a destination image $\mathcal{D}$  with height H   and width W : The coordinates of the destination image pixels (orange dots) pulled back in the source image frame (with pixels in blue circles) when h=w=3 and H=4, W=5.   For each pixel $(i, j)$ in $\mathcal{D}$ with $0\leq i < H$ and $0\leq j < W$, pull back to the coordinates $(x, y)$ in $\mathcal{S}$ linearly. As shown in the figure, if aligned with corners: \begin{eqnarray} y = \frac{h-1}{H-1} \,i\,,\quad x = \frac{w-1}{W-1}\, j\,.\nonumber\end{eqnarray} Otherwise, aligned with boundaries: \begin{eqnarray} y = \frac{h}{H}(i+0.5)-0.5\,,\quad x = \frac{w}{W}(j+0.5)-0.5\,,\nonumber\end{eqnarray} in which pixel values are from pixel center points. For each computed $(x, y)$, do bilinear interpolation as in wiki : \begin{eqnarray}\mathcal{D}(i, j)=\left[\begin{array}{cc} \lceil{y}\rceil-y & y-\lfloor{y}\...