Deep learning optimizers

There are a lot of optimizers for training deep learning models, among which SGD (stochastic gradient descent) and Adam (Adaptive Momentum Estimation) are mostly used at least in object detection. For example, FAIR loves SGD with momentum 0.9 in their seminal papers including Faster R-CNN, RetinaNet and Mask R-CNN, while Adam are the common optimizer in 3D LiDAR object detections.

SGD

SGD with momentum in Pytorch in of the form \begin{equation} { \begin{array}{c}\mathbf{m}^{(t+1)} &=& \mu\,\mathbf{m}^{(t)} + \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\\mathbf{x}^{(t+1)} &=&\mathbf{x}^{(t)} - \gamma\,\mathbf{m}^{(t+1)} \,,\end{array}}\tag{1}\end{equation} where $\mu$ is the momentum coefficient and $\gamma$ is the learning rate.
Remarks:
  • Vanilla SGD updates $\mathbf{x}^{(t)}$ only by its gradient $\nabla f\left(\mathbf{x}^{(t)}\right)$, i.e. $\mu=0$.

  • Here, $\mathbf{x}^{(t)}$ is updated by the exponentially weighted moving average (EWMA) of all the past gradients: \begin{equation} \mathbf{m}^{(t+1)} = \sum_{\tau=0} ^{t} \mu^{t-\tau} \,\nabla f\left(\mathbf{x}^{(\tau)}\right)\,.\nonumber \end{equation} To me, EWMA view is more transparent than the physics analogy of momentum and acceleration.  

  • EWMA at the first look should be \begin{eqnarray} \mathbf{m}^{(t+1)}&=&\mu\,\mathbf{m}^{(t)} + (1-\mu) \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \alpha\,\mathbf{m}^{(t+1)} \,,\nonumber \end{eqnarray} which is the same as the pytorch implementation (1) with $\gamma = \alpha (1-\mu)$.

  • The pytorch implementation (1) relates to the version in Sutskever et.al paper \begin{eqnarray} \mathbf{q}^{(t+1)} &=& \nu\,\mathbf{q}^{(t)} + \gamma\,\nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \mathbf{q}^{(t+1)} \,,\nonumber \end{eqnarray} by $\mathbf{m}^{(t)} =\mathbf{q}^{(t)} / \gamma$ and $\mu = \nu/ \gamma$. 

Furthermore, NAG (Nesterov accelerated gradient) is to compute the gradient with respect to the approximate future position of parameters rather than the current position: \begin{eqnarray} \mathbf{m}^{(t+1)} &=& \mu\,\mathbf{m}^{(t)} + \nabla f\left(\mathbf{x}^{(t)}-\gamma\,\mu\,\mathbf{m}^{(t)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \gamma\,\mathbf{m}^{(t+1)} \,.\nonumber \end{eqnarray} By replacing $\mathbf{x}^{(t)}\leftarrow \mathbf{x}^{(t)}-\gamma\,\mu\,\mathbf{m}^{(t)}$, the pytorch implementation of NAG is of the form \begin{eqnarray}\label{Nesterov} \mathbf{m}^{(t+1)} &=& \mu\,\mathbf{m}^{(t)} + \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \gamma\,\left[\nabla f\left(\mathbf{x}^{(t)}\right)+\mu\,\mathbf{m}^{(t+1)}\right]\,. \tag{2}\end{eqnarray} In sum, SGD with momentum (1) updates $\mathbf{x}^{(t)}$ by EWMV $\mathbf{m}^{(t+1)}$ while NAG (2) updates $\mathbf{x}^{(t)}$ by $\nabla f\left(\mathbf{x}^{(t)}\right)+\mu\,\mathbf{m}^{(t+1)}$.

Adam

Besides the EWMV of gradients as in SGD, Adam also adopts adaptive learning rates controlled by the EWMV of squared gradients: \begin{eqnarray} \mathbf{m}^{(t+1)} &= & \beta_1\,\mathbf{m}^{(t)} + \left(1-\beta_1\right) \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \mathbf{v}^{(t+1)} &=& \beta_2\,\mathbf{v}^{(t)} + \left(1-\beta_2\right) \nabla f\left(\mathbf{x}^{(t)}\right) \odot \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \hat{\mathbf{m}}^{(t+1)} &=& \frac{\mathbf{m}^{(t+1)}}{1-\beta_1^{t+1}}\,,\nonumber\\ \hat{\mathbf{v}}^{(t+1)} &=& \frac{\mathbf{v}^{(t+1)}}{1-\beta_2^{t+1}}\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \gamma\,\hat{\mathbf{m}}^{(t+1)} \oslash \,\left({\sqrt{\hat{\mathbf{v}}^{(t+1)} }+\epsilon}\right)\,. \tag{3}\end{eqnarray}
Remarks:
  • The operations of product, division and square root above are all element-wise.
  • Each component $\mathbf{x}^{(t)}_i$ has its own adaptive learning rate $\gamma / \sqrt{\hat{\mathbf{v}}^{(t)}_i}$.
  • The bias corrections in the third and forth lines is argued as follows: note that  $\mathbf{m}^{(t+1)} = \left(1-\beta_1\right)\sum_{\tau=0}^t \beta_1^{t-\tau}\, \nabla f\left(\mathbf{x}^{(\tau)}\right)$. If $\nabla f\left(\mathbf{x}^{(\tau)}\right)$ forms a stationary time series, $\mathbb{E}\left[\nabla f\left(\mathbf{x}^{(\tau)}\right)\right]=\mathbf{g}$. As a result, $\mathbb{E}\left[\mathbf{m}^{(t+1)}\right] = \mathbf{g}\left(1-\beta_1\right)\sum_{\tau=0}^t \beta_1^{t-\tau}=\mathbf{g}\left(1-\beta_1^{t+1}\right)$.
Some other optimizers including Adagrad, Adadelta, RMSprop can be viewed as simplified verisons of Adam:
  • RMSprop: no momentum ($\beta_1=0$) and no bias corrections. \begin{eqnarray} \mathbf{v}^{(t+1)} &=& \beta_2\,\mathbf{v}^{(t)} + \left(1-\beta_2\right) \nabla f\left(\mathbf{x}^{(t)}\right) \odot \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \gamma\, \nabla f\left(\mathbf{x}^{(t)}\right) \oslash \,\left({\sqrt{{\mathbf{v}}^{(t+1)}}+\epsilon}\right)\,.\nonumber \end{eqnarray}
  • Adadelta: no momentum ($\beta_1=0$) and no bias corrections. A new variable $\boldsymbol{\delta}$ and its EWMA are introduced  $\boldsymbol{\Delta}$: \begin{eqnarray} \mathbf{v}^{(t+1)} &=& \beta_2\,\mathbf{v}^{(t)} + \left(1-\beta_2\right) \nabla f\left(\mathbf{x}^{(t)}\right) \odot \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \boldsymbol{\delta}^{(t+1)} &=& \sqrt{\boldsymbol{\Delta}^{(t)}+\epsilon}\oslash \,\left({\sqrt{{\mathbf{v}}^{(t+1)}}+\epsilon}\right)\odot \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \gamma\,\boldsymbol{\delta}^{(t+1)}\,,\nonumber\\ \boldsymbol{\Delta}^{(t+1)} &=& \beta_2\,\boldsymbol{\Delta}^{(t)} + \left(1-\beta_2\right) \boldsymbol{\delta}^{(t+1)}\odot \boldsymbol{\delta}^{(t+1)}\,.\nonumber \end{eqnarray} Roughly speaking, the intuition is inspired by Newton method:  $\boldsymbol{\delta}\propto\mathbf{H}^{-1}\mathbf{g}$ and $\mathbf{H}^{-1}\propto\boldsymbol{\delta} \oslash \mathbf{g}$, iteratively.
  • Adagrad: no momentum ($\beta_1=0$) and no bias corrections. Squared gradients are simply summed rather than EWMA, which eventually makes adaptive learning rate vanish. \begin{eqnarray} \mathbf{v}^{(t+1)} &=& \sum_{\tau=0}^t \nabla f\left(\mathbf{x}^{(\tau)}\right)\odot \nabla f\left(\mathbf{x}^{(\tau)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \gamma\, \nabla f\left(\mathbf{x}^{(t)}\right) \oslash \,\left({\sqrt{{\mathbf{v}}^{(t+1)}}+\epsilon}\right)\,.\nonumber \end{eqnarray}



Comments

Popular posts from this blog

529 Plan

How to offset W2 tax

Retirement Accounts