Deep learning optimizers
SGD
Vanilla SGD updates $\mathbf{x}^{(t)}$ only by its gradient $\nabla f\left(\mathbf{x}^{(t)}\right)$, i.e. $\mu=0$.
Here, $\mathbf{x}^{(t)}$ is updated by the exponentially weighted moving average (EWMA) of all the past gradients: \begin{equation} \mathbf{m}^{(t+1)} = \sum_{\tau=0} ^{t} \mu^{t-\tau} \,\nabla f\left(\mathbf{x}^{(\tau)}\right)\,.\nonumber \end{equation} To me, EWMA view is more transparent than the physics analogy of momentum and acceleration.
EWMA at the first look should be \begin{eqnarray} \mathbf{m}^{(t+1)}&=&\mu\,\mathbf{m}^{(t)} + (1-\mu) \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \alpha\,\mathbf{m}^{(t+1)} \,,\nonumber \end{eqnarray} which is the same as the pytorch implementation (1) with $\gamma = \alpha (1-\mu)$.
The pytorch implementation (1) relates to the version in Sutskever et.al paper \begin{eqnarray} \mathbf{q}^{(t+1)} &=& \nu\,\mathbf{q}^{(t)} + \gamma\,\nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \mathbf{q}^{(t+1)} \,,\nonumber \end{eqnarray} by $\mathbf{m}^{(t)} =\mathbf{q}^{(t)} / \gamma$ and $\mu = \nu/ \gamma$.
Adam
- The operations of product, division and square root above are all element-wise.
- Each component $\mathbf{x}^{(t)}_i$ has its own adaptive learning rate $\gamma / \sqrt{\hat{\mathbf{v}}^{(t)}_i}$.
- The bias corrections in the third and forth lines is argued as follows: note that $\mathbf{m}^{(t+1)} = \left(1-\beta_1\right)\sum_{\tau=0}^t \beta_1^{t-\tau}\, \nabla f\left(\mathbf{x}^{(\tau)}\right)$. If $\nabla f\left(\mathbf{x}^{(\tau)}\right)$ forms a stationary time series, $\mathbb{E}\left[\nabla f\left(\mathbf{x}^{(\tau)}\right)\right]=\mathbf{g}$. As a result, $\mathbb{E}\left[\mathbf{m}^{(t+1)}\right] = \mathbf{g}\left(1-\beta_1\right)\sum_{\tau=0}^t \beta_1^{t-\tau}=\mathbf{g}\left(1-\beta_1^{t+1}\right)$.
- RMSprop: no momentum ($\beta_1=0$) and no bias corrections. \begin{eqnarray} \mathbf{v}^{(t+1)} &=& \beta_2\,\mathbf{v}^{(t)} + \left(1-\beta_2\right) \nabla f\left(\mathbf{x}^{(t)}\right) \odot \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \gamma\, \nabla f\left(\mathbf{x}^{(t)}\right) \oslash \,\left({\sqrt{{\mathbf{v}}^{(t+1)}}+\epsilon}\right)\,.\nonumber \end{eqnarray}
- Adadelta: no momentum ($\beta_1=0$) and no bias corrections. A new variable $\boldsymbol{\delta}$ and its EWMA are introduced $\boldsymbol{\Delta}$: \begin{eqnarray} \mathbf{v}^{(t+1)} &=& \beta_2\,\mathbf{v}^{(t)} + \left(1-\beta_2\right) \nabla f\left(\mathbf{x}^{(t)}\right) \odot \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \boldsymbol{\delta}^{(t+1)} &=& \sqrt{\boldsymbol{\Delta}^{(t)}+\epsilon}\oslash \,\left({\sqrt{{\mathbf{v}}^{(t+1)}}+\epsilon}\right)\odot \nabla f\left(\mathbf{x}^{(t)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \gamma\,\boldsymbol{\delta}^{(t+1)}\,,\nonumber\\ \boldsymbol{\Delta}^{(t+1)} &=& \beta_2\,\boldsymbol{\Delta}^{(t)} + \left(1-\beta_2\right) \boldsymbol{\delta}^{(t+1)}\odot \boldsymbol{\delta}^{(t+1)}\,.\nonumber \end{eqnarray} Roughly speaking, the intuition is inspired by Newton method: $\boldsymbol{\delta}\propto\mathbf{H}^{-1}\mathbf{g}$ and $\mathbf{H}^{-1}\propto\boldsymbol{\delta} \oslash \mathbf{g}$, iteratively.
- Adagrad: no momentum ($\beta_1=0$) and no bias corrections. Squared gradients are simply summed rather than EWMA, which eventually makes adaptive learning rate vanish. \begin{eqnarray} \mathbf{v}^{(t+1)} &=& \sum_{\tau=0}^t \nabla f\left(\mathbf{x}^{(\tau)}\right)\odot \nabla f\left(\mathbf{x}^{(\tau)}\right)\,,\nonumber\\ \mathbf{x}^{(t+1)} &=& \mathbf{x}^{(t)} - \gamma\, \nabla f\left(\mathbf{x}^{(t)}\right) \oslash \,\left({\sqrt{{\mathbf{v}}^{(t+1)}}+\epsilon}\right)\,.\nonumber \end{eqnarray}
Comments
Post a Comment