\[ \begin{equation} \theta_{t + 1} = \theta_{t} - \alpha \Delta \theta_{t} \end{equation} \]
\[
\begin{split}
u_{t + 1} &= \mu u_{t} + \alpha \nabla E(\theta_{t}) \\
\theta_{t + 1} &= \theta_{t} - u_{t + 1} \\
\end{split}
\]
Momentum allows to avoid some of local minimums, speeds up the convergence in some cases
\[
\begin{split}
u_{t + 1} &= \mu u_{t} + \alpha \nabla E(\theta_{t} - \mu u_{t}) \\
\theta_{t + 1} &= \theta_{t} - u_{t + 1} \\
\end{split}
\]
Adagarad adaptively tunes the learning rate per parameter.
\[
\begin{split}
cache_{t + 1} &= cache_{t} + (\nabla E(\theta_{t}))^2 \\
\theta_{t + 1} &= \theta_{t} - \alpha \frac{\nabla E(\theta_{t})}{\sqrt{cache_{t + 1}} + \epsilon} \\
\end{split}
\]
RMSProp is slight variation of Adagrad. The difference between them is that RMSProp let that squared gradient estimate decay.
\[
\begin{split}
cache_{t + 1} &= \beta cache_{t} + (1 - \beta) (\nabla E(\theta_{t}))^2 \\
\theta_{t + 1} &= \theta_{t} - \alpha \frac{\nabla E(\theta_{t})}{\sqrt{cache_{t + 1}} + \epsilon} \\
\end{split}
\]
\[
\begin{split}
u_{t + 1} &= \gamma u_{t} + (1 - \gamma) \nabla E(\theta_{t}) \\
cache_{t + 1} &= \beta cache_{t} + (1 - \beta) (\nabla E(\theta_{t}))^2 \\
\theta_{t + 1} &= \theta_{t} - \alpha \frac{u_{t + 1})}{\sqrt{cache_{t + 1}} + \epsilon} \\
\end{split}
\]