Gradient Descent with Momentum, RMSProp, and Adam

Momentum, RMSProp, and Adam are considered adaptive optimization algorithms because they change the learning rate during training in a way that is more responsive to the topology of the cost function's landscape, compared to the more rigid update rules of basic gradient descent variations. It helps accelerate the gradient descent algorithm, which is crucial for training neural networks efficiently. Fundamentally, they do this by using an exponentially weighted averages of the gradients rather than just the current one. This way, it smooths out the noise in the steps taken towards the minimum of a cost function.

Exponentially Weighted Moving Averages

We should first go over exponentially weighted averages. If we have time-series data like the temperature of days throughout the year:

\begin{align*} \theta_{1} &= 40 \\ \theta_{2} &= 49 \\ \theta_{3} &= 45 \\ \theta_{4} &= 55 \\ & \vdots \\ \theta_{t} &= 50 \end{align*}

If we plot data like this, we will find it quite noisy. Now let's compute the exponentially weighted averages using:

v_t = \beta \cdot v_{t-1} + (1 - \beta) \cdot \theta_t

For example, say $\beta = 0.9$ and we want to compute $v_4$ :

\begin{align*} v_0 &= 0 \\ v_1 &= 0.9 \cdot v_0 + 0.1 \cdot \theta_1 = 4 \\ v_2 &= 0.9 \cdot v_1 + 0.1 \cdot \theta_2 = 8.5 \\ v_3 &= 0.9 \cdot v_2 + 0.1 \cdot \theta_3 = 12.15 \\ v_4 &= 0.9 \cdot v_3 + 0.1 \cdot \theta_4 = 16.435 \end{align*}

If we plot this, the values will represent averages over, heuristically, $1 / (1 - \beta)$ entries. So, the plot will be much smoother than the original plot.

$\beta = 0.9$ will average last 10 entries
$\beta = 0.98$ will average last 50 entries
$\beta = 0.5$ will average last 2 entries

This heuristic comes from the fact that when $\beta = 0.9$ :

0.9^{10} \approx 0.35 \approx \frac{1}{e}

More generally, it takes about 10 days to decay to about 1/3. So we are averaging over roughly 10 entries. This is a good rule of thumb to keep in mind. Similarly, you can approximate the number of days using:

log_{\beta} \frac{1}{e}

With high value of beta, the plot you get is much smoother because you're now averaging over more days of temperature. So, the curve is less wavy & smoother, but on the flip side the curve has now shifts further to the right because you're now averaging over a much larger window of temperatures. And by averaging over a larger window, this formula, this exponentially weighted average formula adapts more slowly when the temperature changes i.e. we have more latency. And the reason for that is when Beta 0.98 then it's giving a lot of weight to the previous value and a much smaller weight, just 0.02, to whatever you're seeing right now.

Exponential Beta Comparison

By averaging only over two days temperature you're averaging over much shorter window. The resulting data will be much more noisy, and also much more susceptible to outliers. But this adapts much more quickly when the temperature changes.

Understanding exponentially weighted averages

Intuition. Say we were calculating up until $v_100$ :

\begin{align*} v_{100} &= 0.9 \cdot v_{99} + 0.1 \cdot \theta_{100} \\ v_{99} &= 0.9 \cdot v_{98} + 0.1 \cdot \theta_{99} \\ v_{98} &= 0.9 \cdot v_{97} + 0.1 \cdot \theta_{98} \\ & \vdots \\ v_{1} &= 0.9 \cdot v_{0} + 0.1 \cdot \theta_{1} \\ v_{0} &= 0 \end{align*}

We can expand and rewrite the equation as as sum:

v*{100} = 0.1 \cdot \theta*{100} + 0.9 \cdot 0.1 \cdot \theta*{99} + 0.9^2 \cdot 0.1 \cdot \theta*{98} + \cdots + 0.9^{99} \cdot 0.1 \cdot \theta\_{1}

We can see clearly now that this is just a weighted sum with different weights on different $\theta$ 's, where the weights decay exponentially, and the weights add up to 1. So in general:

v*t = (1 - \beta) \sum*{i=1}^{t} \beta^{t-i} \theta_i

This calculation takes very little memory to calculate. This has efficiency - just a storage of one real number. We can implement this algorithm with more accurate results using a moving window. But the code is more efficient and faster using the exponentially weighted averages algorithm.

Bias Correction in Exponentially Weighted Averages

Initially, because $v_0 = 0$ , the exponentially weighted average is biased toward zero, which can affect early iterations. To correct this, you can apply bias correction:

v_t = \frac{\beta \cdot v_{t-1} + (1 - \beta) \cdot \theta_t}{1 - \beta^t}

As $t$ becomes larger, $(1 - \beta^t)$ becomes close to 1. Bias correction works to make the denominator large near $t=0$ and like $1$ when $t$ is large. So, to compare, the old equation:

v_{1} = \theta_{1} \\ v_t = \beta \cdot v_{t - 1} + ( 1 - \beta ) \cdot \theta_{t}

New Equation with Bias Correction:

v_{1} = 0 \\ v_t = \frac{\beta \cdot v_{t - 1} + ( 1 - \beta ) \cdot \theta_{t}}{1 - \beta^t}

When $t$ is small, $v_t$ is inflated and when $t$ is large, the denominator just becomes almost 1 and there is no effect of it.

Although, this will only affect the initial stages of learning. I have found that I'm better off just setting $v_0$ to be equal with $\theta_{1}$ and not bother with bias correction.

Gradient descent with momentum

When we apply momentum to gradient descent, we're aiming to reduce the erratic movements and focus on the consistent direction of progress. The momentum algorithm almost always works faster than standard gradient descent. The simple idea is to calculate the exponentially weighted averages for your gradients and then update your weights with the new values. This averages out the oscillations in the gradients (esp in vertical direction) i.e. In direction we don't want. i.e. from blue to red. By damping these oscialltions there is a more straightforward path to the minimum.

A close-up of a graph Description automatically generated with low confidence

Generally, you will set $v_{d\mathbf{w}}$ and $v_{d\mathbf{b}}$ to 0, and then on each iteration $t$ , once you have computed your gradients $d\mathbf{w}$ and $d\mathbf{b}$ on the current mini-batch, you will compute a moving average for the derivatives as:

\begin{align*} v_{d\mathbf{w}} &= \beta \cdot v_{d\mathbf{w}} + (1 - \beta) \cdot d\mathbf{w} \\ v_{d\mathbf{b}} &= \beta \cdot v_{d\mathbf{b}} + (1 - \beta) \cdot d\mathbf{b} \end{align*}

And then update the parameters as follows:

\begin{align*} \mathbf{w} &:= \mathbf{w} - \alpha \cdot v_{d\mathbf{w}} \\ \mathbf{b} &:= \mathbf{b} - \alpha \cdot v_{d\mathbf{b}} \\ \end{align*}

Derivatives can be thought as acceleration and velocity, and beta is analogous to friction. Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will "oscillate" toward convergence. Using momentum can reduce these oscillations.

Using momentum can often lead to:

Faster convergence.
Smoother descent paths with fewer oscillations.
Better navigation through tricky areas like saddle points.

It's a valuable tool in the machine learning optimization toolbox, often resulting in better performance than standard gradient descent. However, choosing the right $\beta$ hyperparameter value is critical and typically requires some experimentation. A common default is $\beta = 0.9$ . In practice as-well, many opt to skip bias correction after setting the initial value of $v_0$ equal to the first gradient.

RMSprop

Stands for Root mean square prop. Developed by Geoffrey Hinton, this algorithm speeds up the gradient descent with momentum by dividing the learning rate by an exponentially decaying average of squared gradients.

Generally, you will set $s_{d\mathbf{w}}$ and $s_{d\mathbf{b}}$ to 0, and then on each iteration $t$ , once you have computed your gradients $d\mathbf{w}$ and $d\mathbf{b}$ on the current mini-batch, you will compute:

\begin{align*} s_{d\mathbf{w}} &= \beta_2 \cdot s_{d\mathbf{w}} + (1 - \beta_2) \cdot {d\mathbf{w}}^2 \\ s_{d\mathbf{b}} &= \beta_2 \cdot s_{d\mathbf{b}} + (1 - \beta_2) \cdot {d\mathbf{b}}^2 \end{align*}

For clarity, the superscript 2 denotes element-wise squaring. This keeps an exponentially weighted average of the squares of the derivatives. RMSprop then updates the weights as follows:

\begin{align*} \mathbf{w} &:= \mathbf{w} - \alpha \cdot \frac{d\mathbf{w}}{\sqrt{s_{d\mathbf{w}}} + \epsilon} \\ \mathbf{b} &:= \mathbf{b} - \alpha \cdot \frac{d\mathbf{b}}{\sqrt{s_{d\mathbf{b}}} + \epsilon} \\ \end{align*}

$\mathbf{w}$ 's are small so squares are even smaller. $\mathbf{b}$ 's however are normally larger so squares are larger. Then when we are updating our parameters, we divide by a small value for $\mathbf{w}$ , so the updates are larger, and $\mathbf{b}$ is divided by a large term so is updated less. RMSprop will essentially make the cost function move slower on the $\mathbf{b}$ direction and faster on the $\mathbf{w}$ direction in the following diagram:

RMSProp

This damps out more in the vertical direction. Therefore, you can use a bigger learning rate without it diverging. We also ensure more numerical stability by ensuring that $v_{d\mathbf{w}}$ is not zero by adding a small value epsilon (e.g. $\epsilon = 10^{-8}$ ) to it.

Also just to note, the vertical and horizontal $\mathbf{w}$ and $\mathbf{b}$ directions are in reality in a very high dimensional space of parameters. But the intuition is the same.

Adam optimization algorithm

Stands for Adaptive Moment Estimation. Adam optimization and RMSprop are among the optimization algorithms that worked very well with a lot of NN architectures; it is proven to be effective on a lot of learning algorithm architectures. Adam optimization simply puts RMSprop and momentum together.

Similarly to before, you will start of by setting you will set $v_{d\mathbf{w}}$ , $v_{d\mathbf{b}}$ , $s_{d\mathbf{w}}$ and $s_{d\mathbf{b}}$ to 0. Then, once you have computed your gradients $d\mathbf{w}$ and $d\mathbf{b}$ on the current mini-batch, you will compute the momemtum and RMSprop terms as follows:

\begin{align*} v_{d\mathbf{w}} &= \beta_1 \cdot v_{d\mathbf{w}} + (1 - \beta_1) \cdot d\mathbf{w} \\ v_{d\mathbf{b}} &= \beta_1 \cdot v_{d\mathbf{b}} + (1 - \beta_1) \cdot d\mathbf{b} \\ s_{d\mathbf{w}} &= \beta_2 \cdot s_{d\mathbf{w}} + (1 - \beta_2) \cdot {d\mathbf{w}}^2 \\ s_{d\mathbf{b}} &= \beta_2 \cdot s_{d\mathbf{b}} + (1 - \beta_2) \cdot {d\mathbf{b}}^2 \end{align*}

In the typical implementation of Adam, we do use bias correction:

\begin{align*} v_{d\mathbf{w}}^{\text{Corrected}} &= \frac{v_{d\mathbf{w}}}{1-{\beta_1}^t} \\ v_{d\mathbf{b}}^{\text{Corrected}} &= \frac{v_{d\mathbf{b}}}{1-{\beta_1}^t} \\ s_{d\mathbf{w}}^{\text{Corrected}} &= \frac{s_{d\mathbf{w}}}{1-{\beta_1}^t} \\ s_{d\mathbf{b}}^{\text{Corrected}} &= \frac{s_{d\mathbf{b}}}{1-{\beta_1}^t} \end{align*}

Then we update the parameters as follows:

\begin{align*} \mathbf{w} &:= \mathbf{w} - \alpha \cdot \frac{v_{d\mathbf{w}}^{\text{Corrected}}}{\sqrt{s_{d\mathbf{w}}^{\text{Corrected}}} + \epsilon} \\ \mathbf{b} &:= \mathbf{b} - \alpha \cdot \frac{v_{d\mathbf{b}}^{\text{Corrected}}}{\sqrt{s_{d\mathbf{b}}^{\text{Corrected}}} + \epsilon} \\ \end{align*}

This algorithm combines the effect of gradient descent with momentum with the effects of gradient descent with RMSprop. It is currently recommended to use Adam optimization instead of RMSprop, because it works well in practice.

Hyperparameters for Adam

$\alpha$ : Needs to be tuned
$\beta_1$ : parameter of momentum - 0.9 is recommended by default
$\beta_2$ : parameter of RMSprop - 0.999 is recommended by default
$\epsilon$ : $10^{-8}$ is recommended by default.

Batch, Stochastic, and Mini-Batch Gradient Descent Learning Rate Decay & Local Optima