Artificial Intelligence 🤖
Optimization Algorithms
Gradient Descent with Momentum, RMSProp, and Adam

Gradient Descent with Momentum, RMSProp, and Adam

Momentum, RMSProp, and Adam are considered adaptive optimization algorithms because they change the learning rate during training in a way that is more responsive to the topology of the cost function's landscape, compared to the more rigid update rules of basic gradient descent variations. It helps accelerate the gradient descent algorithm, which is crucial for training neural networks efficiently. Fundamentally, they do this by using an exponentially weighted averages of the gradients rather than just the current one. This way, it smooths out the noise in the steps taken towards the minimum of a cost function.

Exponentially Weighted Moving Averages

We should first go over exponentially weighted averages. If we have time-series data like the temperature of days throughout the year:

θ1=40θ2=49θ3=45θ4=55θt=50\begin{align*} \theta_{1} &= 40 \\ \theta_{2} &= 49 \\ \theta_{3} &= 45 \\ \theta_{4} &= 55 \\ & \vdots \\ \theta_{t} &= 50 \end{align*}

If we plot data like this, we will find it quite noisy. Now let's compute the exponentially weighted averages using:

vt=βvt1+(1β)θtv_t = \beta \cdot v_{t-1} + (1 - \beta) \cdot \theta_t

For example, say β=0.9\beta = 0.9 and we want to compute v4v_4:

v0=0v1=0.9v0+0.1θ1=4v2=0.9v1+0.1θ2=8.5v3=0.9v2+0.1θ3=12.15v4=0.9v3+0.1θ4=16.435\begin{align*} v_0 &= 0 \\ v_1 &= 0.9 \cdot v_0 + 0.1 \cdot \theta_1 = 4 \\ v_2 &= 0.9 \cdot v_1 + 0.1 \cdot \theta_2 = 8.5 \\ v_3 &= 0.9 \cdot v_2 + 0.1 \cdot \theta_3 = 12.15 \\ v_4 &= 0.9 \cdot v_3 + 0.1 \cdot \theta_4 = 16.435 \end{align*}

If we plot this, the values will represent averages over, heuristically, 1/(1β)1 / (1 - \beta) entries. So, the plot will be much smoother than the original plot.

  • β=0.9\beta = 0.9 will average last 10 entries
  • β=0.98\beta = 0.98 will average last 50 entries
  • β=0.5\beta = 0.5 will average last 2 entries

This heuristic comes from the fact that when β=0.9\beta = 0.9:

0.9100.351e0.9^{10} \approx 0.35 \approx \frac{1}{e}

More generally, it takes about 10 days to decay to about 1/3. So we are averaging over roughly 10 entries. This is a good rule of thumb to keep in mind. Similarly, you can approximate the number of days using:

logβ1elog_{\beta} \frac{1}{e}

With high value of beta, the plot you get is much smoother because you're now averaging over more days of temperature. So, the curve is less wavy & smoother, but on the flip side the curve has now shifts further to the right because you're now averaging over a much larger window of temperatures. And by averaging over a larger window, this formula, this exponentially weighted average formula adapts more slowly when the temperature changes i.e. we have more latency. And the reason for that is when Beta 0.98 then it's giving a lot of weight to the previous value and a much smaller weight, just 0.02, to whatever you're seeing right now.

Exponential Beta Comparison

By averaging only over two days temperature you're averaging over much shorter window. The resulting data will be much more noisy, and also much more susceptible to outliers. But this adapts much more quickly when the temperature changes.

Understanding exponentially weighted averages

Intuition. Say we were calculating up until v100v_100:

v100=0.9v99+0.1θ100v99=0.9v98+0.1θ99v98=0.9v97+0.1θ98v1=0.9v0+0.1θ1v0=0\begin{align*} v_{100} &= 0.9 \cdot v_{99} + 0.1 \cdot \theta_{100} \\ v_{99} &= 0.9 \cdot v_{98} + 0.1 \cdot \theta_{99} \\ v_{98} &= 0.9 \cdot v_{97} + 0.1 \cdot \theta_{98} \\ & \vdots \\ v_{1} &= 0.9 \cdot v_{0} + 0.1 \cdot \theta_{1} \\ v_{0} &= 0 \end{align*}

We can expand and rewrite the equation as as sum:

v100=0.1θ100+0.90.1θ99+0.920.1θ98++0.9990.1θ_1v*{100} = 0.1 \cdot \theta*{100} + 0.9 \cdot 0.1 \cdot \theta*{99} + 0.9^2 \cdot 0.1 \cdot \theta*{98} + \cdots + 0.9^{99} \cdot 0.1 \cdot \theta\_{1}

We can see clearly now that this is just a weighted sum with different weights on different θ\theta's, where the weights decay exponentially, and the weights add up to 1. So in general:

vt=(1β)i=1tβtiθiv*t = (1 - \beta) \sum*{i=1}^{t} \beta^{t-i} \theta_i

This calculation takes very little memory to calculate. This has efficiency - just a storage of one real number. We can implement this algorithm with more accurate results using a moving window. But the code is more efficient and faster using the exponentially weighted averages algorithm.

Bias Correction in Exponentially Weighted Averages

Initially, because v0=0v_0 = 0, the exponentially weighted average is biased toward zero, which can affect early iterations. To correct this, you can apply bias correction:

vt=βvt1+(1β)θt1βtv_t = \frac{\beta \cdot v_{t-1} + (1 - \beta) \cdot \theta_t}{1 - \beta^t}

As tt becomes larger, (1βt)(1 - \beta^t) becomes close to 1. Bias correction works to make the denominator large near t=0t=0 and like 11 when tt is large. So, to compare, the old equation:

v1=θ1vt=βvt1+(1β)θtv_{1} = \theta_{1} \\ v_t = \beta \cdot v_{t - 1} + ( 1 - \beta ) \cdot \theta_{t}

New Equation with Bias Correction:

v1=0vt=βvt1+(1β)θt1βtv_{1} = 0 \\ v_t = \frac{\beta \cdot v_{t - 1} + ( 1 - \beta ) \cdot \theta_{t}}{1 - \beta^t}

When tt is small, vtv_t is inflated and when tt is large, the denominator just becomes almost 1 and there is no effect of it.

Although, this will only affect the initial stages of learning. I have found that I'm better off just setting v0v_0 to be equal with θ1\theta_{1} and not bother with bias correction.

Gradient descent with momentum

When we apply momentum to gradient descent, we're aiming to reduce the erratic movements and focus on the consistent direction of progress. The momentum algorithm almost always works faster than standard gradient descent. The simple idea is to calculate the exponentially weighted averages for your gradients and then update your weights with the new values. This averages out the oscillations in the gradients (esp in vertical direction) i.e. In direction we don't want. i.e. from blue to red. By damping these oscialltions there is a more straightforward path to the minimum.

A close-up of a graph Description automatically generated with low confidence

Generally, you will set vdwv_{d\mathbf{w}} and vdbv_{d\mathbf{b}} to 0, and then on each iteration tt, once you have computed your gradients dwd\mathbf{w} and dbd\mathbf{b} on the current mini-batch, you will compute a moving average for the derivatives as:

vdw=βvdw+(1β)dwvdb=βvdb+(1β)db\begin{align*} v_{d\mathbf{w}} &= \beta \cdot v_{d\mathbf{w}} + (1 - \beta) \cdot d\mathbf{w} \\ v_{d\mathbf{b}} &= \beta \cdot v_{d\mathbf{b}} + (1 - \beta) \cdot d\mathbf{b} \end{align*}

And then update the parameters as follows:

w:=wαvdwb:=bαvdb\begin{align*} \mathbf{w} &:= \mathbf{w} - \alpha \cdot v_{d\mathbf{w}} \\ \mathbf{b} &:= \mathbf{b} - \alpha \cdot v_{d\mathbf{b}} \\ \end{align*}

Derivatives can be thought as acceleration and velocity, and beta is analogous to friction. Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will "oscillate" toward convergence. Using momentum can reduce these oscillations.

Using momentum can often lead to:

  • Faster convergence.
  • Smoother descent paths with fewer oscillations.
  • Better navigation through tricky areas like saddle points.

It's a valuable tool in the machine learning optimization toolbox, often resulting in better performance than standard gradient descent. However, choosing the right β\beta hyperparameter value is critical and typically requires some experimentation. A common default is β=0.9\beta = 0.9. In practice as-well, many opt to skip bias correction after setting the initial value of v0v_0 equal to the first gradient.

RMSprop

Stands for Root mean square prop. Developed by Geoffrey Hinton, this algorithm speeds up the gradient descent with momentum by dividing the learning rate by an exponentially decaying average of squared gradients.

Generally, you will set sdws_{d\mathbf{w}} and sdbs_{d\mathbf{b}} to 0, and then on each iteration tt, once you have computed your gradients dwd\mathbf{w} and dbd\mathbf{b} on the current mini-batch, you will compute:

sdw=β2sdw+(1β2)dw2sdb=β2sdb+(1β2)db2\begin{align*} s_{d\mathbf{w}} &= \beta_2 \cdot s_{d\mathbf{w}} + (1 - \beta_2) \cdot {d\mathbf{w}}^2 \\ s_{d\mathbf{b}} &= \beta_2 \cdot s_{d\mathbf{b}} + (1 - \beta_2) \cdot {d\mathbf{b}}^2 \end{align*}

For clarity, the superscript 2 denotes element-wise squaring. This keeps an exponentially weighted average of the squares of the derivatives. RMSprop then updates the weights as follows:

w:=wαdwsdw+ϵb:=bαdbsdb+ϵ\begin{align*} \mathbf{w} &:= \mathbf{w} - \alpha \cdot \frac{d\mathbf{w}}{\sqrt{s_{d\mathbf{w}}} + \epsilon} \\ \mathbf{b} &:= \mathbf{b} - \alpha \cdot \frac{d\mathbf{b}}{\sqrt{s_{d\mathbf{b}}} + \epsilon} \\ \end{align*}

w\mathbf{w}'s are small so squares are even smaller. b\mathbf{b}'s however are normally larger so squares are larger. Then when we are updating our parameters, we divide by a small value for w\mathbf{w}, so the updates are larger, and b\mathbf{b} is divided by a large term so is updated less. RMSprop will essentially make the cost function move slower on the b\mathbf{b} direction and faster on the w\mathbf{w} direction in the following diagram:

RMSProp

This damps out more in the vertical direction. Therefore, you can use a bigger learning rate without it diverging. We also ensure more numerical stability by ensuring that vdwv_{d\mathbf{w}} is not zero by adding a small value epsilon (e.g. ϵ=108\epsilon = 10^{-8}) to it.

Also just to note, the vertical and horizontal w\mathbf{w} and b\mathbf{b} directions are in reality in a very high dimensional space of parameters. But the intuition is the same.

Adam optimization algorithm

Stands for Adaptive Moment Estimation. Adam optimization and RMSprop are among the optimization algorithms that worked very well with a lot of NN architectures; it is proven to be effective on a lot of learning algorithm architectures. Adam optimization simply puts RMSprop and momentum together.

Similarly to before, you will start of by setting you will set vdwv_{d\mathbf{w}}, vdbv_{d\mathbf{b}}, sdws_{d\mathbf{w}} and sdbs_{d\mathbf{b}} to 0. Then, once you have computed your gradients dwd\mathbf{w} and dbd\mathbf{b} on the current mini-batch, you will compute the momemtum and RMSprop terms as follows:

vdw=β1vdw+(1β1)dwvdb=β1vdb+(1β1)dbsdw=β2sdw+(1β2)dw2sdb=β2sdb+(1β2)db2\begin{align*} v_{d\mathbf{w}} &= \beta_1 \cdot v_{d\mathbf{w}} + (1 - \beta_1) \cdot d\mathbf{w} \\ v_{d\mathbf{b}} &= \beta_1 \cdot v_{d\mathbf{b}} + (1 - \beta_1) \cdot d\mathbf{b} \\ s_{d\mathbf{w}} &= \beta_2 \cdot s_{d\mathbf{w}} + (1 - \beta_2) \cdot {d\mathbf{w}}^2 \\ s_{d\mathbf{b}} &= \beta_2 \cdot s_{d\mathbf{b}} + (1 - \beta_2) \cdot {d\mathbf{b}}^2 \end{align*}

In the typical implementation of Adam, we do use bias correction:

vdwCorrected=vdw1β1tvdbCorrected=vdb1β1tsdwCorrected=sdw1β1tsdbCorrected=sdb1β1t\begin{align*} v_{d\mathbf{w}}^{\text{Corrected}} &= \frac{v_{d\mathbf{w}}}{1-{\beta_1}^t} \\ v_{d\mathbf{b}}^{\text{Corrected}} &= \frac{v_{d\mathbf{b}}}{1-{\beta_1}^t} \\ s_{d\mathbf{w}}^{\text{Corrected}} &= \frac{s_{d\mathbf{w}}}{1-{\beta_1}^t} \\ s_{d\mathbf{b}}^{\text{Corrected}} &= \frac{s_{d\mathbf{b}}}{1-{\beta_1}^t} \end{align*}

Then we update the parameters as follows:

w:=wαvdwCorrectedsdwCorrected+ϵb:=bαvdbCorrectedsdbCorrected+ϵ\begin{align*} \mathbf{w} &:= \mathbf{w} - \alpha \cdot \frac{v_{d\mathbf{w}}^{\text{Corrected}}}{\sqrt{s_{d\mathbf{w}}^{\text{Corrected}}} + \epsilon} \\ \mathbf{b} &:= \mathbf{b} - \alpha \cdot \frac{v_{d\mathbf{b}}^{\text{Corrected}}}{\sqrt{s_{d\mathbf{b}}^{\text{Corrected}}} + \epsilon} \\ \end{align*}

This algorithm combines the effect of gradient descent with momentum with the effects of gradient descent with RMSprop. It is currently recommended to use Adam optimization instead of RMSprop, because it works well in practice.

Hyperparameters for Adam

  • α\alpha: Needs to be tuned
  • β1\beta_1: parameter of momentum - 0.9 is recommended by default
  • β2\beta_2: parameter of RMSprop - 0.999 is recommended by default
  • ϵ\epsilon: 10810^{-8} is recommended by default.