Learning Rate Decay

Learning rate decay is a technique that gradually reduces the learning rate as training progresses. This slowing down helps the model to settle into the minimum efficiently rather than overshooting it due to too large steps. In general, the formula for adjusting the learning rate over time is:

\alpha = \frac{1}{1 + \text{decayRate} \times \text{epochNumber}} \cdot \alpha_0

Here, $\alpha$ is the learning rate, $\alpha_0$ is the initial learning rate, and the decay rate is a factor that determines how quickly the learning rate decreases. It's key to note that the decay applies across complete passes of the dataset (epochs), not just individual mini-batches.

With this approach, even though mini-batch gradient descent is inherently noisy, as you reduce the learning rate, the oscillations around the cost function's minimum become smaller and more contained in a smaller region.

Oscillations in a tighter region around the minimum

Variations of Learning Rate Decay

Besides the gradual decay over epochs, there are other strategies, like exponential decay:

\alpha = (0.95)^{\text{epochNumber}} \cdot \alpha_0

Or adjusting based on the square root of the epoch number:

\alpha = \frac{k}{\sqrt{\text{epochNumber}}} \cdot \alpha_0

Where $k$ is a hyperparameter. Each method has its own benefits and the best choice may depend on the specifics of the problem at hand. Some people even perform learning rate decay discretely - such as repeatedly decreasing it after some set number of epochs.

Local Optima & High-Dimensional Spaces

In the world of deep learning, the fear of local optima is often overstated. Given the high-dimensional space these models operate in, our intuition in lower dimension space doesn't translate over well. For a point to be a local optima it has to be a local optima for each of those dimensions which is highly unlikely.

Informally, given a function in very high dimensional space, then in each direction it can either be a convex light function or a concave light function. If the gradient is indeed zero, and you are in, say, a 20,000 dimensional space, then for it to be a local optima, all 20,000 directions need to look concave, which is quite unlikely. You are more likely to have some directions where it bends down or up.

Instead, saddle points — places where the cost function is flat, or where the gradient is zero in some dimensions but not others — are more common. Plateaus, or areas where the gradient is very small for an extended number of dimensions, pose a more significant challenge, causing the learning to slow down. This is where advanced algorithms with momentum, RMSprop, or Adam can help.

Gradient Descent with Momentum, RMSProp, and Adam Hyperparameter Tuning Process