Hyperparameter Tuning

Hyperparameter tuning involves tweaking the settings that govern how our model learns. The right adjustments can mean the difference between an average model and a highly accurate one, as well as the difference between a model that trains quickly and one that takes forever.

Key Hyperparameters

Here's a simplified priority list by Andrew Ng for hyperparameters:

Learning rate: $\alpha$
Momentum beta: $\beta$
Mini-batch size: $bs$
No of hidden units: $n^{[l]}$
No of layers: $L$
Learning rate decay: $\alpha_{\text{decay}}$
Regularization $\lambda$
Activation function: $g(z)$
Adam: $\beta_1$ , $\beta_2$ & $\epsilon$

Tuning Practices

While there's no one-size-fits-all when it comes to hyperparameter importance because it depends on the particular problem. In fact, it is overly complex to know in advance which hyperparameters ae going to be the most important for our problem. There are some effective strategies for finding the best settings:

Grid Search: Start by picking a few values for each hyperparameter and train models with all possible combinations. It's systematic but can be inefficient.
- This is a common practice, in machine learning, but when we have to deal with high number of hyperparameters it is recommended to choose the points in grid search at random
Random Sampling: Instead of a grid, try random combinations. This method often finds a good mix faster than a grid search.
Coarse to Fine Sampling: Narrow down the search after identifying promising hyperparameter ranges. Focus in and sample more densely within this refined space.

Hyperparameter tuning visualization

After zooming into a promising region, try more combinations (shown in red), that are again at random, to find the best hyperparameters.

Choosing the Right Scale

Sampling using grid search at random, over the range of hyperparameters, can allow us to search over the space of hyperparameters more efficiently. On the contrary, sampling at random doesn’t mean sampling uniformly at random over the range of valid values. Instead, it is important to pick the appropriate scale on which to explore the hyperparameters. It's better sometimes to search using the logarithmic scale rather than in linear scale. Say we were searching between $a$ and $b$ :

a_log = log(a) # e.g. a = 0.0001 then a_log = -4
# Calculate:
b_log = log(b) # e.g. b = 1 then b_log = 0
# Then:
r = (a_log - b_log) * np.random.rand() + b_log
# r is now a random number between a log and b log
# In the example the range would be from [-4, 0] because rand range [0,1]
result = 10^r

It uniformly samples values in log scale from [a,b]. Say we want to use this method for exploring the momentum $\beta$ **. $\beta$ 's best range is $[0.9, 0.999]$ . You should search for $( 1 - \beta )$ in range $[0.001, 0.1]$ and then use a = 0.001 and b = 0.1.

Then:

a_log = -3
b_log = -1
r = (a_log - b_log) * np.random.rand() + b_log
beta = 1 - 10^r # because 1 - beta = 10^r

Why is it such a bad idea to sample in a linear scale? When beta is close to 1, the sensitivity of the results you get changes, even with very small changes to beta. i.e. 0.900 to 0.9005 will have little impact (both ~10 values), but beta 0.999 to 0.9995 (1000 to 2000 samples) will have a massive impact. i.e. $1/(1-\beta)$ is very sensitive to small changes on beta when it is close to 1. This sampling method allows you to sample more densely when beta is close to 1. It is more efficient in terms of how you distribute the samples, i.e. to explore the space of possible outcomes more efficiently

Hyperparameter Tuning Strategies: Pandas vs. Caviar

Intuitions about hyperparameter settings from one application area may or may not transfer to a different one. Hyperparameters may also get stale as well so you should redo every couple months. Finally, your approach to tuning may also depend on your computational resources:

The Panda Approach: If resources are limited, you might need to monitor and adjust your model's learning process manually, like a babysitting as it is training.
The Caviar Approach: With more resources, you can afford to train many models in parallel with different hyperparameters and pick the best performing one.

In summary, hyperparameter tuning is part art, part science, and an essential part of developing effective machine learning models. The key is to use systematic exploration and leverage computational resources wisely to refine the learning process.

Learning Rate Decay & Local Optima Batch Normalization