Batch, Stochastic, and Mini-Batch Gradient Descent
When training machine learning models, especially neural networks, the choice of gradient descent technique significantly impacts the speed and effectiveness of the learning process. Here's a comparison of three primary methods: Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent.
Batch Gradient Descent
Batch gradient descent, . Uses the entire training set to compute the gradient of the cost function for each iteration. This "full sweep" has its pros and cons:
- Pros: It produces a stable error gradient and a steady convergence. Low noise, large steps.
- Cons: It can be very slow per epoch and computationally expensive, especially with large datasets.
Stochastic Gradient Descent
Stochastic gradient descent (SGD), , on the other hand, updates the model's parameters using only one training example at a time.
- Pros: It's much faster and can help to escape local minima more effectively than batch gradient descent.
- Cons: The frequent updates result in significant variance during training, leading to a fluctuating cost function that never really settles. We will also lose the speedup from vectorization, and settle on a minimum. The noisiness can be reduced by using a smaller learning rate however, but that would also slow down the learning process.
Mini-Batch Gradient Descent
Mini-batch gradient descent, . Strikes a balance between the stability of batch gradient descent and the speed of SGD by using subsets of the training data. On average we go in a good direction
- Pros: It offers a compromise between the efficiency of batch gradient descent and the agility of SGD, leading to faster convergence in practice.
- Cons: The cost function tends not to decrease monotonically but noisily. It also never actually reaches a minimum, but rather fluctuates slightly around the minimum. We also lose some of the benefits we get from vectorization.
Mini-Batch Gradient Descent Intuition
Training neural networks can be time-consuming, especially with a massive dataset. For instance, if you have as 50 million data points, processing them all at once isn't just slow; it may not even fit in memory. A solution is to use mini-batch gradient descent, which begins optimizing and taking steps before processing the entire dataset. Take = 50 million, if we split it into mini batches of size 1000.
Each mini-batch is then used in one iteration of training. So, rather than taking one huge step after going through all 50 million data points, you take many small steps—one for each mini-batch.
Each represents a different mini-batch used in one iteration of training (We similarly split ). The training set is divided into batches of a certain size (in this case, 1000 examples per batch), and each batch is used sequentially during the training process. So, rather than taking one huge step after going through all 50 million data points, you take many small steps—one for each mini-batch.
Batch vs Mini-Batch
With batch gradient descent, the entire dataset is used for each step, which means only one gradient descent update step per epoch. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps, which can significantly speed up the learning process.
Overall, in Batch gradient descent we run the gradient descent on the whole dataset. While in Mini-Batch gradient descent we run the gradient descent on the mini datasets.
What to Expect with Mini-Batch Gradient Descent
Unlike batch gradient descent, where the cost decreases smoothly, mini-batch gradient descent's cost graph is nosier because each mini-batch provides only an approximation of the overall trend. These fluctuations are normal due to the variability in each mini-batch.
The key takeaway is that while mini-batch gradient descent introduces noise into the optimization process, it generally moves towards lower costs, resulting in faster training compared to using the whole dataset at once. Mini-batch gradient descent overall strikes a balance between the efficiency of batch gradient descent and the speed of stochastic gradient descent, making it a practical choice for large datasets.
Comparison
Overall, you can see how each of these methods differs:
- Batch gradient descent:
- Too long per iteration (epoch)
- Stochastic gradient descent:
- Too noisy regarding cost minimization (can be reduced by using a smaller learning rate)
- Won't ever converge (reach the minimum cost)
- Lose speedup from vectorization
- Mini-batch gradient descent:
- faster learning
- you have the vectorization advantage
- make progress without waiting to process the entire training set
- doesn't always exactly converge (oscillates in a very small region, but you can reduce learning rate)
- faster learning
How to Choose the Right Mini-Batch Size
Choosing the correct mini-batch size is crucial:
- For small datasets (less than 2000 examples), batch gradient descent is usually preferable.
- For larger datasets, mini-batch sizes typically are a power of two (64, 128, 256, ...). This takes advantage of the hardware's memory architecture and can result in faster computation.
- Ensure that the mini-batch fits in your CPU/GPU memory. This can vary depending on the specific application and model size.
- Remember, the size of your mini-batch is a hyperparameter. You may need to experiment to find the best size for your particular problem.
Somewhat counter-intuitively, small batch sizes tend to not get stuck in local minima. Large batch sizes can converge on the wrong solution at random. Large batch sizes tend to get stuck, at random, inside "local minima" instead of the correct solution. Large learning rates can overshoot the correct solution, and small learning rates increase training time.
In summary, while batch gradient descent is consistent and stable, stochastic gradient descent is noisy but fast. Mini-batch gradient descent tries to harness the benefits of both. It's typically the method of choice due to its computational efficiency and convergence speed, especially when dealing with larger datasets. These methods solely rely on the gradient of the cost function derived from the current batch of data though, they don't take into account the history of gradients. That is where momentum, RMSprop, and Adam optimization come in.