Batch Normalization

In machine learning, especially when dealing with deep neural networks, we often normalize the input features to aid in faster learning. Batch Normalization extends this concept to internal layers of the network, where it normalizes the activations of each layer before passing them to the next one. This can make a significant difference in how easy it is to train deep networks.

Normalizing Activations

Batch Normalization, developed by Sergey Ioffe and Christian Szegedy, is a technique to standardize the inputs to a layer for each mini-batch. This standardization helps to deal with the issue of internal covariate shift, where the distribution of each layer's inputs changes as the parameters of the previous layers change, slowing down training.

This is one other technique that can make your neural network much more robust to the choice of hyperparameters. Before, we normalized the input by subtracting the mean and dividing by variance. This helped a lot for the shape of the cost function and for reaching the minimum point faster. The question is then:

💡

For any hidden layer $l$ can we normalize $\mathbf{a}^{[l]}$ to train $\mathbf {w}^{[l]}$ , $\mathbf{b}^{[l]}$ faster?

This is what batch normalization is about - we normalize the activations the same way we did the inputs! There are some debates in the deep learning literature about whether you should normalize values before the activation function $\mathbf{z}^{[l]}$ or after applying the activation function $\mathbf{a}^{[l]}$ . In practice, normalizing $\mathbf{z}^{[l]}$ is done much more often and that is what Andrew Ng presents.

In general, given $\mathbf{Z}^{[l]} = [\mathbf{z}^{(1)}, ..., \mathbf{z}^{(m)}]$ for $i: [1, m]$ i.e. for each input/training example:

Calculate the mini-batch mean:

\mu = \frac{1}{m} \sum_{i=1}^{m} \mathbf{z}^{(i)}

Compute the mini-batch variance:

\sigma^2 = \frac{1}{m} \sum\_{i=1}^{m} (\mathbf{z}^{(i)} - \mu)^2

Normalize the activations for each mini-batch:

\mathbf{z}^{(i)}_{norm} = \frac{\mathbf{z}^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}}

We add $\epsilon$ for numerical stability in case $\sigma^2 = 0$ . This forces the inputs to a distribution with zero mean and variance of 1.

Scale and shift the normalized activations:

\tilde{\mathbf{z}}^{(i)} = \gamma \cdot \mathbf{z}^{(i)}_{norm} + \beta

This is to make inputs belong to another distribution (with another mean and variance). $\gamma$ and $\beta$ are learnable parameters of the model. We are essentially making the NN learn the distribution of the outputs. Note though that if $\gamma = sqrt(\sigma^2 + \epsilon)$ and $\beta = \mu$ then $\tilde{\mathbf{z}}^{(i)} = \mathbf{z}^{(i)}$

Overall, we use $\tilde{\mathbf{z}}^{(i)}$ instead of $\mathbf{z}^{(i)}$ to normalise mean and variance.

Why Gamma and Beta?

The parameters $\gamma$ and $\beta$ allow the model to undo the normalization if that is what the optimal model requires. For instance, with activation functions like the sigmoid, you may not want all your values centered around $x=0$ . You might want them to have a larger variance or have a mean that's different than $0$ , in order to better take advantage of the nonlinearity of the sigmoid function rather than have all your values be in just this linear regime.

Sigmoid Function and its Derivative

Batch Normalization in Practice

Batch normalization is usually applied with mini-batches. Batch Norm on each minibatch is using the mean and variance of the $\mathbf{z}^{[l]}$ from just that one minibatch. Using Batch norm iteratively will look something like this:

Take inputs $\mathbf{x}^{(i)}$ and feed it to the first hidden layer.
Use $\mathbf{w}^{[1]}$ and $\mathbf{b}^{[1]}$ to compute $\mathbf{z}^{[1]}$ .
Use Batch Norm with $\gamma$ and $\beta$ to compute $\tilde{\mathbf{z}}^{[1]}$ .
Then feed it to the activation function to get $\mathbf{a}^{[1]}$ .
Then feed it to the next hidden layer and so on.

Our NN parameters will be:

$\mathbf{w}^{[l]}$ , $\gamma^{[l]}$ , $\beta^{[l]}$ for $l: [1, L]$
$\beta^{[l]}$ and $\gamma^{[l]}$ are updated using any optimization algorithms (like GD, RMSprop, Adam)

If you are using a deep learning framework, you won't have to implement batch norm yourself. e.g. in Tensorflow you just use tf.nn.batch-normalization(). Also, note that we did't include $\mathbf{b}^{[l]}$ in the batch normalization parameters. They will no longer do anything as they will be eliminated after the mean subtraction step.

Batch Norm looks at the mini-batch and normalizes $\mathbf{z}^{[1]}$ to (initially) a mean of 0 and standard variance, and then is rescale by $\beta^{[l]}$ and $\gamma^{[l]}$ . This means the value of $\mathbf{b}^{[l]}$ is actually just subtracted out, as Batch Normalization subtracts the means. So adding any constant to all of the examples in the mini-batch doesn't change anything. Because any constant you add will get cancelled out by the mean subtractions step. So, if you're using Batch Norm, you can actually eliminate that parameter, or if you want, think of it as setting it permanently to 0. so, in forwad prop, you can just eliminate $\mathbf{b}^{[l]}$ and do $\mathbf{z}^{[l]} = \mathbf{w}^{[l]} \mathbf{a}^{[l-1]}$ . In backprop, you can eliminate $d\mathbf{b}^{[l]}$ and just compute $d\mathbf{w}^{[l]}$ , alongside $d\mathbf{\gamma}^{[l]}$ and $d\mathbf{\beta}^{[l]}$ .

Why does Batch normalization work?

The first reason is the same reason as why we normalize $X$ .
The second reason is that by normalizing the inputs to layers within the network, it reduces the amount by which the inputs shift around (internal covariate shift):
- It makes weights, later or deeper in your network, say the weight on layer 10, more robust to changes to weights in earlier layers of the neural network, say, in layer 1.
- Limits the amount to which updating the parameters in the earlier layers affects the distribution of values that the next layer now sees and therefore has to learn on.
- Data distribution changing (covariance shift) of inputs, From perspective of nodes in a given layer, the inputs are always changing as weights and biases upstream are continually updated.
- Inputs are now more stable - changes less Layers later have to learn less
- Batch Norm reduces the amount that the distribution of these hidden unit values shifts around
- Allows for higher learning rates. Normalization has a stabilizing effect, which can allow you to use a higher learning rate and potentially speed up learning.
Batch normalization does some regularization:
- Each mini batch is scaled by the mean/variance computed of that mini-batch.
- This adds some noise to the values of $\mathbf{z}^{l1]}$ within that mini batch. So, similarly to dropout, it adds some noise to each hidden layer's activations.
- This has a slight regularization effect
- Using bigger size of the mini-batch you are reducing noise and therefore the regularization effect.
- Don't rely on batch normalization as a regularization. It's intended for normalization of hidden units, activations and therefore speeding up learning. For regularization, we should still use other regularization techniques (L2 or dropout).

Batch Normalization During Testing

In a testing scenario, you might not be processing mini-batches, so you can't compute the mean and variance per batch. Instead, you use the entire population's estimated mean and variance, often calculated during training using an exponentially weighted average.

In modern deep learning frameworks, functions like tf.nn.batch_normalization() in TensorFlow automate this process, handling the necessary calculations for both training and testing phases.

Hyperparameter Tuning Process Softmax Classification