Vanishing / Exploding Gradients

The Vanishing / Exploding gradients occurs when your derivatives become very small or very big. To understand the problem, suppose that we have a deep neural network with number of layers $L$ , and for the sake of simplicity, say all the activation functions are linear, $g(z) = z$ and each $b^{[l]} = 0$ .

Demo Neural Net

Then:

\hat{y} = \mathbf{w}^{[L]}\mathbf{w}^{[L-1]}...\mathbf{w}^{[2]}\mathbf{w}^{[1]}\mathbf{X}

Say each $\mathbf{w}^{[l]}$ is defined as a little larger than the identity:

\mathbf{w}^{[l]} = \begin{bmatrix} 1.5 & 0 \\ 0 & 1.5 \end{bmatrix}

Also, technically, since the output layer is a single unit, $\mathbf{w}^{[L]}$ will be of shape $(1, 2)$ . So leaving that aside, we can write:

\hat{y} = \mathbf{w}^{[L]} \begin{bmatrix} 1.5 & 0 \\ 0 & 1.5 \end{bmatrix}^{L-1} \mathbf{X}

If $L$ is large for a very deep neural network, then $\mathbf{w}^{[L]}$ will be multiplied by an exponentially increasing large number. This is the problem of Vanishing / Exploding gradients. The last example explains that the activations (and similarly derivatives) will be decreased/increased exponentially as a function of number of layers.

So If:

$\mathbf{w}^{[l]} > I$ (Identity matrix) the activation and gradients will explode.
$\mathbf{w}^{[l]} < I$ (Identity matrix) the activation and gradients will vanish.

So use careful choices of the random weight initialization to significantly reduce this problem

Microsoft trained a 152 layers (ResNet) which is a really large number of layers. With such a deep neural network, if your activations or gradients increase or decrease exponentially as a function of $L$ , then these values could get really big or really small. And this makes training difficult, especially if your gradients are exponentially smaller than $L$ , then gradient descent will take tiny little steps. It will take a long time for gradient descent to learn anything.

There is one partial solution that doesn't completely solve this problem but it helps a lot and that is a careful choice of how you initialize the weights. Skip connections will be covered later.

Weight Initialization for Deep Networks

A partial solution to the Vanishing / Exploding gradients in NN is better or more careful choice of the random initialization of weights. For a larger $n^{[l]}$ , we need a smaller individual $\mathbf{w}$ 's to prevent $z$ from blowing up.

Take this single neuron (Perceptron model):

Single Neuron

The output of this neuron is:

z = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4

So if $n^{[l]}$ is large we want $\mathbf{w}$ to be smaller to not explode the cost.

He/Xavier Initialization

One reasonable thing to do, especially when using a tanh activation, is to set the variance to $1/n^{[l]}$ :

var(\mathbf{w}^{[l]}) = \frac{1}{n^{[l-1]}}

In practice, you can set each layer like this:

W[l] = np.random.randn(shape) * np.sqrt(1/n[l-1])

When using ReLU activation function, as quoted from her et al., you can set the variance to $2/n^{[l-1]}$ :

W[l] = np.random.randn(shape) * np.sqrt(2/n[l-1])

If the input features of activations are roughly mean 0 and standard variance 1 then this would cause $z$ to also take on a similar scale; this helps vanishing/exploding gradients.

Or variation of this (Bengio et al.):

var(\mathbf{w}^{[l]}) = \frac{2}{n^{[l-1]} + n^{[l]}}

In practice, all of these give you a starting point. Number 1 or 2 in the nominator can also be a hyperparameter to tune (but not the first to start with). You can also tune a parameter that multiplies into that formula. This is one of the best way of partially solving the Vanishing / Exploding gradients (ReLU + Weight Initialization with variance) which will help gradients not to vanish/explode too quickly.

Normalization and Regularization Gradient Checking