Artificial Intelligence 🤖
Practical Aspects of Deep Learning
The Vanishing/Exploding Gradient Problem

Vanishing / Exploding Gradients

The Vanishing / Exploding gradients occurs when your derivatives become very small or very big. To understand the problem, suppose that we have a deep neural network with number of layers LL, and for the sake of simplicity, say all the activation functions are linear, g(z)=zg(z) = z and each b[l]=0b^{[l]} = 0.

Demo Neural Net

Then:

y^=w[L]w[L−1]...w[2]w[1]X\hat{y} = \mathbf{w}^{[L]}\mathbf{w}^{[L-1]}...\mathbf{w}^{[2]}\mathbf{w}^{[1]}\mathbf{X}

Say each w[l]\mathbf{w}^{[l]} is defined as a little larger than the identity:

w[l]=[1.5001.5]\mathbf{w}^{[l]} = \begin{bmatrix} 1.5 & 0 \\ 0 & 1.5 \end{bmatrix}

Also, technically, since the output layer is a single unit, w[L]\mathbf{w}^{[L]} will be of shape (1,2)(1, 2). So leaving that aside, we can write:

y^=w[L][1.5001.5]L−1X\hat{y} = \mathbf{w}^{[L]} \begin{bmatrix} 1.5 & 0 \\ 0 & 1.5 \end{bmatrix}^{L-1} \mathbf{X}

If LL is large for a very deep neural network, then w[L]\mathbf{w}^{[L]} will be multiplied by an exponentially increasing large number. This is the problem of Vanishing / Exploding gradients. The last example explains that the activations (and similarly derivatives) will be decreased/increased exponentially as a function of number of layers.

So If:

  • w[l]>I\mathbf{w}^{[l]} > I (Identity matrix) the activation and gradients will explode.
  • w[l]<I\mathbf{w}^{[l]} < I (Identity matrix) the activation and gradients will vanish.

So use careful choices of the random weight initialization to significantly reduce this problem

Microsoft trained a 152 layers (ResNet) which is a really large number of layers. With such a deep neural network, if your activations or gradients increase or decrease exponentially as a function of LL, then these values could get really big or really small. And this makes training difficult, especially if your gradients are exponentially smaller than LL, then gradient descent will take tiny little steps. It will take a long time for gradient descent to learn anything.

There is one partial solution that doesn't completely solve this problem but it helps a lot and that is a careful choice of how you initialize the weights. Skip connections will be covered later.

Weight Initialization for Deep Networks

A partial solution to the Vanishing / Exploding gradients in NN is better or more careful choice of the random initialization of weights. For a larger n[l]n^{[l]}, we need a smaller individual w\mathbf{w}'s to prevent zz from blowing up.

Take this single neuron (Perceptron model):

Single Neuron

The output of this neuron is:

z=w1x1+w2x2+w3x3+w4x4z = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4

So if n[l]n^{[l]} is large we want w\mathbf{w} to be smaller to not explode the cost.

He/Xavier Initialization

One reasonable thing to do, especially when using a tanh activation, is to set the variance to 1/n[l]1/n^{[l]}:

var(w[l])=1n[l−1]var(\mathbf{w}^{[l]}) = \frac{1}{n^{[l-1]}}

In practice, you can set each layer like this:

W[l] = np.random.randn(shape) * np.sqrt(1/n[l-1])

When using ReLU activation function, as quoted from her et al., you can set the variance to 2/n[l−1]2/n^{[l-1]}:

W[l] = np.random.randn(shape) * np.sqrt(2/n[l-1])

If the input features of activations are roughly mean 0 and standard variance 1 then this would cause zz to also take on a similar scale; this helps vanishing/exploding gradients.

Or variation of this (Bengio et al.):

var(w[l])=2n[l−1]+n[l]var(\mathbf{w}^{[l]}) = \frac{2}{n^{[l-1]} + n^{[l]}}

In practice, all of these give you a starting point. Number 1 or 2 in the nominator can also be a hyperparameter to tune (but not the first to start with). You can also tune a parameter that multiplies into that formula. This is one of the best way of partially solving the Vanishing / Exploding gradients (ReLU + Weight Initialization with variance) which will help gradients not to vanish/explode too quickly.