Vanishing / Exploding Gradients
The Vanishing / Exploding gradients occurs when your derivatives become very small or very big. To understand the problem, suppose that we have a deep neural network with number of layers , and for the sake of simplicity, say all the activation functions are linear, and each .
Then:
Say each is defined as a little larger than the identity:
Also, technically, since the output layer is a single unit, will be of shape . So leaving that aside, we can write:
If is large for a very deep neural network, then will be multiplied by an exponentially increasing large number. This is the problem of Vanishing / Exploding gradients. The last example explains that the activations (and similarly derivatives) will be decreased/increased exponentially as a function of number of layers.
So If:
- (Identity matrix) the activation and gradients will explode.
- (Identity matrix) the activation and gradients will vanish.
So use careful choices of the random weight initialization to significantly reduce this problem
Microsoft trained a 152 layers (ResNet) which is a really large number of layers. With such a deep neural network, if your activations or gradients increase or decrease exponentially as a function of , then these values could get really big or really small. And this makes training difficult, especially if your gradients are exponentially smaller than , then gradient descent will take tiny little steps. It will take a long time for gradient descent to learn anything.
There is one partial solution that doesn't completely solve this problem but it helps a lot and that is a careful choice of how you initialize the weights. Skip connections will be covered later.
Weight Initialization for Deep Networks
A partial solution to the Vanishing / Exploding gradients in NN is better or more careful choice of the random initialization of weights. For a larger , we need a smaller individual 's to prevent from blowing up.
Take this single neuron (Perceptron model):
The output of this neuron is:
So if is large we want to be smaller to not explode the cost.
He/Xavier Initialization
One reasonable thing to do, especially when using a tanh activation, is to set the variance to :
In practice, you can set each layer like this:
W[l] = np.random.randn(shape) * np.sqrt(1/n[l-1])
When using ReLU activation function, as quoted from her et al., you can set the variance to :
W[l] = np.random.randn(shape) * np.sqrt(2/n[l-1])
If the input features of activations are roughly mean 0 and standard variance 1 then this would cause to also take on a similar scale; this helps vanishing/exploding gradients.
Or variation of this (Bengio et al.):
In practice, all of these give you a starting point. Number 1 or 2 in the nominator can also be a hyperparameter to tune (but not the first to start with). You can also tune a parameter that multiplies into that formula. This is one of the best way of partially solving the Vanishing / Exploding gradients (ReLU + Weight Initialization with variance) which will help gradients not to vanish/explode too quickly.