Gradient Descent in Neural Networks

Gradient descent is a fundamental algorithm in training neural networks, where the model's parameters are iteratively adjusted to minimize the cost function. This section is a simplified overview of how gradient descent works in the context of a shallow neural network with one hidden layer.

Neural Network Architecture

A neural network's architecture is defined by its layers and neurons. The input layer has a number of neurons corresponding to the features in the dataset ( $n^{[0]} = n_x$ ). Consider a simple network with one hidden layer. The hidden layer has a number of neurons ( $n^{[1]}$ = noOfHiddenNeurons). The output layer produces the final prediction. Here, $n^{[2]}$ = NoOfOutputNeurons = 1 for binary classification.

The dimensions of the weight and bias matrices are:

$\mathbf{w}^{[1]}: (n^{[1]}, n^{[0]})$
$\mathbf{b}^{[1]}: (n^{[1]}, 1)$
$\mathbf{w}^{[2]}: (n^{[2]}, n^{[1]})$
$\mathbf{b}^{[2]}: (n^{[2]}, 1)$

Cost Function

The cost function, $I$ , is an average of the loss function $L$ over all $m$ training examples:

I = I(\mathbf{w^{[1]}}, \mathbf{b}^{[1]}, \mathbf{w^{[2]}}, \mathbf{b}^{[2]}) = \frac{1}{m} \sum L(\mathbf{Y}, \mathbf{a}^{[2]})

Gradient Descent Steps

The gradient descent algorithm involves the following steps:

Compute predictions: For each training example, compute the predicted value $\hat{y}$ .
Calculate derivatives: Determine how much the cost function would change if you changed the weights and biases, giving you the derivatives $d\mathbf{w^{[1]}}$ , $d\mathbf{b}^{[1]}$ , $d\mathbf{w^{[2]}}$ , $d\mathbf{b}^{[2]}$ .
Update parameters: Adjust the weights and biases by a fraction of the derivatives, scaled by the learning rate.

\mathbf{w^{[1]}} := \mathbf{w^{[1]}} - \alpha \cdot d\mathbf{w^{[1]}}

\mathbf{b}^{[1]} := \mathbf{b}^{[1]} - \alpha \cdot d\mathbf{b}^{[1]}

\mathbf{w^{[2]}} := \mathbf{w^{[2]}} - \alpha \cdot d\mathbf{w^{[2]}}

\mathbf{b}^{[2]} := \mathbf{b}^{[2]} - \alpha \cdot d\mathbf{b}^{[2]}

Forward Propagation

During forward propagation, the data flows through the network from the input to the output layer:

Z1 = W1A0 + b1  # A0 is the input matrix X
A1 = g1(Z1)     # g1 could be a ReLU or tanh function for the hidden layer
Z2 = W2A1 + b2
A2 = Sigmoid(Z2)  # Sigmoid is used for the output layer since the output is between 0 and 1

Backpropagation

Backpropagation calculates gradients for updating the weights and biases:

dZ2 = A2 - Y
dW2 = (dZ2 * A1.T) / m
db2 = Sum(dZ2) / m
dZ1 = (W2.T * dZ2) * gprime1(Z1) # element wise product (*)
dW1 = (dZ1 * A0.T) / m
db1 = Sum(dZ1) / m

Here, gprime11(Z1) represents the derivative of the activation function used in the hidden layer.

Gradient Descent Update

Parameters are updated using the calculated gradients and a learning rate:

W1 = W1 - alpha * dW1
b1 = b1 - alpha * db1
W2 = W2 - alpha * dW2
b2 = b2 - alpha * db2

Random Initialization

Proper initialization of weights is crucial for breaking symmetry and ensuring each neuron learns different features. We typically initialize weights with small random numbers and biases to zero:

# small random values are better because now we don’t have the symmetry breaking problem
W1 = np.random.randn(n[1], n[0]) * 0.01
# its ok to have b as zero, it won't get us to the symmetry breaking problem
b1 = np.zeros((n[1], 1))

Small weights prevent the activation functions from saturating at the start, which can slow down learning, especially with sigmoid or tanh functions.

For logistic regression when we didn't have a hidden layer, initializing weights to zeros worked fine because even though the first iteration would output zero's, the gradients are in terms of the input features $x^{(1)}$ , which isn't zero. So at the second iteration, the weights values follow $x^{(1)}$ 's distribution and are different from each other, as long as $x^{(1)}$ wasn't a constant vector.

However, for neural networks with hidden layers, random initialization of the weights is crucial (initialising bias' with zero is OK however), otherwise all hidden units will be completely identical (symmetric) and end up computing exactly the same function. On each gradient descent iteration all the hidden units will always update the same way. To solve this we initialize $W^{n}$ with a small random numbers.

We need small values because in sigmoid (or tanh) functions, if the weight is too large you are more likely to end up even at the very start of training with very large values of $Z$ . This causes your tanh or your sigmoid activation function to be saturated, thus slowing down learning. If you don't have any sigmoid or tanh activation functions throughout your neural network, this is less of an issue. Constant 0.01 is alright for 1 hidden layer networks, but if the NN is deep this number can be changed, but in general, it will always be a small number.

Decisions in Neural Network Design

Building a neural network involves several decisions about its architecture and learning process. While there are no hard rules, experimenting with the number of layers, number of neurons, learning rate, and activation functions is essential.

Activation Functions Deep Neural Networks