Deep Neural Networks

Deep Neural Networks (DNNs) are crucial in understanding and applying deep learning, especially in the field of computer vision.

Deep L-layer Neural Networks

For some general rules & notation:

Layer Counting: In a neural network, we count only the hidden and output layers, not the input layer because there's no parameters (bias and weights). For example, Logistic Regression, without parameters in its input layer, is considered a 1-layer neural network.
Network Types:
- Shallow Neural Networks have 1 or 2 layers.
- Deep Neural Networks have 3 or more layers.
Key Terms:
- $L$ : Total number of layers.
- $m$ : Total training examples.
- $n^{[l]}$ : Neurons in layer $l$ .
- $n^{[0]}$ : Neurons in the input layer, equal to the size of the input features $n_x$ .
- $n^{[L]}$ : Neurons in the output layer, typically 1 in binary classification, representing $n_y$ , the size of the output vector.
- $g^{[l]}$ : Activation function in layer $l$ .
- Activation: $\mathbf{a}^{[l]} = g^{[l]}(\mathbf{z}^{[l]})$ , indicating activations in layer $l$ .
- Weights: $\mathbf{w}^{[l]}$ for layer $l$ used for $\mathbf{z}^{[l]}$ .
Data Representation:
- Input Data $\mathbf{X}$ and Output Data $\mathbf{Y}$ are represented as matrices of input and output vectors respectively.
- $\mathbf{x}^{(1)}$ and $\mathbf{y}^{(1)}$ : The first input and output vectors in the dataset.
Vector and Matrix Dimensions:
- Understanding and maintaining correct dimensions for vectors like $\mathbf{n}$ , $\mathbf{g}$ , weights $\mathbf{w}$ , and biases $\mathbf{b}$ is essential.
- $\mathbf{n}$ is of shape $(1, L + 1)$
- $\mathbf{g}$ is of shape $(1, L)$
- List of different shaped $\mathbf{w}$ based on the number of neurons on the previous and the current layer. Shape is $(n^{[l]}, n^{[l-1]})$
- List of different shaped $\mathbf{b}$ based on the number of neurons on the current layer. Shape is $(n^{[l]}, 1)$
- $\mathbf{X} = \mathbf{a}^{[0]}$
- $\mathbf{a}^{[l]} = \mathbf{\hat{Y}}$

Forward Propagation in Deep Networks

The general rule in forward propagation for one training input would look like this:

\mathbf{z}^{[l]} = \mathbf{w}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}

\mathbf{a}^{[l]} = g^{[l]}(\mathbf{z}^{[l]})

For $m$ training inputs, adjustments are made to accommodate all the training examples. It's important to note that we can't compute the whole layers forward propagation without a for loop, so it may be unavoidable. However, the dimensions of the matrices are really important to figure out:

\mathbf{Z}^{[l]} = \mathbf{w}^{[l]}\mathbf{A}^{[l-1]} + \mathbf{b}^{[l]}

\mathbf{A}^{[l]} = g^{[l]}(\mathbf{Z}^{[l]})

In terms of dimensions, we have:

$\mathbf{Z}^{[1]}$ has a shape of $(n^{[1]}, m)$
$\mathbf{w}^{[1]}$ has a shape of $(n^{[1]}, n^{[0]})$
$\mathbf{X}$ has a shape of $(n^{[0]}, m)$
$\mathbf{b}^{[1]}$ has a shape of $(n^{[1]}, 1)$

Getting Matrix Dimensions Right

Take the shallow NN from Shallow Neural Networks:

Neural Network

Using a pencil and paper approach can be the best way to ensure the dimensions of matrices like $\mathbf{w}^{[l]}$ , $\mathbf{b}^{[l]}$ , and their derivatives align correctly. For example, represent this first layer as:

\mathbf{z^{[1]}} = \overbrace{ \left[ \begin{array}{c} \mathbf{w}_{1}^{[1]T} \\ \mathbf{w}_{2}^{[1]T} \\ \mathbf{w}_{3}^{[1]T} \\ \mathbf{w}_{4}^{[1]T} \\ \end{array} \right]}^{(4,3)} \overbrace{ \left[ \begin{array}{c} x_{1} \\ x_{2} \\ x_{3} \\ \end{array} \right]}^{(3,1)} + \overbrace{ \left[ \begin{array}{c} b_{1}^{[1]} \\ b_{2}^{[1]} \\ b_{3}^{[1]} \\ b_{4}^{[1]} \end{array} \right]}^{(4,1)}

The dimension of $\mathbf{w}^{[l]}$ is $(n^{[l]}, n^{[l-1]})$ i.e. You have a weight for every every neuron for each input feature / neuron in the previous layer.
The dimensions of $\mathbf{b}^{[l]}$ is $(n^{[l]}, 1)$ i.e. Each Neuron in that layer has its own weight.
Make sure derivatives are of the same dimension aswell.
- $d\mathbf{w}^{[l]}$ should have the same shape as $\mathbf{w}$ i.e $(n^{[l]}, n^{[l-1]})$ , while $d\mathbf{b}$ is the same shape as $\mathbf{b}$ i.e $(n^{[l]}, 1)$

Dimension of $\mathbf{Z}^{[l]}, \mathbf{A}^{[l]}, d\mathbf{Z}^{[l]}$ , and $d\mathbf{A}^{[l]}$ is $(n^{[l]}, m)$

Why Deep Representations?

Deep NN makes relations from input data to outputs that can range from simple to complex. Each layer of a DNN forms a connection to the previous layer, with deeper layers computing increasingly complex features of the input:

Face recognition application:
- Image → Edges → Face parts → Faces → Desired face
Audio recognition application:
- Audio → Low level sound features like (sss,bb) → Phonemes → Words → Sentences

This progression is similar to the way human brains process information, moving from simple to complex interpretations.

Circuit Theory and Deep Learning

Deep learning and circuit theory share an interesting connection. Informally said, certain functions can be computed with a small L-layer deep neural network, for which we would require exponentially more hidden units if attempted with shallower networks.

Deep vs. Shallow Networks

The connection between deep learning and circuit theory provides fascinating insights. Certain functions, like XOR operations on a set of input features, can be computed more efficiently with a deep neural network as compared to a shallower one. A deep network can accomplish this with a number of units that increases logarithmically with the number of inputs, making it significantly more efficient than shallow networks. Shallow networks, in contrast, require an exponentially large number of hidden units to compute the same function.

Consider a function $y$ that is the result of XOR operations on a set of input features $x_1, x_2, ..., x_n$ :

y = x_1 \text{ XOR } x_2 \text{ XOR } \dots \text{ XOR } x_n

A deep neural network with multiple layers can compute such functions efficiently, often with a number of units that grows only logarithmically with $n$ :

y = \text{DeepNN}(x_1, x_2, \dots, x_n) = O(\log n)

XOR function

In contrast, a shallow network with only one or two layers would require an exponentially large number of hidden units to compute the same function, on the order of $2^n$ .

In a shallow network, each XOR computation would need to be connected to every possible combination of inputs. For 8 inputs, this results in $2^7 = 128$ unique XOR gates if we were to follow the structure where each XOR gate takes a unique combination of inputs.

y = \text{ShallowNN}(x_1, x_2, \dots, x_n) = n^{2^n}

XOR function

The implication is that deeper architectures can represent complex functions more compactly than shallow ones. For certain types of computations, particularly those that can be decomposed into hierarchical patterns or features, deep neural networks have a significant advantage. This difference is a critical reason why deep learning excels in tasks that involve complex, hierarchical data structures, such as image and speech recognition.

Building blocks of deep neural networks

Here is a schematic for what happens during forward and backward propagation. The forward propagation is used to calculate the cost function. The backward propagation is used to calculate the gradients of the cost function.

forward and backward propagation

With the derivatives calculated, we can update w and b:

\mathbf{w^{[l]}} := \mathbf{w^{[l]}} - \alpha \cdot d\mathbf{w^{[l]}}

\mathbf{b^{[l]}} := \mathbf{b^{[l]}} - \alpha \cdot d\mathbf{b^{[l]}}

Pseudo code for forward propagation for layer $l$ :

def forward_propagation(A_prev, W, b, g):
    """
    Perform forward propagation for one layer of a neural network.
 
    Parameters:
    A_prev (numpy.ndarray): Activations from the previous layer
    W (numpy.ndarray): Weight matrix for the current layer
    B (numpy.ndarray): Bias vector for the current layer
    g (function): Activation function for the current layer
 
    Returns:
    A (numpy.ndarray): Output activations of the current layer
    cache (tuple): Tuple containing (Z, A_prev, W, b) for use in backpropagation
    """
    Z = np.dot(W, A_prev) + b
    A = g(Z)
    cache = (Z, A_prev, W, b)
 
    return A, cache

For pseudo code for back propagation for layer $l$ . Each activation has a different derivative. Therefore, during backpropagation you need to know which activation was used in the forward propagation to be able to compute the correct derivative.

def backward_propagation(dA, cache, g_prime, m):
    """
    Perform backward propagation for one layer of a neural network.
 
    Parameters:
    dA (numpy.ndarray): Gradient of the activation function from the next layer
    cache (tuple): Cached data from forward propagation (Z, A_prev, W, b)
    g_prime (function): Derivative of the activation function for the current layer
    m (int): Number of training examples
 
    Returns:
    dA_prev (numpy.ndarray): Gradient of the activation function for the previous layer
    dW (numpy.ndarray): Gradient of the weight matrix for the current layer
    db (numpy.ndarray): Gradient of the bias vector for the current layer
    """
    Z, A_prev, W, _ = cache
    dZ = dA * g_prime(Z)
    dW = np.dot(dZ, A_prev.T) / m
    db = np.sum(dZ, axis=1, keepdims=True) / m
    dA_prev = np.dot(W.T, dZ)
 
    return dA_prev, dW, db

If we have used our loss function then, then the Gradient of Loss Function:

def compute_gradient_loss(y, a):
    """
    Compute the gradient of the loss function.
 
    Parameters:
    y (numpy.ndarray): True labels
    a (numpy.ndarray): Predicted output from the last layer of the network
 
    Returns:
    dA (numpy.ndarray): Gradient of the loss function
    """
    dA = -(np.divide(y, a) - np.divide(1 - y, 1 - a))
    return dA

Parameters vs Hyperparameters

Being able to organize your hyper-parameters well will help you be more efficient in developing your networks. The main parameters of a Neural Network are weights $w$ and biases $b$ . Hyper parameters (parameters that control the algorithm) are:

Learning rate ( $\alpha$ )
Number of iterations
Number of hidden layers $L$
Number of hidden units $n^{[l]}$
Choice of activation functions.
$\lambda$ is a hyperparameter that you can tune using a dev-set if you have a regularization term in your cost function.
Other Hyperparameters explored later include the momentum term, mini batch size, various forms of regularization parameters etc.

The workflow usually follows an empirical Approach, experimenting with hyperparameters, However, there are systematic approaches available.

Gradient Descent Dataset Splits