Shallow Neural Networks

Neural networks are powerful tools that can model complex patterns in data. A neural network with one hidden layer is referred to as a shallow neural network. Understanding the forward and backward propagation in this network is key to building neural network models.

Overview of Neural Networks

In logistic regression, we take input features and, through a series of mathematical operations involving weights and biases, predict an output. i.e.

Computation Graph Logistic

In a neural network with one hidden layer, we extend this idea:

Computation Graph Logistic

Imagine we have an input vector $\mathbf{x^{(n)}}$ consisting of features $x_1, x_2, x_3$ , and we're trying to predict an output $y$ . In a neural network, we do this by passing the input through layers of logistic regression.

Neural Network Representation

Neural Network

A neural network consists of an input layer, one or more hidden layers, and an output layer. Unlike the input layer, which is exposed to our training data, the hidden layers aren't directly exposed to our dataset. Here’s what happens in a two-layer neural network (where the input layer isn't counted):

The input layer $\mathbf{a^{[0]}}$ is our initial data $\mathbf{x^{(n)}}$ .
The activations $\mathbf{a^{[1]}}$ represent the outputs of the hidden layer's neurons after applying the weights and biases from the input data.
The final layer $\mathbf{a^{[2]}}$ , or the output layer, is responsible for producing the predicted value $\hat{y}$ .

The operations performed in the first hidden layer can be mathematically represented as:

\mathbf{z^{[1]}} = \mathbf{a^{[0]}} \cdot \mathbf{w^{[1]}} + \mathbf{b}^{[1]}

Or, to demonstrate fully:

\mathbf{z^{[1]}} = \begin{bmatrix} \mathbf{w}_{1}^{[1]} \\ \mathbf{w}_{2}^{[1]} \\ \mathbf{w}_{3}^{[1]} \\ \mathbf{w}_{4}^{[1]} \end{bmatrix} \mathbf{x^{(n)}} + \begin{bmatrix} b_{1}^{[1]} \\ b_{2}^{[1]} \\ b_{3}^{[1]} \\ b_{4}^{[1]} \end{bmatrix} = \begin{bmatrix} \mathbf{w}_{1}^{[1]} \mathbf{x^{(n)}} + b_{1}^{[1]} \\ \mathbf{w}_{2}^{[1]} \mathbf{x^{(n)}} + b_{2}^{[1]} \\ \mathbf{w}_{3}^{[1]} \mathbf{x^{(n)}} + b_{3}^{[1]} \\ \mathbf{w}_{4}^{[1]} \mathbf{x^{(n)}} + b_{4}^{[1]} \end{bmatrix} = \begin{bmatrix} z_{1}^{[1]} \\ z_{2}^{[1]} \\ z_{3}^{[1]} \\ z_{4}^{[1]} \end{bmatrix}

Here, $\mathbf{z^{[1]}}$ is the weighted sum of inputs plus the bias term for the first hidden layer. $\mathbf{w}^{[1]}$ is the weights matrix connecting the input layer to the first hidden layer, and $\mathbf{b}^{[1]}$ is the bias vector for the first hidden layer. For our case:

$\mathbf{w}^{[1]}$ has a shape of $(4, 3)$ , since we have 4 neurons in the hidden layer and 3 input features.
$\mathbf{b}^{[1]}$ has a shape of $(4, 1)$ , which corresponds to the 4 neurons in the hidden layer.

The output of each neuron in the first hidden layer is then passed through an activation function, such as the sigmoid function:

\mathbf{a}^{[1]} = \sigma(\mathbf{z^{[1]}})

Where $\mathbf{a}^{[1]}$ is the activation of the first hidden layer, and it retains the shape of (4, 1), the same as $\mathbf{z^{[1]}}$ .

The activations from the first hidden layer are then used as inputs to the second layer. If the network has only one hidden layer, this second layer would be the output layer. The process is similar:

\mathbf{z^{[2]}} = \mathbf{a^{[1]}} \cdot \mathbf{w^{[2]}} + \mathbf{b}^{[2]}

Here, $\mathbf{z^{[2]}}$ is the weighted sum of the activations from the first hidden layer plus the bias term for the second layer, which ultimately is the output layer for a single hidden layer neural network.

$\mathbf{w^{[2]}}$ has a shape of (1, 4), which reflects a single output neuron connected to the 4 neurons of the hidden layer.
$\mathbf{b}^{[2]}$ has a shape of (1, 1), corresponding to the single output neuron.

Finally, the output neuron also applies the sigmoid function to produce the final output $\hat{y}$ :

\hat{y} = \mathbf{a}^{[2]} = \sigma(\mathbf{z^{[2]}} )

Where $\mathbf{a}^{[2]}$ is the predicted output, with a shape of (1, 1).

Each layer's output serves as the subsequent layer's input in a neural network. The activation functions introduce non-linearity, allowing the network to learn complex patterns. By adjusting the weights and biases through training, using algorithms like gradient descent, the network learns to make predictions.

Vectorizing Across Multiple Examples

When handling multiple training examples, we want to avoid slow for-loop computations on each example. Instead, we use vectorization to process all examples simultaneously. If our input data $\mathbf{X}$ has dimensions $[n_x, m]$ , where $n_x$ is the number of features and $m$ is the number of examples, we can perform all our computations in a vectorized form:

$\mathbf{z^{[1]}} = \mathbf{w^{[1]}}\mathbf{X} + \mathbf{b}^{[1]}$ has dimensions $[n_\text{hidden neurons}, m]$ .
$\mathbf{a^{[1]}} = \sigma(\mathbf{z^{[1]}})$ has the same dimensions as $\mathbf{z^{[1]}}$ .
$\mathbf{z^{[2]}} = \mathbf{w^{[2]}}\mathbf{a^{[1]}} + \mathbf{b}^{[2]}$ has dimensions $[1, m]$ .
$\mathbf{a^{[2]}} = \sigma(\mathbf{z^{[2]}})$ also has dimensions $[1, m]$ .

This approach dramatically increases efficiency, particularly with large datasets.

Vectorization Activation Functions