Residual Networks (ResNets)

Residual Networks, or ResNets, address the challenges of training very deep neural networks. By incorporating skip connections, ResNets facilitate the training of networks with more than 100 layers, overcoming issues like vanishing and exploding gradients. Skip connection let you take the activation from one layer and suddenly feed it to another layer, even ones much deeper in the NN.

The Concept of Residual Blocks

Structure of a Residual Block

ResNets are built out of Residual blocks:

Basic Idea: A residual block introduces a shortcut that allows the activation from one layer to be fed directly into a layer deeper in the network.
Operation: Instead of learning an underlying mapping directly, these blocks learn the residual mapping, which is often easier for deeper networks.

Residual Block

How Residual Blocks Work

Skip Connection: Fast forward (copy) the activation $a^{[l]}$ $a^{[l]}$ directly to a deeper layer before applying the ReLU non-linearity.
- Rather than needing to follow the 'main path', the information from $a^{[l]}$ can now follow a shortcut to go much deeper into the neural network.
Modified Activation: The activation $a^{[l+2]}$ becomes $g(z^{[l+2]} + a^{[l]})$ , simplifying the learning process for deeper layers.

The authors of this block find that you can train a deeper NNs using stacking this block.

Residual Networks Architecture

A Residual Network is composed of multiple stacked residual blocks.
These networks can increase in depth without degrading performance, unlike plain networks.

Residual Network

When using Plain Networks, deeper networks tend to suffer from increased training error due to vanishing/exploding gradients. ResNets on the other hand Show improved performance with increased depth.

Training Error Comparison

On the left is the normal NN with No Skip Connections, and on the right are the ResNets. In theory, having a deeper network should help but in practice, but when you have a plain network that's very deep, it means that your optimization algorithm has a much harder time training. And so, in reality, your training error gets worse if you pick a network that's too deep.

The performance of ResNet on the other hand increases as the network goes deeper. Again, the skip connections help with vanishing and exploding gradients problems and allows training of deep NN's without a loss of performance.

In some cases, going deeper won't affect the performance and that depends on the problem at hand. Some people are trying to train 1000 layer now which isn't used in practice.

He et al., 2015. Deep residual networks for image recognition

Understanding Residual Networks

Consider a neural network with the following structure:

X \rightarrow \text{Big NN} \rightarrow a^{[l]}

Adding two layers within a residual block modifies this structure as follows:

X \rightarrow \text{Big NN} \rightarrow a^{[l]} \rightarrow \text{Layer1} \rightarrow \text{Layer2} \rightarrow a^{[l+2]}

Assuming the use of ReLU activations (implying $a \geq 0$ ), the output of the added block is:

a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) = g(W^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]})

Then if we are using L2 regularization for example, $W^{[l+2]}$ will be zero (or even if we have experienced weight decay). Let's say that $b^{[l+2]}$ will also be zero if you are applying Regularization to biases aswell. Then:

a^{[l+2]} = g(a^{[l]}) = a^{[l]}

This equation shows that the identity function is relatively easy for a residual block to learn, enabling the training of deeper networks without degradation in performance. And that's why it can train deeper NNs. Also, the two layers we added don't hurt the performance of the big NN we made.

Dimensional Consistency in ResNets

For ResNets, the dimensions of $z^{[l+2]}$ and $a^{[l]}$ must match. If they differ, a transformation matrix $W_s$ is introduced:

a^{[l+2]} = g(z^{[l+2]} + W_s \cdot a^{[l]})

$W_s$ can be either a fixed, zero-padding matrix or a matrix of learned weights. The addition of this matrix should make the dimensions equal.

Because of this, it's common to see a lot of same convolutions in order to be able to add vectors. If not, we would need another matrix, $W_s$ . Using a skip-connection helps the gradient to backpropagate and thus helps you to train deeper networks.

If the hidden unit actually learn something useful, you will do better than the identity function. The issue with very deep plain nets is that when you make the network deeper and deeper, it's actually very difficult for it to choose parameters that learn the identity function which is why a lot of layers end up making your result worse rather than making your result better. Here it is easy for extra layers to learn the identity function so you're kind of guaranteed that it doesn't hurt performance, and then a lot the time you may get lucky and then even help performance.

ResNet-34 Architecture

Uses 3x3 convolutions with the same padding.
Adjusts the spatial size and number of filters: Spatial size /2 ⇒ # filters x2.
Omits fully connected layers and dropout.
Employs two types of blocks depending mainly on whether the input/output dimensions are same or different.
Incorporates downsampling and zero-padding for dimension mismatch.
- The dotted lines are the cases when the dimensions are different
- To solve them they down-sample the input by 2 and then pad zeros to match the two dimensions. There's another trick which is called bottleneck which we will explore later.

VGG and ResNet Comparison

Spectrum of Depth

We need to employ different strategies as the network gets deeper:

Spectrum of Depth

Up to 5 layers: Building neural networks with up to 5 layers is generally straightforward without the need for specialized techniques to aid in training or convergence.
More than 10 layers: When constructing networks with more than 10 layers, it becomes important to use good initialization methods and techniques like Batch Normalization to help with the training process. Batch Normalization also helps in reducing internal covariate shift which can occur in deeper networks.
More than 30 layers: Networks with depth beyond 30 layers benefit from incorporating skip connections, which are connections that skip one or more layers. Skip connections help mitigate the vanishing gradient problem by allowing gradients to flow through the network more effectively during backpropagation.
More than 100 layers: At this depth, using identity skip connections becomes essential. These are a special case of skip connections where the output from a previous layer is added unchanged to a later layer. This can create paths through which the gradient can bypass multiple layers without attenuation, simplifying the learning of the identity function and enabling the training of very deep networks.

Residual Block Types

Both the Identity Block and the Convolutional Block are fundamental components of ResNets. They enable the network to learn identity mappings where necessary.

Identity Block

This block is characterized by maintaining the dimensions of the input throughout the block. It is typically employed in scenarios where the input and output dimensions are identical.

Identity Block

Input: The block takes an input, denoted as $x$ .
First Layer: The input passes through a 2D convolutional layer (CONV2D), followed by batch normalization (Batch Norm), and then a ReLU activation function.
Second Layer: The output from the first ReLU activation goes through another 2D convolutional layer and batch normalization.
Shortcut/Skip Connection: Simultaneously, the original input $x$ is carried over via a shortcut, bypassing these layers.
Final Activation: The output of the second batch normalization and the shortcut $x$ are added together before being passed through another ReLU activation function.

The Identity Block allows the gradients to flow through the network directly, bypassing two layers and thus mitigating the vanishing gradient problem.

Convolutional Block

The Convolutional Block is used when there is a need to adjust the dimensions, for example when downsampling the input or changing the number of channels.

Convolutional Block

Input: The block starts with an input, again denoted as $x$ .
Layers: The input goes through a series of transformations that typically include a 2D convolutional layer, batch normalization, and ReLU activation. This process may repeat for a specified number of times.
Dimension Adjustment: If the dimensions need to be altered, a 1x1 convolution (also known as a bottleneck layer) can be used to match the dimensions for the shortcut connection.
Shortcut/Skip Connection: The input $x$ is fast-forwarded to later in the network, undergoing necessary transformations if the dimensions are being changed.
Final Activation: Similar to the Identity Block, the outputs from the last batch normalization and the transformed shortcut are added together and then activated with a ReLU function.

The Convolutional Block is essential for modifying the input dimensions while still providing the benefits of the skip connection for deep network training.

Conclusion

ResNets represent a significant advancement in the field of deep learning, particularly for image recognition tasks. The introduction of residual blocks allows for the training of much deeper networks, addressing previous limitations and opening new possibilities in neural network design.

VGG-16 Inception (GoogleNet)