Convolutions

Convolutions are fundamental in Convolutional Neural Networks (CNNs), with applications such as image edge detection.

Edge Detection

A CNN might detect edges in early layers, parts of objects in middle layers, and assemble these parts in later layers. In an image we can detect vertical edges, horizontal edges, or have a full-edge detector.

Vertical Edge Detection

A convolution operation for detecting vertical edges involves a $(6 \times 6)$ matrix (possibly a grayscale image) convoluted with a $(3 \times 3)$ filter/kernel, resulting in a $(4 \times 4)$ output matrix. This operation in TensorFlow is performed using tf.nn.conv2d, and in Keras, conv2d is used.

Vertical Edge Detection

The filter identifies $(3 \times 3)$ areas where bright regions are followed by dark regions. Applying this filter to a white region followed by a dark region yields positive values. Conversely, a dark region followed by a white region results in negative values. Using the absolute value function abs helps to consider just the edge intensities.

Some other examples are below. The output is essentially the edge intensities:

Edge Detection Examples

Similarly, a horizontal edge detection filter would be:

\begin{bmatrix} 1 & 1 & 1 \\ 0 & 0 & 0 \\ -1 & -1 & -1 \end{bmatrix}

This looks for a part that is bright on top and dark on the bottom.

Sobel and Scharr Filters

Instead of handcrafting filter values, deep learning utilizes backpropagation to treat these values as weights and learn them, enabling automatic detection of various edge types. It can learn horizontal, vertical, angled, or any edge type automatically rather than getting them by hand. Notable examples include the Sobel and Scharr filters, which place more weight on central pixels for robustness.

Sobel Filter (Vertical Edge Detection):

\begin{bmatrix} 1 & 0 & -1 \\ 2 & 0 & -2 \\ 1 & 0 & -1 \end{bmatrix}

Scharr Filter (Vertical Edge Detection). The idea is that it is placing more weight on the middle row:

\begin{bmatrix} 3 & 0 & -3 \\ 10 & 0 & -10 \\ 3 & 0 & -3 \end{bmatrix}

Padding in Convolutional Networks

Padding is crucial in deep neural networks to address the reduction in matrix size after convolution and to preserve edge information. The general rule for convolution with padding is:

(n \times n) * (f \times f) \rightarrow (n + 2p - f + 1, n + 2p - f + 1)

where (*) denotes 2D convolution, ( n ) is the original matrix size, ( f ) is the filter size, and ( p ) is the padding amount. For a 'same convolution' where the output size is the same as the input size:

p = \frac{f - 1}{2}

Padding in Convolutional Networks

Padding is crucial in deep neural networks to address the reduction in matrix size after convolution and to preserve edge information. Without padding, The convolution operation shrinks the matrix if $f > 1$ .

(n \times n) * (f \times f) \rightarrow (n-f+1, n-f+1)

Where $*$ denotes 2D convolution, $n$ is the original matrix size, $f$ is the filter size. We want to apply the convolution operation multiple times, but if the image shrinks we will lose a lot of data during this process. The edge pixels especially are used less than the other pixels in an image; the corner pixel is only used once. So we are essentially throwing away edge information.

We are shrinking the output.
We are throwing away information at the edges.

To solve these problems we can pad the input image before convolution by adding some rows and columns to it. We will call the padding amount $p$ ; this is the number of rows/columns that we will insert on the top, bottom, left and right of the image. In almost all the cases, the values we use for padding are zeros. The general rule for convolution with padding is:

(n \times n) * (f \times f) \rightarrow (n + 2p - f + 1, n + 2p - f + 1)

It is common for us to want to perform a same convolution which is a convolution with a pad so that output size is the same as the input size. Its given by the equation:

p = \frac{f - 1}{2}

Other types of padding include a Valid Convolution where there is no padding; $p=0$ . Also, note that in computer vision $f$ is usually odd. There are 2 reasons:

If $f$ was even, you'd need asymmetric padding.
With an odd dimension filter/kernel, then it has a central position. Sometimes in computer vision it's nice to have a 'distinguisher', i.e. A central pixel so you can talk about the position of the filter.

Strided Convolution

Strided convolution, another essential technique in CNNs, involves jumping over pixels by a specified stride $s$ during convolution. The general formula is:

(n \times n) * (f \times f) \rightarrow \left( \frac{n + 2p - f}{s} + 1, \frac{n + 2p - f}{s} + 1 \right)

This approach helps in reducing the output dimensions and computational load.

In the case that:

\frac{n + 2p - f}{s} \notin \mathbb{Z}

i.e it is fractional, we can take the floor of this value and round down the output dimensions to the nearest integer. For computation, we just do not do the calculation if the box hangs outside of the original image (or image + padding).

In math textbooks, the conv operation is technically filpping the filter before using it. What we were doing is called the cross-correlation operation but the state of art of deep learning algorithms are using this as conv operation.

Same convolutions with Stride is a convolution with a padding so that output size is the same as the input size. It's given by the equation:

p = \frac{n \cdot s - n + f - s}{2}

When $s=1$ , then $p = \frac{f - 1}{2}$ , which is the same as the same convolution without stride.

Convolutions Over Volumes

For 3D images (e.g., RGB), convolution is applied across each channel using, what we call, stacked filters, as there is a filter for each channel. Multiple filters can be used to detect various features, resulting in a multi-dimensional output volume.

For example, this convolution operation takes a matrix of shape $(6,6,3)$ and a $(3,3,3)$ filter/kernel, which gives us a $(4,4,1)$ output matrix (image). The output image here is only 2D. In the last result here, result $p=0$ , $s=1$ .

We can use multiple filters to detect multiple features or edges. For example, as above, a convolution operation takes a matrix of shape $(6,6,3)$ and 10 separate $(3,3,3)$ filter/kernel, which gives us a $(4,4,10)$ output matrix. Here, we use multiple convolutions and stack the outputs, and the result is a dimensional output volume.

There is a lot possible with these filters. For example, we can have a filter that just detects edges in the red channel, or all of them etc. with different choices of these parameters, you can achieve different outputs.

One Layer of a Convolutional Network

A convolutional layer consists of several steps:

Convolve filters with the input.
Apply a bias.
Use a non-linearity (e.g., RELU) to each convolution output.

The layer's parameters depend on the filter size, not the input size, making it less prone to overfitting and requires fewer parameters.

Take the input image:

(6,6,3) = a^{[0]}

Take 10 Filters of size:

(3,3,3) = w^{[1]}

The result image:

(4,4,10) = a^{[0]} * w^{[1]}

Add $b$ (bias) of $(10,1)$ which will give us:

(4,4,10) = (a^{[0]} * w^{[1]}) + b = z^{[1]}

Applying RELU gets us:

(4,4,10) = RELU(z^{[1]}) = a^{[1]}

In the last result $p=0$ , $s=1$ and the number of parameters here are: $(3 \cdot 3 \cdot 3 \cdot 10) + 10 = 280$

Similar to before:

z^{[1]} = w^{[1]}a^{[0]} + b^{[1]}

a^{[1]} = g(z^{[1]})

This forms a single layer in a CNN. No matter the size of the input, the number of the parameters is the same if the filter size is same. That makes it less prone to overfitting, and also means we can use a smaller number of parameters.

Pooling Layers

Pooling layers, typically following convolutional layers, reduce input size and computational load while maintaining robustness in feature detection. Max pooling and average pooling are common types, with max pooling being more prevalent. Simply, take this max pooling example:

Pooling Layers

In this example:

\begin{align*} f = 2 \\ s = 2 \\ p = 0 \\ \end{align*}

The main reason why people are using pooling because its works well in practice and reduces unnecessary computations. Max pooling has no parameters to learn, it just has a fixed $f$ and $s$ hyperparameters. Example of Max pooling on 3D input:

Input: $(4,4,10)$
Max pooling size = 2 and stride = 2
Output: $(2,2,10)$

For multi-channel inputs, the way you compute max pooling is you perform the computation described on each of the channels independently.

Average pooling is taking the averages of the values instead of taking the max values. Max pooling is used more often than average pooling in practice. Padding is rarely used here.

Everything that influences the loss should appear in the backpropagation because we are computing derivatives. In fact, pooling layers modify the input by choosing one value out of several values in their input volume. Also, to compute derivatives for the layers that have parameters (Convolutions, Fully-Connected), we still need to backpropagate the gradient through the Pooling layers.

Notation

Here are some notations we will use. If layer $l$ is our conv layer:

$f^{[l]}$ : filter size
$p^{[l]}$ $p^{[l]}$ : padding
- The default is $0$
$s^{[l]}$ : stride
$n_c^{[l]}$ : number of filters/channels
$n_H^{[l]}$ : height of output volume
$n_W^{[l]}$ : width of output volume
$n^{[l]}$ is the simplified case where $n_H^{[l]} = n_W^{[l]}$ : dimensions of the output
$a^{[l]}$ : output of layer $l$

Input:

n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]}

Output:

n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}

Where $n_c^{[l]}$ is the number of filters and $n^{[l]}$ :

n^{[l]} = \frac{(n^{[l-1]} + 2p^{[l]} - f^{[l]})}{s^{[l]}} + 1

Each filter is:

f^{[l]} \times f^{[l]} \times n_c^{[l-1]}

i.e. # channels in a filter must match channels in the input. Activations:

a^{[l]} = n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}

A^{[l]} = m \times n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}

In batch or minbatch training. Here $m$ is the number of examples in the batch or minbatch training. We use a 4D Tensor. Weights:

f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}

Bias:

(1, 1, 1, n_c^{[l]})

We have 1 real number for each filter. We use a 4D Tensor.

Simple CNN Example

Here, the input is a $(39,39,3)$ image. $n^{[0]} = 39$ and $n_c^{[0]} = 3$ . We have a $(3,3,3)$ filter, and we are using 10 filters. In the first conv layer,

\begin{align*} f^{[1]} = 3 \\ p^{[1]} = 0 \\ s^{[1]} = 1 \\ n_c^{[1]} = 10 \\ \end{align*}

The output is a $(37,37,10)$ image. The number of parameters is:

(3 \cdot 3 \cdot 3 \cdot 10) + 10 = 280

The rest of the layers follow a similar pattern. Notice that when the stride is larger, the dimensions of the output volume decreases faster.

The last layer is a fully connected soft-max layer. We can take the output volume and unroll it into a $a^{[3]} = 7 \times 7 \times 40 = 1,960$ vector and then feed this to a logistic regression or softmax unit.

	Activation Shape	Activation Size	# Parameters
Input	(32,32,3)	3072	0
CONV1 (f=5, s=1)	(28,28,8)	6272	(5 x 5 x 3) = 75 Parameters per Kernel. 8 Filters, 8 Bias. (75 x 8) + 8 = 608
POOL1	(14,14,8)	1568	0
CONV2 (f=5, s=1)	(10,10,16)	1600	(5 x 5 x 8) = 200 Parameters per Kernel. 16 Filters, 16 Bias. (200 x 16) + 16 = 3216
POOL2	(5,5,16)	400	0
FC3	(120,1)	120	48120
FC4	(10,1)	10	850
		Total	52794

There are a lot of Hyperparameters. For choosing the value of each you should follow the guideline that we will discuss later or check the literature and takes some ideas and numbers from it. Usually, the input size decreases over layers while the number of filters increases. A CNN usually consists of one or more convolutions (Not just one as the shown examples) followed by a pooling. You can also see that fully connected layers have the most parameters in the network, and Conv layers have comparatively few.

To consider using these blocks together you should look at other working examples firsts to get some intuitions.

Why Convolutions?

Convolutions offer two main advantages:

Parameter Sharing: A feature detector (such as a vertical edge detector) useful in one part of the image is likely useful elsewhere, reducing the number of parameters and the risk of overfitting.
- This reduces the number of parameters to train. (32x32x3) to (28,28x6) in a fully connected layer would be 14 million connections, as opposed to (if $f=5$ and we have 6 convolutions) $(5 \cdot 5 \cdot 3 + 1) \cdot 6 = 456$ parameters.
Sparsity of Connections: Each output value depends on a small number of inputs, providing translation invariance and reducing computational complexity.
- A cat shifted a bit to the right is still a cat. This is because of the sparsity of connections.
- one cell in the convolution is only connected to (say) 9 out of the entire source image. Each activation in the next layer depends only on a small number of activations from the previous layer

1 X 1 Convolutions & Network in Network

Role of 1 x 1 Convolutions

1 x 1 convolutions, often termed as "Network in Network" (NiN), have a pivotal role in CNN architecture:

Shrinking Channels: Reduces the number of channels, aiding in computational efficiency.
Feature Transformation: Acts as a fully connected layer applied at each pixel location.

Fundamentals of 1 x 1 Convolutions

What does a 1 X 1 convolution do though? At first glance, it may seem like it is just multiplying by a scalar.

Example 1:
- Input: $6 \times 6 \times 1$ (single-channel)
- Operation: $1 \times 1 \times 1$ convolution with one filter.
- Output: $6 \times 6 \times 1$ , which may appear as if each element is doubled.
Example 2:
- Input: $6 \times 6 \times 32$ (multi-channel)
- Operation: $1 \times 1 \times 32$ $1 \times 1 \times 32$ convolution with 5 filters, each performing an element-wise product and a ReLU non-linearity.
  - The convolution looks at each 36 different positions and takes element wise product with the 32 numbers in the input and the filter and applies a ReLu non linearity to it.
- Output: $6 \times 6 \times 5$ , transforming the 32-channel input to 5 channels.

In this second example, the operation is analogous to having 5 neurons, each taking the 32-dimensional input, multiplying the 32 numbers i.e. weights and applying a ReLu non-linearity to it. This emphasizing that a 1 x 1 convolution is not trivial and amounts to a significant transformation.

If you have multiple filters, it is like having a fully connected NN that applies to each of the 62 different positions. It is a non-trivial computation

Application in Modern CNNs

These are used extensively in sophisticated models like ResNet and Inception. 1 x 1 convolutions have become a key component in learning complex functions within networks. They are especially valuable for feature transformation (In the second discussed example above we have shrunk the input from 32 to 5 channels), not just for reducing the height and width dimensions as max pooling does. Although, 1 x 1 convolutions are still useful when we want to shrink the number of channels.

Computation Savings and Feature Transformation

By shrinking the number of channels, 1 x 1 convolutions can significantly decrease computational requirements without compromising the network's capability.

Consistency and Non-Linearity

If the number of 1 x 1 convolution filters equals the number of input channels, the output retains the same number of channels. Here, the convolution layers serve as complex non-linear functions, contributing to the network's ability to learn richer representations.

Lin et al., 2013. Network in network

Yann LeCun's Perspective

Yann LeCun, a pioneering figure in convolutional networks, asserts that 1 x 1 convolutions effectively serve the same purpose as fully connected layers within CNNs, thereby challenging the traditional distinction between convolutional and fully connected layers.

"In Convolutional Nets, there is no such thing as fully-connected layers. There are only convolution layers with 1x1 convolution kernels and a full connection table." ~ Yann LeCun

Introduction Case Studies