Object Detection with Sliding Windows

Object detection involves identifying and locating objects within images. It's a step beyond image classification by not only recognizing what objects are present but also pinpointing their locations.

Training Convolutional Neural Networks for Detection

The process begins by training a convolutional network (ConvNet) on cropped images of the object of interest (e.g., cars) and various non-object images.

Trained ConvNet on Cropped Images

Once the ConvNet is adept at distinguishing the object, it can be employed in a technique known as sliding window detection.

Sliding Window Detection Technique

Sliding windows detection is an algorithmic approach where a fixed-size window "slides" across the image, and at each position, the ConvNet classifies whether the window contains the object.

Steps of Sliding Windows Detection

Select a window size.
Segment the image into rectangles of the chosen size, ensuring coverage of the entire image.
Apply the ConvNet to each rectangle to classify if it contains the object.
Repeat with rectangles of varying sizes for comprehensive detection.
Record rectangles that confidently contain the object.
In cases of overlapping rectangles, prioritize the one with the highest classification confidence.

Sliding Windows Process

While effective, a significant drawback of this method is its computational intensity, particularly for complex deep learning models.

In the era of machine learning before deep learning, people used hand-crafted linear classifiers that classifies the object and then use the sliding window technique. The linear classier make it a cheap computation. But in the deep learning era that is too computationally expensive due to the complexity of the DL models.

To solve this problem, we can implement sliding windows with a Convolutional approach. One other idea is to compress your deep learning model.

Convolutional Implementation of Sliding Windows

To mitigate the computational burden, sliding windows can be integrated into a convolutional approach.

Transforming Fully Connected to Convolutional Layers

Transform fully connected (FC) layers into convolutional layers, enabling the ConvNet to process spatial information more efficiently.

FC to Convolutional Transformation

As you can see in the above image, we turned the FC layer into a Conv layer using a convolution with the width and height of the filter as the same as the width and height of the input. Each of these 400 nodes has a filter of dimension $5 \times 5 \times 16$ . So, each of those 400 values is some arbitrary linear function of these $5 \times 5 \times 16$ activations from the previous layer.

Convolutional Sliding Windows Approach

First let's consider that the Conv net you trained is like this (No FC, all you have is conv layers):

Say now we have a $16 \times 16 \times 3$ image that we need to apply the sliding windows in. By the normal implementation that have been mentioned in the section before this, we would run this Conv net four times each rectangle size will be $16 \times 16$ .

Notice a lot of this computation done by these four convnets is highly duplicative. The convolution implementation will be as follows:

Simply we have feed the image into the same Conv net we have trained. Each one of these is what we would have gotten above. The convolutional implementation of sliding windows allows these windows in the convnet to share a lot of computation. Instead of forcing you to run forward propagation on four subsets of the input image independently, Instead, it combines all four into one forward computation and shares a lot of the computation in the regions of image that are common

The left cell of the result "The blue one" will represent the first sliding window of the normal implementation. The other cells will represent the others. Its more efficient because it now shares the computations of the four times needed for the sliding windows technique. Another example would be:

This example has a total of 16 sliding windows that shares the computation together.

Sermanet et al., 2014, "OverFeat: Integrated Recognition, Localization, and Detection Using Convolutional Networks"

Advantages and Limitations

The convolutional implementation accelerates the object detection process by consolidating the computations required for each window. However, it may result in less precise localization due to the fixed nature of the window sizes.

In red, the rectangle we want and in blue is our sliding window. This algorithm clearly still has one weakness, which is the position of the bounding boxes; it is not going to be too accurate. The perfect bounding box isn't even quite square -- it has a horizontal aspect ratio.

Enhancing Precision

Advanced techniques, such as anchor boxes and region proposal networks, can be introduced to improve the precision of object localization.

Landmark Detection Bounding Box Predictions (YOLO)