Object Detection with Sliding Windows
Object detection involves identifying and locating objects within images. It's a step beyond image classification by not only recognizing what objects are present but also pinpointing their locations.
Training Convolutional Neural Networks for Detection
The process begins by training a convolutional network (ConvNet) on cropped images of the object of interest (e.g., cars) and various non-object images.
Once the ConvNet is adept at distinguishing the object, it can be employed in a technique known as sliding window detection.
Sliding Window Detection Technique
Sliding windows detection is an algorithmic approach where a fixed-size window "slides" across the image, and at each position, the ConvNet classifies whether the window contains the object.
Steps of Sliding Windows Detection
- Select a window size.
- Segment the image into rectangles of the chosen size, ensuring coverage of the entire image.
- Apply the ConvNet to each rectangle to classify if it contains the object.
- Repeat with rectangles of varying sizes for comprehensive detection.
- Record rectangles that confidently contain the object.
- In cases of overlapping rectangles, prioritize the one with the highest classification confidence.
While effective, a significant drawback of this method is its computational intensity, particularly for complex deep learning models.
In the era of machine learning before deep learning, people used hand-crafted linear classifiers that classifies the object and then use the sliding window technique. The linear classier make it a cheap computation. But in the deep learning era that is too computationally expensive due to the complexity of the DL models.
To solve this problem, we can implement sliding windows with a Convolutional approach. One other idea is to compress your deep learning model.
Convolutional Implementation of Sliding Windows
To mitigate the computational burden, sliding windows can be integrated into a convolutional approach.
Transforming Fully Connected to Convolutional Layers
Transform fully connected (FC) layers into convolutional layers, enabling the ConvNet to process spatial information more efficiently.
As you can see in the above image, we turned the FC layer into a Conv layer using a convolution with the width and height of the filter as the same as the width and height of the input. Each of these 400 nodes has a filter of dimension . So, each of those 400 values is some arbitrary linear function of these activations from the previous layer.
Convolutional Sliding Windows Approach
First let's consider that the Conv net you trained is like this (No FC, all you have is conv layers):
Say now we have a image that we need to apply the sliding windows in. By the normal implementation that have been mentioned in the section before this, we would run this Conv net four times each rectangle size will be .
Notice a lot of this computation done by these four convnets is highly duplicative. The convolution implementation will be as follows:
Simply we have feed the image into the same Conv net we have trained. Each one of these is what we would have gotten above. The convolutional implementation of sliding windows allows these windows in the convnet to share a lot of computation. Instead of forcing you to run forward propagation on four subsets of the input image independently, Instead, it combines all four into one forward computation and shares a lot of the computation in the regions of image that are common
The left cell of the result "The blue one" will represent the first sliding window of the normal implementation. The other cells will represent the others. Its more efficient because it now shares the computations of the four times needed for the sliding windows technique. Another example would be:
This example has a total of 16 sliding windows that shares the computation together.
Sermanet et al., 2014, "OverFeat: Integrated Recognition, Localization, and Detection Using Convolutional Networks"
Advantages and Limitations
The convolutional implementation accelerates the object detection process by consolidating the computations required for each window. However, it may result in less precise localization due to the fixed nature of the window sizes.
In red, the rectangle we want and in blue is our sliding window. This algorithm clearly still has one weakness, which is the position of the bounding boxes; it is not going to be too accurate. The perfect bounding box isn't even quite square -- it has a horizontal aspect ratio.
Enhancing Precision
Advanced techniques, such as anchor boxes and region proposal networks, can be introduced to improve the precision of object localization.