Artificial Intelligence 🤖
Convolutional Neural Networks (CNNs)
Object Detection
Bounding Box Predictions (YOLO)

Bounding Box Predictions with YOLO Algorithm

YOLO (You Only Look Once), developed in 2015, is a highly efficient algorithm for object detection that predicts bounding boxes and class probabilities directly from full images in one evaluation.

Advantages of YOLO

  1. Direct Bounding Box Predictions: YOLO provides precise bounding box coordinates of any aspect ratios, overcoming the limitations of fixed stride sizes in traditional sliding windows.
  2. Convolutional Implementation: The algorithm is fully convolutional, enhancing computational efficiency.

YOLO Architecture

How YOLO Works

Constructing the Target Vector

  • YOLO divides an image into a grid (e.g., 3×33 \times 3 for simplicity, but higher resolutions like 19×1919 \times 19 are common for finer detection).
  • Each grid cell predicts bounding boxes and class probabilities.
  • The output target vector for each grid cell is 8-dimensional in our simplified example, thus the overall output dimension is 3×3×83 \times 3 \times 8.

Steps of YOLO

Lets say we have an image of 100×100100 \times 100:

  1. Image Division: Divide a 100×100100 \times 100 image into a 3×33 \times 3 grid.
  2. Apply Classification and Localization: For each grid cell, apply the classification and localization algorithm we discussed in a previous section to each section of the grid. Predict bounding box coordinates (bx,by,bh,bwb_x, b_y, b_h, b_w) and class probabilities. The coordinates are relative to the grid cell, allowing predictions of varying sizes.
  3. Use Convolutional Sliding Window: Implement the entire process in a single convolutional pass to output a 3×3×83 \times 3 \times 8 volume for the 100×100100 \times 100 image.
  4. Merge Results: Combine overlapping predictions based on localization mid-points.

Advantages & Disadvantages

YOLO's integrated approach leads to precise bounding box predictions and high processing speed.

We have a problem though if we have found more than one object in one grid box. With an algorithm of a 19 by 19 grid, the chance of an object of two midpoints of objects appearing in the same grid cell is just a bit smaller.

One of the best advantages that makes the YOLO algorithm popular is that it has a great speed and a Conv net implementation. How is YOLO different from other Object detectors? YOLO uses a single CNN network for both classification and localizing the object using bounding boxes.

Key Techniques in YOLO

Intersection Over Union (IoU)

  • IoU is used to measure the accuracy of object detection.
  • It's the ratio of the intersection area to the union area of the predicted and ground truth bounding boxes.
  • A higher IoU indicates better accuracy.

Intersection Over Union

The red is the labeled output and the purple is the predicted output. To compute Intersection Over Union we first compute the union area of the two rectangles which is "the first rectangle + second rectangle" Then compute the intersection area between these two rectangles.

Finally:

IoU=intersection areaUnion areaIoU = \frac{\text{intersection area}}{\text{Union area}}

An IOU0.5IOU \geq 0.5 then its good. The best answer will be 1. The higher the IOU the better is the accuracy.

Non-Max Suppression

One of the problems we have addressed in YOLO is that it can detect an object multiple time. Non-max Suppression is a way to make sure that YOLO detects the object just once. It keeps only the most probable predictions and removes non-maximal ones based on IoU thresholds.

For example:

Non-Max Suppression Example

Each car has two or more detections with different probabilities. This came from some of the grids that thinks that this is the center point of the object. The Non-max suppression algorithm is as follows:

  1. Let's assume that we are targeting one class as an output class.
  2. YY shape should be [Pc,bx,by,bh,hw][P_c, b_x, b_y, b_h, h_w] where PcP_c is the probability if that object occurs.
  3. Discard all boxes with Pc<0.6P_c < 0.6
  4. While there are any remaining boxes:
    • Pick the box with the largest PcP_c Output as a prediction.
    • Discard any remaining box with IoU>0.5IoU > 0.5 with that box output in the previous step i.e any box with high overlap (greater than overlap threshold of 0.5).

If there are multiple classes/object types c you want to detect, you should run the non-max suppression c times, once for every output class.

Anchor Boxes

Used when multiple objects fall within the same grid cell.Anchor boxes allow multiple objects to be detected in a single cell by having predefined shapes and sizes.

Anchor Box Concept

Here a car and person have their centre point in the same grid. In practice this happens rarely. The idea of Anchor boxes helps us solve this issue.

If Y=[Pc,bx,by,bh,bw,c1,c2,c3]Y = [P_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3] Then to use two anchor boxes like this: Y=[Pc,bx,by,bh,bw,c1,c2,c3,Pc,bx,by,bh,bw,c1,c2,c3]Y = [P_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3, P_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3]. We simply have repeated the one anchor YY. The two anchor boxes you choose should be known as a shape (in practice you will have multiple e.g. 5 anchor boxes):

Anchor Boxes

So previously, each object in a training image is assigned to grid cell that contains that object's midpoint. With two anchor boxes, each object in training image is assigned to grid cell that contains object's midpoint and anchor box for the grid cell with highest IoU. You have to check where your object should be based on its rectangle closest to which anchor box. Whichever it is, that object then gets assigned not just to a grid cell but to a gridcell, anchor box pair. That is how it is encoded in the target label:

Example of data:

Where the car was near the anchor 2 than anchor 1. What doesn't this solution handle well? Two objects of same shape in this example, or more than 2 objects in the same grid cell. In practice this happens rarely, especially in a 19×1919 \times 19 grid cell. It allows specializing better aswell.

You may have two or more anchor boxes but you should know their shapes.

  • how do you choose the anchor boxes and people used to just choose them by hand. Maybe five or ten anchor box shapes that spans a variety of shapes that cover the types of objects you seem to detect frequently.
  • use K means to help select a set of anchor boxes that this most stereotypically representative of the maybe multiple, of the maybe dozens of object causes you're trying to detect
    • You may also use a k-means algorithm on your dataset to specify that.

Anchor boxes allows your algorithm to specialize, means in our case to easily detect wider images or taller ones.

YOLO Algorithm in Practice

YOLO is a state-of-the-art object detection model that is fast and accurate. Let's sum up and introduce the whole YOLO algorithm given an example. Suppose we need to do object detection for our autonomous driver system. It needs to identify three classes:

  1. Pedestrian (Walks on ground).
  2. Car.
  3. Motorcycle.
  4. You will also need to explicitly have the 4th background class

We decided to choose two anchor boxes, a taller one and a wide one. Like we said, in practice they use five or more anchor boxes hand-made or generated using k-means. Our labeled Y shape will be [Ny,HeightOfGrid,WidthOfGrid,16][N_y, \text{HeightOfGrid}, \text{WidthOfGrid}, 16], where NyN_y is number of instances and each row.

Counting the size, we have:

Pc+bounding parameters=5 ParamtersP_c + \text{bounding parameters} = 5 \text{ Paramters}

For each class. Then, we will have:

No. anchors×(5+No. Classes)=16\text{No. anchors} \times (5 + \text{No. Classes}) = 16

These 16 look like:

[Pc,bx,by,bh,bw,c1,c2,c3,Pc,bx,by,bh,bw,c1,c2,c3][P_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3, P_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3]

Your dataset could be an image with multiple labels and a rectangle for each label, we should go to your dataset and make the shape and values of Y like we agreed. 2 Anchor Boxes:

We first initialize all of them to zeros and ?, then for each label and rectangle choose its closest grid point then the shape to fill it and then the best anchor point based on the IOU. So that the shape of YY for one image should be [HeightOfGrid,WidthOfGrid,16][\text{HeightOfGrid}, \text{WidthOfGrid}, 16].

Train the labeled images on a Conv net. you should receive an output of [HeightOfGrid,WidthOfGrid,16][\text{HeightOfGrid}, \text{WidthOfGrid}, 16] for our case. To make predictions, run the Conv net on an image and run the Non-max suppression algorithm for each class you have; in our case there are 3 classes.

You could get something like this:

For each grid cell, we will get 2 predicted bounding boxes, one for each anchor box we have specified. The total number of generated boxes are grid height×grid width×no. of anchors\text{grid height} \times \text{grid width} \times \text{no. of anchors}. In our case, we have 3 x 3 x 2 = 18 boxes. By removing the low probability predictions you should have:

Now, for each class (pedestrian, car motorcycle) you independently use non-max suppression for each class to generate final predictions. Then get the best probability followed by the IOU filtering:

It should be noted that YOLO algorithms are not good at detecting smaller object. YOLO9000 however is better, faster, stronger.


Resources: