Object Localization

Object localization is a critical step in computer vision that bridges the gap between image classification and more complex tasks like object detection and segmentation.

Image Classification

In a basic image classification task, the goal is to categorize an entire image as belonging to one of the predefined classes. Typically, there's a single, central object of interest in the image.

Image Classification Example

Classification with Localization

Classification with localization not only categorizes the image but also identifies the location of the object within the image using a bounding box. This is generally applied when a single object's position within the image is also of interest.

Classification with Localization

Object Detection

Object detection extends localization to multiple objects. The task involves detecting all objects of certain classes and their locations within an image. This is crucial for complex scenes with multiple objects, such as in autonomous driving systems.

Object Detection Example

If you're doing this for an autonomous driving application, then you might need to detect not just other cars, but maybe other pedestrians and motorcycles and maybe even other objects for example.

Semantic Segmentation

Semantic segmentation takes pixel-level classification to the forefront, labeling each pixel of the image with a category. Unlike object detection, it does not differentiate between distinct objects of the same class.

Semantic Segmentation Example

Instance Segmentation

Instance segmentation combines the fine-grained pixel-level classification of semantic segmentation with object differentiation. It not only labels each pixel but also distinguishes between different instances of the same class.

Instance Segmentation Example

Mechanism of Localization in ConvNets

For classification with localization, a ConvNet is utilized with a Softmax layer for class prediction and additional outputs to specify the bounding box ( $b_x$ , $b_y$ , $b_h$ , $b_w$ ).

The dataset should contain these four numbers with the class too. As convention, we denote the upper left as the coordinate (0,0), and at the lower right is (1,1).

Target Label in Localization

The target label vector in a classification with localization problem typically contains:

$P_c$ : Probability that an object is present.
$b_x, b_y$ : Bounding box center coordinates.
$b_h, b_w$ : Bounding box height and width.
Class probabilities: $c_1, c_2, \ldots$

Defining the target label $Y$ vector in classification with localization problem:

Y = [
    Pc # Prob an object is presented i.e. Is there an Obj?
    bx # Bounding box
    by # Bounding box
    bh # Bounding box
    bw # Bounding box
    c1 # The classes
    c2
    ...
]

Example (When an object is present):

Y = [
    1 # Object is present
    0.5
    0.7
    0.3
    0.4
    0
    1
    0
]

Example (When object isn't presented):

Y = [
    0 # Object isn't presented
    ? # ? means we don't care about other values
    ?
    ?
    ?
    ?
    ?
    ?
]

Loss Function for Localization

The loss function for the $Y$ we have created (Example of the square error):

L(y', y) = \begin{cases} (y_1' - y_1)^2 + (y_2' - y_2)^2 + \ldots & \text{if } y_1 = 1 \\ (y_1' - y_1)^2 & \text{if } y_1 = 0 \end{cases}

In practice, various loss components such as logistic regression for $P_c$ , log-likelihood for class probabilities, and squared error for bounding boxes.

ConNet Practical Advice Landmark Detection