Artificial Intelligence 🤖
Convolutional Neural Networks (CNNs)
Face Recognition

Face Recognition

Face recognition systems are designed to identify a person by analyzing their facial features. These systems are effective with both still images and videos. A critical component, liveness detection, is used in video face recognition to differentiate between live persons and static images, ensuring authenticity. This is often achieved through supervised deep learning, involving datasets that distinguish between live and non-live subjects.

  • Face Verification: Involves a one-to-one matching process. Given an image and a name or ID, the system determines whether the image corresponds to the claimed identity. Essentially, it answers the question, "Is this the claimed person?"
  • Face Recognition: This is a more complex process, involving a one-to-many comparison. The system, holding a database of multiple (KK) individuals, receives an input image and identifies whether it matches any of the KK persons in the database, answering "Who is this person?"

Face verification and face recognition differ primarily in scale: verification involves comparing an image to a single individual's face, while recognition involves comparison with multiple faces.

For a face recognition system to be effective, the underlying face verification system must be highly accurate (approximately 99.9% or more), as the accuracy of the recognition system is inherently less due to the larger number of comparisons.

One Shot Learning

Traditional methods of feeding images into a Convolutional Neural Network (CNN) and using a softmax layer for identification are not effective in situations with limited data or frequent new entries. If someone else joins the team, it is infeasible to retrain the model to increase the number of nodes in the softmax This leads to the challenge of 'one-shot learning'.

One Shot Learning refers to a system's ability to recognize a person from a single image. This is crucial in scenarios where only one image per individual is available. Traditional deep learning struggles with limited data, but one-shot learning overcomes this by using a similarity function:

d(img1,img2)=Degree of difference between the two imagesd(\text{img1}, \text{img2}) = \text{Degree of difference between the two images}

The goal is for dd to be low for identical faces. We use a threshold τ\tau to determine if two faces are the same:

If d(img1,img2)τ, then the faces are considered identical.\text{If } d(\text{img1}, \text{img2}) \leq \tau, \text{ then the faces are considered identical.}

The similarity function helps us in solving the one shot learning issue; you need this for most face recognition systems because you might have only one picture of each of your employees or of your team members in your employee database. Also, it is much more robust to new inputs.

So long as you can learn this function dd, which inputs a pair of images and tells you if they're the same person or different persons then if you have someone new join your team, you can add a fifth person to your database, and it just works fine.

Siamese Network

A Siamese Network is used to implement the similarity function. This network architecture involves multiple inputs passing through networks with identical architecture and parameters.

Siamese Network

The networks encode input images into vectors. Typically, these networks consist of convolutional layers, pooling layers, and fully connected layers, producing an encoding like f(x(1))f(x^{(1)}) and f(x(2))f(x^{(2)}).

If you believe that these encodings are a good representation of these two images, what you can do is then define the image distance dd as the loss function. This will be:

d(x(1),x(2))=f(x(1))f(x(2))2d(x^{(1)}, x^{(2)}) = \| f(x^{(1)}) - f(x^{(2)}) \|^2

It is the norm of the difference of the 2 encodings. If x(1)x^{(1)} & x(2)x^{(2)} are the same person, we want d(x(1),x(2))d(x^{(1)}, x^{(2)}) to be low. If they are different people, we would want d(x(1),x(2))d(x^{(1)}, x^{(2)}) to be high.

Taigman et. al., 2014. DeepFace closing the gap to human level performance (opens in a new tab)

Triplet Loss

L(A,P,N)=max(f(A)f(P)2f(A)f(N)2+α,0)L(A, P, N) = \max(\|f(A) - f(P)\|^2 - \|f(A) - f(N)\|^2 + \alpha, 0)

Triplet Loss is one of the loss functions we can use to solve the similarity distance in a Siamese network. Our learning objective in the triplet loss function is to get the distance between an Anchor image and a positive or a negative image (where positive means same person, while negative means different person).

The name 'triplet loss' came from the fact that we are comparing an anchor AA with a positive PP and a negative NN image. Formally we want:

For the positive distance to be less than negative distance:

d(A,P)d(A,N)d(A,P) \leq d(A, N) f(A)f(P)2f(A)f(N)2\|f(A) - f(P)\|^2 \leq \|f(A) - f(N)\|^2 f(A)f(P)2f(A)f(N)20\|f(A) - f(P)\|^2 - \|f(A) - f(N)\|^2 \leq 0

One trivial way to make sure this is satisfied is to just learn that everything equals zero. To make sure the NN won't get an output of zeros easily:

f(A)f(P)2f(A)f(N)2α\|f(A) - f(P)\|^2 - \|f(A) - f(N)\|^2 \leq - \alpha

Where α\alpha is a small number, sometimes called the margin. Then:

f(A)f(P)2f(A)f(N)2+α0\|f(A) - f(P)\|^2 - \|f(A) - f(N)\|^2 + \alpha \leq 0

If we achieve the objective of making the below loss less than 0, the loss will be 0. Otherwise, taking the max, we will have a positive loss. The Final Loss function is, given 3 images (A,P,N)(A, P, N)

L(A,P,N)=max(f(A)f(P)2f(A)f(N)2+α,0)L(A, P, N) = \max(\|f(A) - f(P)\|^2 - \|f(A) - f(N)\|^2 + \alpha, 0)

This is for a single training example. For all training examples:

J=i=1mL(A(i),P(i),N(i))J = \sum_{i=1}^{m} L(A^{(i)}, P^{(i)}, N^{(i)})

For the purpose of training your system, you do need a dataset where you have multiple pictures of the same person. Then, you select some triplets out of your dataset. To carry this out properly, the dataset should be big enough.

Choosing the triplets AA, PP, NN

During training if AA, PP, NN can be chosen randomly (subject to AA and PP being the same while AA and NN aren't the same). One problem that immediately occurs is that this constrain is easily satisfied.

d(A,P)+αd(A,N)d(A, P) + \alpha \leq d (A, N)

Given two randomly chosen pictures of people, chances are AA and NN are much more different than AA and PP, and in this case the NN won't learn much.

What we want to do is choose triplets that are hard to train on, so we aim to construct a hard to train dataset. We need to find examples where d(A,P)d(A,P) and d(A,N)d(A, N) are actually quite close. It increases the computational efficiency. Otherwise, it will get it right half the time anyway. This can be achieved by, for example, using the same poses. Find more at the paper: Schroff et al.,2015, FaceNet: A unified embedding for face recognition and clustering.

Commercial recognition systems are trained on a large datasets, with ~ 10/100 million images. However, there are a lot of pre-trained models and parameters online for face recognition. Because of the sheer volume, transfer learning is good use here.

Face Verification and Binary Classification

An alternative to triplet loss for learning a convolutional network's parameters is to frame the problem as binary classification.

Binary Classification

In this approach, image embeddings are input into a logistic regression unit, concluding with a sigmoid layer. For these 128 dimensional vectors for example:

y^=σ(k=1128wkf(x(i))kf(x(j))k+b)\hat{y} = \sigma \left( \sum_{k=1}^{128} w_k \left| f(x^{(i)})_k - f(x^{(j)})_k \right| + b \right)

Here, f(xi)f(xj)f(x_i) - f(x_j) represents a form of distance, such as Manhattan distance, between the embeddings of two images. Other distances that can be used here the Euclidean distance and the Chi squared distance.

χ2(x,y)=k=1m(f(x(i))kf(x(j))k)2f(x(i))k+f(x(j))k\chi^2(x, y) = \sum_{k=1}^{m} \frac{(f(x^{(i)})_k - f(x^{(j)})_k)^2}{f(x^{(i)})_k + f(x^{(j)})_k}

Taigman et. al., 2014. DeepFace closing the gap to human level performance

For efficient deployment, pre-computing embeddings for comparison images can save computational resources. When a new image is presented, only its embedding needs to be computed and compared against the pre-existing ones. This way we don't need to store the raw images aswell.

Both triplet loss and binary classification methods are effective in facial recognition using deep learning. Available implementations for face recognition using deep learning includes:

  1. Openface (opens in a new tab)
  2. FaceNet (opens in a new tab)
  3. DeepFace (opens in a new tab)