Artificial Intelligence 🤖
The Confusion Matrix

The Confusion Matrix

A test for a rare disease can be 99.9% accurate by just guessing "no" all the time by just guessing you don't have it. a model that does that would look on paper to have very high accuracy, but in reality, it's worse than useless.

We need to understand true positives and true negative, as well as false positives and false negatives for what you're trying to accomplish. A confusion matrix shows this. There's no real convention to how this is ordered. Sometimes you see predictions on the top or the side.

Actual YESActual NO
Predicted YESTRUE POSITIVESFALSE POSITIVES
Predicted NOFALSE NEGATIVESTRUE NEGATIVE

The diagonal here of your confusion matrix is where most of your results should be. This is where accuracy lives.

Actual YESActual NO
Predicted YES505
Predicted NO10100

Sometimes you'll see confusion matrices in a different format where we actually add things up on each row and column as well.

Actual YESActual NO
Predicted YES50555
Predicted NO10100110
60105

Metrics Derived from Confusion Matrices

Think about the difference between precision and recall (in cat classification for example). Suppose we run the classifier on 10 images which are 5 cats and 5 non-cats. The classifier identifies that there are 4 cats, but it identified 1 wrong cat. The confusion matrix:

Predicted CatPredicted Non-Cat
Actual Cat32
Actual Non-Cat14

Recall

Out of all the possible cats, how many did we actually manage to predict? percentage of true recognition cat of the all cat predictions:

R=True PositivesTrue Positives+False NegativesR = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}

So for this case:

R=33+2R = \frac{3}{3 + 2}

How many of all the cats did I manage to catch? This is the "sensitivity", "True Positive rate", "Completeness". It is the percent of positives rightly predicted. It is a good choice of metric when you care a lot about false negatives i.e., fraud detection

Precision

Out of what we predicted was a cat, how many were actually cats? Percentage of true cats in the recognized result

P=True PositivesTrue Positives+False PositivesP = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}

So for this case,

P=33+1P = \frac{3}{3 + 1}

Of the cats I predicted, how many were actually cats? This is the "Correct Positives". It is the Percent of relevant results. It is a good choice of metric when you care a lot about false positives i.e., medical screening, drug testing. You don't want to say somebody is, you know on cocaine or something when they're not.

Specificty (True negative rate)

Specificity=True NegativesTrue Negatives+False Positives\text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives} + \text{False Positives}}

F1 score

F1=2×Precision×RecallPrecision+RecallF_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

This is a common single number evaluation metric used in ML. Mathamatically, it is the harmonic mean of precision and sensitivity (Recall), used when you care about precision AND recall.

RMSE

  • Root mean squared error, exactly what it sounds like
  • Accuracy measurement
  • Only cares about right & wrong answers

ROC Curve

  • Receiver Operating Characteristic Curve
  • Plot of true positive rate (recall) vs. false positive rate at various threshold settings. Here, The False Positive Rate (FPR) is:
FPR=FPFP+TN\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}
  • Points above the diagonal represent good classification (better than random)
  • Ideal curve would just be a point in the upper-left corner
  • The more it's "bent" toward the upper-left, the better

ROC

AUC

  • The area under the ROC curve is Area Under the Curve (AUC)
  • Equal to probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
  • ROC AUC of 0.5 is a useless classifier, 1.0 is perfect
  • Commonly used metric for comparing classifiers

Confusion Matrix Formats

Can also see them in a different format.

Multi- Class

For multi-class confusion matrix & heat map:

AWS Confusion Matrix

This is a particular style of the confusion matrix.

AWS Predicted genre example

It's again in that heat map style we just talked about. I lifted this from the AWS Machine Learning Service documentation, which is where this originated from. It shows:

  • No. of correct and incorrect predictions per class (can be inferred from colors of each cell)
  • F1 scores per class
  • True class frequencies: the "total" column
  • Predicted class frequencies: the "total" row