The Confusion Matrix

A test for a rare disease can be 99.9% accurate by just guessing "no" all the time by just guessing you don't have it. a model that does that would look on paper to have very high accuracy, but in reality, it's worse than useless.

We need to understand true positives and true negative, as well as false positives and false negatives for what you're trying to accomplish. A confusion matrix shows this. There's no real convention to how this is ordered. Sometimes you see predictions on the top or the side.

	Actual YES	Actual NO
Predicted YES	TRUE POSITIVES	FALSE POSITIVES
Predicted NO	FALSE NEGATIVES	TRUE NEGATIVE

The diagonal here of your confusion matrix is where most of your results should be. This is where accuracy lives.

	Actual YES	Actual NO
Predicted YES	50	5
Predicted NO	10	100

Sometimes you'll see confusion matrices in a different format where we actually add things up on each row and column as well.

	Actual YES	Actual NO
Predicted YES	50	5	55
Predicted NO	10	100	110
	60	105

Metrics Derived from Confusion Matrices

Think about the difference between precision and recall (in cat classification for example). Suppose we run the classifier on 10 images which are 5 cats and 5 non-cats. The classifier identifies that there are 4 cats, but it identified 1 wrong cat. The confusion matrix:

	Predicted Cat	Predicted Non-Cat
Actual Cat	3	2
Actual Non-Cat	1	4

Recall

Out of all the possible cats, how many did we actually manage to predict? percentage of true recognition cat of the all cat predictions:

R = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}

So for this case:

R = \frac{3}{3 + 2}

How many of all the cats did I manage to catch? This is the "sensitivity", "True Positive rate", "Completeness". It is the percent of positives rightly predicted. It is a good choice of metric when you care a lot about false negatives i.e., fraud detection

Precision

Out of what we predicted was a cat, how many were actually cats? Percentage of true cats in the recognized result

P = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}

So for this case,

P = \frac{3}{3 + 1}

Of the cats I predicted, how many were actually cats? This is the "Correct Positives". It is the Percent of relevant results. It is a good choice of metric when you care a lot about false positives i.e., medical screening, drug testing. You don't want to say somebody is, you know on cocaine or something when they're not.

Specificty (True negative rate)

\text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives} + \text{False Positives}}

F1 score

F_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

This is a common single number evaluation metric used in ML. Mathamatically, it is the harmonic mean of precision and sensitivity (Recall), used when you care about precision AND recall.

RMSE

Root mean squared error, exactly what it sounds like
Accuracy measurement
Only cares about right & wrong answers

ROC Curve

Receiver Operating Characteristic Curve
Plot of true positive rate (recall) vs. false positive rate at various threshold settings. Here, The False Positive Rate (FPR) is:

\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}

Points above the diagonal represent good classification (better than random)
Ideal curve would just be a point in the upper-left corner
The more it's "bent" toward the upper-left, the better

ROC

AUC

The area under the ROC curve is Area Under the Curve (AUC)
Equal to probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
ROC AUC of 0.5 is a useless classifier, 1.0 is perfect
Commonly used metric for comparing classifiers

Confusion Matrix Formats

Can also see them in a different format.

Multi- Class

For multi-class confusion matrix & heat map:

AWS Confusion Matrix

This is a particular style of the confusion matrix.

AWS Predicted genre example

It's again in that heat map style we just talked about. I lifted this from the AWS Machine Learning Service documentation, which is where this originated from. It shows:

No. of correct and incorrect predictions per class (can be inferred from colors of each cell)
F1 scores per class
True class frequencies: the "total" column
Predicted class frequencies: the "total" row

Detecting outliers A/B testing concepts