The Confusion Matrix
A test for a rare disease can be 99.9% accurate by just guessing "no" all the time by just guessing you don't have it. a model that does that would look on paper to have very high accuracy, but in reality, it's worse than useless.
We need to understand true positives and true negative, as well as false positives and false negatives for what you're trying to accomplish. A confusion matrix shows this. There's no real convention to how this is ordered. Sometimes you see predictions on the top or the side.
Actual YES | Actual NO | |
---|---|---|
Predicted YES | TRUE POSITIVES | FALSE POSITIVES |
Predicted NO | FALSE NEGATIVES | TRUE NEGATIVE |
The diagonal here of your confusion matrix is where most of your results should be. This is where accuracy lives.
Actual YES | Actual NO | |
---|---|---|
Predicted YES | 50 | 5 |
Predicted NO | 10 | 100 |
Sometimes you'll see confusion matrices in a different format where we actually add things up on each row and column as well.
Actual YES | Actual NO | ||
---|---|---|---|
Predicted YES | 50 | 5 | 55 |
Predicted NO | 10 | 100 | 110 |
60 | 105 |
Metrics Derived from Confusion Matrices
Think about the difference between precision and recall (in cat classification for example). Suppose we run the classifier on 10 images which are 5 cats and 5 non-cats. The classifier identifies that there are 4 cats, but it identified 1 wrong cat. The confusion matrix:
Predicted Cat | Predicted Non-Cat | |
---|---|---|
Actual Cat | 3 | 2 |
Actual Non-Cat | 1 | 4 |
Recall
Out of all the possible cats, how many did we actually manage to predict? percentage of true recognition cat of the all cat predictions:
So for this case:
How many of all the cats did I manage to catch? This is the "sensitivity", "True Positive rate", "Completeness". It is the percent of positives rightly predicted. It is a good choice of metric when you care a lot about false negatives i.e., fraud detection
Precision
Out of what we predicted was a cat, how many were actually cats? Percentage of true cats in the recognized result
So for this case,
Of the cats I predicted, how many were actually cats? This is the "Correct Positives". It is the Percent of relevant results. It is a good choice of metric when you care a lot about false positives i.e., medical screening, drug testing. You don't want to say somebody is, you know on cocaine or something when they're not.
Specificty (True negative rate)
F1 score
This is a common single number evaluation metric used in ML. Mathamatically, it is the harmonic mean of precision and sensitivity (Recall), used when you care about precision AND recall.
RMSE
- Root mean squared error, exactly what it sounds like
- Accuracy measurement
- Only cares about right & wrong answers
ROC Curve
- Receiver Operating Characteristic Curve
- Plot of true positive rate (recall) vs. false positive rate at various threshold settings. Here, The False Positive Rate (FPR) is:
- Points above the diagonal represent good classification (better than random)
- Ideal curve would just be a point in the upper-left corner
- The more it's "bent" toward the upper-left, the better
AUC
- The area under the ROC curve is Area Under the Curve (AUC)
- Equal to probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
- ROC AUC of 0.5 is a useless classifier, 1.0 is perfect
- Commonly used metric for comparing classifiers
Confusion Matrix Formats
Can also see them in a different format.
Multi- Class
For multi-class confusion matrix & heat map:
AWS Confusion Matrix
This is a particular style of the confusion matrix.
It's again in that heat map style we just talked about. I lifted this from the AWS Machine Learning Service documentation, which is where this originated from. It shows:
- No. of correct and incorrect predictions per class (can be inferred from colors of each cell)
- F1 scores per class
- True class frequencies: the "total" column
- Predicted class frequencies: the "total" row