Error Analysis and Data Mismatch

ℹ️

Error analysis: The process of manually examining mistakes that your algorithm is making. It can give you insights into what to do next.

In the cat classification example, lets say you have 10% error on your dev set and you want to decrease the error. You discovered that some of the mislabeled data are dog pictures that look like cats. Should you try to make your cat classifier do better on dogs (this could take some weeks)?

Error analysis approach (calculate the ceiling on performance improvement):

Get 100 mislabeled dev set examples at random. Count up how many are dogs.
if 5 of 100 are dogs then training your classifier to do better on dogs will decrease your error up to 5/100 * 10 = 0.5% (called ceiling), which can be too little.
if 50 of 100 are dogs then you could decrease your error up to 50/100 * 10 = 5%, which is reasonable and you should work on that.

Based on the last example, error analysis helps you to analyze the error before taking an action that could take lot of time with no need. Sometimes, you can evaluate multiple error analysis ideas in parallel and choose the best idea. Create a spreadsheet for error analysis to do that and decide, e.g.:

Image	Dog	Great Cats	Blurry	IG Filters	Comments
1	✅			✅	Pitbull
2	✅		✅	✅
3					Rainy Day at Zoo
4		✅
...
% totals	8%	43%	61%	12%

In the last example you will decide to work on great cats or blurry images to improve your performance. This error analysis gives you a sense of what to pursue. This quick counting procedure, which you can often do in, at most, small numbers of hours can really help you make much better prioritization decisions, and understand how promising different approaches are to work on.

Cleaning up incorrectly labeled data

DL algorithms are quite robust to random errors in the training set but less robust to systematic errors. But it's OK to go and fix these labels if you can. (empahsis on RANDOM ERRORS - less robust to systematic errors). If you want to check for mislabeled data in dev/test set, you should also try error analysis with the mislabeled column.

Image	Dog	Great Cats	Blurry	Mislabeled	Comments
1	✅			Y	Pitbull
2	✅		✅
3				Y	Rainy Day at Zoo
4		✅
...
% totals	8%	43%	61%	12%

Then, carry out some simple calculations:

If overall dev set error: 10%
- Then errors due to incorrectly labelled data: 0.6%
- Then errors due to other causes: 9.4%
Then you should focus on the 9.4% error rather than the incorrect data.

The advice is, if it makes a significant difference to your ability to evaluate algorithms on your dev set, then go ahead and spend the time to fix incorrect labels. But if it doesn't make a significant difference to your ability to use the dev set to evaluate classifiers, then it might not be the best use of your time.

Consider these guidelines while correcting the dev/test mislabeled examples:

Apply the same process to your dev and test sets to make sure they continue to come from the same distribution. This is the data mismatch problem
- It's super important that your dev and test sets come from the same distribution. But if your training set comes from a slightly different distribution, often that's a pretty reasonable thing to do.
- Train and (dev/test) data may now come from a slightly different distributions. Learning algorithms are quite robust to this
Consider examining examples your algorithm got right as well as ones it got wrong. (Not always done if you reached a good accuracy)

Build your first system quickly, then iterate

The steps you take to make your deep learning project:

Setup dev/test set and metric
Build initial system quickly
Use Bias/Variance analysis & Error analysis to prioritize next steps.

The Data Mismatch Problem

In deep learning, we often find that the data we use to train models (training sets) doesn't quite match the data we use to test them (dev/test sets). This mismatch can cause problems, but there are ways to deal with it.

Different Data Distributions

Sometimes, the vast amount of data needed for deep learning means our training sets are different from our dev/test sets. There are some strategies to follow when your training set distribution differs from dev/test sets distribution:

Option one (not recommended): Shuffle All Data: You could mix all the data and then split it into new training and dev/test sets.
- Pros: Uniform data distribution.
- Cons: The new dev/test sets may not represent real-world scenarios well. The target is not that good anymore - the Dev/test sets optimize for a different distribution than you actually care about in real life now.
Option two: Adding Dev/Test Data to Training: You could move some data from the dev/test sets to the training set.
- Pros: The model might perform better over time on the real-world distribution. The dev/test set have the same distribution as the use case for the algorithm
- Cons: The training set no longer matches the dev/test sets, but it could be worth it in the long run.

When Datasets Don't Match Up

If your training and dev/test sets are from different distributions, it complicates things. Take a cat classifier:

Human error: 0%
Training error: 1%
Dev error: 10%

You might think this is a variance issue, but the different data distributions make it unclear, because it could be that train set was easy to train on, but the dev set was more difficult. This is where a "train-dev" set comes in, which is a random subset of the training set (so it has the same distribution):

Human error: 0%
Training error: 1%
Train-dev error: 9%
Dev error: 10%

Now, it's clear there's a variance problem. But, if you find:

Human error: 0%
Training error: 1%
Train-dev error: 1.5%
Dev error: 10%

This indicates a data mismatch problem, meaning the model isn't generalizing well to the real-world data it needs to handle. Here, your algorithm has learned to do well on a different distribution than what you really care about.

It is important that your dev and test set have the closest possible distribution to "real" data. It is also important for the training set to contain enough "real" data to avoid having a data-mismatch problem. You should also consider the tradeoff between the data accessibility and potential improvement of your model trained on this additional data.

Addressing Data Mismatch

Dealing with data mismatch isn't straightforward, but here are a couple of strategies:

Manual Error Analysis: Compare errors on training and dev/test sets to understand differences between training and dev/test sets.
- Do error analysis, or look at the training set & dev set to gain insight into how these two distributions of data might differ. And then see if you can find some ways to get more training data that looks a bit more like your dev set.
Make Training Data More Similar: Modify training data to resemble dev/test data more closely.
- e.g. Car noise? add to background. Street number errors, add numbers to training data.

If your goal is to make the training data more similar to your dev set one of the techniques you can use Artificial data synthesis that can help you make more training data. Artificial Data Synthesis can help bridge the gap by creating new training data. But be careful; if your synthetic data is too narrow, your model might not generalize well to real-world data. Always check if the synthesized data accurately represents the variety you need.

Combine some of your training data with something that can convert it to the dev/test set distribution. For example:
- Combine normal audio with car noise to get audio with car noise example.
- Generate cars using 3D graphics in a car classification example.
Be cautious and bear in mind whether or not you might be accidentally simulating data only from a tiny subset of the space of all possible examples because your NN might overfit these generated data (like particular car noise or a particular design of 3D graphics cars)
- you might end up creating this very impoverished synthesized data set from a much smaller subset of the space without actually realizing it.
- To the human ears/eyes, we might not actually be able to tell the differences

Conclusions

Human-level error: Helps estimate Bayes error.
- Avoidable bias: Training error - Human (Bayes) error
- If avoidable bias difference is bigger than your variance, then it's a bias problem and you should use a strategy to reduce bias
Training error: How well does the model learn the training data?
- Variance: Train-dev error - training error.
- If variance difference is bigger, then you should use a strategy for variance resolving.
Dev error: How well does the model perform on unseen data?
- Data mismatch: Dev error - train-dev error.
- If difference is much bigger then train-dev error its Data mismatch problem.
Test error: Indicates overfitting to the dev set.
- Calculate degree of overfitting to dev set: test error - dev error
- remember you should compare the bayes error to the real-world distribution test/dev sets to see how bad/good the error is
- If the difference is big (positive) then maybe you need to find a bigger dev set (dev set and test set come from the same distribution, so the only way for there to be a huge gap here is for it to do much better on the dev set than the test set, is if you somehow managed to overfit the dev set).
- sometimes if your dev/test set distribution is much easier for whatever application you're working on then these numbers can actually go down for Dev & Test

Unfortunately, there aren't many systematic ways to deal with data mismatch.

Benchmarking to Human-Level Performance Transfer Learning