Benchmarking to Human Level Performance

When we measure the performance of ML algorithms in certain tasks, we often compare it to human-level performance. This is useful for two main reasons:

ML has improved a lot thanks to deep learning. It's now realistic to expect ML to match or even outdo humans performance.
The workflow of building ML systems is more straightforward when the task is something humans can also do. It's easier to tell how well the ML is doing by comparing it to people.

Once an ML reaches human-level performance, improving it further becomes much harder. It can't get better than a limit called "Bayes optimal error." This is the theoretical lowest possible error rate, and since humans are already quite good at many tasks, this is a good proxy for Bayes error. After an algorithm reaches the human level performance, the progress and accuracy slow down.

Bayes

Here's what you can do when ML is still not as good as humans:

Use human judgments to create training data.
Look at why humans succeed where ML fails.
Better analysis of bias/variance.

Avoidable bias

Suppose that a cat classification algorithm gives these results:

	Example A	Example B
Humans	1%	7.5%
Training Error	8%	8%
Dev Error	10%	10%

Let's say an ML that identifies cats in photos makes some mistakes. If humans almost never make a mistake (1% error), but the ML does 8% of the time, we focus on what's causing the AI's errors. But if humans also make more mistakes (7.5% error), then the AI's inconsistency is the problem i.e variance.

Depending on what we think is achievable, with the same training error and dev error in these two cases, we decided to focus on bias reduction tactics or on variance reduction tactics.

We use the human error rate as a stand-in for the best ML can achieve, the Bayes optimal error. If the AI's training error is higher than the human (or Bayes) error, it's got a bias problem. If the error on new, unseen data (dev error) is higher, it's got a variance problem.

\text{Avoidable bias} = \text{Training error} - \text{Human (Bayes) error}

\text{Variance} = \text{Dev error} - \text{Training error}

Choosing the Right Human Benchmark

Deciding on a human performance benchmark depends on what's suitable for the system you're trying to build. If you want to beat an individual, like one doctor's diagnosis rate, you might set a lower benchmark. But if you're aiming for the best possible, like a group of expert doctors, your benchmark will be higher.

In summary:

Calculate $\text{Avoidable bias}$ using human-level error (proxy for Bayes error).
Calculate $\text{Variance}$ using training error.
If avoidable bias is larger than your variance, then it's a bias problem and you should use a strategy to reduce bias.
- Train bigger model.
- Train longer/better optimization algorithm (like Momentum, RMSprop, Adam).
- Find better NN architecture/hyperparameters search.
If variance difference is latger, then you should use a strategy for variance resolving:
- Get more training data.
- Regularization (L2, Dropout, data augmentation)
- Find better NN architecture/hyperparameters search.

So having an estimate of human-level performance gives you an estimate of Bayes error. And this allows you to more quickly make decisions as to whether you should focus on trying to reduce a bias or trying to reduce the variance of your algorithm.

These techniques will tend to work well until you surpass human-level performance, whereupon you might no longer have a good estimate of Bayes error that still helps you make this decision really clearly.

After Beating Human Performance

In some problems, deep learning has surpassed human-level performance, like:

Online advertising
Product recommendation
Loan approval
Logistics (predicting transit time)

The last examples are not natural perception task, rather learning on structural data. Humans are far better in natural perception tasks like computer vision and speech recognition (speech recog, image recog, ECG, Skin cancer etc.). It's harder for machines to surpass human-level performance in natural perception task. But there are already some systems that achieved it.

Once you've surpassed this bayes error limit proxy threshold, your options of making progress on the machine learning problem are just less clear.

Orthogonalization & Single Number Evaluation Metrics Error Analysis & Data Mismatch