Artificial Intelligence 🤖
Orthogonalization & Single Number Evaluation Metrics

Orthogonalization

Orthogonalization is an approach in machine learning that involves adjusting distinct aspects of a model with separate, specific controls. Each control does a specific task and doesn't affect other controls. Some deep learning developers know exactly what hyperparameter to tune in order to try to achieve one effect. This is a process we call orthogonalization, and it simplifies the complex process of machine learning by breaking it down into more manageable parts. The chain of assumptions in machine learning:

  1. Training Set Performance: Your model should fit the training set well on cost function (near human level performance if possible). If it doesn't, consider a larger network or a more sophisticated optimization algorithm.
  2. Development Set Performance: Your model should also fit the development set well on cost function. Issues here could be addressed with techniques like regularization or by gathering more data to generalize better to the dev set.
  3. Test Set Performance: A good fit on the test set on cost function is crucial. If your model performs well on the development set but not on the test set, you might need a larger development set to detect overfitting.
  4. Real-World Performance: Ultimately, your model needs to perform well in real-world applications. If there's a gap, it might be necessary to adjust your development set or cost function to better reflect the real-world scenarios. Either the dev/test set distribution is not correct, or the cost function is not measuring the correct thing.

Orthogonal Controls in Practice:

  • Bigger network or Adam optimizer: To improve training set performance.
  • Regularization (like L2 or dropout): To help with the development set performance.
  • Bigger development set: To ensure the test set performance is robust.
  • Change development set or cost function: To align with real-world performance.
💡

Andrew Ng advises against methods like early stopping because they simultaneously influence both the fit of the training set and the development set performance, which goes against the principle of orthogonalization.

A Single Number Evaluation Metric

Setting a single, real number evaluation metric before starting your project streamlines the entire optimization process. It cuts through the noise of conflicting indicators and allows for faster iteration.

Example: Combining Precision and Recall

Think about the difference between precision and recall (in cat classification for example). Suppose we run the classifier on 10 images which are 5 cats and 5 non-cats. The classifier identifies that there are 4 cats, but it identified 1 wrong cat. The confusion matrix:

Predicted CatPredicted Non-Cat
Actual Cat32
Actual Non-Cat14
  1. Precision: Out of what we predicted was a cat, how many were actually cats? Percentage of true cats in the recognized result: P=33+1P = \frac{3}{3 + 1}
  2. Recall: Out of all the possible cats, how many did we actually manage to predict? percentage of true recognition cat of the all cat predictions: R=33+2R = \frac{3}{3 + 2}
  3. Accuracy: How many did we classify correctly, including non-cats? 3+410\frac{3+4}{10}

Using a precision and recall for evaluation is good in a lot of cases, but separately they don't tell you which algorithms is better. E.g.

ClassifierPrecisionRecall
A95%90%
B98%85%

Instead of juggling precision and recall separately, combine them into one single (real) number evaluation metric like the F1F_1 score:

F1=2×Precision×RecallPrecision+RecallF_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

This score harmonizes the two aspects, giving you a clear indication of overall performance. Mathamatically, it is the harmonic mean of precision and sensitivity (Recall), used when you care about precision AND recall.

Satisficing and Optimizing Metrics

When it's challenging to distill performance down to a single number, identify one optimizing metric and one or more 'satisficing' metrics that are "good enough". Take this example, it's hard sometimes to just use a single number evaluation metric:

ClassifierF1Running Time
A90%80 ms
B92%95 ms
C92%1,500 ms

You may want to select a classifier that maximizes accuracy, but subject to running time, that is the time it takes to classify an image that must be less than 100 milliseconds or equal to it. So in this case, we would say that accuracy is a metric that optimizes, because you want to maximize accuracy. In terms of accuracy you want to do as well as possible so that run time is what we call a satisfying metric. This will be a fairly reasonable way to trade off or put off Both accuracy and running time together.

In this case, accuracy and running time are the evaluation matrices. Accuracy is the optimizing metric, because you want the classifier to correctly detect a cat image as accurately as possible. The running time which is set to be under 100 ms in this example,is the satisficing metric which mean that the metric has to meet expectation set. The general rule is:

So as a general rule:

  1. Optimize one key metric: This is your focus for maximization or minimization.
  2. Maintain N−1N-1 satisfactory metrics: These are your constraints that need to meet a predefined standard.

So we can solve this for our example by choosing a single optimizing metric and decide that other metrics are satisfying:

  1. Maximize F1F_1. This is our optimizing metric
  2. Subject to running time <100ms< 100ms # satisficing metric

By separating concerns and identifying clear metrics, you can fine-tune your machine learning system more efficiently, saving time and resources while improving performance. This could include accuracy vs false positives for example, maximize accuracy, and 1 false positive every 24 hours etc.

Refining Your Evaluation Metric

In machine learning, it's crucial that our evaluation metrics and datasets align with our goals for the application. If they don't, our model may perform well during testing but fail in the real world. Let's take the cat classification example. Say we have these metric results:

Classification Error
Algorithm AA3% error (But a lot of porn images are treated as cat images here and presented to users)
Algorithm BB5% error

In the last example if we choose the best algorithm purely by metric it would be AA, but if the users decide it should actually be BB, this discrepancy signals that our evaluation metric isn't capturing what's truly important.

When our evaluation metric is no longer correctly rank ordering differences between algorithms so that it translate into real-world effectiveness, it's time to refine those metrics. Let's modify the existing error calculation, which is the average of misclassifications:

ErrorOld=1mdev∑i=1mdev1{ypred(i)≠y(i)}\text{Error}_{\text{Old}} = \frac{1}{m_{\text{dev}}} \sum_{i=1}^{m_{\text{dev}}} \text{1}\{y_{pred}^{(i)} \neq y^{(i)}\}

To address our specific concern, we introduce a weighting factor that penalizes certain misclassifications more heavily:

Where 1\text{1} represents the indicator function that counts up the number of examples where the thing inside is true. To address our specific concern, we introduce a weighting factor that penalizes certain misclassifications more heavily:

ErrorNew=1∑i=1mdevw(i)∑i=1mdevw(i)1{ypred(i)≠y(i)}\text{Error}_{\text{New}} = \frac{1}{\sum_{i=1}^{m_{\text{dev}}} w^{(i)}} \sum_{i=1}^{m_{\text{dev}}} w^{(i)} \text{1}\{y_{pred}^{(i)} \neq y^{(i)}\}

Where:

w(i)={1if x(i)=non-porn10if x(i)=pornw^{(i)} = \begin{cases} 1 & \text{if } x^{(i)} = \text{non-porn} \\ 10 & \text{if } x^{(i)} = \text{porn} \end{cases}

This adjustment means we're assigning ten times the penalty for misclassifying pornographic images as cats. It's a direct approach to ensure that our error metric reflects the severity of this mistake; it makes the error term go up a lot more if we see a porn image and classify it as a cat.

This is an example of how we can change our evaluation metric to better reflect what we want our algorithm to do. To normalise and keep the error metric between 0 and 1, we multiply through by: 1∑i=1mdevw(i)\frac{1}{\sum_{i=1}^{m_{\text{dev}}} w^{(i)}}. We also need to now label images in our sets to do this.

Orthogonalization in Refining Evaluation Metrics

This is actually an example of an orthogonalization where you are breaking down a problem into clear, distinct steps:

  1. Define the Metric: Set up a metric that reflects your project's objectives clearly.
  2. Optimize the Metric: Once the target is clear, focus on strategies to hit it accurately.

The high level of take away is that if you find that your evaluation metric is not giving the correct rank order preference for what is actually the better algorithm, then there's a time to think about defining a new evaluation metric. Don't keep coasting with an old error metric that you are unhappy with.

Adjusting Dev and Test Sets

If you have trained on a nice high quality images, but the user images are blurry / non-framed, this would be another example of our dev/test set falling through, and our metric being malformed. So you might need to change your dev/test set to reflect the data you expect to get in the future in the real-world, and consider what is important to do well on.

Overall, if doing well on your metrics and dev/test set that you are validating on does not correspond to doing well on your application in real life, change your metric and/or dev/test set.