Artificial Intelligence 🤖
Dealing with Real-World Data
Detecting outliers

Detecting Outliers

Outliers are data points that are significantly different from most of the data. They can be either a headache or a goldmine, affecting your machine learning model for better or worse. Sometimes they represent unique, interesting patterns; other times, they're just noise or other times, they're just some sort of malicious traffic/fake data. The key is knowing when to keep them and when to filter them out.

The Outlier's Impact

Take movie recommendations, for instance. A few users who've watched and rated every movie could heavily influence the recommendations for everyone else. In this case, you might want to identify and remove these 'power users' as outliers. On the flip side, outliers in web log data could signal something problematic like malicious traffic or bots, which you'd want to discard.

Ethical Considerations

Don't just remove outliers to make your data look good. For example, if you're calculating the mean income in the U.S., excluding a billionaire like Donald Trump would skew the results. You should only remove outliers if they don't align with what you're trying to understand or model.

Spotting the Outliers

  • Standard Deviation: A classic tool to find outliers. Data points that are more than one or two standard deviations from the mean are usually considered outliers.

    Standard Deviation=Σ(x−mean)2N\text{Standard Deviation} = \sqrt{\frac{\Sigma(x - \text{mean})^2}{N}}
  • Box and Whisker Diagrams: These define outliers as data points outside 1.5 times the interquartile range. This is a more visual approach and helps in quick identification.

    Outlier if x>Q3+1.5×(Q3−Q1) or x<Q1−1.5×(Q3−Q1)\text{Outlier if } x > Q3 + 1.5 \times (Q3 - Q1) \text{ or } x < Q1 - 1.5 \times (Q3 - Q1)

Decision Making

There's no universal rule for what exactly is an outlier. Common sense plays a big role. Look at histograms, distributions, and understand the nature of your data before taking action. Always remember, outliers are not inherently bad; it's the context that determines their fate.

Dealing with outliers

So, let's see how you might handle outliers in practice. Start off with a normal distribution of 10,000 incomes here that are have a mean of \ 27,000$ per year, with a standard deviation of 15,000, add append a Billionaire:

%matplotlib inline
import numpy as np
 
incomes = np.random.normal(27000, 15000, 10000)
incomes = np.append(incomes, [1000000000])
incomes.mean() # Output: 126713.54327205669
 
import matplotlib.pyplot as plt
plt.hist(incomes, 50)
plt.show()

Outliers

Clearly it's not very helpful. We have the entire normal distribution of everyone else in the country squeezed into one bucket of the histogram. On the other hand, we have Donald Trump out at the right side screwing up the whole thing at a billion dollars.

It's important to dig into what is causing your outliers, and understand where they are coming from. You also need to think about whether removing them is a valid thing to do, given the spirit of what it is you're trying to analyze. If I know I want to understand more about the incomes of "typical Americans", filtering out billionaires seems like a legitimate thing to do.

Here's something a little more robust than filtering out billionaires - it filters out anything beyond two standard deviations of the median value in the data set:

def reject_outliers(data):
    u = np.median(data)
    s = np.std(data)
    filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
    return filtered
 
filtered = reject_outliers(incomes)
 
plt.hist(filtered, 50)
plt.show()

Outliers

That looks better. And, our mean is more meaningful now as well

np.mean(filtered) # Output: 26726.214626383888

So, that's one example of identifying outliers and dealing with outliers. Remember though to always do this in a principled manner; don't just throw out outliers because they're inconvenient, understand where they're coming from, and how they actually affect the objective you're trying to measure.