Percentiles & Moments

Next, we'll talk about percentiles and moments. You hear about percentiles in the news all the time. People that are in the top $1\%$ of income: that's an example of percentile. We'll explain that and have some examples. Then, we'll talk about moments, a very fancy mathematical concept, but it turns out it's very simple to understand conceptually.

Percentiles

If you were to sort all of the data in a dataset, a given percentile is the point at which that percent of the data is less than the point you're at.

A common example you see talked about a lot, is income distribution. When we talk about the 99th percentile, or the one-percenters, imagine that you were to take all the incomes of everybody in the country, in this case the UK, and sort them by income. The 99th percentile will be the income amount at which $99 \%$ of the rest of the country was making less than that amount. It's a very easy way to comprehend it.

💡

In a dataset, a percentile is the point at which $x \%$ of the values are less than the value at that point.

The following graph is an example for income distribution:

Income Percentiles

The preceding image shows an example of income distribution data. For example, at the 97th percentile we can say that $97 \%$ of the data points, which represent people in the UK, make less than $£ 1,678$ a week, and $3 \%$ make more than that.

The 50th percentile defines the point at which half of the people are making less and half are making more than you are, which is the definition of median. The 50th percentile is the same thing as median, and that would be at $£ 547$ given this dataset. So, if you're making $£ 547$ a year in the UK, you are making exactly the median amount of income for the country.

You can see the problem of income distribution in the graph above. Things tend to be very concentrated toward the high end of the graph, which is a very big political problem right now in the country

Quartiles

Percentiles are also used in the context of talking about the quartiles in a distribution. Let's look at a normal distribution to understand this better. Here's an example illustrating Percentile in normal distribution:

Quartiles

Looking at the normal distribution in the image, we can talk about quartiles. Quartile 1 (Q1) and quartile 3 (Q3) in the middle are just the points that contain together $50 \%$ of the data, so $25 \%$ are on left side of the median and $25 \%$ are on the right side of the median.

The median in this example happens to be near the mean. For example, the interquartile range (IQR), when we talk about a distribution, is the area in the middle of the distribution that contains $50 \%$ of the values. The topmost part of the image is an example of what we call a box-and-whisker diagram.

Computing percentiles in Python

Let's start off by generating some randomly distributed normal data, or normally distributed random data, rather, refer to the following code block:

import numpy as np
import matplotlib.pyplot as plt
 
values = np.random.normal(0,0.5,10000)
 
plt.hist(values, 50)
plt.show()

Alt text

NumPy provides a very handy percentile function that will compute the percentile values of this distribution for you. call the np.percentile function to figure out the 50th percentile value:

np.percentile(vals, 50)

The following is the output of the preceding code:

0.0053397035195310248

The output turns out to be 0.005 . The 50th percentile is just another name for the median, and it turns out the median is very close to zero in this data. You can see in the graph that we're tipped a little bit to the right, so that's not too surprising. I want to compute the 90th percentile, which gives me the point at which $90 \%$ of the data is less than that value. We can easily do that with the following code:

np.percentile(vals, 90)

Here is the output of that code:

0.64099069837340827

Moments

Moments are ways to measure the shape of a data distribution, of a probability density function, or of anything, really. Mathematically:

\mu_{n}=\int_{-\infty}^{\infty}(x-c)^{n} f(x) d x \text { (for moment } n \text { around value } c \text { ) }

We're taking the difference between each value from some value raised to the $n^{th}$ power, where $n$ is the moment number and integrating across the entire function from negative infinity to infinity. Intuitively:

💡

Moments can be defined as quantitative measures of the shape of a probability density function.

1. Mean

The first moment works out to just be the mean of the data that you're looking at.

2. Variance

The second moment is the variance. The second moment of the dataset is the same thing as the variance value.

3. Skew

This is basically a measure of how lopsided a distribution is.

Skew

You can see in these two examples above that, if I have a longer tail on the left, now then that is a negative skew, and if I have a longer tail on the right then, that's a positive skew. The dotted lines show what the shape of a normal distribution would look like without skew. The dotted line out on the left side then I end up with a negative skew, or on the other side, a positive skew in that example.

4. Kurtosis

How thick the tail and how sharp is the peak. So again, it's a measure of the shape of the data distribution. Here's an example:

Kurtosis

Higher peak values have a higher kurtosis value. The topmost curve has a higher kurtosis than the bottommost curve. It's a very subtle difference, but a difference nonetheless. It basically measures how peaked your data is.

To do this in python, take this demo data:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sp
 
values = np.random.normal(0,0.5,10000)
 
plt.hist(values, 50)
plt.show()

The moments are calculated as follows:

# 1st moment
mean = np.mean(values)
# 2nd moment
var = np.var(values)
# 3rd moment
skew = sp.skew(values)
# 4th moment
kurt = sp.kurtosis(values)

Probability Density/Mass functions Covariance & Correlation