Covariance and correlation
Covariance is covered in more detail later but in general, with reference to this fake data:
covariance and correlation give us a means of measuring just how tight these things are correlated. I would expect a very low correlation or covariance for the data in the left scatter plot, but a very high covariance and correlation for the data in the right scatter plot. So that's the concept of covariance and correlation. It measures how much these two attributes that I'm measuring seem to depend on each other.
Measuring Covariance
Measuring covariance mathematically is explained later, for now the steps are:
- Think of the data sets for the two variables as high-dimensional vectors
- Convert these to vectors of variances from the mean
- Take the dot product (cosine of the angle between them) of the two vectors
- Divide by the sample size
Think of the attributes of the data as high dimensional vectors. What we're going to do on each attribute for each data point is compute the variance from the mean at each point. So now I have these high dimensional vectors where each data point, each person, if you will, corresponds to a different dimension.
I have one vector in this high dimensional space that represents all the variances from the mean for, let's say, age for one attribute. Then I have another vector that represents all the variances from the mean for some other attribute, like income. What I do then is I take these vectors that measure the variances from the mean for each attribute, and I take the dot product between the two. Mathematically, that's a way of measuring the angle between these high dimensional vectors. So if they end up being very close to each other, that tells me that these variances are pretty much moving in lockstep with each other across these different attributes. If I take that final dot product and divide it by the sample size, that's how I end up with the covariance amount.
Now the problem with covariance is that it can be hard to interpret. If I have a covariance that's close to zero, well, I know that's telling me there's not much correlation between these variables at all, but a large covariance implies there is a relationship. But how large is large? Depending on the units I'm using, there might be very different ways of interpreting that data. That's a problem that correlation solves.
Correlation
Correlation normalizes everything by the standard deviation of each attribute (just divide the covariance by the standard deviations of both variables and that normalizes things). By doing so, I can say very clearly that a correlation of -1 means there's a perfect inverse correlation, so as one value increases, the other decreases, and vice versa. A correlation of 0 means there's no correlation at all between these two sets of attributes. A correlation of 1 would imply perfect correlation, where these two attributes are moving in exactly the same way as you look at different data points.
Remember, correlation does not imply causation. Just because you find a very high correlation value does not mean that one of these attributes causes the other. It just means there's a relationship between the two, and that relationship could be caused by something completely different. The only way to really determine causation is through a controlled experiment
Covariance and Correlation in Python
From first principles, to demonstrate the math, we can do this:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sp
def de_mean(x):
xmean = np.mean(x)
return [xi - xmean for xi in x]
def covariance(x, y):
n = len(x)
return np.dot(de_mean(x), de_mean(y)) / (n - 1)
Covariance, again, is defined as the dot product, which is a measure of the angle between two vectors, of a vector of the deviations from the mean for a given set of data and the deviations from the mean for another given set of data for the same data's data points. We then divide that by in this case, because we're actually dealing with a sample.
de_mean()
, our deviation from the mean function, is just a function that takes a list of values and returns a list of values where each value is the difference between the original value and the mean of the original list of values.
To compute correlations, we can do something like:
def correlation(x, y):
stddevx = x.std()
stddevy = y.std()
return covariance(x, y) / stddevx / stddevy # We should really check for 0 here
Instead of using first principles, we can also use numpy
:
np.corrcoef(x, y)
array([[ 1.00500501, 0.99499499],
[ 0.99499499, 1.00500501]])