Standard deviation & Variance
Standard deviation and variance are two fundamental quantities for a data distribution.
Variance
Variance measures how 'spread-out' the data is. Variance () is simply the average of the squared differences from the mean.
This represents the formula for variance, where is the sample mean.
Example: What is the variance of the data set (1, 4, 5, 4, 8)?
- First find the mean: (1+4+5+4+8)/5 = 4.4
- Now find the differences from the mean: (-3.4, -0.4, 0.6, -0.4, 3.6)
- Find the squared differences: (11.56, 0.16, 0.36, 0.16, 12.96)
- Find the average of the squared differences:
- = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
Standard Deviation
Standard Deviation is just the square root of the variance.
This is usually used as a way to identify outliers. Data points that lie more than one standard deviation from the mean can be considered unusual.
You can talk about how extreme a data point is by talking about 'how many sigmas/standard deviations' away from the mean it is. The standard deviation is usually used as a way to think about how to identify outliers in your dataset. If I say if I'm within one standard deviation of the mean, that's considered to be kind of a typical value in a normal distribution.