Artificial Intelligence 🤖
Statistics and Sampling Distributions

Statistics and Sampling Distributions

Before addressing how to estimate model parameters and their uncertainty from data, we need to define some statistical terminology. In this section we will introduce the concept of a 'population' and a 'sample', how they are different, and how we would like to utilise the statistics of sample to make inferences about a population.

In statistics, a population consists of all members of a defined group from which we may collect data. Populations are often very large, or even infinite, so it is usually infeasible to conduct a census, that is, collect data for every member of the population. Therefore, we can study only a subset of the population, which we refer to as a sample. Statistical inference is the process of deducing things about the whole population (which properties we do not know) based on a sample that we can measure these properties directly for.

In statistical theory, the distinction between sample and population is basic. The sample properties (e.g., mean, mode, median, range, standard deviation, inner quantile etc.) are called statistics, whereas the true (i.e. population) mean, mode etc. are called the parameters.

We compute the mean, mode, median and other sample statistics from the data in the sample directly. For a given sample statistic, we always assume the existence of a true parameter for a given population, which we cannot measure. If we could somehow obtain a data point for the whole population, or in other words, sample the whole population (which we cannot), these sample statistics would then become equal to the true population parameters. However, this is generally not possible so we need to establish a set of tools that relate the sample statistics to true population parameters, which is the main topic of this chapter.

Parameters: θ\theta

Population parameters are generally denoted as θ\theta and represent some characteristic of the population (e.g. mean, variance, proportion). In this chapter, we are will be specifically referring to the parameters in the model fX(xθ)f_{X}(x \mid \theta). Below are some examples of model parameters for distributions already considered and summarised in Table 1:

 Binomial: θ=(n,p) Poisson: θ=λ Normal: θ=(μ,σ2)\begin{aligned} \text { Binomial: } \theta & =(n, p) \\ \text { Poisson: } \theta & =\lambda \\ \text { Normal: } \theta & =\left(\mu, \sigma^{2}\right) \end{aligned}

Thus, in this chapter we will examine methods for computing θ\theta given some sample data. In particular, Point Estimation involves getting a particular value for θ\theta (e.g. θ=0.5)\theta=0.5), and Interval Estimation involves defining an interval in which we are 'confident' θ\theta lies (e.g. θ[0.4,0.6])\theta \in[0.4,0.6]).

Sample Statistics:

In a general sense, a statistic is any quantity (e.g. mean, variance, etc) calculated from sample data. At this point we will make a slight notation change to distinguish between a random variable given by some population distribution and sample data from that distribution:

X=X= Random variable associated with the underlying population distribution

D=D= Random sample of nn data points from the distribution for XX.

Note that DD is also a random variable, as it represents the sample data points before they are measured.

So we will say that we take random sample data DD from some underlying population distribution for XX. The sample data consists of D={D1,D2,,Dn}D=\left\{D_{1}, D_{2}, \ldots, D_{n}\right\} for a sample size nn. Please note again that DD is also a random variable and represents the sample measurements before they are recorded from the population distribution. From a sample DD we can compute a statistic, such as the sample mean:

Xˉ=1ni=1nDi\bar{X}=\frac{1}{n} \sum_{i=1}^{n} D_{i}

As an example, imagine going out and asking the first n=3n=3 people you meet on campus what their height is. Your sample might look like D={D1=185 cm,D2=170 cm,D3=162 cm}D=\left\{D_{1}=185 \mathrm{~cm}, D_{2}=170 \mathrm{~cm}, D_{3}=162 \mathrm{~cm}\right\} today, for instance. If you were repeat this same process tomorrow, it is highly unlikely that you will randomly sample the same people from the campus population, and therefore your DD will be different (e.g. D={D1=170 cm,D2=164 cm,D3=201 cm}D=\left\{D_{1}=170 \mathrm{~cm}, D_{2}=164 \mathrm{~cm}, D_{3}=201 \mathrm{~cm}\right\} ), even though the population (and its properties) will be the same (i.e. students on campus and their heights). Therefore DD is a random variable, and it is expected that for each sampling trial the values of DD will be different.

NOTE: Here we have made the important assumption that the sampling process is independent and identically distributed ('i.i.d.'). This property will be assumed throughout the rest of the chapter and will also prove useful in estimating model parameters.

Sampling Distributions:

As just mentioned above, we can repeatedly sample D={D1,D2,,Dn}D=\left\{D_{1}, D_{2}, \ldots, D_{n}\right\} from some underlying population distribution to compute a sample statistic (i.e. mean, variance), and we refer to this process as conducting a number of sampling trials. As we conduct a sufficient number of sampling trials, we end up generating a probability distribution for our sample statistic, which we call a sampling distribution:

The probability distribution of a statistic (such as mean or std dev) is known as its Sampling Distribution. The standard deviation of a sampling distribution (e.g. sample means) is the Standard Error (SE).

At this point, we have been introducing a lot of terminology, and it is easy to get confused between the distribution of a population vs. the distribution of a statistic (i.e. a sampling distribution). So let us consider an illustrative example for a sampling distribution of a sample mean, using data we have previously analysed in lectures.

Example: Sampling Distribution of the Sample Mean

As an illustration, let's take our population distribution from which we are sampling to be the exponential distribution fW(wλ^=0.1)f_{W}(w \mid \hat{\lambda}=0.1) shown in Figure 16. In lecture we had shown that when taking random samples DD of size nn from this distribution, we observed the sampling distribution of the sample mean Wˉ=1ni=1nDi\bar{W}=\frac{1}{n} \sum_{i=1}^{n} D_{i} was modelled by the normal distribution for large enough sample sizes nn. Indeed, this was the main result of the Central Limit Theorem (end of Ch 3), which states that the mean (Wˉ)(\bar{W}) of a random sample D=(D1,D2,,Dn)D=\left(D_{1}, D_{2}, \ldots, D_{n}\right) is distributed as:

limnP[a1n(D1+D2++Dn)E[1ni=1nDi]Var[1ni=1nDi]b]=limnP[a(WˉE[W])Var[W]/nb]=12πabez2/2dz\begin{aligned} \lim _{n \rightarrow \infty} P\left[a \leq \frac{\frac{1}{n}\left(D_{1}+D_{2}+\ldots+D_{n}\right)-E\left[\frac{1}{n} \sum_{i=1}^{n} D_{i}\right]}{\sqrt{\operatorname{Var}\left[\frac{1}{n} \sum_{i=1}^{n} D_{i}\right]}} \leq b\right] & = \\ \lim _{n \rightarrow \infty} P\left[a \leq \frac{(\bar{W}-E[W])}{\sqrt{\operatorname{Var}[W] / n}} \leq b\right] & =\frac{1}{\sqrt{2 \pi}} \int_{a}^{b} e^{-z^{2} / 2} d z \end{aligned}

Which can equivalently be written as WˉN(μ=E[W],σ=Var[W]n)\bar{W} \sim N\left(\mu=E[W], \sigma=\sqrt{\frac{\operatorname{Var}[W]}{n}}\right). In this example, since we know WW is exponentially distributed as WfW(wλ^=0.1)W \sim f_{W}(w \mid \hat{\lambda}=0.1), we can compute the CLT normal approximation using μ=E[W]=1λ^=10\mu=E[W]=\frac{1}{\hat{\lambda}}=10 and σ=Var[W]/n=1nλ^2=10n\sigma=\sqrt{\operatorname{Var}[W] / n}=\sqrt{\frac{1}{n \hat{\lambda}^{2}}}=\frac{10}{\sqrt{n}} :

limnP[a(Wˉ10)10/nb]=12πabez2/2dz\lim _{n \rightarrow \infty} P\left[a \leq \frac{(\bar{W}-10)}{10 / \sqrt{n}} \leq b\right]=\frac{1}{\sqrt{2 \pi}} \int_{a}^{b} e^{-z^{2} / 2} d z

The end result is that the CLT states that the sampling distribution for the sample mean Wˉ\bar{W} is normally distributed for large sample sizes nn, and its standard deviation, 10/n10 / \sqrt{n}, decreases with increasing sample size nn. This implies that as the sample size nn becomes larger, the dispersion of the sampling distribution of the sample means becomes smaller. To avoid confusion, we refer to the standard deviation of the sampling distribution for the sample mean as the standard error:

 Standard Error (SE)=σn\text { Standard Error }(\mathrm{SE})=\frac{\sigma}{\sqrt{n}}

Computing a Sampling Distribution of Sample Means:

We have just asserted that the sampling distribution of a sample mean is normally distributed, with a standard error that decreases with increasing sample size nn. Here we compute several sampling distributions of the sample mean for different values of nn, to examine if this is indeed true.

By conducting 1,000 trials of sampling D=(D1,,Dn)D=\left(D_{1}, \ldots, D_{n}\right) that are randomly drawn from a population given by the exponential distribution WfW(wλ^=0.1)W \sim f_{W}(w \mid \hat{\lambda}=0.1), we can compute the sampling distribution of the sample mean Wˉ=1ni=1nDi\bar{W}=\frac{1}{n} \sum_{i=1}^{n} D_{i}. The sampling distribution of the sample mean is shown in Figure 29 (as blue histograms) for sample sizes n=5,20,50n=5,20,50 and 100, where we do see that they are roughly bell-shaped and contain the true population mean of E[W]=10E[W]=10. For reference, we also present the CLT normal approximation given by WˉN(μ=E[W],σ=Var[W]n)=N(μ=10,σ=10n)(\bar{W} \sim N\left(\mu=E[W], \sigma=\sqrt{\frac{\operatorname{Var}[W]}{n}}\right)=N\left(\mu=10, \sigma=\frac{10}{\sqrt{n}}\right)( red curves in Figure 29) since in this hypothetical experiment we actually know the underlying population distribution for WW and can compute E[W]=10E[W]=10 and Var[W]=100\operatorname{Var}[W]=100.

Figure 29: Sampling distribution of sample means Wˉ=1ni=1nDi\bar{W}=\frac{1}{n} \sum_{i=1}^{n} D_{i} for 1,000 sampling trials of D=D= (D1,D2,..,Dn)\left(D_{1}, D_{2}, . ., D_{n}\right) from the exponential distribution WfW(wλ^=0.1)W \sim f_{W}(w \mid \hat{\lambda}=0.1). As the sample size nn increases, we can see that the distribution of the sample means (given by the blue histogram) becomes more focused around the true value of the expected mean E[W]=1λ^=10E[W]=\frac{1}{\hat{\lambda}}=10. The red curve shows the normal approximation given by the CLT, which states that WˉN(μ=E[W]=1λ^=10,σ=Var[W]/n=1nλ^2=10n)\bar{W} \sim N\left(\mu=E[W]=\frac{1}{\hat{\lambda}}=10, \sigma=\sqrt{\operatorname{Var}[W] / n}=\sqrt{\frac{1}{n \hat{\lambda}^{2}}}=\frac{10}{\sqrt{n}}\right).

Visually we can see in Figure 29 that the CLT normal approximation N(μ=10,σ=10n)N\left(\mu=10, \sigma=\frac{10}{\sqrt{n}}\right) is indeed a very good representation of our sampling distribution of the sample mean Wˉ\bar{W}, particularly for large nn (as the CLT theorem asserts). Likewise, we see that as the sample size nn increases, the sampling distribution for the sample mean Wˉ\bar{W} is more tightly distributed around E[W]E[W], meaning that our estimations of the expected value E[W]E[W] based on random samples are more accurate. This latter point is connected to the fact that the standard error of our sampling distribution is given by σ=10n\frac{\sigma=10}{\sqrt{n}}.