Statistics and Sampling Distributions

Before addressing how to estimate model parameters and their uncertainty from data, we need to define some statistical terminology. In this section we will introduce the concept of a 'population' and a 'sample', how they are different, and how we would like to utilise the statistics of sample to make inferences about a population.

In statistics, a population consists of all members of a defined group from which we may collect data. Populations are often very large, or even infinite, so it is usually infeasible to conduct a census, that is, collect data for every member of the population. Therefore, we can study only a subset of the population, which we refer to as a sample. Statistical inference is the process of deducing things about the whole population (which properties we do not know) based on a sample that we can measure these properties directly for.

In statistical theory, the distinction between sample and population is basic. The sample properties (e.g., mean, mode, median, range, standard deviation, inner quantile etc.) are called statistics, whereas the true (i.e. population) mean, mode etc. are called the parameters.

We compute the mean, mode, median and other sample statistics from the data in the sample directly. For a given sample statistic, we always assume the existence of a true parameter for a given population, which we cannot measure. If we could somehow obtain a data point for the whole population, or in other words, sample the whole population (which we cannot), these sample statistics would then become equal to the true population parameters. However, this is generally not possible so we need to establish a set of tools that relate the sample statistics to true population parameters, which is the main topic of this chapter.

Parameters: $\theta$

Population parameters are generally denoted as $\theta$ and represent some characteristic of the population (e.g. mean, variance, proportion). In this chapter, we are will be specifically referring to the parameters in the model $f_{X}(x \mid \theta)$ . Below are some examples of model parameters for distributions already considered and summarised in Table 1:

\begin{aligned} \text { Binomial: } \theta & =(n, p) \\ \text { Poisson: } \theta & =\lambda \\ \text { Normal: } \theta & =\left(\mu, \sigma^{2}\right) \end{aligned}

Thus, in this chapter we will examine methods for computing $\theta$ given some sample data. In particular, Point Estimation involves getting a particular value for $\theta$ (e.g. $\theta=0.5)$ , and Interval Estimation involves defining an interval in which we are 'confident' $\theta$ lies (e.g. $\theta \in[0.4,0.6])$ .

Sample Statistics:

In a general sense, a statistic is any quantity (e.g. mean, variance, etc) calculated from sample data. At this point we will make a slight notation change to distinguish between a random variable given by some population distribution and sample data from that distribution:

$X=$ Random variable associated with the underlying population distribution

$D=$ Random sample of $n$ data points from the distribution for $X$ .

Note that $D$ is also a random variable, as it represents the sample data points before they are measured.

So we will say that we take random sample data $D$ from some underlying population distribution for $X$ . The sample data consists of $D=\left\{D_{1}, D_{2}, \ldots, D_{n}\right\}$ for a sample size $n$ . Please note again that $D$ is also a random variable and represents the sample measurements before they are recorded from the population distribution. From a sample $D$ we can compute a statistic, such as the sample mean:

\bar{X}=\frac{1}{n} \sum_{i=1}^{n} D_{i}

As an example, imagine going out and asking the first $n=3$ people you meet on campus what their height is. Your sample might look like $D=\left\{D_{1}=185 \mathrm{~cm}, D_{2}=170 \mathrm{~cm}, D_{3}=162 \mathrm{~cm}\right\}$ today, for instance. If you were repeat this same process tomorrow, it is highly unlikely that you will randomly sample the same people from the campus population, and therefore your $D$ will be different (e.g. $D=\left\{D_{1}=170 \mathrm{~cm}, D_{2}=164 \mathrm{~cm}, D_{3}=201 \mathrm{~cm}\right\}$ ), even though the population (and its properties) will be the same (i.e. students on campus and their heights). Therefore $D$ is a random variable, and it is expected that for each sampling trial the values of $D$ will be different.

NOTE: Here we have made the important assumption that the sampling process is independent and identically distributed ('i.i.d.'). This property will be assumed throughout the rest of the chapter and will also prove useful in estimating model parameters.

Sampling Distributions:

As just mentioned above, we can repeatedly sample $D=\left\{D_{1}, D_{2}, \ldots, D_{n}\right\}$ from some underlying population distribution to compute a sample statistic (i.e. mean, variance), and we refer to this process as conducting a number of sampling trials. As we conduct a sufficient number of sampling trials, we end up generating a probability distribution for our sample statistic, which we call a sampling distribution:

The probability distribution of a statistic (such as mean or std dev) is known as its Sampling Distribution. The standard deviation of a sampling distribution (e.g. sample means) is the Standard Error (SE).

At this point, we have been introducing a lot of terminology, and it is easy to get confused between the distribution of a population vs. the distribution of a statistic (i.e. a sampling distribution). So let us consider an illustrative example for a sampling distribution of a sample mean, using data we have previously analysed in lectures.

Example: Sampling Distribution of the Sample Mean

As an illustration, let's take our population distribution from which we are sampling to be the exponential distribution $f_{W}(w \mid \hat{\lambda}=0.1)$ shown in Figure 16. In lecture we had shown that when taking random samples $D$ of size $n$ from this distribution, we observed the sampling distribution of the sample mean $\bar{W}=\frac{1}{n} \sum_{i=1}^{n} D_{i}$ was modelled by the normal distribution for large enough sample sizes $n$ . Indeed, this was the main result of the Central Limit Theorem (end of Ch 3), which states that the mean $(\bar{W})$ of a random sample $D=\left(D_{1}, D_{2}, \ldots, D_{n}\right)$ is distributed as:

\begin{aligned} \lim _{n \rightarrow \infty} P\left[a \leq \frac{\frac{1}{n}\left(D_{1}+D_{2}+\ldots+D_{n}\right)-E\left[\frac{1}{n} \sum_{i=1}^{n} D_{i}\right]}{\sqrt{\operatorname{Var}\left[\frac{1}{n} \sum_{i=1}^{n} D_{i}\right]}} \leq b\right] & = \\ \lim _{n \rightarrow \infty} P\left[a \leq \frac{(\bar{W}-E[W])}{\sqrt{\operatorname{Var}[W] / n}} \leq b\right] & =\frac{1}{\sqrt{2 \pi}} \int_{a}^{b} e^{-z^{2} / 2} d z \end{aligned}

Which can equivalently be written as $\bar{W} \sim N\left(\mu=E[W], \sigma=\sqrt{\frac{\operatorname{Var}[W]}{n}}\right)$ . In this example, since we know $W$ is exponentially distributed as $W \sim f_{W}(w \mid \hat{\lambda}=0.1)$ , we can compute the CLT normal approximation using $\mu=E[W]=\frac{1}{\hat{\lambda}}=10$ and $\sigma=\sqrt{\operatorname{Var}[W] / n}=\sqrt{\frac{1}{n \hat{\lambda}^{2}}}=\frac{10}{\sqrt{n}}$ :

\lim _{n \rightarrow \infty} P\left[a \leq \frac{(\bar{W}-10)}{10 / \sqrt{n}} \leq b\right]=\frac{1}{\sqrt{2 \pi}} \int_{a}^{b} e^{-z^{2} / 2} d z

The end result is that the CLT states that the sampling distribution for the sample mean $\bar{W}$ is normally distributed for large sample sizes $n$ , and its standard deviation, $10 / \sqrt{n}$ , decreases with increasing sample size $n$ . This implies that as the sample size $n$ becomes larger, the dispersion of the sampling distribution of the sample means becomes smaller. To avoid confusion, we refer to the standard deviation of the sampling distribution for the sample mean as the standard error:

\text { Standard Error }(\mathrm{SE})=\frac{\sigma}{\sqrt{n}}

Computing a Sampling Distribution of Sample Means:

We have just asserted that the sampling distribution of a sample mean is normally distributed, with a standard error that decreases with increasing sample size $n$ . Here we compute several sampling distributions of the sample mean for different values of $n$ , to examine if this is indeed true.

By conducting 1,000 trials of sampling $D=\left(D_{1}, \ldots, D_{n}\right)$ that are randomly drawn from a population given by the exponential distribution $W \sim f_{W}(w \mid \hat{\lambda}=0.1)$ , we can compute the sampling distribution of the sample mean $\bar{W}=\frac{1}{n} \sum_{i=1}^{n} D_{i}$ . The sampling distribution of the sample mean is shown in Figure 29 (as blue histograms) for sample sizes $n=5,20,50$ and 100, where we do see that they are roughly bell-shaped and contain the true population mean of $E[W]=10$ . For reference, we also present the CLT normal approximation given by $\bar{W} \sim N\left(\mu=E[W], \sigma=\sqrt{\frac{\operatorname{Var}[W]}{n}}\right)=N\left(\mu=10, \sigma=\frac{10}{\sqrt{n}}\right)($ red curves in Figure 29) since in this hypothetical experiment we actually know the underlying population distribution for $W$ and can compute $E[W]=10$ and $\operatorname{Var}[W]=100$ .

Figure 29: Sampling distribution of sample means $\bar{W}=\frac{1}{n} \sum_{i=1}^{n} D_{i}$ for 1,000 sampling trials of $D=$ $\left(D_{1}, D_{2}, . ., D_{n}\right)$ from the exponential distribution $W \sim f_{W}(w \mid \hat{\lambda}=0.1)$ . As the sample size $n$ increases, we can see that the distribution of the sample means (given by the blue histogram) becomes more focused around the true value of the expected mean $E[W]=\frac{1}{\hat{\lambda}}=10$ . The red curve shows the normal approximation given by the CLT, which states that $\bar{W} \sim N\left(\mu=E[W]=\frac{1}{\hat{\lambda}}=10, \sigma=\sqrt{\operatorname{Var}[W] / n}=\sqrt{\frac{1}{n \hat{\lambda}^{2}}}=\frac{10}{\sqrt{n}}\right)$ .

Visually we can see in Figure 29 that the CLT normal approximation $N\left(\mu=10, \sigma=\frac{10}{\sqrt{n}}\right)$ is indeed a very good representation of our sampling distribution of the sample mean $\bar{W}$ , particularly for large $n$ (as the CLT theorem asserts). Likewise, we see that as the sample size $n$ increases, the sampling distribution for the sample mean $\bar{W}$ is more tightly distributed around $E[W]$ , meaning that our estimations of the expected value $E[W]$ based on random samples are more accurate. This latter point is connected to the fact that the standard error of our sampling distribution is given by $\frac{\sigma=10}{\sqrt{n}}$ .

Sampling & Estimation Estimators and Point Estimation