Interval Estimation
Up until now we have been using Maximum Likelihood Estimation to generate point estimates for our model parameters . In particular, we derived expressions for the estimators as a function of random sample . You might have noticed by now that most of our maximum likelihood estimators for distributions considered thus far have been taking on similar functional forms:
That is, the estimators above are equal to the sample mean . This is somewhat intuitive, as we had previously shown the true value of these model parameters to be equal to the mean value of the population distribution (see Table 1 ). Thus, the best approximation to would be to compute the mean from a random sample drawn from the underlying distribution for .
However, we had discussed earlier that itself is a random variable that will change with each sampling trial; therefore the value of the estimate will also change for each random sample. Fortunately, for the estimates listed above we already know how the sample means is distributed over many sampling trials (e.g. see sampling distribution of sample means in Figure 29), and we use these sampling distributions to quantify our uncertainty for any estimate we compute.
Sample Mean for Normal Random Variables
We begin by considering the exact solution where our random samples are drawn from a Normal distribution. Recall from our discussion on the Normal distribution in Chapter 3 that a Z-transformation of a Normal random variable follows a standard Normal distribution:
We had also shown that the sum of Normal random variables is itself a Normal random variable (note that this is exact and not a CLT approximation). Thus, if a maximum likelihood estimator corresponds to the mean of a sample from a Normal distribution (as is the case for , then we can precisely state that our sampling distribution for is:
Where and are the mean and variance of the underlying Normal population distribution, respectively. Thus, we would like to use this sampling distribution to understand how good our estimate is of the true population mean . For simplicity sake, let's say that we are given the population variance . Then we can rearrange the above inequalities to deduce bounds around the unknown parameter based on an estimate :
Thus, by choosing bounds and , we can integrate the standard normal on the righthand side to compute the probability that our unknown model parameter is contained within the interval ; we refer to this interval as the confidence interval.
In order to use our sampling distribution to compute the confidence interval, we need to first specify a probability that the interval we construct using an estimate will contain ; we define this as our confidence level:
Confidence Level: Probability that a confidence interval computed from will contain the true parameter value
It is immediately apparent from the above expression that once we have selected a value for the confidence level (i.e. by selecting an ), then the corresponding values for and are also fixed since the standard normal curve is symmetric about . To see this, consider the confidence intervals around for confidence levels of and , as shown in Figure 30. Due to the symmetry of the standard normal curve, and we can see in Figure 30 that for and for .
Computing Confidence Interval for Standard Normal using Lookup Tables
A convenient way to compute the confidence interval for a given confidence level is to use the standard Normal CDF lookup tables (see right side of Figure 30). Once we have specified our confidence level , we just search the standard Normal CDF lookup table for the corresponding such that:
So for the following scenarios we have:
Confidence Level | such that | |||
---|---|---|---|---|
0.95 | 0.05 | 0.025 | 0.975 | 1.96 |
0.99 | 0.01 | 0.005 | 0.995 | 2.575 |
Table 2: Using the standard Normal CDF lookup table to compute confidence interval for a given confidence level .
We can therefore equivalently express our confidence interval for as:
NOTE: It is often helpful to think of as how many units of standard error we allow the confidence interval to be. Thus, the confidence interval is Standard Error.
Figure 30: Computing confidence intervals for parameter estimates that correspond to the mean of a sample of Normal random variables. First we specify the confidence level , which is represented here as the shaded area under the curve centred around our estimate. For a) we have specified the confidence level to be , corresponding to ; likewise for c) we have specified the confidence level to be , corresponding to . To find the corresponding confidence interval for a given confidence level, we then use the standard Normal CDF lookup table (right side). For in b), we find the bound to be ; similarly for in ), we find the bound to be .
Sample Mean for Normal Random Variables, Unknown Population Variance
Thus far we have considered the case where the population variance was a known quantity. However, this parameter is seldom known so it is not precisely correct to say that follows a standard normal distribution. Indeed, in practice we can only ascertain information about by looking at the sample available to us. Recall that we derived an expression for the sample variance by correcting for the bias our MLE estimator for as:
Thus we can use as an approximation for in computing the confidence intervals. But how does our uncertainty in manifest itself in the calculation of these confidence intervals? Intuitively we can appreciate that for large enough sample sizes, will be a fairly accurate approximation of . But how do we deal with small sample sizes (i.e. )?
The credit for recognising that and do not follow the same distribution goes to William Sealy Gossett (who graduated from Oxford in 1899 with First Class degrees in Chemistry and Maths). As an employee at the Guinness brewery, Gossett worked on the task of making the art of brewery more scientific. However, for many of his experiments, the sample sizes were only on the order of 4 or 5 , and he knew that there was no possibly way of knowing the exact value of the true population variance in his statistical calculations.
In 1908 Gossett published a new PDF that describes the distribution of . At that time, Guinness forbid any employees to publish papers for confidentiality reasons, so Gossett published these findings under the pen name of "Student". Thus, the resulting PDF was referred to as the Student -distribution (the ' ' in this name refers to its use in test statistics, which we will analyse in the next chapter).
Formally speaking, for a random sample of a Normal population distribution with mean and variance , sampling distribution of the sample mean follows a Student -distribution with degrees of freedom :
Note here that we define the transform as instead of to distinguish it from the Z-transformation. The Student -distribution, , has a few interesting and important properties, which are illustrated graphically in Figure 31:
- It is bell-shaped and symmetric about zero, like the Normal distribution, but has "heavier tails" that approach zero probability at a slower rate.
- Its dispersion varies according the size of the sample, ; the smaller the sample size, the greater the dispersion (to reflect the uncertainty in estimating the true variance). 3. As approaches converges to the standard Normal distribution (approximately is close enough to use .
Figure 31: Difference between the Student t-distribution, , and standard Normal distribution, . As the sample size increases from to , we can see that becomes closer to the standard Normal distribution.
We can analogously define our confidence interval in terms of the Student -distribution for a Normal sample with unknown population variance and sample size :
Where we note in these tests that the Normal sample is used to compute both the sample mean and the sample standard deviation:
The corresponding confidence interval bound can be found using a lookup table for the upper percentile of Student -distributions given a corresponding value for and in a very similar manner to the standard Normal tables. Indeed, since is a symmetric distribution, we only need to need to find the value for the bound for the upper tail for a given confidence level (where in contrast for the standard Normal we used the .
We illustrate this calculation graphically in Figure 32 for a confidence level of and sample sizes and . Table 3 below also summarises the confidence interval bounds for additional values and sample sizes (compare to Table 2):
such that | ||||
---|---|---|---|---|
0.05 | 0.025 | 6 | 5 | 2.5706 |
0.05 | 0.025 | 11 | 10 | 2.2281 |
0.05 | 0.025 | 21 | 20 | 2.0860 |
0.01 | 0.005 | 6 | 5 | 4.0321 |
0.01 | 0.005 | 11 | 10 | 3.1693 |
0.01 | 0.005 | 21 | 20 | 2.8453 |
Table 3: Using the Student -distribution lookup table to compute confidence interval for a given . Note that the 'degrees of freedom' is equal to .
Again, it is often helpful to think of as how many units of standard error we allow the confidence interval to be. Thus, we can write the confidence interval as Standard Error, just as we did in the case for the Normal population with known variance. However, we will generally expect a larger confidence interval (i.e. ) for a given due to the additional uncertainty associated with estimating from the sample data.
For instance, we see in Table 3 for that as increases and our estimation of from the sample data becomes more accurate (i.e. ), approaches the standard Normal bound of . Likewise as increases for approaches the standard Normal bound of . Using shorthand notation we can write the confidence interval for the estimate as:
Figure 32: Computing confidence intervals for parameter estimates that correspond to the mean of a sample of Normal random variables. First we specify the confidence level , which corresponds to the area under the curve centred around our estimate. For a) we have specified the area under the curve to be , corresponding to ; likewise for c) we have specified the area under the curve to be , corresponding to . To find the corresponding values for the bounds around our estimate for a given confidence level, we then use the standard Normal lookup table. For in b), we find the bound to be ; similarly for in ), we find the bound to be .
Sample Mean for Non-Normal Distributions
So far we have considered how to compute confidence intervals for our sample means for a random sample from a Normal distribution. In the case where the population variance was known:
And in the case where the population variance is unknown (and must be estimated using the sample variance :
But we noted that for large enough (e.g. ) that the Student -distribution converges to the standard Normal , and as a result (i.e. the interval becomes the same).
However, we also seen for large sample sizes that the sample means for a non-Normal population distribution follows a sampling distribution that is Normally distributed. For instance, Figure 29 shows that the sampling distribution of the sample means , as computed from random samples of the exponential population distribution - (and therefore , is itself a Normal distribution given by .
Indeed this was the main result of the Central Limit Theorem at the end of Chapter 3: the sample means , computed from random samples of any population distribution with mean and variance , is Normally distributed for large sample sizes :
And since the CLT only applies to large , we can use the sample standard deviation to approximate without introducing too much further variability into our confidence intervals:
Confidence Intervals for General Sample Statistics
We can generally define the procedure for constructing confidence intervals for any sample statistic as follows:
Let be a random sample from a population with an unknown parameter . Given a confidence level , and if are computed from sample statistics with the property that:
then we say that is a confidence interval for . Thus, the confidence interval contains the true value of the parameter with some known probability .
To illustrate, consider an example for computing the confidence interval for for a random sample from a Normal distribution and a confidence level of . With the help of Figure 30 and Table 2 we find that:
Thus confidence interval for is given as:
Again, this means that the true population parameter will be within this interval of the time. In other words, if we were to conduct 100 sampling trials for , then the computed confidence interval would contain times on average.
Example: Sampling Distribution of Sample Variance for a Normal Population
Thus far our sample statistic of interest has been the sample mean , since it has been equal to most of our Maximum Likelihood Estimators for the probability models considered in the course:
However, we saw that the MLE of the variance of a Normal population distribution was equal to the sample variance :
This sample variance is a statistic we have not analysed in much detail, but remember that for all statistics we can compute their corresponding sampling distributions.
Let us consider the example covered in lecture, where we sampled 10 random UG student heights per day for a year (365 days, or 365 sampling trials). We had used this data to compute the sampling distribution for the sample means (i.e. the distribution of mean heights computed each day), as shown in Figure 33a. We had also shown that this sampling distribution was very accurately approximated by a Normal distribution of using the fact that and (i.e the standard error), where is the population mean and is the population variance. Generally we do not know and , but temporarily assuming we had these values allowed us to perform a Z-transformation on (Figure 33b) and then derive the confidence interval for as a function of and some multiple of the standard error (see previous sections for details).
In an analogous fashion, using this data can also construct a sampling distribution for the sample variance , which is shown in Figure 33c. We can see that the sampling distribution for the sample variance generally skews to the right, and so one could guess that a Normal distribution (or any other symmetric distribution) would not be a good model. However, it turns out that if we transform by diving by then this sampling distribution is modelled by the chi-squared distribution:
Let be a random sample from a Normal population with mean and variance . The ratio of the sampling distribution for the sample variance over follows a chi-squared distribution of degrees of freedom.
As we did for the Normal and Student -distributions, for a given confidence level we can define a confidence interval for in terms of bounds and as follows:
In this case we are interested in constructing a confidence interval for the unknown population variance , so we take the reciprocal of the above expression:
Finally multiplying each term by gives us our confidence interval for the unknown population variance :
Figure 33: a) Sampling distribution of sample means of 10 student heights for 365 sampling trials. We see that it is normally distributed, and well modelled by . b) Z-transformation of sampling distribution of sample means to standardise calculation of confidence intervals for unknown population mean . c) Sampling distribution of sample variance of 10 student heights for 365 sampling trials. We see that it can be modelled by some asymmetric distribution, as it skews to the right. d) The ratio of to is modelled by a chi-squared distribution for degrees of freedom , which will be used to compute confidence intervals for unknown population variance .
Computing Chi-squared Confidence Bounds
Note that bounds and are not equal and opposite, as the chi-squared distribution is not symmetric (as is evident in Figure 33d). To illustrate, let us compute these bounds for the chi-squared distribution in Figure for a confidence level . Figure 34 a shows where these bounds are for areas of under the curve of the left and right tails of .
As with the Student -distribution, we are often given lookup tables in the form of for the chi-squared distribution, where . Therefore, to compute the bounds we need to find the following for the lower bound (as shown in Figure 34b):
Thus the lower bound is . Analogously we can compute the upper bound (shown in Figure 34c):
And we find the corresponding upper bound is .
Figure 34: a) confidence interval for chi-squared distribution and . b) How to compute lower bound, , using chi-squared lookup table. c) How to compute upper bound, , using chi-squared lookup table.