Interval Estimation
Up until now we have been using Maximum Likelihood Estimation to generate point estimates for our model parameters . In particular, we derived expressions for the estimators as a function of random sample . You might have noticed by now that most of our maximum likelihood estimators for distributions considered thus far have been taking on similar functional forms:
That is, the estimators above are equal to the sample mean . This is somewhat intuitive, as we had previously shown the true value of these model parameters to be equal to the mean value of the population distribution (see Table 1 ). Thus, the best approximation to would be to compute the mean from a random sample drawn from the underlying distribution for .
However, we had discussed earlier that itself is a random variable that will change with each sampling trial; therefore the value of the estimate will also change for each random sample. Fortunately, for the estimates listed above we already know how the sample means is distributed over many sampling trials (e.g. see sampling distribution of sample means in Figure 29), and we use these sampling distributions to quantify our uncertainty for any estimate we compute.
Sample Mean for Normal Random Variables
We begin by considering the exact solution where our random samples are drawn from a Normal distribution. Recall from our discussion on the Normal distribution in Chapter 3 that a Z-transformation of a Normal random variable follows a standard Normal distribution:
We had also shown that the sum of Normal random variables is itself a Normal random variable (note that this is exact and not a CLT approximation). Thus, if a maximum likelihood estimator corresponds to the mean of a sample from a Normal distribution (as is the case for , then we can precisely state that our sampling distribution for is:
Where and are the mean and variance of the underlying Normal population distribution, respectively. Thus, we would like to use this sampling distribution to understand how good our estimate is of the true population mean . For simplicity sake, let's say that we are given the population variance . Then we can rearrange the above inequalities to deduce bounds around the unknown parameter based on an estimate :
Thus, by choosing bounds and , we can integrate the standard normal on the righthand side to compute the probability that our unknown model parameter is contained within the interval ; we refer to this interval as the confidence interval.
In order to use our sampling distribution to compute the confidence interval, we need to first specify a probability that the interval we construct using an estimate will contain ; we define this as our confidence level:
Confidence Level: Probability that a confidence interval computed from will contain the true parameter value
It is immediately apparent from the above expression that once we have selected a value for the confidence level (i.e. by selecting an ), then the corresponding values for and are also fixed since the standard normal curve is symmetric about . To see this, consider the confidence intervals around for confidence levels of and , as shown in Figure 30. Due to the symmetry of the standard normal curve, and we can see in Figure 30 that for and for .
Computing Confidence Interval for Standard Normal using Lookup Tables
A convenient way to compute the confidence interval for a given confidence level is to use the standard Normal CDF lookup tables (see right side of Figure 30). Once we have specified our confidence level , we just search the standard Normal CDF lookup table for the corresponding such that:
So for the following scenarios we have:
Confidence Level | such that | |||
---|---|---|---|---|
0.95 | 0.05 | 0.025 | 0.975 | 1.96 |
0.99 | 0.01 | 0.005 | 0.995 | 2.575 |
Table 2: Using the standard Normal CDF lookup table to compute confidence interval for a given confidence level .
We can therefore equivalently express our confidence interval for as:
NOTE: It is often helpful to think of as how many units of standard error we allow the confidence interval to be. Thus, the confidence interval is Standard Error.
Figure 30: Computing confidence intervals for parameter estimates that correspond to the mean of a sample of Normal random variables. First we specify the confidence level , which is represented here as the shaded area under the curve centred around our estimate. For a) we have specified the confidence level to be , corresponding to ; likewise for c) we have specified the confidence level to be , corresponding to . To find the corresponding confidence interval for a given confidence level, we then use the standard Normal CDF lookup table (right side). For in b), we find the bound to be ; similarly for in ), we find the bound to be .
Sample Mean for Normal Random Variables, Unknown Population Variance
Thus far we have considered the case where the population variance was a known quantity. However, this parameter is seldom known so it is not precisely correct to say that follows a standard normal distribution. Indeed, in practice we can only ascertain information about by looking at the sample available to us. Recall that we derived an expression for the sample variance by correcting for the bias our MLE estimator for as:
Thus we can use as an approximation for in computing the confidence intervals. But how does our uncertainty in manifest itself in the calculation of these confidence intervals? Intuitively we can appreciate that for large enough sample sizes, will be a fairly accurate approximation of . But how do we deal with small sample sizes (i.e. )?
The credit for recognising that and do not follow the same distribution goes to William Sealy Gossett (who graduated from Oxford in 1899 with First Class degrees in Chemistry and Maths). As an employee at the Guinness brewery, Gossett worked on the task of making the art of brewery more scientific. However, for many of his experiments, the sample sizes were only on the order of 4 or 5 , and he knew that there was no possibly way of knowing the exact value of the true population variance in his statistical calculations.
In 1908 Gossett published a new PDF that describes the distribution of . At that time, Guinness forbid any employees to publish papers for confidentiality reasons, so Gossett published these findings under the pen name of "Student". Thus, the resulting PDF was referred to as the Student -distribution (the ' ' in this name refers to its use in test statistics, which we will analyse in the next chapter).
Formally speaking, for a random sample of a Normal population distribution with mean and variance , sampling distribution of the sample mean follows a Student -distribution with degrees of freedom :
Note here that we define the transform as instead of to distinguish it from the Z-transformation. The Student -distribution, , has a few interesting and important properties, which are illustrated graphically in Figure 31:
- It is bell-shaped and symmetric about zero, like the Normal distribution, but has "heavier tails" that approach zero probability at a slower rate.
- Its dispersion varies according the size of the sample, ; the smaller the sample size, the greater the dispersion (to reflect the uncertainty in estimating the true variance). 3. As approaches converges to the standard Normal distribution (approximately is close enough to use .
Figure 31: Difference between the Student t-distribution, , and standard Normal distribution, . As the sample size increases from to , we can see that becomes closer to the standard Normal distribution.
We can analogously define our confidence interval in terms of the Student -distribution for a Normal sample with unknown population variance and sample size :
Where we note in these tests that the Normal sample is used to compute both the sample mean and the sample standard deviation:
The corresponding confidence interval bound can be found using a lookup table for the upper percentile of Student -distributions given a corresponding value for and in a very similar manner to the standard Normal tables. Indeed, since is a symmetric distribution, we only need to need to find the value for the bound for the upper tail for a given confidence level (where in contrast for the standard Normal we used the .
We illustrate this calculation graphically in Figure 32 for a confidence level of and sample sizes and . Table 3 below also summarises the confidence interval bounds for additional values and sample sizes (compare to Table 2):
such that | ||||
---|---|---|---|---|
0.05 | 0.025 | 6 | 5 | 2.5706 |
0.05 | 0.025 | 11 | 10 | 2.2281 |
0.05 | 0.025 | 21 | 20 | 2.0860 |
0.01 | 0.005 | 6 | 5 | 4.0321 |
0.01 | 0.005 | 11 | 10 | 3.1693 |
0.01 | 0.005 | 21 | 20 | 2.8453 |
Table 3: Using the Student -distribution lookup table to compute confidence interval for a given . Note that the 'degrees of freedom' is equal to .
Again, it is often helpful to think of as how many units of standard error we allow the confidence interval to be. Thus, we can write the confidence interval as Standard Error, just as we did in the case for the Normal population with known variance. However, we will generally expect a larger confidence interval (i.e. ) for a given due to the additional uncertainty associated with estimating from the sample data.
For instance, we see in Table 3 for that as increases and our estimation of from the sample data becomes more accurate (i.e. ), approaches the standard Normal bound of . Likewise as increases for approaches the standard Normal bound of . Using shorthand notation we can write the confidence interval for the estimate as:
Figure 32: Computing confidence intervals for parameter estimates that correspond to the mean of a sample of Normal random variables. First we specify the confidence level , which corresponds to the area under the curve centred around our estimate. For a) we have specified the area under the curve to be , corresponding to ; likewise for c) we have specified the area under the curve to be , corresponding to . To find the corresponding values for the bounds around our estimate for a given confidence level, we then use the standard Normal lookup table. For in b), we find the bound to be ; similarly for in ), we find the bound to be .
Sample Mean for Non-Normal Distributions
So far we have considered how to compute confidence intervals for our sample means for a random sample from a Normal distribution. In the case where the population variance was known:
And in the case where the population variance is unknown (and must be estimated using the sample variance :
But we noted that for large enough (e.g. ) that the Student -distribution converges to the standard Normal , and as a result (i.e. the interval becomes the same).
However, we also seen for large sample sizes that the sample means for a non-Normal population distribution follows a sampling distribution that is Normally distributed. For instance, Figure 29 shows that the sampling distribution of the sample means , as computed from random samples of the exponential population distribution - (and therefore , is itself a Normal distribution given by .
Indeed this was the main result of the Central Limit Theorem at the end of Chapter 3: the sample means , computed from random samples of any population distribution with mean and variance , is Normally distributed for large sample sizes :
And since the CLT only applies to large , we can use the sample standard deviation to approximate without introducing too much further variability into our confidence intervals:
Confidence Intervals for General Sample Statistics
We can generally define the procedure for constructing confidence intervals for any sample statistic as follows:
Let be a random sample from a population with an unknown parameter . Given a confidence level , and if are computed from sample statistics with the property that:
then we say that is a confidence interval for . Thus, the confidence interval contains the true value of the parameter with some known probability .
To illustrate, consider an example for computing the confidence interval for for a random sample from a Normal distribution and a confidence level of . With the help of Figure 30 and Table 2 we find that: