Type I and Type II Errors, Power of the Test
Whenever we are rejecting a null hypothesis, we are claiming that beyond a reasonable doubt the null hypothesis is false. However, even highly improbable events do occur occasionally. In hypothesis testing we can make two types of mistakes, which are summarised by the table below.
Reject | Do not reject | |
---|---|---|
is True | Type I error | no error |
is True | no error | Type II error |
Clearly from the table above, the Type I error occurs when the null hypothesis is rejected when it is actually true. We have already made attempts to safeguard against Type I errors by defining a rejection region a priori for the test statistic according to the specified significance level (e.g. ). Therefore, the probability of such error is completely established by definition:
Computing the probability of the Type II error is often tricky as in order to do this we need to specify the sampling distribution for the test statistic under the alternative hypothesis . We indicate this probability (Type II error) with the greek letter . An important concept related to the type II error probability is the power of the test which we define as (Type II error) . Type I error null hypothesis is rejected, when it is actually true. Reject True) , where is the significance level of the test.
Type II error null hypothesis is not rejected, when the alternative hypothesis is true. Do not reject True
Power of the test is defined as (Type II error)
In general, we assume that Type I error is more severe of the two.
When designing an experiment, we first fix the significance level of the test (e.g. ) and then attempt to maximise the power of the test.
Figure 36: Type I and Type II errors for the coin flip example consisting of trials for a significance level . The null hypothesis is that the coin is fair (heads , and the alternative hypothesis is that the coin is heads-biased (heads) ; for illustrative purposes, we have plotted the sampling distribution of the test statistic for a specific alternative hypothesis heads . Thus, the Type II error is given by the sum of the sample distribution of the alternative hypothesis for all outcomes of the test statistic for which we do not reject (i.e. the sum of ; this results in .
The easiest way to illustrate these ideas is again by using an example. Let us revisit the composite hypothesis test that a coin is specifically biased towards heads. Recall our null hypothesis is heads , and our alternative hypothesis is (heads) . Of course there are an infinite number of sampling distributions for our alternative hypothesis, since the (unknown) true value for parameter could in theory be any value such that .
Imagine though that we were somehow given the true value of the distribution parameter . As an example, assume that the coin is heads for 3 out of every 4 flips (i.e. ). Figure 36 shows a plot for the null distribution and the sampling distribution for the alternative hypothesis under consideration . Here we have selected a significance level of , which results in a rejection region ; thus we reject the null hypothesis if our test statistic is found in this region. However, we have also plotted the sampling distribution for the alternative hypothesis, and we see that there are several outcomes for which we accept the null hypothesis (i.e. ) but have committed a Type II error since the alternative hypothesis is true! Summing our sampling distribution for the alternative hypothesis over all values results in (Type II error) , and a Power of test value equal to 0.617 .
NOTE: There are a number of factors that influence the Power of a test, . We can see from Figure 36 that if we decrease that would in turn increase; however, we would never do this in practice! An experimenter should always set an significance level to be in the neighborhood of 0.05 or less to safeguard against Type I errors.
Generally speaking, the power of the experiment can be maximised sampling a sufficient number of data points , as the standard error (SE) of our sample means was found to vary inversely to the square root of the sample size: . Thus, the effect of decreasing will increase , but one should avoid collecting sampling until a test is proven significant; in practice, and should be parameters set before conducting the experiment.