The P-value

In the previous sections we have established the decision rule and used this to quantify the evidence against $H_{0}$ ; that is, we have selected a significance level $\alpha$ and defined a critical region $R$ for rejecting the null hypothesis for a given test statistic before any data are collected. An alternative strategy to this would be to calculate a $\mathbf{P}$ -value:

The P-value associated with an observed test statistic is the probability of getting a value for that test statistic as extreme or more extreme than what was actually observed in the experiment (relative to $H_{1}$ ) given that $H_{0}$ is true.

In other words, it is the smallest significance level $\alpha$ at which the null-hypothesis will be rejected:

If $\mathrm{P}$ -value $\leq \alpha$ , then $H_{0}$ can be rejected at significance level $\alpha$

As with the significance level, the smaller the P-value, the better. Unlike the significance level, the P-value captures both the certainty about $H_{0}$ and the power of the test. A large P-value can result both due to the null hypothesis being true, or the power of the test being low.

Figure 37: Illustration of P-value calculation for coin flip example where $H_{0}: \theta=0.5, H_{1}: \theta \neq 0.5$ . Since this is a simple hypothesis, we are performing a two-tailed test. Thus, the P-value is computed by summing over $P(T \leq 8)$ and $P(T \geq 12)$ (i.e. labeled in red in the null distribution).

Suppose that we again were testing that a coin is fair (that is $H_{0}: \theta=0.5, H_{1}: \theta \neq 0.5$ ). The observation of our test statistic is that we toss it twenty times and get 8 heads (i.e. the dataset for the MLE of the Bernoulli distribution in the previous chapter). Using the number of heads as a test statistic, we again know that $T \sim \operatorname{Binomial}(20,0.5)$ under the null hypothesis. From Table 4 and Figure 37, we see that probability of getting eight or fewer heads, under the null hypothesis is $P(T \leq 8) \approx 0.25$ .

Since we are interested in non-composite $H_{1}$ , which corresponds to a two-tailed test, we need to consider the symmetrical bound $(P(T \geq 12))$ in the rejection region. This means that our $\mathrm{P}$ -value, the lowest significance threshold at which we would reject the null hypothesis is $P=P(T \leq 8)+P(T \geq 12) \approx 0.5$ (summarised in Figure 37). This is quite a large P-value: it is equivalent to saying that if we choose to reject the null hypothesis, given this data, we should be willing to accept that we will be wrong (i.e. we will commit a Type I error) half of the time!

Type I and Type II Errors, Power of the Test Single Sample Inferences about the Population Mean