Two Sample Inferences for Comparing Population Means: $H_{0}: \mu_{X}=\mu_{Y}$

Now suppose that instead of making inferences about a population mean based on a single sample, we want to compare the means of two different populations as estimated by two samples. The most common situation is where we want to test if the means are equal. For example, we may wish to determine whether manufactured goods from two different factories are of equal quality, whether a medical intervention has an effect on patients, or whether the average income is the same in two areas of a country.

Two Normal Populations: $H_{0}: \mu_{X}=\mu_{Y}$ , Known Variances

Suppose that the two populations we want to compare are Normal with known variances. That is, we assume that one of the populations follows a Normal distribution with parameters $\left(\mu_{X}, \sigma_{X}^{2}\right)$ and the other with parameters $\left(\mu_{Y}, \sigma_{Y}^{2}\right)$ , where both $\sigma_{X}^{2}$ and $\sigma_{Y}^{2}$ are known. Formally, we want to test the hypothesis that the population means are different, say:

H_{0}: \mu_{X}=\mu_{Y} \quad H_{1}: \mu_{X} \neq \mu_{Y}

We collect a sample $X_{1}, X_{2}, \ldots, X_{n_{X}}$ from the first population, and another sample $Y_{1}, Y_{2}, \ldots, Y_{n_{Y}}$ from the second. Note that the sample sizes need not be equal. We base our inference on the difference between the two mean estimators $\bar{X}$ and $\bar{Y}$ .

The main idea behind these tests is rather intuitive: the less the populations overlap, the more likely their means are different. Therefore, if the difference between the mean estimates of the two populations is large, it is more likely that they have indeed come from two different populations with different means, rather than from one underlying population with the same mean. Therefore, we use this difference between sample means $\bar{X}-\bar{Y}$ as our test statistic $^{1}$ , and explicitly derive its sampling distribution below.

${ }^{1}$ You may be wondering why we are not using the absolute value $|\bar{X}-\bar{Y}|$ . We mostly do this as we need to know the direction of the difference for one-tailed tests. We know that $\bar{X} \sim N\left(\mu_{X}, \sigma_{X}^{2} / n_{X}\right)$ and $\bar{Y} \sim N\left(\mu_{Y}, \sigma_{Y}^{2} / n_{Y}\right)$ (proof omitted, but follows similar reasoning to the one in section 6.6.1). Since sum of two normal variables is also normal, we know that $\bar{X}-\bar{Y}$ is also normal and therefore:

\bar{X}-\bar{Y} \sim N\left(\mu_{X}-\mu_{Y}, \sigma_{X}^{2} / n_{X}+\sigma_{Y}^{2} / n_{Y}\right)

Note that although the means are subtracted, the variances are added together!

We then apply the usual Z-transform to this composite random variable to obtain our test statistic $T$ :

T=\frac{(\bar{X}-\bar{Y})-\left(\mu_{X}-\mu_{Y}\right)}{\sqrt{\sigma_{X}^{2} / n_{X}+\sigma_{Y}^{2} / n_{Y}}}

Under the null hypothesis, $\mu_{X}=\mu_{Y}$ and therefore $\mu_{X}-\mu_{Y}=0$ , which significantly simplifies the above expression for the test statistic $T$ :

T=\frac{\bar{X}-\bar{Y}}{\sqrt{\sigma_{X}^{2} / n_{X}+\sigma_{Y}^{2} / n_{Y}}} \sim N(0,1)

Now that we have the null distribution for $T$ , we can proceed to compute the rejection regions, or the p-value as we had done for the one-sample Z-test in section 6.6.1.

Two Normal Populations: $H_{0}: \mu_{X}=\mu_{Y}$ , Unknown Variances: Two-Sample $t$ Test

Previously we assumed that we knew the variances of the two populations exactly. Now suppose we relax this assumption in that the variances for the populations are unknown, but are known to be equal, as this is commonly the case. For these use problems, we need to consider the uncertainty of the sample variance estimator, just as we did in section 6.6.2, which again hints towards the use of the Student $t$ -distribution.

Before we define a t-test for two samples, we first observe that, since both of the samples have the same variance, we can pool them together to improve the accuracy of the estimator. Therefore we define the pooled variance estimator to be:

S_{\text {pooled }}^{2}=\frac{\sum_{i=1}^{n_{X}}\left(X_{i}-\bar{X}\right)^{2}+\sum_{i=1}^{n_{Y}}\left(Y_{i}-\bar{Y}\right)^{2}}{n_{X}+n_{Y}-2}=\frac{\left(n_{X}-1\right) S_{X}^{2}+\left(n_{Y}-1\right) S_{Y}^{2}}{n_{X}+n_{Y}-2}

Note that degrees of freedom for a pooled estimator $\left(n_{X}+n_{Y}-2\right)$ are equal to the sum of degrees of freedom for two single-sample estimators $\left(n_{X}-1\right.$ and $\left.n_{Y}-1\right)$ .

The test statistic is thus derived similarly to the one in the previous chapter (replacing $\sigma$ with $S_{\text {pooled }}$ ) and is defined as:

T=\frac{\bar{X}-\bar{Y}}{\sqrt{S_{\text {pooled }}^{2} / n_{X}+S_{\text {pooled }}^{2} / n_{Y}}}=\frac{\bar{X}-\bar{Y}}{S_{\text {pooled }} \sqrt{1 / n_{X}+1 / n_{Y}}} \sim t_{n_{X}+n_{Y}-2}

Thus, our test statistic $T$ follows the Student $t$ -distribution with $\left(n_{X}+n_{Y}-2\right)$ degrees of freedom.

Example: SOLE Scores

Dislike Dr DiMaggio? Well now is the time to get back at him with SOLE by giving him all 1's (i.e. 'Definitely Disagree'). Think he did good job? Help him out by giving him all 5's (i.e. 'Definitely Agree'). No matter how they are filled out, student evaluations do play an important role, and are commonly used as metrics for promotion, course evaluations, etc. However, questions remain as to how effective they are at actually being able to distinguish good lecturers and courses.

Let's say a veteran lecturer in our Department has decided to perform a systematic study on how might a single factor affect their overall SOLE scores. The Imperial College London EDU course 'Introduction to Teaching for Learning' has stated that "body language' (e.g. enthusiasm, varying tones of voice, smiling, hand gestures) will impact on levels of engagement". Our decorated lecturer has decided to formally investigate if this is really true or not.

To test this, the lecturer will keep the course material exactly the same between the Spring term ( $n_{Y}=108$ students $)$ and the previous Autumn term $\left(n_{X}=130\right.$ students). That is, they used the same lecture notes, lecture slides, and has listened to all previous Panopto lectures and will reproduce what was previously said as closely as possible. The only difference is that during the Spring term, the lecturer smiled a lot more, made more hand gestures during delivery and even cracked a joke or two.

The SOLE scores for both terms are in, and we are particularly interested in scores under the category . For this, students could provide a scores ranging from 5 ('Definitely Agree') to 1 ('Definitely Disagree'), which are summarised in the table below:

SOLE Question	Autumn Term $\left(n_{X}=130\right), x_{i}$	Spring Term $\left(n_{Y}=108\right), y_{i}$
'The lecturer generated	$\bar{x}=4.0$	$\bar{y}=4.7$
interest and enthusiasm'	$s_{X}=1.09$	$s_{Y}=0.84$
'The lecturer explained	$\bar{x}=3.9$	$\bar{y}=4.2$
the material well'	$s_{X}=1.15$	$s_{Y}=1.03$

Are these changes in SOLE scores significant, assuming a significance level of $\alpha=0.05 ?$

Solution: 'The lecturer generated interest and enthusiasm':

We are assuming that the SOLE scores $x_{i}$ for the Autumn term are normally distributed with some mean $\mu_{X}$ and variance $\sigma_{X}^{2}$ , which is perfectly reasonable. Similarly, the SOLE scores for the Spring term are believed to be normally distributed with $\mu_{Y}$ and $\sigma_{Y}^{2}$ . We assume that the variances are equal, that is $\sigma_{X}^{2}=\sigma_{Y}^{2}=\sigma^{2}$ , which again is a fair assumption. We want to test the hypothesis that there was an increase in the SOLE scores when the lecturer is more enthusiastic, which results in the following hypothesis test:

H_{0}: \mu_{X}=\mu_{Y} \quad H_{1}: \mu_{X}<\mu_{Y}

As described, earlier, under the null hypothesis, the difference of mean estimators for the samples follows the Student $t$ -distribution:

T=\frac{\bar{X}-\bar{Y}}{S_{\text {pooled }} \sqrt{1 / n_{X}+1 / n_{Y}}} \sim t_{n+m-2}

We note that the Student $t$ -distribution with $n+m-2=236$ degrees of freedom is essentially the standard Normal distribution, so in our lookup table we find a critical value of:

-z_{\alpha=0.05}=-1.64

Where a negative sign is included because we are testing the hypothesis that $\bar{X}<\bar{Y}$ . Therefore we would reject the null hypothesis $H_{0}$ that the SOLE scores did not change if $T<-1.64$ . For this particular question, the pooled sample standard deviation is computed as:

S_{\text {pooled }}^{2}=\frac{129(1.09)^{2}+107(0.84)^{2}}{130+108-2}

Which results in $S_{\text {pooled }}=0.985$ . Thus, our resulting test statistic is:

T=\frac{4.0-4.7}{0.985 \sqrt{\frac{1}{130}+\frac{1}{108}}}=-5.46

Which is a strong rejection of the null hypothesis! The extra enthusiasm significantly boosted this SOLE score, which was what we expected (and the hoped lecturer hoped!).

Solution: 'The lecturer explained the material well':

But what about the affect of this additional enthusiasm on other metrics in SOLE? After all, the lectures were delivered exactly the same, word for word and in the same order, with the only difference being additional hand gestures, smiling, etc.

To assess this change, we apply the same procedure as for the previous SOLE question, but using the data collected for this question. A similar calculation shows that $S_{\text {pooled }}=1.097$ and the resulting test statistic is now:

T=\frac{3.9-4.2}{1.097 \sqrt{\frac{1}{130}+\frac{1}{108}}}=-2.10

Which is still cause to reject the null hypothesis. Thus, what we gather from these hypothesis tests is that the 'Introduction to Teaching for Learning' course is indeed correct in asserting that body language plays an important role in the perception of effective teaching, as they can influence several additional factors. It also provides a cautionary note that evaluations can sometimes be difficult to interpret.

Two Non-Normal Populations, Large Sample Sizes: $H_{0}: \mu_{X}=\mu_{Y}$

We can generalise the results from sections 6.7.1 and 6.7.2 to means of any distributions by considering the Central Limit Theorem again, as we did in 6.6.3.

In general, when the sample size $n$ is large the sample means become normally distributed (according to the Central Limit Theorem), the sample variance becomes a good estimator for the population variance and we can get away with applying a Z-test. We simply replace $\sigma_{X}^{2}$ and $\sigma_{Y}^{2}$ by their sample estimates:

T=\frac{\bar{X}-\bar{Y}}{\sqrt{S_{X}^{2} / n_{X}+S_{Y}^{2} / n_{Y}}} \sim N(0,1)

Paired Data for Two Normal Populations: $H_{0}: \mu_{d}=0$ , The Paired $t$ Test

In some situations, particularly in the presence of "confounding variables", we may wish to take additional information into account when comparing two populations, and use this to pair our samples. In general, a paired test will have greater power than an unpaired test for the same data.

For example, suppose that we wish to test a drug which is purported to decrease the blood glucose level. We can select a sample from the pre-treatment population and another from the post-treatment population, but this test would be hampered by the fact that blood glucose levels vary substantially within each population. In this situation, it would be sensible to examine the same patients before and after the treatment.

Again, we are interested in answering the hypotheses of this form (or similar):

H_{0}: \mu_{X}=\mu_{Y} \quad H_{1}: \mu_{X} \neq \mu_{Y}

We will base our inference on the differences between the populations, just as we did in 6.6.1 and 6.6.2. However, since our data is necessarily paired, i.e. our dataset is of form $\left(X_{1}, Y_{1}\right), \ldots,\left(X_{n}, Y_{n}\right)$ , we can work with the sample differences directly. That is, instead of having a paired dataset $(X, Y)$ , we work with a dataset of differences $D$ :

D=\left\{\left(X_{1}-Y_{1}\right),\left(X_{2}-Y_{2}\right), \ldots,\left(X_{n}-Y_{n}\right)\right\}

Our null and alternative hypotheses then become:

H_{0}: \mu_{d}=0 \quad H_{1}: \mu_{d} \neq 0

Allowing us to use the methods described in section 6.6 (i.e. z- or t-tests) to test its significance directly.

Example: ESP Paired Test

Recall in Chapter 3 we had discussed the Normal approximation for Binomial distributions, and in particular its use in modelling the probability of predicting Zener cards correctly (i.e. the ESP research from 1938). More recent studies have explored the effect of hypnosis on ESP ability, where the student's predictive accuracy is assessed when they are awake versus when they are hypnotised (this naturally leads us to conduct a paired test!). In the study considered here, 15 students were asked to guess the identity of 200 Zener cards: 100 while awake, and 100 while under hypnosis. The number of correct predictions in each case are presented in the table below.

Student	$\begin{array}{c}\text { Correct Predictions } \\ \text { Awake }\left(X_{i}\right)\end{array}$	$\begin{array}{l}\text { Correct Predictions } \\ \text { Under Hypnosis }\left(Y_{i}\right)\end{array}$	$D_{i}=X_{i}-Y_{i}$
1	18	25	-7
2	19	20	-1
3	16	26	-10
4	21	26	-5
5	16	20	-4
6	20	23	-3
7	20	14	6
8	14	18	-4
9	11	18	-7
10	22	20	2
11	19	22	-3
12	29	27	2
13	16	19	-3
14	27	27	0
15	15	21	-6

Is there a significant change in ESP capability when the student is under hypnosis (assume $\alpha=0.05)$ ?

Solution:

We have already computed the difference terms, $D_{i}$ , in the table above for convenience. For the paired test, we use the following:

H_{0}: \mu_{d}=0 \quad H_{1}: \mu_{d} \neq 0

Thus, we find the test statistic and the corresponding null distribution are:

T=\frac{\bar{D}-\mu_{d}}{S_{D} / \sqrt{n}}=\frac{\bar{D}}{S_{D} / \sqrt{n}} \sim t_{n-1}

Where $n=15$ . We can easily compute $\bar{D}=-2.867$ from the above table, and $S_{D}$ is computed using the formula for sample standard deviation:

S_{D}=\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(D_{i}-\bar{D}\right)^{2}}=4.138

Since our alternative hypothesis is simple (i.e. two-tailed), we can look up the critical values for the two-tailed test (i.e. using $\alpha / 2=0.025$ ) from the Student $t$ -distribution table:

P\left(T \geq t_{0.025,14}\right)=0.025 \rightarrow c=t_{0.025,14}=2.1448

Our resulting decision rule is to reject the null hypothesis if $|T|>2.145$ . Calculating the test statistic we find:

T=\frac{-2.867}{4.138 / \sqrt{15}}=-2.683

Therefore we reject the null hypothesis and claim that hypnosis does have a significant effect on ESP ability!

Single Sample Inferences about the Population Mean Single Sample Inferences about the Population Variance

Student	$\begin{array}{c}\text { Correct Predictions } \\ \text { Awake }\left(X_{i}\right)\end{array}$	$\begin{array}{l}\text { Correct Predictions } \\ \text { Under Hypnosis }\left(Y_{i}\right)\end{array}$	$D_{i}=X_{i}-Y_{i}$
1	18	25	-7
2	19	20	-1
3	16	26	-10
4	21	26	-5
5	16	20	-4
6	20	23	-3
7	20	14	6
8	14	18	-4
9	11	18	-7
10	22	20	2
11	19	22	-3
12	29	27	2
13	16	19	-3
14	27	27	0
15	15	21	-6

Student	$\begin{array}{c}\text { Correct Predictions } \\ \text { Awake }\left(X_{i}\right)\end{array}$	$\begin{array}{l}\text { Correct Predictions } \\ \text { Under Hypnosis }\left(Y_{i}\right)\end{array}$	$D_{i}=X_{i}-Y_{i}$
1	18	25	-7
2	19	20	-1
3	16	26	-10
4	21	26	-5
5	16	20	-4
6	20	23	-3
7	20	14	6
8	14	18	-4
9	11	18	-7
10	22	20	2
11	19	22	-3
12	29	27	2
13	16	19	-3
14	27	27	0
15	15	21	-6

Two Sample Inferences for Comparing Population Means: H0:μX=μYH_{0}: \mu_{X}=\mu_{Y}H0​:μX​=μY​

Two Normal Populations: H0:μX=μYH_{0}: \mu_{X}=\mu_{Y}H0​:μX​=μY​, Known Variances

Two Normal Populations: H0:μX=μYH_{0}: \mu_{X}=\mu_{Y}H0​:μX​=μY​, Unknown Variances: Two-Sample ttt Test