Artificial Intelligence 🤖
Two Sample Inferences for Comparing Population Means

Two Sample Inferences for Comparing Population Means: H0:μX=μYH_{0}: \mu_{X}=\mu_{Y}

Now suppose that instead of making inferences about a population mean based on a single sample, we want to compare the means of two different populations as estimated by two samples. The most common situation is where we want to test if the means are equal. For example, we may wish to determine whether manufactured goods from two different factories are of equal quality, whether a medical intervention has an effect on patients, or whether the average income is the same in two areas of a country.

Two Normal Populations: H0:μX=μYH_{0}: \mu_{X}=\mu_{Y}, Known Variances

Suppose that the two populations we want to compare are Normal with known variances. That is, we assume that one of the populations follows a Normal distribution with parameters (μX,σX2)\left(\mu_{X}, \sigma_{X}^{2}\right) and the other with parameters (μY,σY2)\left(\mu_{Y}, \sigma_{Y}^{2}\right), where both σX2\sigma_{X}^{2} and σY2\sigma_{Y}^{2} are known. Formally, we want to test the hypothesis that the population means are different, say:

H0:μX=μYH1:μXμYH_{0}: \mu_{X}=\mu_{Y} \quad H_{1}: \mu_{X} \neq \mu_{Y}

We collect a sample X1,X2,,XnXX_{1}, X_{2}, \ldots, X_{n_{X}} from the first population, and another sample Y1,Y2,,YnYY_{1}, Y_{2}, \ldots, Y_{n_{Y}} from the second. Note that the sample sizes need not be equal. We base our inference on the difference between the two mean estimators Xˉ\bar{X} and Yˉ\bar{Y}.

The main idea behind these tests is rather intuitive: the less the populations overlap, the more likely their means are different. Therefore, if the difference between the mean estimates of the two populations is large, it is more likely that they have indeed come from two different populations with different means, rather than from one underlying population with the same mean. Therefore, we use this difference between sample means XˉYˉ\bar{X}-\bar{Y} as our test statistic 1^{1}, and explicitly derive its sampling distribution below.

1{ }^{1} You may be wondering why we are not using the absolute value XˉYˉ|\bar{X}-\bar{Y}|. We mostly do this as we need to know the direction of the difference for one-tailed tests. We know that XˉN(μX,σX2/nX)\bar{X} \sim N\left(\mu_{X}, \sigma_{X}^{2} / n_{X}\right) and YˉN(μY,σY2/nY)\bar{Y} \sim N\left(\mu_{Y}, \sigma_{Y}^{2} / n_{Y}\right) (proof omitted, but follows similar reasoning to the one in section 6.6.1). Since sum of two normal variables is also normal, we know that XˉYˉ\bar{X}-\bar{Y} is also normal and therefore:

XˉYˉN(μXμY,σX2/nX+σY2/nY)\bar{X}-\bar{Y} \sim N\left(\mu_{X}-\mu_{Y}, \sigma_{X}^{2} / n_{X}+\sigma_{Y}^{2} / n_{Y}\right)

Note that although the means are subtracted, the variances are added together!

We then apply the usual Z-transform to this composite random variable to obtain our test statistic TT :

T=(XˉYˉ)(μXμY)σX2/nX+σY2/nYT=\frac{(\bar{X}-\bar{Y})-\left(\mu_{X}-\mu_{Y}\right)}{\sqrt{\sigma_{X}^{2} / n_{X}+\sigma_{Y}^{2} / n_{Y}}}

Under the null hypothesis, μX=μY\mu_{X}=\mu_{Y} and therefore μXμY=0\mu_{X}-\mu_{Y}=0, which significantly simplifies the above expression for the test statistic TT :

T=XˉYˉσX2/nX+σY2/nYN(0,1)T=\frac{\bar{X}-\bar{Y}}{\sqrt{\sigma_{X}^{2} / n_{X}+\sigma_{Y}^{2} / n_{Y}}} \sim N(0,1)

Now that we have the null distribution for TT, we can proceed to compute the rejection regions, or the p-value as we had done for the one-sample Z-test in section 6.6.1.

Two Normal Populations: H0:μX=μYH_{0}: \mu_{X}=\mu_{Y}, Unknown Variances: Two-Sample tt Test

Previously we assumed that we knew the variances of the two populations exactly. Now suppose we relax this assumption in that the variances for the populations are unknown, but are known to be equal, as this is commonly the case. For these use problems, we need to consider the uncertainty of the sample variance estimator, just as we did in section 6.6.2, which again hints towards the use of the Student tt-distribution.

Before we define a t-test for two samples, we first observe that, since both of the samples have the same variance, we can pool them together to improve the accuracy of the estimator. Therefore we define the pooled variance estimator to be:

Spooled 2=i=1nX(XiXˉ)2+i=1nY(YiYˉ)2nX+nY2=(nX1)SX2+(nY1)SY2nX+nY2S_{\text {pooled }}^{2}=\frac{\sum_{i=1}^{n_{X}}\left(X_{i}-\bar{X}\right)^{2}+\sum_{i=1}^{n_{Y}}\left(Y_{i}-\bar{Y}\right)^{2}}{n_{X}+n_{Y}-2}=\frac{\left(n_{X}-1\right) S_{X}^{2}+\left(n_{Y}-1\right) S_{Y}^{2}}{n_{X}+n_{Y}-2}

Note that degrees of freedom for a pooled estimator (nX+nY2)\left(n_{X}+n_{Y}-2\right) are equal to the sum of degrees of freedom for two single-sample estimators (nX1\left(n_{X}-1\right. and nY1)\left.n_{Y}-1\right).

The test statistic is thus derived similarly to the one in the previous chapter (replacing σ\sigma with Spooled S_{\text {pooled }} ) and is defined as:

T=XˉYˉSpooled 2/nX+Spooled 2/nY=XˉYˉSpooled 1/nX+1/nYtnX+nY2T=\frac{\bar{X}-\bar{Y}}{\sqrt{S_{\text {pooled }}^{2} / n_{X}+S_{\text {pooled }}^{2} / n_{Y}}}=\frac{\bar{X}-\bar{Y}}{S_{\text {pooled }} \sqrt{1 / n_{X}+1 / n_{Y}}} \sim t_{n_{X}+n_{Y}-2}

Thus, our test statistic TT follows the Student tt-distribution with (nX+nY2)\left(n_{X}+n_{Y}-2\right) degrees of freedom.

Example: SOLE Scores

Dislike Dr DiMaggio? Well now is the time to get back at him with SOLE by giving him all 1's (i.e. 'Definitely Disagree'). Think he did good job? Help him out by giving him all 5's (i.e. 'Definitely Agree'). No matter how they are filled out, student evaluations do play an important role, and are commonly used as metrics for promotion, course evaluations, etc. However, questions remain as to how effective they are at actually being able to distinguish good lecturers and courses.

Let's say a veteran lecturer in our Department has decided to perform a systematic study on how might a single factor affect their overall SOLE scores. The Imperial College London EDU course 'Introduction to Teaching for Learning' has stated that "body language' (e.g. enthusiasm, varying tones of voice, smiling, hand gestures) will impact on levels of engagement". Our decorated lecturer has decided to formally investigate if this is really true or not.

To test this, the lecturer will keep the course material exactly the same between the Spring term ( nY=108n_{Y}=108 students )) and the previous Autumn term (nX=130\left(n_{X}=130\right. students). That is, they used the same lecture notes, lecture slides, and has listened to all previous Panopto lectures and will reproduce what was previously said as closely as possible. The only difference is that during the Spring term, the lecturer smiled a lot more, made more hand gestures during delivery and even cracked a joke or two.

The SOLE scores for both terms are in, and we are particularly interested in scores under the category . For this, students could provide a scores ranging from 5 ('Definitely Agree') to 1 ('Definitely Disagree'), which are summarised in the table below:

SOLE QuestionAutumn Term (nX=130),xi\left(n_{X}=130\right), x_{i}Spring Term (nY=108),yi\left(n_{Y}=108\right), y_{i}
'The lecturer generatedxˉ=4.0\bar{x}=4.0yˉ=4.7\bar{y}=4.7
interest and enthusiasm'sX=1.09s_{X}=1.09sY=0.84s_{Y}=0.84
'The lecturer explainedxˉ=3.9\bar{x}=3.9yˉ=4.2\bar{y}=4.2
the material well'sX=1.15s_{X}=1.15sY=1.03s_{Y}=1.03

Are these changes in SOLE scores significant, assuming a significance level of α=0.05?\alpha=0.05 ?

Solution: 'The lecturer generated interest and enthusiasm':

We are assuming that the SOLE scores xix_{i} for the Autumn term are normally distributed with some mean μX\mu_{X} and variance σX2\sigma_{X}^{2}, which is perfectly reasonable. Similarly, the SOLE scores for the Spring term are believed to be normally distributed with μY\mu_{Y} and σY2\sigma_{Y}^{2}. We assume that the variances are equal, that is σX2=σY2=σ2\sigma_{X}^{2}=\sigma_{Y}^{2}=\sigma^{2}, which again is a fair assumption. We want to test the hypothesis that there was an increase in the SOLE scores when the lecturer is more enthusiastic, which results in the following hypothesis test:

H0:μX=μYH1:μX<μYH_{0}: \mu_{X}=\mu_{Y} \quad H_{1}: \mu_{X}<\mu_{Y}

As described, earlier, under the null hypothesis, the difference of mean estimators for the samples follows the Student tt-distribution:

T=XˉYˉSpooled 1/nX+1/nYtn+m2T=\frac{\bar{X}-\bar{Y}}{S_{\text {pooled }} \sqrt{1 / n_{X}+1 / n_{Y}}} \sim t_{n+m-2}

We note that the Student tt-distribution with n+m2=236n+m-2=236 degrees of freedom is essentially the standard Normal distribution, so in our lookup table we find a critical value of:

zα=0.05=1.64-z_{\alpha=0.05}=-1.64

Where a negative sign is included because we are testing the hypothesis that Xˉ<Yˉ\bar{X}<\bar{Y}. Therefore we would reject the null hypothesis H0H_{0} that the SOLE scores did not change if T<1.64T<-1.64. For this particular question, the pooled sample standard deviation is computed as:

Spooled 2=129(1.09)2+107(0.84)2130+1082S_{\text {pooled }}^{2}=\frac{129(1.09)^{2}+107(0.84)^{2}}{130+108-2}

Which results in Spooled =0.985S_{\text {pooled }}=0.985. Thus, our resulting test statistic is:

T=4.04.70.9851130+1108=5.46T=\frac{4.0-4.7}{0.985 \sqrt{\frac{1}{130}+\frac{1}{108}}}=-5.46

Which is a strong rejection of the null hypothesis! The extra enthusiasm significantly boosted this SOLE score, which was what we expected (and the hoped lecturer hoped!).

Solution: 'The lecturer explained the material well':

But what about the affect of this additional enthusiasm on other metrics in SOLE? After all, the lectures were delivered exactly the same, word for word and in the same order, with the only difference being additional hand gestures, smiling, etc.

To assess this change, we apply the same procedure as for the previous SOLE question, but using the data collected for this question. A similar calculation shows that Spooled =1.097S_{\text {pooled }}=1.097 and the resulting test statistic is now:

T=3.94.21.0971130+1108=2.10T=\frac{3.9-4.2}{1.097 \sqrt{\frac{1}{130}+\frac{1}{108}}}=-2.10

Which is still cause to reject the null hypothesis. Thus, what we gather from these hypothesis tests is that the 'Introduction to Teaching for Learning' course is indeed correct in asserting that body language plays an important role in the perception of effective teaching, as they can influence several additional factors. It also provides a cautionary note that evaluations can sometimes be difficult to interpret.

Two Non-Normal Populations, Large Sample Sizes: H0:μX=μYH_{0}: \mu_{X}=\mu_{Y}

We can generalise the results from sections 6.7.1 and 6.7.2 to means of any distributions by considering the Central Limit Theorem again, as we did in 6.6.3.

In general, when the sample size nn is large the sample means become normally distributed (according to the Central Limit Theorem), the sample variance becomes a good estimator for the population variance and we can get away with applying a Z-test. We simply replace σX2\sigma_{X}^{2} and σY2\sigma_{Y}^{2} by their sample estimates:

T=XˉYˉSX2/nX+SY2/nYN(0,1)T=\frac{\bar{X}-\bar{Y}}{\sqrt{S_{X}^{2} / n_{X}+S_{Y}^{2} / n_{Y}}} \sim N(0,1)

Paired Data for Two Normal Populations: H0:μd=0H_{0}: \mu_{d}=0, The Paired tt Test

In some situations, particularly in the presence of "confounding variables", we may wish to take additional information into account when comparing two populations, and use this to pair our samples. In general, a paired test will have greater power than an unpaired test for the same data.

For example, suppose that we wish to test a drug which is purported to decrease the blood glucose level. We can select a sample from the pre-treatment population and another from the post-treatment population, but this test would be hampered by the fact that blood glucose levels vary substantially within each population. In this situation, it would be sensible to examine the same patients before and after the treatment.

Again, we are interested in answering the hypotheses of this form (or similar):

H0:μX=μYH1:μXμYH_{0}: \mu_{X}=\mu_{Y} \quad H_{1}: \mu_{X} \neq \mu_{Y}

We will base our inference on the differences between the populations, just as we did in 6.6.1 and 6.6.2. However, since our data is necessarily paired, i.e. our dataset is of form (X1,Y1),,(Xn,Yn)\left(X_{1}, Y_{1}\right), \ldots,\left(X_{n}, Y_{n}\right), we can work with the sample differences directly. That is, instead of having a paired dataset (X,Y)(X, Y), we work with a dataset of differences DD :

D={(X1Y1),(X2Y2),,(XnYn)}D=\left\{\left(X_{1}-Y_{1}\right),\left(X_{2}-Y_{2}\right), \ldots,\left(X_{n}-Y_{n}\right)\right\}

Our null and alternative hypotheses then become:

H0:μd=0H1:μd0H_{0}: \mu_{d}=0 \quad H_{1}: \mu_{d} \neq 0

Allowing us to use the methods described in section 6.6 (i.e. z- or t-tests) to test its significance directly.

Example: ESP Paired Test

Recall in Chapter 3 we had discussed the Normal approximation for Binomial distributions, and in particular its use in modelling the probability of predicting Zener cards correctly (i.e. the ESP research from 1938). More recent studies have explored the effect of hypnosis on ESP ability, where the student's predictive accuracy is assessed when they are awake versus when they are hypnotised (this naturally leads us to conduct a paired test!). In the study considered here, 15 students were asked to guess the identity of 200 Zener cards: 100 while awake, and 100 while under hypnosis. The number of correct predictions in each case are presented in the table below.

Student Correct Predictions  Awake (Xi)\begin{array}{c}\text { Correct Predictions } \\ \text { Awake }\left(X_{i}\right)\end{array} Correct Predictions  Under Hypnosis (Yi)\begin{array}{l}\text { Correct Predictions } \\ \text { Under Hypnosis }\left(Y_{i}\right)\end{array}Di=XiYiD_{i}=X_{i}-Y_{i}
11825-7
21920-1
31626-10
42126-5
51620-4
62023-3
720146
81418-4
91118-7
1022202
111922-3
1229272
131619-3
1427270
151521-6

Is there a significant change in ESP capability when the student is under hypnosis (assume α=0.05)\alpha=0.05) ?

Solution:

We have already computed the difference terms, DiD_{i}, in the table above for convenience. For the paired test, we use the following:

H0:μd=0H1:μd0H_{0}: \mu_{d}=0 \quad H_{1}: \mu_{d} \neq 0

Thus, we find the test statistic and the corresponding null distribution are:

T=DˉμdSD/n=DˉSD/ntn1T=\frac{\bar{D}-\mu_{d}}{S_{D} / \sqrt{n}}=\frac{\bar{D}}{S_{D} / \sqrt{n}} \sim t_{n-1}

Where n=15n=15. We can easily compute Dˉ=2.867\bar{D}=-2.867 from the above table, and SDS_{D} is computed using the formula for sample standard deviation:

SD=1n1i=1n(DiDˉ)2=4.138S_{D}=\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(D_{i}-\bar{D}\right)^{2}}=4.138

Since our alternative hypothesis is simple (i.e. two-tailed), we can look up the critical values for the two-tailed test (i.e. using α/2=0.025\alpha / 2=0.025 ) from the Student tt-distribution table:

P(Tt0.025,14)=0.025c=t0.025,14=2.1448P\left(T \geq t_{0.025,14}\right)=0.025 \rightarrow c=t_{0.025,14}=2.1448

Our resulting decision rule is to reject the null hypothesis if T>2.145|T|>2.145. Calculating the test statistic we find:

T=2.8674.138/15=2.683T=\frac{-2.867}{4.138 / \sqrt{15}}=-2.683

Therefore we reject the null hypothesis and claim that hypnosis does have a significant effect on ESP ability!