T-test and P-value in A/B Testing
Wondering if a change resulting from an A/B test is real or just random variation? Remember, there is a real cost to running an A/B test when your experiment has a negative result. So, you don't want to run it for too long because there's a chance you could be losing money. T-tests and p-values are designed to help with this. Although, remember that statistical significance doesn't have a specific meaning, at the end of the day it has to be a judgment call.
What is a T-test?
As explored before, t-test uses the t-statistic and measures the difference between two groups, often in this context called the control group and the treatment group. The t-statistic simply tells us how much change there is between the two groups when you adjust for random variations in the data. The formula for the t-statistic is often based on standard error, which adjusts for these random variations. A T-statistic Interpretation normally goes:
- High t-value: Likely a real difference.
- Low t-value: Not much difference.
- Positive or negative t-value: Indicates direction of change.
Different tests suit different situations. While t-tests work well for normally distributed data like spending habits, you might use other tests for different types of data:
- Fisher's Exact Test: For click-through rates.
- E-test: For transactions per user (Web page views etc.)
- Chi-squared Test: For order quantities.
What's a P-value?
While the t-statistic tells us the size and direction of a change, the p-value tells us how confident that there is no real difference between the control and the treatment's behavior i.e. the probability that this experiment satisfies the null hypothesis. It's a probability score. A low p-value, often below 0.05, suggests that the observed change is not just due to random chance. P-value Rules of Thumb:
- Low p-value (< 0.05): Likely a real effect.
- High p-value: Might just be random noise.
Neither t-tests nor p-values give an absolute answer. You have to decide what's acceptable for your situation. It's important to share this with everyone involved so they know there's always some risk of random error.
Analyzing T-statistics and P-values with Python
We can simulate experimental data to see how t-statistics and p-values work. Here's an example using Python:
import numpy as np
from scipy import stats
A = np.random.normal(25.0, 5.0, 10000)
B = np.random.normal(26.0, 5.0, 10000)
stats.ttest_ind(A, B)
Ttest_indResult(statistic=-13.474352083263295, pvalue=3.3501540587927287e-41)
- T-statistic: Measures the difference between two sets expressed in units of standard error, adjusted for data variance. A high absolute t-value suggests a significant difference.
- P-value: Measures the probability of an observation lying at extreme t-values. A low p-value () also indicates significance.
- High t-value and low p-value: Both indicate that the difference between the data sets A and B is likely not due to random chance.
Let's change things up so both A and B are just random, generated under the same parameters. So there's no "real" difference between the two:
B = np.random.normal(25.0, 5.0, 10000)
stats.ttest_ind(A, B)
Ttest_indResult(statistic=0.034527716049370272, pvalue=0.97245668591059853)
Now, our t-statistic is much lower and our p-value is really high. This supports the null hypothesis - that there is no real difference in behavior between these two sets.
Does the sample size make a difference?
Does increasing the sample size make a difference? Not really. Even with 100,000 samples for both A and B:
A = np.random.normal(25.0, 5.0, 100000)
B = np.random.normal(25.0, 5.0, 100000)
stats.ttest_ind(A, B)
Ttest_indResult(statistic=-0.30982965514862532, pvalue=0.75669082178465885)
Our p-value actually got a little lower, and the t-test a little larger, but still not enough to declare a real difference. So, you could have reached the right decision with just 10,000 samples instead of 100,000. Even a million samples doesn't help. The take-home message is that even if we were to keep running this A/B test for years, you'd never achieve the result you're hoping for.
And running experiments like this is a good way to get a good gut feel as to how long you might need to run an experiment for. How many samples does it actually take to get a significant result? And if you know something about the distribution of your data ahead of time, you can actually run these sorts of models.
A/A testing
If we were to compare the set to itself, this is called an A/A test:
stats.ttest_ind(A, A)
Ttest_indResult(statistic=0.0, pvalue=1.0)
A t-statistic should be 0 and a p-value 1.0. Any other result indicates an issue with your testing setup.
Key Takeaways
Determining significance is not black and white; it's a judgment call. T-statistics and p-values are guides, not guarantees. These metrics can show trends over time, helping you make informed decisions.