Artificial Intelligence 🤖
Testing & Experimental Design
A/B test Pitfalls

A/B Testing: Tips and Pitfalls

A/B testing helps businesses make data-driven decisions by comparing two versions (A and B) of a web page, app, or product. Yet, many aspects can influence the outcome and interpretation of these tests. Here’s are some critical considerations and common pitfalls in A/B testing.

Deciding Test Duration

Knowing when to stop an A/B test is crucial. If you (or whoever has developed a new experiment) are the author of a new change, there is a bias to wait until you see the results you want to see. So instead, set a predefined threshold (usually 1% or 5%) beforehand for your p-value (a statistical measure), and if it crosses your threshold, it's time to end the test and act on the findings. Plot your p-value daily; a decreasing trend suggests the test is promising. No change or erratic behavior indicates you should stop the test, as it's unlikely to provide any insights.

Another thing to keep in mind is that having more than one experiment running on the site at once can conflate your results. Time spent on experiments is a valuable commodity aswell, so finding a balance is important.

ℹ️

Setting a max test duration up front saves time and prevents endless testing.

Don't Trust p-values Blindly

A low p-value indicates a statistically significant result but isn’t a guarantee of success. Other factors can skew experiment results, leading to incorrect decisions. The p-value is a helpful guide but should not be the sole factor in decision-making. Remember also, correlation does not imply causation.

Even with a well-designed experiment, all you can say is there is some probability that this effect was caused by this change you made.

ℹ️

Use p-value as a guide, not gospel.

Novelty and Seasonal Effects

Novelty effects happen when a change, like a new button color, temporarily boosts performance simply because it's new. There may also be long-term effects of your change you may not be able to measure. Similarly, seasonal trends, like holidays or weather conditions, can influence customer behavior. Running the test during such periods might give misleading results.

For these reason, if you do have a change that is somewhat controversial, it's a good idea to rerun that experiment later on and see if you can replicate its results. Also, are there seasonal fluctuations that you see every year? And if so, you want to try to avoid running your experiment during one of those peaks or valleys, as users may behave unusually during these times e.g. On holiday, Christmas, Hot Weather etc.

Key Takeaway: Be aware of external factors like novelty and seasonal trends that can affect your test outcome.

Selection Bias and Data Pollution

To get accurate results and avoid selection bias, randomize the assignment of users to control or treatment groups (A or B). Otherwise, you risk selection bias where a particular group dominates and skews the outcome. Also, watch out for data pollution from bots or other automated activities that can corrupt your test data.

Foe example, let's say that you're hashing your customer IDs to place them into one bucket or the other. Maybe there's some subtle bias between how that hash function affects people with lower customer IDs versus higher customer IDs. Now what you end up measuring then is just a difference in behavior between old customers and new customers

You also need to make sure that assignment is sticky; you have to make sure users aren't not switching groups in between clicks in a session.

ℹ️

Ensure proper randomization, sessions, and clean data for unbiased results.

Auditing and Attribution Errors

Audit your A/B testing framework with an A/A test to identify any inherent biases or other problems, for example, session leakage. An A/A test involves running an experiment with no changes to see if your setup is generating false positives or negatives. Also, be clear on how you are attributing conversions to avoid errors.

Attribution errors is if you are actually using downstream behavior from a change, and that gets into a gray area. You need to understand how you're actually counting those conversions as a function of distance from the thing that you changed and agree with your business stakeholders upfront as to how you're going to measure those effects.

In terms of attribution, make sure you multiple experiments aren't affecting the same metric. You have to apply your judgment as to whether these changes actually could interfere with each other in some meaningful way

ℹ️

Conduct A/A tests and have a clear attribution model to trust your A/B test results.

Conclusion

A/B testing is a powerful tool, but it's not free from challenges. Understanding these aspects ensures that you conduct meaningful tests, interpret the data correctly, with a grain of salt, and make informed decisions. Also, ideally,retest them later on during a different time period.