HeadlinesBriefing favicon HeadlinesBriefing.com

Why 75% of A/B Tests Are Statistically Invalid

Towards Data Science •
×

Most A/B test results are statistical illusions, not real wins. A product manager celebrates a +8.3% conversion lift with 96% significance, but if she'd waited three more days, that significance would drop to 74% and the lift to +1.2%. This scenario plays out daily across tech companies, with only 10-20% of controlled experiments at Google and Bing generating positive results according to Ronny Kohavi's research.

The four statistical sins that invalidate most A/B tests are surprisingly simple to fix. Peeking at results before reaching the planned sample size inflates false positive rates from 5% to 26.1% - meaning one in four 'winners' is pure noise. Statistical power calculations reveal that small samples create the 'winner's curse,' where real but small effects appear artificially large. Testing multiple metrics without correction turns your experiment into a noise-finding machine, with a 64% chance of finding false positives when tracking 20 metrics.

Fixing these issues requires discipline: calculate sample sizes before starting, don't peek at results, run power analyses, and apply multiple comparison corrections. The article provides a five-item pre-test checklist and a decision framework for choosing between frequentist, Bayesian, and sequential testing approaches. Your A/B testing tool that lets you stop whenever the confidence bar turns green isn't a testing tool - it's a random number generator with a nicer UI.