Keeping False Positives Out of Your A/B Test Results

If you already know all about correcting for familywise error rate (FWER) in suites of A/B tests, you can skip straight to our new FWER calculator web app by clicking here. Otherwise, read on!

A/B tests have become one of the most fundamental tools in a marketer's toolbox over the past decade: from copy and ad images to landing pages and web forms, if it's worth doing it's worth testing. Because of this, most marketers have become very familiar with the concept of statistical significance. In simple terms, “statistically significant” means “very unlikely to happen by chance alone.” In most tests, “unlikely” means 5%: in other words, a test result is considered significant if there's a 5% or less chance that it could have happened randomly.

This might sound boring to some folks, so lets take a second to remember why this is so important. Statistical significance matters because it lets us make better decisions with our marketing spends and creative. If we don't check to make sure our A/B test results are statistically significant, we could spend our time and energy on something that doesn't actually work!

Greenough_xkcd
Greenough_xkcd

Because this statistic is such a critical part of the modern marketing world, dozens of popular web tools have sprung up to help marketers make sense of it. KissMetrics' A/B Significance Test and Evan Miller's Sample Size Calculator are just two great examples (though the latter actually helps you figure out a related probability called power, but that's another blog post entirely).

Sounds great, right? It's 2015 – all marketers want more data! In reality, though, it's not so simple. Believe it or not, testing more things can sometimes make our decisions worse rather than better. For an example of how basic statistical significance calculations can lead marketers astray, take a look at our revision of this great xkcd comic (on the left).

As you can see, if you consider any test result with a 5% or less chance to occur randomly as significant, then you'll generate (roughly) one false positive for every 20 tests you run. If this is confusing, just think of a 20-sided die: if you were to roll it 20 times, you'd expect to roll a five at least once. Just like there's nothing significant about the five, there's also nothing significant about getting one “significant” result in a suite of 20 tests!

For a real life example, let's assume that a data-driven marketer goes out and runs 20 tests on something. Just like with the die roll example, we can assume that one of the test results will be a false positive. If only two of the 20 test results come back positive, how confident can we be that a spend/creative decision we make based on the results will be the right one? Only 50% - or, in other words, about as confident as a coin flip!

Don't panic, though; there's a fix for this.

This phenomenon is called the “familywise error rate” (FWER), and it's actually not that hard to stop it from ruining your results. But despite looking around for a while, no one at Greenough was able to find an online significance calculator that lets users account for FWER!

To make marketers' lives a little bit easier, we decided to solve this problem ourselves by contributing a new app to the community. Unlike most online calculators that only give you results for one test, our app takes a list of test results and returns only the ones that are found significant vis-a-vis FWER. It even allows you to choose your method of correction, if you're interested in going that deep. Check it out at this link – we hope it makes your life a little easier!

A big thanks to Boston's own R Studio for making the Shiny web framework package, which we used (along with shinydashboard) to build this app. If you're interested in using this app within your own organization, you can find the code for it on GitHub.

Zach Pearson is Manager Content and Digital at Greenough. Follow him on Twitter: @zach_p_pearson