A/B Test Significance

Two-proportion z-test for A/B test conversions.

Overview

The A/B Test Significance calculator runs a two-proportion z-test on conversion data from a control and variant. Enter visitor counts and conversion counts for each arm and get back the lift, z-statistic and two-sided p-value, so you know whether the difference is more than noise.

Marketers running landing-page tests, product managers comparing onboarding flows and growth engineers picking between two checkout buttons all need this calculation. It saves you from the spreadsheet gymnastics and ensures you compare apples to apples instead of eyeballing the raw conversion rates.

How it works

Let p1 = c1 / n1 and p2 = c2 / n2 be the observed conversion rates. The pooled estimate is p = (c1 + c2) / (n1 + n2). The z-statistic for the difference under the null hypothesis "no real difference" is z = (p2 - p1) / sqrt(p * (1 - p) * (1/n1 + 1/n2)).

The two-sided p-value is 2 * (1 - Φ(|z|)) where Φ is the standard normal CDF. A small p-value (commonly below 0.05) is taken as evidence against the null. The tool also reports absolute and relative lift so you can judge practical significance, not just statistical.

Examples

Control: 1000 visitors, 100 conversions (10%)
Variant: 1000 visitors, 130 conversions (13%)
   →  z = 2.07, p = 0.038, significant at 5%

Control: 500 / 5000 (10%)
Variant: 520 / 5000 (10.4%)
   →  z = 0.69, p = 0.49, not significant

Control: 50 / 500 (10%)
Variant: 80 / 500 (16%)
   →  z = 2.83, p = 0.005, significant at 1%

FAQ

Should I use one-sided or two-sided?

This tool uses two-sided by default, which is the conservative choice. Pick one-sided only if you decided the direction of the test in advance.

Does it handle very small samples?

The z-test assumes the normal approximation is reasonable, which needs at least roughly 5 successes and 5 failures in each arm. For tiny experiments, Fisher's exact test is preferable.

What about peeking and multiple comparisons?

The classical z-test assumes a single look at the end of the experiment. Stopping early when significance is reached inflates false-positive rates.

Is statistical significance the same as a business win?

No. Always combine the p-value with the magnitude of the lift and a confidence interval before shipping the change.

Why is the p-value so high when my lift looks big?

Small samples produce wide confidence intervals. The same percentage lift on 10 visitors versus 10,000 carries very different evidence.

Try A/B Test Significance