Welch's t-Test

Compare two samples' means with Welch's unequal-variance t-test.

Overview

The Welch's t-Test compares the means of two independent samples when the variances aren't necessarily equal. It is the more robust cousin of Student's classical t-test and the safer default for most real-world experiments.

It is built for researchers analysing A/B tests with unequal-sized buckets, scientists comparing measurements from instruments with different precision and students who want a single procedure that works whether or not the variances match.

How it works

For samples with means m1, m2, standard deviations s1, s2 and sizes n1, n2, the t-statistic is t = (m1 - m2) / sqrt(s1² / n1 + s2² / n2). The Welch-Satterthwaite degrees of freedom are df ≈ (s1²/n1 + s2²/n2)² / ((s1²/n1)² / (n1 - 1) + (s2²/n2)² / (n2 - 1)).

The p-value is computed from the t-distribution with df degrees of freedom and a two-sided rejection region. Small p-values reject the null hypothesis that the two population means are equal.

Examples

Group A: mean 100, sd 10, n=30
Group B: mean 105, sd 12, n=35
   →  t ≈ -1.83, df ≈ 62.7, p ≈ 0.07

Group A: mean 50, sd 5, n=100
Group B: mean 52, sd 8, n=80
   →  t ≈ -1.97, df ≈ 130, p ≈ 0.05

Group A: mean 7, sd 1, n=20
Group B: mean 9, sd 1.5, n=25
   →  t ≈ -5.32, df ≈ 41, p < 0.0001

FAQ

Welch's or Student's?

Welch's is the safer default. Student's is slightly more powerful when variances really are equal, but Welch's is barely less powerful when they are and substantially more reliable when they aren't.

What's the difference between paired and independent t-tests?

Paired tests compare two measurements on the same subjects (before/after). Independent tests, like Welch's, compare two separate groups.

Why are the degrees of freedom fractional?

The Welch-Satterthwaite formula produces a non-integer adjustment that better matches the true sampling distribution. The t-distribution can be evaluated at fractional df.

Can it handle large samples?

Yes — for very large samples the t-distribution converges to the normal, so results approach those of a z-test.

What does a non-significant result mean?

You cannot reject the null at your chosen level. That is not the same as confirming the means are equal — small samples can mask real differences.

Try Welch's t-Test