Multi-Armed Bandits vs A/B Tests: A PM's Plain-English Guide

Your A/B test is running. Variant B is up 12% on conversions after a week. You can see it. But you can’t call it — significance hasn’t been reached. So you wait three more weeks while half your traffic goes to the worse experience.

This is the core tension between A/B testing and multi-armed bandits. They solve different versions of the same problem, and mixing them up costs you either rigor or revenue.

A/B Testing: What It Is and Where It Breaks

A fixed-horizon A/B test splits traffic 50/50, sets a sample size target, and waits for significance. The logic is sound: a 95% confidence level means roughly a 5% false positive rate. Peeking early inflates that rate considerably — documented in Spotify’s experimentation research and Statsig’s published methodology.

But the protocol has costs. For teams under 50,000 MAU, reaching significance on a button copy test can take months. You hold both variants open the whole time, meaning users in the losing experience have a worse product throughout. The insight-to-action gap compounds: you probably know which variant is better early, but you can’t act on it.

What Bandits Actually Do

The algorithm name comes from a classic probability problem: a row of slot machines with unknown payout rates. You want to maximize total payout while learning which machines are best. Pure exploration wastes money; pure exploitation locks you in early. Bandits balance both.

In product terms: you start split evenly. As the algorithm sees that variant B converts better, it shifts more traffic toward B continuously. By the time a fixed-horizon test would have reached significance, a well-tuned bandit has already routed 70-80% of users to the better experience.

Thompson Sampling is the most common production implementation. Statsig, Optimizely, and some PostHog configurations use it. The algorithm maintains a probability distribution for each variant’s conversion rate and samples from those to allocate each session.

The tradeoff is statistical control. Bandits optimize for cumulative reward during the experiment. They don’t produce a clean p-value. If you need documented causality — “variant B increased signup rate by 8.3% (p=0.02)” — you need the fixed-horizon design.

When to Use Which

Use a fixed-horizon A/B test when the decision is high-stakes or irreversible: pricing changes, checkout flow redesigns, anything requiring documented justification. Statistical discipline here is what prevents you from shipping a winner that isn’t one.

Use a bandit when optimizing something continuous and low-risk: CTA copy, email subject lines, push notification timing. When users in the losing variant pay a real cost during the test, the bandit’s regret-minimization is what you want. Every session improves allocation rather than waiting for a count to accumulate.

The failure mode is using bandits to avoid rigor on decisions that need it. Teams that never run structured experiments sometimes reach for bandits as a shortcut — and end up locally optimizing CTR while the product’s core activation problem goes unexamined.

The Constraint Neither Solves

Both methods share one hard limit: you still need to ship the variants and have instrumentation in Amplitude or PostHog that captures the right outcome at the right granularity. Bad event schemas corrupt both approaches equally.

The deeper constraint is adaptation latency: time from behavioral signal to live variant. Bandits allocate traffic better once variants are running. They don’t close the gap between seeing a signal and having a UI change deployed.

Rayform addresses this differently: instead of discrete variants and split traffic, it uses behavioral telemetry to adapt what each user sees at runtime. No deploy cycle, no significance wait.

One concrete thing to do this week: instrument a hesitation event in PostHog — user pauses on a CTA for over 3 seconds without clicking. If that signal appears on more than 8% of sessions, you have a testable hypothesis. A 2-week bandit will answer it faster than a 6-week fixed test, and it’ll send fewer users through the broken experience while you learn.