march 20, 2026 · research paper

Swarm Testing

In 2012, researchers found that if you deliberately break your test suite by randomly turning off features, you find 42% more bugs than if you try to design the best possible setup. This actually makes sense.

by Chaitanya · founder, workers io

This post is an interactive explainer based on the original paper: “Swarm Testing” by Alex Groce, Chaoqiang Zhang, Eric Eide, Yang Chen, and John Regehr (ISSTA 2012).

PART IThe obvious strategy (That Doesn’t Work)

Imagine a stack with two operations: push and pop. There’s a bug lurking in it. If you ever put more than 32 items on the stack, it crashes. You don’t know this yet. Your job is to find the bug.

The first thing you’d probably try is to generate random sequences of pushes and pops, run a lot of tests, and wait for something to break. Using both operations feels like the most thorough way to test. It seems like the best coverage, the highest chance of finding a bug. But is it?

If you pick push and pop at random, half and half, you’ll need about 370,000 tests before you ever hit the overflow. That number isn’t a mistake. Pushes and pops cancel each other out, like a random walk: the stack goes up a bit, then down, then up again. Getting to 33 items is like flipping a coin and getting 33 heads in a row. It almost never happens.

Now try something different. Before each test, pick a random non-empty subset of the API. With two operations, you get three cases: both push and pop, just push, or just pop. In a third of your tests, you’ll only use push, so every operation grows the stack. The bug shows up right away.

If you leave out features at random when you write tests, you find the stack bug about a third of the time. Every test that only uses push will overflow. Before, the chance was almost zero. The tests that only use push are the ones that catch the bug, and you get those just by picking random subsets. You didn’t have to guess that pop was the problem. Doing less actually finds more.

PART IISee it in practice

Here are 300 test cases for each approach. Each dot is a test: the x-axis is pushes, the y-axis is pops. Try hitting re-run a few times and watch how the pattern changes.

Standard testing gives you a tight cluster in the center. You get pushes and pops in about equal amounts every time. The stack barely notices anything.

Swarm testing scatters dots along the axes. Those are tests with only pushes on the x-axis or only pops on the y-axis. In these, one operation is just turned off. The push-only tests pile up in the bug zone. That’s where all the red shows up.

PART IIIThe Two ways features hide bugs

Leaving out features helps, but why? The paper gives two reasons. Once you spot them, you’ll start seeing them everywhere.

Try the demo below. It runs a 40-step test on the buggy stack, flipping different features on and off each time.

Active suppression means stopping the bug before it starts. If you use pop, the stack never overflows. Close frees file handles before you run out. Sync clears buffers before they go bad. The feature is the fix.

Passive suppression is when a feature doesn’t stop the bug, but it gets in the way. Say your test runs 50 operations, and 15 are peek. That’s 15 fewer chances for push to grow the stack. The test never goes deep enough.

Think of a buffet and a small plate. If you load up on bread, you leave less room for the food you actually want. Active suppression is someone grabbing food off your plate. Passive suppression is just the bread sitting there, taking up space.

The tricky part is that adding every feature to every test doesn’t just waste time; it actually makes things worse. You can’t predict which features will hide which bugs. The only way to find out is to try lots of combinations.

PART IVHow do we fix this?

Since this works so well, you’d probably think there’s some fancy algorithm behind it. Maybe a genetic optimizer, or a coverage-guided feedback loop, or some machine learning model.

Nope. You flip coins.

For each test setup, just take your list of features and flip a coin for each one. Heads, you include it; tails, you leave it out. That’s it. Run a batch of tests with that setup, then do it again with a new set of coin flips. The paper calls this group of random setups a “swarm.”

Here are 10 out of 20 Csmith features. The rest include things like comma operators, compound and embedded assignments, increment and decrement, goto, integer multiplication, 64-bit math, packed structs, volatile pointers, and argc/argv.

Every test you can get from a swarm setup, you could also get from the default ‘everything-on’ setup. The set of possible tests doesn’t change. What changes is how likely you are to hit each one. Possibility isn’t the same as probability. That’s the key.

PART V”But won’t you miss bugs?”

This is always the first thing people ask, and they’re right to worry. Some bugs only show up when you hit just the right mix of features. If a file system only crashes when read, write, open, mkdir, rmdir, unlink, and sync all show up in the same test, a coin-flip config only has a 1 in 128 shot at hitting all seven.

That sounds bad. But you aren’t just running one config; you’re running a whole swarm. Try dragging the sliders below to see how fast the coverage adds up.

P = 1 − (1 − 0.5^k)ⁿ

If you run 100 configs, you have a 95% chance of hitting any set of 5 features. With 1,000 configs, you cover any 8 features with 95% odds, and any 10 with 60%.

There’s a second effect the paper points out. In the rare configs that have all seven features, the rest are usually off. So those seven interact more; you get more calls to each, and you explore their states more deeply. The test ends up more focused. You don’t just cover the combination; you test it harder when you do.

PART VITry It: Break the Stack

It’s easy to talk about theory, but it’s more useful to see what actually happens. Try this: you have a stack with six API operations. Turn them on and off, then run 200 tests and watch how the max-depth changes. See if you can push it to overflow (depth greater than 32) as many times as possible.

The obvious move is to turn off pop. That stops the stack from shrinking. But you can also turn off peek, size, and isEmpty. These don’t remove anything, but if you take them away, you get more pushes. Every time you skip an operation that isn’t push, you lose a chance to change the stack. It’s a subtler shift, but it still matters.

PART VIIThis Isn’t a Toy Example

The stack demo is tidy, but does this work on real code? The researchers tried swarm testing on 17 versions of 5 real C compilers: GCC, Clang, Intel CC, Open64, and Sun CC. They used Csmith, a random C program generator they’d already spent years tuning to catch compiler bugs.

They set up two identical machines and let them run for a week. One used the default setup, with every C feature turned on. The other used swarm testing.

Swarm found 42% more ways to crash a compiler, even on software that had already been fuzzed a lot, using a tool we’d already tuned as much as we could. One week, one machine, no changes to the test generator.

Swarm crashed compilers on fewer programs: 15,851 compared to 22,691. But it tested more programs overall. The test cases were simpler, so it could generate and compile them faster: 66,699 vs 47,477.

But it still found 42% more distinct crash signatures, while producing 30% fewer total crashes. Different error messages usually mean different bugs. The default setup just kept repeating the same crashes.

Pointers turned out to be the most interesting part. In C, a[i] and *(a+i) are basically the same thing. But the researchers found a compiler bug that only happened with pointers, almost never with arrays. Then they found another bug in the same compiler that only happened with arrays, and went away if you used pointers instead.

Pointers caused a third of the compiler bugs, but they also prevented even more: 41%. The same thing was both the main source of trouble and the best fix. You can’t hand-tune your way out of a contradiction like that. The only thing that works is variety.

Even if you play it safe and only count the buggiest version from each compiler family, swarm still found 56 faults instead of 37. That’s a 51% increase.

PART VIIIThe uncomfortable lesson

Most people think you should turn on every feature and run as many tests as possible. Swarm testing works better if you do the opposite.

There’s no perfect setup you can tune your way into. The best trick is to pick random configurations. You don’t have to be clever. Flipping coins works better than expert guesses.

The recipe

List the features your test generator can toggle (API calls, language constructs, input properties)
Flip a coin per feature for each test batch, in or out
Run your normal test generation with that subset
Repeat with a new random config

That’s all there is to it. You don’t need to install anything new or tweak parameters. Most test generators already have feature flags, usually for debugging. Swarm testing just uses them for something more useful.

The real mistake is thinking more is always better. More features, more coverage, more edge cases. But in a single test, features fight for space. Every feature you add pushes another one out. Some bugs never even get a chance to show up.

Sometimes the best thing a test can do is leave something out.