KD
← All writing
Systems2024 · 09 · 26· 9 min

Tail latency is a moral position.

On why the worst microsecond is the only honest one — and what averages hide about every system that has ever killed anyone.

The first thing most teams do when someone asks them about performance is reach for an average. The mean is the most cooperative number in software. It makes everything look better than it is, it forgives the worst minutes of your day, and it hides behind a comforting word — average. You cannot have a reasoned conversation about a production system at the mean. The interesting events live in the tail. I will spend the rest of this essay arguing something I half-believe and half-fear: that almost every catastrophe humanity has ever lived through was a tail event we told ourselves was unlikely.

Ninety-nine point nine percent of species that have ever existed have gone extinct. Extinction is not rare. It is the default. The only question every species answers is when. Most of them answer it on a Tuesday they didn't plan for, in the worst microsecond of their geological day. The mean was fine. The mean was beautiful. The tail showed up.

A nuclear war was, on average, not going to happen during the Cold War. The mean was peace. The 99th percentile was a Soviet officer correctly guessing his early-warning system was malfunctioning when it told him American missiles were inbound. The 99.9th percentile was an American bomber breaking apart over North Carolina and three of four safety pins on a thermonuclear device failing as it fell. The 99.99th was the world looking at its own machinery and choosing, by the smallest margin, not to use it. We were saved by individuals who refused to authorize what the mean said was acceptable. Anyone who tells you the Cold War was safe is reporting the mean.

The temptation of the warm cache

The easiest way to look fast is to benchmark a warm cache, single-threaded, on a loop that fits in L1. Most published low-latency numbers do exactly that. Technically not a lie. Technically not a benchmark of the system anyone will ever actually use. The market does not arrive in your loop. It arrives at your worst microsecond.

On Canyon Exchange, the rule was that the harness has to model adversity, not throughput. Bursty arrival distributions. Cache pollution between runs. CPU thermal limits engaged. Real network drivers, not loopback. Hugepages on, but TLB pressure simulated. None of these add throughput. All of them lift the tail. That is the work.

What changed when I started measuring honestly

Three things moved p99 the most. First, cacheline alignment of the level structure — a level walk now stays in L2 instead of getting evicted on every order. Second, switching ingress from a write-then-notify futex pattern to io_uring with kernel polling, which eliminated wake-up jitter the harness was previously hiding. Third, sentinel-terminating the linked list so the loop predictor never has to guess. The flamegraph stopped showing the system spending its time on anything other than the thing it was supposed to do.

The moral part

Calling tail latency a moral position is not theatre. The mean is a politically comfortable number — easy to defend, easy to publish, easy to celebrate. p99 is uncomfortable for everyone. For the engineer who has to own it. For the manager who has to explain it. For the customer who experiences it. Choosing to publish p99 is a small choice about who you are willing to disappoint, and when.

I would rather disappoint a benchmark than disappoint a counterparty. That is the position. It informs the rest of the system.

The unsettling part — the one I do not have a good answer to — is that this principle scales. If extinction is the default and tail events are where we go to die, then every system humans build that is okay on average is a system that will, at the worst microsecond of its lifetime, stop being okay. Some of those systems are matching engines and the worst case is a missed fill. Some of them are nuclear early-warning networks and the worst case is everyone's last moment. Some are the AI labs that, right now, are publishing benchmarks while privately admitting their personal estimates of catastrophic outcomes are in the double digits. They keep shipping. The averages always look fine. The averages have always looked fine.

I know, when I write that, that some readers will think it is theatrical. It isn't. It is the same statistical instinct that tunes a matching engine, applied to a class of system whose worst-microsecond cost is everyone. There is no engineering reason to treat one of these systems with more honesty than the other. There may be commercial reasons. Those are not the same thing.

What I keep returning to

Writers in this corner of the field have been arguing for a few years now that the single most important number in any production system is the one its operators most want to obscure. I think they are right. I think the discipline of looking at p99 long enough to be unhappy with it — long enough to refuse the warm cache, refuse the synthetic load, refuse the brochure — is one of the few intellectual virtues that translates cleanly out of systems engineering and into everything else.

It does not let you off the hook on any of the bigger arguments. It does not tell you whether to trust an AI lab's alignment plan. It does not tell you which scenario for the next century is the most likely. It just says: if the people building the system will only show you the mean, what they are showing you is not the system. The system lives in the worst microsecond, and the worst microsecond is what you should ask to see.

Most days I am tuning a matching engine. Some days I am thinking about all the averages that have ever been published with a straight face about technologies we could not, in fact, control. Both of these are the same job.