The eval harness was the product.
How treating evals like SLOs changed what we shipped — and why grading a model honestly looks suspiciously like surveilling it.
Most teams ship AI by vibes. A prompt change feels good in a notebook, gets merged on a Tuesday, and breaks something on Friday that nobody can reproduce by Monday. The thing standing between “the assistant feels smart” and “the assistant is reliable” is a harness that grades changes the way a CI suite grades changes to a database engine. We built that harness before we built the assistant. That sequence — harness first, model second — is the single piece of advice I would protect, about production AI, against any other I have ever been given.
Treating evals like SLOs
We shipped four eval categories the way some teams ship four SLOs: grounded resolution rate, refusal correctness on the negative-eval set, tool-call accuracy, escalation precision. Each had a target, a regression budget, and a release-blocking threshold. A regression on grounded resolution blocked a deploy the same way a p99 regression on a database service would. The point was not to stop changes. The point was to make a bad change painful to merge.
The numbers came from this discipline, not from any one prompt. Grounded resolution moved from 83% to 95% over a quarter. Median latency dropped 34%. Inference spend dropped 38%. Self-service completion improved 19%. None of these were the goal of any individual change. They were the residue of grading every change honestly.
The negative evals matter more than the positive
Most teams write evals that look like “the assistant should answer X correctly when the user asks Y.” Necessary. Not enough.
The evals that caught the most drift were negative. The classes of question the assistant was supposed to refuse, escalate, or remain silent on. Drift in those is invisible to a positive eval suite — the model still gives a confident, fluent answer. It just gives it about the wrong topic, with the wrong authority. Negative evals are a brake. Without them you are accelerating into a wall in slow motion.
The thing I am uneasy with
Here is where I begin to disagree with my own work.
A good eval harness is a surveillance regime aimed at a model. It watches every output. It compares each one to a ledger of permissible behaviors. It refuses to let the model ship any change that drifts from the contract we wrote. The metaphor is not subtle. It is the thing some people fear about AI watching humans, in reverse — humans watching AI, with all the same machinery.
There is a famous observation from the historians of the late 20th century that a secret-police agency could not surveil two hundred million people because it did not have two hundred million agents — and certainly did not have analysts to read two hundred million daily reports. Modern AI changes that arithmetic. A modern model can read every report, summarize every conversation, and flag every deviation. We are already doing this in some American prisons, where machine-learned models monitor phone calls to predict when crimes are being contemplated. The technology gap between that and a global surveillance system is small and shrinking.
I do not run that kind of harness on people. I run it on a model. The lever is the same. The asymmetry — that I am pointing the harness at the system rather than the user it serves — is, right now, my job. The day someone builds an aligned model whose purpose is to monitor unaligned ones, the lever points the same way it does in my codebase, and the only difference will be what is on the other end of it.
I am not arguing this is wrong. I am noticing that it is the same shape, and that anyone who is comfortable with one of these and not the other is doing politics, not engineering.
What routing actually buys you
Most of the inference savings came from routing, not from prompt compression. A small classifier on the front decides which question class an inbound goes to, and each class is sent to the cheapest backend that can pass the relevant evals. TTL cache by prompt hash on the popular tail. The eval suite is what makes routing safe — without a graded test for each backend, you cannot route honestly.
There is a quiet version of this story I keep returning to. The same architecture that lets you route safely also lets you rank, classify, profile, and reduce. A production AI system is, almost always, a sorting machine. Sorting is, almost always, the political act underneath every technical one. The negative-eval set is what decides which sorts the system is allowed to do. Whoever owns the negative-eval set owns the politics.
The thing nobody tells you about evals
Eval suites get out of date. The set of questions users ask shifts every few weeks. If you do not version the eval set the way you version the model, you are eventually passing a test that has nothing to do with what your users are doing. Make the eval set a versioned product. Review changes to it the way you review changes to any other piece of production code. That is the discipline.
The deeper question — the one I will leave open in this essay because I have not earned an answer — is whether the discipline of grading a model harshly is something we will, eventually, allow a model to do back to us. There are research groups openly arguing that the only way to remain safe is for an aligned AI to monitor unaligned ones. Watch the watcher. Watch the watcher of the watcher. Grade everything that grades. The recursion is, technically, fine. It is the people who are comfortable with all of it that I find difficult to read.
What I would write next
Building an eval set from production traces without leaking PII. When to use LLM-as-judge versus exact-match scoring. The math behind “route to the cheapest model that can pass.” And — the one I keep starting and stopping — how the moment we started calling our harness “evals” instead of “tests” changed what kind of object we thought we were testing. A test is a thing you run against a function. An eval is a thing you run against a mind. We made that change of nouns without admitting we were making it. Most of the harder questions about the ethics of this work are downstream of that one syntactic move.