The Kalyanam Archive

At the BELIV Lab, the most expensive person in the room was a research assistant spending ninety-seven minutes the morning after a failed nightly run trying to reproduce one regression. By the time they had it isolated, the engineer whose change caused the failure had already cycled to their next task. The regression got punted. It surfaced again two weeks later as a planner failure on a different scenario, and nobody connected the dots. The cost was not compute. The cost was institutional forgetting.

The replay harness I built collapsed reproduction time from ninety-seven minutes to eleven. That is a real number. It is not the interesting one. The interesting one is what changed about how the team worked once the harness existed.

The pruner that lied once

The bug we caught was that the bag pruner was dropping a topic the planner did not subscribe to but the perception node did. The harness reproduced clean and the planner passed. We caught it manually, comparing two failures that should have been identical and weren't. We added strict-mode that records the full subscription graph of the unit under test and refuses to prune anything reachable. That single change moved the harness from “mostly works” to “trusted by the lab.”

I think about that pruner often, because the lesson it taught me is much wider than connected-vehicle perception.

Trust is a binary

A regression harness lives in one of two states. Either the team trusts it enough to gate merges, or they don't. There is no in-between. Once the team learns the harness can pass while the production system fails, they stop using it as a gate, and you have lost the only thing it was for.

This is why I now think of reproducibility as a contract, not a feature. A contract is a thing you can break exactly once. A regression set that lies once is never trusted again, no matter how fast you fix the bug.

Where the eerie part starts

There are people running AI labs who have spent the last decade telling the rest of us that they have a plan for keeping their systems aligned to human values. Some of them have moved their personal estimates of catastrophic outcomes upward over the last two years and continued to ship. Some of them have argued, in writing, that humanity getting replaced by its silicon descendants is a morally good thing. Some of them have signed open letters warning of extinction-level risk while their companies lobbied to ban regulation of the very systems that pose the risk.

What you should ask yourself, reading any of those statements, is the same question I asked of my pruner the morning after it lied. Has this contract been broken once?If yes — what is the difference between this person's confidence in their alignment plan and a regression set that says everything is fine because it is not watching the right topic?

Trust is a binary. The contract is single-shot. There is no version of “we promise to be careful” that survives one observed lie about how careful they have actually been. There is no version of “we will keep humans in the loop” that survives one quietly-shipped feature where a human was, in fact, not in the loop. There is no version of “we have not crossed a red line” that survives the discovery, six weeks later, that the line was simply moved.

I am not telling you who has lied and who has not. I am telling you that the test you should run is the same test I run on my own code. If a contract has been broken once in a way that cannot be transparently audited, the contract is broken. The harness is no longer a harness. It is theatre.

What “contract” means in practice

Concretely, the BELIV harness has to: (1) refuse to prune anything in the dependency graph of the unit under test; (2) raise loudly when an upstream contract changes and the regression set was not regenerated; (3) version the bag store by content hash so identical scenarios deduplicate cleanly; (4) attach a Foxglove layout to every failure so the next person can see it without setup.

Translate that to AI safety: (1) refuse to ship a model that has not been evaluated against the specific failure modes of the system it will replace; (2) raise loudly when a deployment surface changes without re-grading; (3) version every alignment claim by what specifically was tested, when, and against which version of the model; (4) attach a reproduction trace to every safety claim so the next person — including the auditor, the regulator, the journalist — can see the actual behavior without setup.

Almost none of this is true of any frontier AI system shipping today. That is not a political claim. It is an observation about whether the contract, as I am using the word, exists.

What changed culturally at the lab

The cultural change at the lab was bigger than the technical one. Once the harness was trusted, the lab stopped phrasing failures as “something is broken” and started phrasing them as “here is the bag, here is the layout, here is the offending commit.” Bug reports became reproducible by default. Newer team members could load a failure and see it without learning a single shell command.

That is the other thing reproducibility-as-a-contract gives you: it lowers the on-ramp for everyone who joins after you. It also raises the on-ramp for everyone who would prefer you not be able to check.

What I would write next

Coverage maps over scenario taxonomies — how do you know where the holes are in the regression set? Reproducibility for ML training, not just planner regressions. And — the one I keep avoiding — the case for treating every published claim about AI safety the same way you treat every published claim about p99: assume the harness is lying until you see the trace. I do not think that is paranoia. I think that is the baseline professional courtesy you owe to everyone who will live downstream of the thing you are claiming to control.