The Kalyanam Archive

01 · The scene

The reproducibility gap.

BELIV had the most interesting connected-vehicle data on campus and the worst reproducibility story I'd seen. A planner regression flagged at 11pm took a research assistant 97 minutes the next morning to reproduce. By the time they had it, the person who wrote the offending change was already in their next standup.

02 · The system

What we built.

A two-tier replay harness. Tier one captures every regression as a minimum reproducing bag: only the topics, frames, and V2X messages necessary for that scenario. Tier two re-runs the bag against the planner under test on a per-PR basis, gated in CI.

Bag pruner uses dependency tracing of subscribed topics.
Bag store backed by content-addressed storage; identical scenarios dedupe.
Per-PR runner sharded across the lab's GPU box and a CARLA fallback.
Foxglove layout auto-generated for each failed bag — you click the link in the PR comment and you're looking at the failure.

03 · The hard part

Pruning without lying.

The hardest bug we hit was a pruner that dropped a topic the planner did not subscribe to but the perception node did. The bag reproduced clean and the planner passed. We added a strict mode that records the full subscription graph and refuses to prune anything reachable from the unit under test. That single change moved reproducibility from “mostly works” to “trusted by the lab.”

Lesson

A regression harness is a contract. If it lies once, no one trusts it again.

04 · The result

Lead time collapsed.

97 → 11 min

reproduction

Median time-to-reproduce

28,000

nightly tests

Gating PRs against the planner

−88%

lead time

Bug filed → fix merged

flakes accepted

Strict-mode contract on the pruner

05 · The artifact

What it produces.

Per-PR comment with reproduction link, failure clip, and discordance trace.
Nightly leaderboard of slowest scenarios — the lab's “hot bags.”
Foxglove layouts authored once, regenerated on every failure.

06 · The reflection

What I'd add.

A coverage map across scenario taxonomy. Right now we know how fast the harness is and how much it catches; we do not yet know where the holes are in the scenario space. Building that map is the next semester's project.