BELIV Lab @ ASU · ROS2 · C++ · Connected vehicles
The regression set ran while we slept.
A nightly replay and simulation-validation harness for connected-vehicle perception/planning regressions. The lab had a backlog of failures that nobody could reproduce by lunch. We made reproduction cheap.
28k
nightly tests
97 → 11m
reproduction
−88%
lead time
CI / CD
gating
01 · The scene
The reproducibility gap.
BELIV had the most interesting connected-vehicle data on campus and the worst reproducibility story I'd seen. A planner regression flagged at 11pm took a research assistant 97 minutes the next morning to reproduce. By the time they had it, the person who wrote the offending change was already in their next standup.
02 · The system
What we built.
A two-tier replay harness. Tier one captures every regression as a minimum reproducing bag: only the topics, frames, and V2X messages necessary for that scenario. Tier two re-runs the bag against the planner under test on a per-PR basis, gated in CI.
- Bag pruner uses dependency tracing of subscribed topics.
- Bag store backed by content-addressed storage; identical scenarios dedupe.
- Per-PR runner sharded across the lab's GPU box and a CARLA fallback.
- Foxglove layout auto-generated for each failed bag — you click the link in the PR comment and you're looking at the failure.
03 · The hard part
Pruning without lying.
The hardest bug we hit was a pruner that dropped a topic the planner did not subscribe to but the perception node did. The bag reproduced clean and the planner passed. We added a strict mode that records the full subscription graph and refuses to prune anything reachable from the unit under test. That single change moved reproducibility from “mostly works” to “trusted by the lab.”
Lesson
04 · The result
Lead time collapsed.
97 → 11 min
reproduction
Median time-to-reproduce
28,000
nightly tests
Gating PRs against the planner
−88%
lead time
Bug filed → fix merged
0
flakes accepted
Strict-mode contract on the pruner
05 · The artifact
What it produces.
- Per-PR comment with reproduction link, failure clip, and discordance trace.
- Nightly leaderboard of slowest scenarios — the lab's “hot bags.”
- Foxglove layouts authored once, regenerated on every failure.
06 · The reflection
What I'd add.
A coverage map across scenario taxonomy. Right now we know how fast the harness is and how much it catches; we do not yet know where the holes are in the scenario space. Building that map is the next semester's project.