The Kalyanam Archive

01 · The scene

The eval harness was the product.

Most teams ship AI features by vibes. A prompt change feels good, gets merged, and breaks something three days later that nobody can reproduce. We built a grading harness before we built the assistant. Every change had to pass it to land.

That decision changed everything downstream — what we measured, what we cached, what we routed where, and what we were willing to ship.

02 · The system

Tool-calling agent over a graded RAG layer.

Inbound questions arrive on a websocket. A router classifies intent (retrieve, act, escalate) and dispatches to one of several tool subgraphs. Retrieval is pgvector with a learned re-ranker. Tools are typed, idempotent, and traced through OpenTelemetry. Every response carries a citation set; the assistant cannot ground a claim it cannot point to.

Tool schema — typed and idempotenttype LookupCandidate = {
  name: 'lookup_candidate';
  args: { id: string };
  side_effect: false;
  cache: 'ttl_5m';
};

type EscalateToHuman = {
  name: 'escalate';
  args: { reason: 'low_confidence' | 'sensitive' };
  side_effect: true;
  cache: 'never';
};

Routing is the thing most teams under-build. We wrote a small classifier that picked the cheapest model that could pass the eval suite for the question class. That alone accounted for most of the inference savings.

03 · The hard part

Latency and grounding pull in opposite directions.

Adding retrieval improves grounded-resolution rate but adds latency. Adding tool calls improves correctness but multiplies tail latency by the number of tools invoked. Caching helps but only on the popular tail, and that tail shifts every week. The eval harness let us tune all of these against each other instead of guessing.

Note

We treat eval categories the way SREs treat SLOs. A regression on grounded resolution blocks a release the same way a p99 regression does.

04 · The result

Numbers that survived a quarter.

11,000+

weekly chats

Across recruiter & candidate surfaces

83 → 95%

grounded resolution

Citation-backed answers

−34%

median latency

p50, end-to-end

−38%

inference spend

Routing + caching + pruned evals

+19%

self-service completion

Without escalation

silent drift

Eval harness on every PR

Result

The biggest unlock was not a prompt or a model. It was making bad changes painful to merge.

05 · The artifact

A no-confidential-data demo.

The production assistant lives behind enterprise auth. The portfolio demo runs on synthetic recruiting data with the same routing, retrieval, and eval shape. You can inspect the citation graph and trace a single request all the way through.

/lab → AI Product Cockpit — interactive tool-call graph.
Eval harness pattern open-sourced as a starter repo (link in GitHub).

06 · The reflection

What I'd build differently.

Two things. First, treat the eval set as a versioned product — it deserves the same review rigor as the model. Second, invest earlier in negative evals: the classes of question we want the assistant to refuse, escalate, or remain silent on. Those evals catch more drift than positive ones do.