Paradox · AI product engineering
Production AI that gets graded like a pull request.
Eval-driven recruiter and candidate assistant flows. RAG, tool-calling, routing, caching, and an evaluation harness that gates every prompt change. The results are not anecdotal — they are CI-defended.
11k+
weekly chats
83 → 95%
grounded resolution
−34%
median latency
−38%
inference spend
01 · The scene
The eval harness was the product.
Most teams ship AI features by vibes. A prompt change feels good, gets merged, and breaks something three days later that nobody can reproduce. We built a grading harness before we built the assistant. Every change had to pass it to land.
That decision changed everything downstream — what we measured, what we cached, what we routed where, and what we were willing to ship.
02 · The system
Tool-calling agent over a graded RAG layer.
Inbound questions arrive on a websocket. A router classifies intent (retrieve, act, escalate) and dispatches to one of several tool subgraphs. Retrieval is pgvector with a learned re-ranker. Tools are typed, idempotent, and traced through OpenTelemetry. Every response carries a citation set; the assistant cannot ground a claim it cannot point to.
Tool schema — typed and idempotenttype LookupCandidate = {
name: 'lookup_candidate';
args: { id: string };
side_effect: false;
cache: 'ttl_5m';
};
type EscalateToHuman = {
name: 'escalate';
args: { reason: 'low_confidence' | 'sensitive' };
side_effect: true;
cache: 'never';
};Routing is the thing most teams under-build. We wrote a small classifier that picked the cheapest model that could pass the eval suite for the question class. That alone accounted for most of the inference savings.
03 · The hard part
Latency and grounding pull in opposite directions.
Adding retrieval improves grounded-resolution rate but adds latency. Adding tool calls improves correctness but multiplies tail latency by the number of tools invoked. Caching helps but only on the popular tail, and that tail shifts every week. The eval harness let us tune all of these against each other instead of guessing.
Note
04 · The result
Numbers that survived a quarter.
11,000+
weekly chats
Across recruiter & candidate surfaces
83 → 95%
grounded resolution
Citation-backed answers
−34%
median latency
p50, end-to-end
−38%
inference spend
Routing + caching + pruned evals
+19%
self-service completion
Without escalation
0
silent drift
Eval harness on every PR
Result
05 · The artifact
A no-confidential-data demo.
The production assistant lives behind enterprise auth. The portfolio demo runs on synthetic recruiting data with the same routing, retrieval, and eval shape. You can inspect the citation graph and trace a single request all the way through.
- /lab → AI Product Cockpit — interactive tool-call graph.
- Eval harness pattern open-sourced as a starter repo (link in GitHub).
06 · The reflection
What I'd build differently.
Two things. First, treat the eval set as a versioned product — it deserves the same review rigor as the model. Second, invest earlier in negative evals: the classes of question we want the assistant to refuse, escalate, or remain silent on. Those evals catch more drift than positive ones do.