Blue Yonder · Data infrastructure
140 terabytes a month, observed.
Throughput +2.1×, compute cost −27%, and an incident triage path that collapsed from 41 minutes to 12. The kind of work that does not screenshot well but shows up in oncall calm.
170M
records / day
140 TB
monthly volume
+2.1×
throughput
41 → 12m
triage
01 · The scene
The bill that made the case.
The first conversation about this work was a finance review of compute spend. The number was real, the trend was wrong, and the team had a pile of jobs that had grown slower over time without anyone catching it. Nobody's fault — Spark gives you so many ways to be slow that you cannot watch them all.
02 · The system
Two streams: the pipeline, and its observability.
The pipeline itself: Kafka → Flink stateful enrichment → Iceberg landing → Spark transformation → analytics warehouse. Volume is roughly 140 TB / month across feature and event streams. The work I owned was making it observable, then making it cheap.
Observability:
- Per-job SLO dashboards with throughput, cost, and freshness as first-class metrics.
- Schema-drift detector emitting Kafka events when an upstream changes shape.
- Run-time DAG that tags every Spark stage with the upstream contract it depends on.
03 · The hard part
Cost is hidden until you make it visible.
Most of the wins came from columnar work. Switching feature generation from row-based UDFs to Arrow-aware vectorized expressions unblocked the planner. After that, an honest review of skew exposed a small handful of partitions doing 80% of the work; rebalancing those was a one-line change once we could see them.
Lesson
04 · The result
Throughput, cost, and oncall.
170M
records / day
Quality / SLO dashboards
+2.1×
throughput
Spark + Arrow rewrite
−27%
compute cost
Across 140 TB / month
41 → 12 min
triage
Incident detected → root cause located
< 5 min
drift detection
Schema-drift detector
0
silent freshness misses
After SLO instrumentation
05 · The artifact
Dashboards, runbooks, and a quieter pager.
- SLO dashboard template (Grafana).
- Schema-drift detector pattern.
- Spark / Arrow rewrite playbook for the team.
06 · The reflection
What I'd add.
A first-class lineage layer. Right now we track contracts per stage, but we do not have an end-to-end provenance trail from source event to warehouse fact table. That would change incident triage from 12 minutes to 2.