KD
← All work

Blue Yonder · Data infrastructure

140 terabytes a month, observed.

Throughput +2.1×, compute cost −27%, and an incident triage path that collapsed from 41 minutes to 12. The kind of work that does not screenshot well but shows up in oncall calm.

170M

records / day

140 TB

monthly volume

+2.1×

throughput

41 → 12m

triage

SparkApache ArrowKafkaFlinkIcebergAirflowOpenTelemetryGrafana

01 · The scene

The bill that made the case.

The first conversation about this work was a finance review of compute spend. The number was real, the trend was wrong, and the team had a pile of jobs that had grown slower over time without anyone catching it. Nobody's fault — Spark gives you so many ways to be slow that you cannot watch them all.

02 · The system

Two streams: the pipeline, and its observability.

The pipeline itself: Kafka → Flink stateful enrichment → Iceberg landing → Spark transformation → analytics warehouse. Volume is roughly 140 TB / month across feature and event streams. The work I owned was making it observable, then making it cheap.

Observability:

  • Per-job SLO dashboards with throughput, cost, and freshness as first-class metrics.
  • Schema-drift detector emitting Kafka events when an upstream changes shape.
  • Run-time DAG that tags every Spark stage with the upstream contract it depends on.

03 · The hard part

Cost is hidden until you make it visible.

Most of the wins came from columnar work. Switching feature generation from row-based UDFs to Arrow-aware vectorized expressions unblocked the planner. After that, an honest review of skew exposed a small handful of partitions doing 80% of the work; rebalancing those was a one-line change once we could see them.

Lesson

Performance work in data pipelines is mostly visibility work. Once the cost curve is visible per job, optimization is a calm Tuesday afternoon.

04 · The result

Throughput, cost, and oncall.

170M

records / day

Quality / SLO dashboards

+2.1×

throughput

Spark + Arrow rewrite

−27%

compute cost

Across 140 TB / month

41 → 12 min

triage

Incident detected → root cause located

< 5 min

drift detection

Schema-drift detector

0

silent freshness misses

After SLO instrumentation

05 · The artifact

Dashboards, runbooks, and a quieter pager.

  • SLO dashboard template (Grafana).
  • Schema-drift detector pattern.
  • Spark / Arrow rewrite playbook for the team.

06 · The reflection

What I'd add.

A first-class lineage layer. Right now we track contracts per stage, but we do not have an end-to-end provenance trail from source event to warehouse fact table. That would change incident triage from 12 minutes to 2.