KD
← All work

Canyon Exchange · C++ · Linux · Low-latency systems

A market that lives below 100 microseconds.

A from-scratch matching engine designed for honesty, not benchmarks. Lock-free ring buffers, hugepages, core pinning, io_uring on the order ingress path, and a measurement methodology I am willing to defend in a room full of strangers.

8.7M

msgs / sec

87µs

p99 latency

0

allocs in hot path

io_uring

ingress

C++20Linuxio_uringeBPFperfflamegraphhugepagesNUMAlock-free

01 · The scene

The price-time-priority story most people skip.

Most matching engines sold as “low latency” pass that test only with a sanitized load. They batch. They drop the tail. They report p50 and call it a benchmark. I wanted a market that behaved the same way under pressure as it did at idle, because the parts of trading I find interesting are not the easy ones.

Canyon Exchange started as a private experiment. It became the project I'd hand to a recruiter who asks “show me real systems work.”

02 · The system

Architecture in three layers.

A single-writer matching core sits behind a multi-producer / single-consumer ingress ring. Every order arrives via io_uring, deserializes into a slab-allocated struct, and lands in a per-instrument shard with strict price-time priority. Outbound fills route through a SPMC ring back to a network thread that fans out to clients.

C++ — order book shard// Each shard owns a side. No locks across shards.
struct alignas(64) Shard {
  PriceLevel levels[MAX_LEVELS];     // contiguous, cacheline-aligned
  RingBuffer<Order, 1<<20> ingress;  // SPSC, lock-free
  RingBuffer<Fill,  1<<20> egress;
  // Match runs on a pinned core; ingress is fed by io_uring.
  void match() noexcept;
};

The matching loop never allocates. Every transient lives in arenas reset between batches. The ingress thread is pinned to one core; the matcher to another on the same NUMA node. Hugepages back the rings to keep TLB misses out of the histogram.

03 · The hard part

Tail latency is a moral position.

The interesting numbers are not throughput; throughput is just “more cores, more shards.” The interesting number is what happens at the worst microsecond of the day.

Three things moved p99 the most:

  1. Pulling order metadata into a fixed-size cacheline-aligned struct so a level scan stays in L2.
  2. Switching ingress from a write-then-notify futex pattern to io_uring with kernel polling. Eliminated wake-up jitter.
  3. Removing branches in the level walk by sentinel-terminating the linked list, so the loop predictor never had to guess.

Below that, returns diminished. I could have squeezed another few microseconds with DPDK or kernel bypass, but I want this engine to be honest first and faster later.

04 · The result

Measured, not estimated.

8.7M

msgs / sec

Single shard, single matcher core

87µs

p99

Order in → ack out, end-to-end

46µs

p50

Same path, same harness

0

allocs / hot path

Verified with tcmalloc + perf

< 0.5%

CV across runs

10× 30-second runs, warm cache

flamegraph

evidence

Linked in repo · methodology in README

Result

The top of the flamegraph is now match() itself. The system is finally spending its time on the thing it's supposed to do.

05 · The artifact

What you can actually see.

06 · The reflection

What I'd do next.

The next move is kernel bypass. DPDK on the network path, then a deeper cacheline-density analysis on the level structure. After that, a replay harness that captures real packet streams from a public exchange feed and lets me regress against historical volatility, not synthetic load. The system has earned a real adversary.