Canyon Exchange · C++ · Linux · Low-latency systems
A market that lives below 100 microseconds.
A from-scratch matching engine designed for honesty, not benchmarks. Lock-free ring buffers, hugepages, core pinning, io_uring on the order ingress path, and a measurement methodology I am willing to defend in a room full of strangers.
8.7M
msgs / sec
87µs
p99 latency
0
allocs in hot path
io_uring
ingress
01 · The scene
The price-time-priority story most people skip.
Most matching engines sold as “low latency” pass that test only with a sanitized load. They batch. They drop the tail. They report p50 and call it a benchmark. I wanted a market that behaved the same way under pressure as it did at idle, because the parts of trading I find interesting are not the easy ones.
Canyon Exchange started as a private experiment. It became the project I'd hand to a recruiter who asks “show me real systems work.”
02 · The system
Architecture in three layers.
A single-writer matching core sits behind a multi-producer / single-consumer ingress ring. Every order arrives via io_uring, deserializes into a slab-allocated struct, and lands in a per-instrument shard with strict price-time priority. Outbound fills route through a SPMC ring back to a network thread that fans out to clients.
C++ — order book shard// Each shard owns a side. No locks across shards.
struct alignas(64) Shard {
PriceLevel levels[MAX_LEVELS]; // contiguous, cacheline-aligned
RingBuffer<Order, 1<<20> ingress; // SPSC, lock-free
RingBuffer<Fill, 1<<20> egress;
// Match runs on a pinned core; ingress is fed by io_uring.
void match() noexcept;
};The matching loop never allocates. Every transient lives in arenas reset between batches. The ingress thread is pinned to one core; the matcher to another on the same NUMA node. Hugepages back the rings to keep TLB misses out of the histogram.
03 · The hard part
Tail latency is a moral position.
The interesting numbers are not throughput; throughput is just “more cores, more shards.” The interesting number is what happens at the worst microsecond of the day.
Three things moved p99 the most:
- Pulling order metadata into a fixed-size cacheline-aligned struct so a level scan stays in L2.
- Switching ingress from a write-then-notify futex pattern to
io_uringwith kernel polling. Eliminated wake-up jitter. - Removing branches in the level walk by sentinel-terminating the linked list, so the loop predictor never had to guess.
Below that, returns diminished. I could have squeezed another few microseconds with DPDK or kernel bypass, but I want this engine to be honest first and faster later.
04 · The result
Measured, not estimated.
8.7M
msgs / sec
Single shard, single matcher core
87µs
p99
Order in → ack out, end-to-end
46µs
p50
Same path, same harness
0
allocs / hot path
Verified with tcmalloc + perf
< 0.5%
CV across runs
10× 30-second runs, warm cache
flamegraph
evidence
Linked in repo · methodology in README
Result
match() itself. The system is finally spending its time on the thing it's supposed to do.05 · The artifact
What you can actually see.
- GitHub · Canyon-Exchange — source, benchmarks, methodology, flamegraphs.
- Live order book visualizer in /lab → Trading Engine Floor.
- p50 / p95 / p99 histogram bundled with the repo.
- Reproduction script:
./bench/run.sh --dur=30s --shards=1 --producers=4.
06 · The reflection
What I'd do next.
The next move is kernel bypass. DPDK on the network path, then a deeper cacheline-density analysis on the level structure. After that, a replay harness that captures real packet streams from a public exchange feed and lets me regress against historical volatility, not synthetic load. The system has earned a real adversary.