QuillCache

A Mooncake/Dynamo-style distributed KV cache pool and control plane for LLM serving — in Rust, with identity-governed safe reuse and a crash-consistent persistent tier.

Read the docs View on GitHub

distributed KV byte poolDRAM / SSD tierstransfer engine · TCP / RDMAcross-node fetchidentity-safe reusecrash-consistent SSDpersistent ART indexDynamo cost router9 crates · MIT

QuillCache sits beside real inference engines (vLLM, SGLang) and owns the KV cache as a resource — the byte pool, the transfer engine, the residency index, and the cache-aware control plane. It replicates the architecture of NVIDIA Dynamo and Moonshot’s Mooncake, at a size one person can read end-to-end and measure. It does not run models — the CUDA tier moves and quantizes KV bytes (the data path), not inference compute.

Headline results

Real code, no simulation — every number is measured against in-tree baselines. 45 tests pass; fmt + clippy clean.

9.96 µs

ART prefix-scan (p50)

Holt (persistent ART) residency index vs RocksDB/LSM 16.8 µs and an O(N) flat map (494 µs). Recovery 2.6 ms.

1× vs 10.6×

Write amplification

ART writes each record once; LSM rewrites through compaction. Measured from RocksDB’s own stats, not assumed.

96.8%

Unsafe reuse caught

Of a naive content-hash cache’s hits on a collision workload — cross-tenant leaks + cross-adapter/model errors. The guard serves 0.

crash-consistent

Persistent SSD tier

Object-first atomic publish + WAL: complete blocks recover, half-written / corrupted blocks dropped, no dangling pointers.

real KV bytes

Distributed byte pool

DRAM + SSD tiers that hold actual KV blocks, with index-located cross-node fetch over the transfer engine — verified over TCP.

1.7%

Cost of safety

Identity-guard overhead on a realistic, mostly-same-identity workload (47.8% only on the adversarial all-collision case).

Where it sits

The engines run the model and own the live HBM KV; QuillCache holds the offloaded KV bytes (DRAM / SSD), moves them between nodes, indexes where they live, and decides routing and reuse.

Explore

Overview What it is, what's wired online vs a tested unit vs reserved.

Architecture Gateway, control plane, KV cache pool, distribution — how a request flows.

ART vs LSM storage study Which storage engine fits a residency index — prefix-scan, write-amp, recovery.

Identity-safe reuse Why content-hash caches leak, and the guard that costs ~1.7%.

Crash-consistent tier Object-first atomic publish + WAL recovery, proven by test.

Mooncake / Dynamo mapping Every reference-design component, mapped to a crate.