QuillCache
A Mooncake/Dynamo-style distributed KV cache pool and control plane for LLM serving — in Rust, with identity-governed safe reuse and a crash-consistent persistent tier.
distributed KV byte poolDRAM / SSD tierstransfer engine · TCP / RDMAcross-node fetchidentity-safe reusecrash-consistent SSDpersistent ART indexDynamo cost router9 crates · MIT
QuillCache sits beside real inference engines (vLLM, SGLang) and owns the KV cache as a resource — the byte pool, the transfer engine, the residency index, and the cache-aware control plane. It replicates the architecture of NVIDIA Dynamo and Moonshot’s Mooncake, at a size one person can read end-to-end and measure. It does not run models — the CUDA tier moves and quantizes KV bytes (the data path), not inference compute.
Headline results
Real code, no simulation — every number is measured against in-tree baselines.
45 tests pass; fmt + clippy clean.
9.96 µs
ART prefix-scan (p50)
Holt (persistent ART) residency index vs RocksDB/LSM 16.8 µs and an O(N) flat map (494 µs). Recovery 2.6 ms.
1× vs 10.6×
Write amplification
ART writes each record once; LSM rewrites through compaction. Measured from RocksDB’s own stats, not assumed.
96.8%
Unsafe reuse caught
Of a naive content-hash cache’s hits on a collision workload — cross-tenant leaks + cross-adapter/model errors. The guard serves 0.
crash-consistent
Persistent SSD tier
Object-first atomic publish + WAL: complete blocks recover, half-written / corrupted blocks dropped, no dangling pointers.
real KV bytes
Distributed byte pool
DRAM + SSD tiers that hold actual KV blocks, with index-located cross-node fetch over the transfer engine — verified over TCP.
1.7%
Cost of safety
Identity-guard overhead on a realistic, mostly-same-identity workload (47.8% only on the adversarial all-collision case).
Where it sits
The engines run the model and own the live HBM KV; QuillCache holds the offloaded KV bytes (DRAM / SSD), moves them between nodes, indexes where they live, and decides routing and reuse.
Explore
Overview What it is, what's wired online vs a tested unit vs reserved.
Architecture Gateway, control plane, KV cache pool, distribution — how a request flows.
ART vs LSM storage study Which storage engine fits a residency index — prefix-scan, write-amp, recovery.
Identity-safe reuse Why content-hash caches leak, and the guard that costs ~1.7%.
Crash-consistent tier Object-first atomic publish + WAL recovery, proven by test.
Mooncake / Dynamo mapping Every reference-design component, mapped to a crate.