How it fits together
QuillCache is the KV cache pool and the control plane. The engines run the model and own the live HBM KV; QuillCache holds the offloaded KV bytes (DRAM / SSD), moves them between nodes, indexes where they live, and decides routing and reuse.
The layers
- Gateway (
quillcache-gateway) — an OpenAI-compatible proxy. It parses the request, derives block hashes, asks the control plane for a plan, streams the upstream response through, and attachesx-quillcache-*decision headers before the first token (so they don’t add to TTFT). - Control plane / Conductor (
quillcache-control+quillcache-router) — picks a worker with the Dynamo cost function (cache-locality vs load), runs the identity guard, and reads/writes the residency index. - KV cache pool / data plane (
quillcache-store) —StoreDataPlanemanages HBM/DRAM/SSD tiers and moves real bytes on demotion/eviction;LocalKvStoreis the byte pool, with a WAL-backed crash-consistent SSD tier. - Distribution (
quillcache-store+quillcache-transfer) —PooledStoreserves a block locally or fetches it from the peer the residency index located, resolving the address through theNodeRegistryand moving it over the transfer engine.
A request, end to end
- Parse — the gateway reads the request and derives the prompt’s block hashes (a shared system prompt yields shared leading blocks).
- Plan — the control plane looks up only this request’s blocks in the
residency index (
locate, not a full snapshot), scores workers with the Dynamo cost function, and emits per-block actions (local hit / fetch / recompute / decode). - Guard —
audit_reuserefuses any content-matching block that belongs to a different identity (a privacy leak or a correctness error). - Route + load feedback — the chosen engine’s in-flight count is bumped before the upstream call and released after, so the cost function’s load term is live (it spreads under load instead of dog-piling the cache-hot engine).
- Observe — on success the request’s prefix blocks are recorded as resident on that engine, so the next request for the same prefix sees a local hit.
- Stream — the upstream body is forwarded chunk by chunk; the first chunk’s arrival is timed against the request’s SLO budget for live goodput.
Two seams keep it pluggable
IndexBackend— the residency index (Memory / Holt-ART / RocksDB-LSM), compared in the storage study.DataPlane+Transfer+NodeRegistry— the byte tiers, the wire (TCP / RDMA), and node discovery (in-memory now, etcd later).