How it fits together

QuillCache is the KV cache pool and the control plane. The engines run the model and own the live HBM KV; QuillCache holds the offloaded KV bytes (DRAM / SSD), moves them between nodes, indexes where they live, and decides routing and reuse.

The layers

Gateway (quillcache-gateway) — an OpenAI-compatible proxy. It parses the request, derives block hashes, asks the control plane for a plan, streams the upstream response through, and attaches x-quillcache-* decision headers before the first token (so they don’t add to TTFT).
Control plane / Conductor (quillcache-control + quillcache-router) — picks a worker with the Dynamo cost function (cache-locality vs load), runs the identity guard, and reads/writes the residency index.
KV cache pool / data plane (quillcache-store) — StoreDataPlane manages HBM/DRAM/SSD tiers and moves real bytes on demotion/eviction; LocalKvStore is the byte pool, with a WAL-backed crash-consistent SSD tier.
Distribution (quillcache-store + quillcache-transfer) — PooledStore serves a block locally or fetches it from the peer the residency index located, resolving the address through the NodeRegistry and moving it over the transfer engine.

A request, end to end

Parse — the gateway reads the request and derives the prompt’s block hashes (a shared system prompt yields shared leading blocks).
Plan — the control plane looks up only this request’s blocks in the residency index (locate, not a full snapshot), scores workers with the Dynamo cost function, and emits per-block actions (local hit / fetch / recompute / decode).
Guard — audit_reuse refuses any content-matching block that belongs to a different identity (a privacy leak or a correctness error).
Route + load feedback — the chosen engine’s in-flight count is bumped before the upstream call and released after, so the cost function’s load term is live (it spreads under load instead of dog-piling the cache-hot engine).
Observe — on success the request’s prefix blocks are recorded as resident on that engine, so the next request for the same prefix sees a local hit.
Stream — the upstream body is forwarded chunk by chunk; the first chunk’s arrival is timed against the request’s SLO budget for live goodput.

Two seams keep it pluggable

IndexBackend — the residency index (Memory / Holt-ART / RocksDB-LSM), compared in the storage study.
DataPlane + Transfer + NodeRegistry — the byte tiers, the wire (TCP / RDMA), and node discovery (in-memory now, etcd later).