Skip to content

How it fits together

QuillCache is the KV cache pool and the control plane. The engines run the model and own the live HBM KV; QuillCache holds the offloaded KV bytes (DRAM / SSD), moves them between nodes, indexes where they live, and decides routing and reuse.

Gateway OpenAI-compatible proxy · streaming · decision headers · SLO goodput Control plane / Conductor router (Dynamo cost fn) · identity guard · residency index (Holt ART / RocksDB / Memory) KV cache pool / data plane StoreDataPlane: HBM/DRAM/SSD tiers (real bytes) · LocalKvStore: WAL crash-consistent SSD Distribution PooledStore: cross-node fetch · NodeRegistry (etcd analogue) · Transfer: TCP / RDMA (reserved) Engines (external) vLLM / SGLang — run the model, own live HBM KV CUDA device tier HBM↔host copies · FP8 quantize (feature-gated)

The layers

  • Gateway (quillcache-gateway) — an OpenAI-compatible proxy. It parses the request, derives block hashes, asks the control plane for a plan, streams the upstream response through, and attaches x-quillcache-* decision headers before the first token (so they don’t add to TTFT).
  • Control plane / Conductor (quillcache-control + quillcache-router) — picks a worker with the Dynamo cost function (cache-locality vs load), runs the identity guard, and reads/writes the residency index.
  • KV cache pool / data plane (quillcache-store) — StoreDataPlane manages HBM/DRAM/SSD tiers and moves real bytes on demotion/eviction; LocalKvStore is the byte pool, with a WAL-backed crash-consistent SSD tier.
  • Distribution (quillcache-store + quillcache-transfer) — PooledStore serves a block locally or fetches it from the peer the residency index located, resolving the address through the NodeRegistry and moving it over the transfer engine.

A request, end to end

  1. Parse — the gateway reads the request and derives the prompt’s block hashes (a shared system prompt yields shared leading blocks).
  2. Plan — the control plane looks up only this request’s blocks in the residency index (locate, not a full snapshot), scores workers with the Dynamo cost function, and emits per-block actions (local hit / fetch / recompute / decode).
  3. Guardaudit_reuse refuses any content-matching block that belongs to a different identity (a privacy leak or a correctness error).
  4. Route + load feedback — the chosen engine’s in-flight count is bumped before the upstream call and released after, so the cost function’s load term is live (it spreads under load instead of dog-piling the cache-hot engine).
  5. Observe — on success the request’s prefix blocks are recorded as resident on that engine, so the next request for the same prefix sees a local hit.
  6. Stream — the upstream body is forwarded chunk by chunk; the first chunk’s arrival is timed against the request’s SLO budget for live goodput.

Two seams keep it pluggable

  • IndexBackend — the residency index (Memory / Holt-ART / RocksDB-LSM), compared in the storage study.
  • DataPlane + Transfer + NodeRegistry — the byte tiers, the wire (TCP / RDMA), and node discovery (in-memory now, etcd later).