Skip to content

Identity-safe reuse

A KV block’s content hash is computed from its tokens, so the same tokens produce the same hash — regardless of which tenant sent them, which LoRA adapter is active, or which model/tokenizer version is loaded. But the KV tensors depend on all of those. A cache that reuses on content hash alone — which is what the data-plane caches (Mooncake / LMCache / KVBM) key on — will serve blocks it must not:

  • across tenants → a privacy leak (one tenant’s cached state served to another);
  • across adapters / models / tokenizers → a correctness error (numerically wrong KV).

The guard

QuillCache makes the reuse contract explicit. Every block carries an IdentityScope (model · tokenizer · adapter · tenant), and LocalKvStore::get serves a block only when the requester’s identity matches:

pub fn get(&mut self, key: &KvBlockKey) -> Result<Bytes, StoreError> {
// exact identity + content match -> serve.
// content resident under a *different* identity -> refuse:
// Err(StoreError::Unsafe(ReuseViolation::Tenant | Adapter | Model | Tokenizer))
// otherwise -> Err(StoreError::NotFound)
}

The same check runs inline on the live gateway: after a tenant-a request caches a prefix, a tenant-b request for the same content returns x-quillcache-local-hits: 0 and x-quillcache-reuse-refused: 2 — it refuses to serve tenant A’s KV to tenant B, and says so.

Measured

On a collision-heavy workload (one popular prefix shared across many identities — the multi-tenant shared-system-prompt / shared-RAG case):

policycontent-hash hitsunsafe servedsafe reuse kept
naive (content hash only)1240012000 (96.8%)
identity guard04800

The guard eliminates all unsafe reuse while preserving safe same-identity reuse. And it is precise, not blunt: on a realistic mostly-same-identity mix the overhead — forced recomputes as a fraction of all reuse work — drops to 1.7% (it is only 47.8% on the adversarial all-collision case). Safety is near-free exactly where it matters.