Ingest Buffer Architecture (English)
The ingest buffer is a per-level staging area for SSTables—typically promoted from L0—designed to absorb bursts, reduce overlap, and unlock parallel compaction without touching the main level tables immediately. It combines fixed sharding, adaptive scheduling, and optional ingest-only merge to keep write amplification and contention low.
flowchart LR
L0["L0 SSTables"] -->|moveToIngest| Ingest["Ingest Buffer (sharded)"]
subgraph levelN["Level N"]
Ingest -->|ingest-only compact| MainTables["Main Tables"]
Ingest -->|ingest-merge| Ingest
end
Ingest -.read path merge.-> ClientReads["Reads/Iterators"]
Design Highlights
- Sharded by key prefix: ingest tables are routed into fixed shards (top bits of the first byte). Sharding cuts cross-range overlap and enables safe parallel drain.
- Snapshot-friendly reads: ingest tables are read under the level
RLock, and iterators hold table refs so mmap-backed data stays valid without additional snapshots. - Two ingest paths:
- Ingest-only compaction: drain ingest → main level (or next level) with optional multi-shard parallelism guarded by
compact.State. - Ingest-merge: compact ingest tables back into ingest (stay in-place) to drop superseded versions before promoting, reducing downstream write amplification.
- Ingest-only compaction: drain ingest → main level (or next level) with optional multi-shard parallelism guarded by
- IngestMode enum: plans carry an
IngestModewithIngestNone,IngestDrain, andIngestKeep.IngestDraincorresponds to ingest-only, whileIngestKeepcorresponds to ingest-merge. - Adaptive scheduling:
- Shard selection is driven by
compact.PickShardOrder/compact.PickShardByBacklogusing per-shard size, age, and density. - Shard parallelism scales with backlog score (based on shard size/target file size) bounded by
IngestShardParallelism. - Batch size scales with shard backlog to drain faster under pressure.
- Ingest-merge triggers when backlog score exceeds
IngestBacklogMergeScore(default 2.0), with dynamic lowering under extreme backlog/age.
- Shard selection is driven by
- Observability: expvar/stats expose ingest-only vs ingest-merge counts, duration, and tables processed, plus ingest size/value density per level/shard.
Configuration
IngestShardParallelism: max shards to compact in parallel (defaultmax(NumCompactors/2, 2), auto-scaled by backlog).IngestCompactBatchSize: base batch size per ingest compaction (auto-boosted by shard backlog).IngestBacklogMergeScore: backlog score threshold to trigger ingest-merge (default 2.0).
Benefits
- Lower write amplification: bursty L0 SSTables land in ingest first; ingest-merge prunes duplicates before full compaction.
- Reduced contention: sharding +
compact.Stateallow parallel ingest drain with minimal overlap. - Predictable reads: ingest is part of the read snapshot, so moving tables in/out does not change read semantics.
- Tunable and observable: knobs for parallelism and merge aggressiveness, with per-path metrics to guide tuning.
Future Work
- Deeper adaptive policies (IO/latency-aware), richer shard-level metrics, and more exhaustive parallel/restart testing under fault injection.