MemTable Flush Pipeline
NoKV’s flush path converts immutable memtables into L0 SST files, then advances the manifest WAL checkpoint and reclaims obsolete WAL segments. The queue and timing bookkeeping live directly in engine/lsm/flush_runtime.go; SST persistence and manifest install are in engine/lsm/table_builder.go and engine/lsm/level_manager.go.
1. Responsibilities
- Persistence: materialize immutable memtables into SST files.
- Ordering: publish SST metadata to manifest only after the SST is durably installed (strict mode).
- Cleanup: remove WAL segments once checkpoint and raft constraints allow removal.
- Observability: export queue/build/release timing through flush metrics.
2. Concrete Flush Queue
flowchart LR
Active[Active MemTable]
Immutable[Immutable MemTable]
FlushQ[flush queue]
Build[Build SST]
Install[Install SST]
Release[Release MemTable]
Active -->|threshold reached| Immutable --> FlushQ
FlushQ --> Build --> Install --> Release --> Active
- Enqueue:
lsm.submitFlushpushes the immutable memtable into the concrete flush queue and records wait-start time. - Build: worker pulls the next task, builds the SST (
levelManager.flush->openTable->tableBuilder.flush). - Install: after SST + manifest edits succeed, the worker records install timing.
- Release: worker removes the immutable from memory, closes the memtable, records release timing, and completes the task.
3. SST Persistence Modes
Flush uses two write modes controlled by Options.ManifestSync:
-
Fast path (
ManifestSync=false)- Writes SST directly to final filename with
O_CREATE|O_EXCL. - No temp file/rename step.
- Highest throughput, weaker crash-consistency guarantees.
- Writes SST directly to final filename with
-
Strict path (
ManifestSync=true)- Writes to
"<table>.tmp.<pid>.<ns>". tmp.Sync()to persist SST bytes.RenameNoReplace(tmp, final)installs file atomically. If unsupported by platform/filesystem, returnsvfs.ErrRenameNoReplaceUnsupported.SyncDir(workdir)is called before manifest edit so directory entry is durable.
- Writes to
This is the durability ordering used by current code.
4. Execution Path in Code
lsm.Set/lsm.SetBatchdetectswalSize + estimate > MemTableSizeand rotates memtable.- Rotated memtable is submitted to the flush queue (
lsm.submitFlush). - Worker executes
levelManager.flush(mt):- iterates memtable entries,
- builds SST via
tableBuilder, - prepares manifest edits:
EditAddFile+EditLogPointer.
- In strict mode,
SyncDirruns beforemanifest.LogEdits(...). - On successful manifest commit, table is added to L0 and
wal.RemoveSegmentruns when allowed.
5. Recovery Notes
- Startup rebuild (
levelManager.build) validates manifest SST entries against disk. - Missing or unreadable SSTs fail startup; normal restart does not repair manifest state by deleting referenced files.
- Temp SST names are only used in strict mode and are created in
WorkDirwith suffix.tmp.<pid>.<ns>(not a dedicatedtmp/directory).
6. Metrics & CLI
flushRuntime.stats() feeds StatsSnapshot.Flush:
pending,queue,active- wait/build/release totals, counts, last, max
completed
Use:
nokv stats --workdir <dir>
to inspect flush backlog and latency.
7. Related Tests
engine/lsm/flush_runtime_test.go: queue lifecycle and timing counters.db_test.go::TestRecoveryWALReplayRestoresData: replay still restores data after crash before flush completion.db_test.go::TestRecoveryFailsOnMissingSSTanddb_test.go::TestRecoveryFailsOnCorruptSST: startup fails when manifest SSTs are missing or corrupt.
See also recovery.md, memtable.md, and wal.md.