Crash Recovery Playbook
This document describes how NoKV restores state after abnormal exit, and which tests validate each recovery contract.
1. Recovery Phases
flowchart TD
Start[DB.Open]
Verify[runRecoveryChecks]
WalOpen[wal.Open]
LSM[lsm.NewLSM]
Manifest[manifest replay + table load]
WALReplay[WAL replay to memtables]
VLog[valueLog recover]
Flush[submit immutable flush backlog]
Stats[stats/start background loops]
Start --> Verify --> WalOpen --> LSM --> Manifest --> WALReplay --> VLog --> Flush --> Stats
- Pre-flight verification:
DB.runRecoveryChecksrunsmanifest.Verify,wal.VerifyDir, and per-bucketvlog.VerifyDir. - WAL manager reopen:
wal.Openreopens latest segment and rebuilds counters. - Manifest replay + SST load:
levelManager.buildreplays manifest version and opens SST files. - Stale SST cleanup: if a manifest SST is missing or unreadable/corrupt, it is marked stale and removed from manifest (
EditDeleteFile) so startup can continue. - WAL replay:
lsm.recoveryreplays post-checkpoint WAL records into memtables. - Flush backlog restore: recovered immutable memtables are resubmitted to
flush.Manager. - ValueLog recovery: value-log managers reconcile on-disk files with manifest metadata, trim torn tails, and drop stale/orphan segments.
- Runtime restart: metrics and periodic workers start again.
2. Failure Scenarios & Tests
| Failure Point | Expected Recovery Behaviour | Tests |
|---|---|---|
| WAL tail truncated | Replay stops safely at truncated tail, preserving valid prefix records | wal/manager_test.go::TestManagerReplayHandlesTruncate |
| Crash before memtable flush install | WAL replay restores user data not yet flushed to SST | db_test.go::TestRecoveryWALReplayRestoresData |
| Manifest references missing SST | Startup removes stale manifest entry and continues | db_test.go::TestRecoveryCleansMissingSSTFromManifest |
| Manifest references corrupt/unreadable SST | Startup removes stale entry and continues | db_test.go::TestRecoveryCleansCorruptSSTFromManifest |
| ValueLog stale segment (manifest marked invalid) | Recovery deletes stale file from disk | db_test.go::TestRecoveryRemovesStaleValueLogSegment |
| ValueLog orphan segment (disk only) | Recovery deletes orphan file not tracked by manifest | db_test.go::TestRecoveryRemovesOrphanValueLogSegment |
| Manifest rewrite interrupted | Recovery keeps using CURRENT-selected manifest and data remains readable | db_test.go::TestRecoveryManifestRewriteCrash |
| ValueLog contains records absent from LSM/WAL | Recovery does not replay vlog as source-of-truth | db_test.go::TestRecoverySkipsValueLogReplay |
3. Recovery Tooling
3.1 Targeted tests
go test ./... -run 'Recovery|ReplayHandlesTruncate'
Set RECOVERY_TRACE_METRICS=1 to emit RECOVERY_METRIC ... lines in tests.
3.2 Script harness
RECOVERY_TRACE_METRICS=1 ./scripts/recovery_scenarios.sh
Outputs are saved under artifacts/recovery/.
3.3 CLI checks
nokv manifest --workdir <dir>: verify level files, WAL pointer, vlog metadata.nokv stats --workdir <dir>: confirm flush backlog converges.nokv vlog --workdir <dir>: inspect vlog segment state.
4. Operational Signals
Watch these fields during restart:
flush.queue_lengthwal.segment_countvalue_log.headsvalue_log.segmentsvalue_log.pending_deletes
If flush.queue_length remains high after replay, inspect flush worker throughput and manifest sync settings.
5. Notes on Consistency Model
- WAL + manifest remain the authoritative recovery chain for LSM state.
- ValueLog is reconciled/validated but is not replayed as a mutation source.
- In strict flush mode (
ManifestSync=true), SST install ordering isSST Sync -> RenameNoReplace -> SyncDir -> manifest edit.
For deeper internals, see flush.md, manifest.md, and wal.md.