WAL Subsystem
NoKV’s write-ahead log mirrors RocksDB’s durability model and is implemented as a compact Go module similar to Badger’s journal. WAL appends happen alongside memtable writes (via lsm.Set), while values that are routed to the value log are written before the WAL so that replay always sees durable value pointers.
1. File Layout & Naming
- Location:
${Options.WorkDir}/wal/. - Naming pattern:
%05d.wal(e.g.00001.wal). - Rotation threshold: configurable via
wal.Config.SegmentSize(defaults to 64 MiB, minimum 64 KiB). - Verification:
wal.VerifyDirensures the directory exists prior toDB.Open.
On open, wal.Manager scans the directory, sorts segment IDs, and resumes the highest ID—exactly how RocksDB re-opens its MANIFEST and WAL sequence files.
2. Record Format
uint32 length (big-endian, includes type + payload)
uint8 type
[]byte payload
uint32 checksum (CRC32 Castagnoli over type + payload)
- Checksums use
kv.CastagnoliCrcTable, the same polynomial used by RocksDB (Castagnoli). Record encoding/decoding lives inwal/record.go. - The type byte allows mixing LSM mutations with raft log/state/snapshot records in the same WAL segment.
- Appends are buffered by
bufio.Writerso batches become single system calls. - Replay stops cleanly at truncated tails; tests simulate torn writes by truncating the final bytes and verifying replay remains idempotent (
wal/manager_test.go::TestReplayTruncatedTail).
3. Public API (Go)
mgr, _ := wal.Open(wal.Config{Dir: path})
infos, _ := mgr.Append(batchPayload)
_ = mgr.Sync()
_ = mgr.Rotate()
_ = mgr.Replay(func(info wal.EntryInfo, payload []byte) error {
// reapply to memtable
return nil
})
Key behaviours:
Appendautomatically callsensureCapacityto decide when to rotate; it returnsEntryInfo{SegmentID, Offset, Length}for each payload so higher layers can build value pointers or manifest checkpoints.Syncflushes the active file (used forOptions.SyncWrites).Rotateforces a new segment (used after flush/compaction checkpoints similar to RocksDB’sLogFileManager::SwitchLog).Replayiterates segments in numeric order, forwarding each payload to the callback. Errors abort replay so recovery can surface corruption early.- Metrics (
wal.Manager.Metrics) reveal the active segment ID, total segments, and number of removed segments—these feed directly intoStatsSnapshotandnokv statsoutput.
Compared with Badger: Badger keeps a single vlog for both data and durability. NoKV splits WAL (durability) from vlog (value separation), matching RocksDB’s separation of WAL and blob files.
4. Integration Points
| Call Site | Purpose |
|---|---|
lsm.memTable.set | Encodes each entry (kv.EncodeEntry) and appends to WAL before inserting into the skiplist. |
DB.commitWorker | Commit worker applies batched writes via writeToLSM, which flows into lsm.Set and thus WAL. |
DB.Set | Direct write path: calls lsm.Set, which appends to WAL and updates the memtable. |
manifest.Manager.LogEdit | Uses EntryInfo.SegmentID to persist the WAL checkpoint (EditLogPointer). This acts as the log number seen in RocksDB manifest entries. |
lsm/flush.Manager.Update | Once an SST is installed, WAL segments older than the checkpoint are released (wal.Manager.Remove). |
db.runRecoveryChecks | Ensures WAL directory invariants before manifest replay, similar to Badger’s directory bootstrap. |
5. Metrics & Observability
Stats.collect reads the manager metrics and exposes them as:
NoKV.WAL.ActiveSegmentNoKV.WAL.SegmentCountNoKV.WAL.RemovedSegments
The CLI command nokv stats --workdir <dir> prints these alongside backlog, making WAL health visible without manual inspection. In high-throughput scenarios the active segment ID mirrors RocksDB’s LOG number growth.
6. WAL Watchdog (Auto GC)
The WAL watchdog runs inside the DB process to keep WAL backlog in check and surface warnings when raft-typed records dominate the log. It:
- Samples WAL metrics + per-segment metrics and combines them with
manifest.RaftPointerSnapshot()to compute removable segments. - Removes up to
WALAutoGCMaxBatchsegments when at leastWALAutoGCMinRemovableare eligible. - Exposes counters (
WALAutoGCRuns/Removed/LastUnix) and warning state (WALTypedRecordRatio/Warning/Reason) throughStatsSnapshot.
Relevant options (see options.go for defaults):
EnableWALWatchdogWALAutoGCIntervalWALAutoGCMinRemovableWALAutoGCMaxBatchWALTypedRecordWarnRatioWALTypedRecordWarnSegments
7. Recovery Walkthrough
wal.Openreopens the highest segment, leaving the file pointer at the end (switchSegmentLocked).manifest.Managersupplies the WAL checkpoint (segment + offset) while building the version. Replay skips entries up to this checkpoint, ensuring we only reapply writes not yet materialised in SSTables.wal.Manager.Replay(invoked by the LSM recovery path) rebuilds memtables from entries newer than the manifest checkpoint. Value-log recovery only validates/truncates segments and does not reapply data.- If the final record is partially written, the CRC mismatch stops replay and the segment is truncated during recovery tests, mimicking RocksDB’s tolerant behaviour.
8. Operational Tips
- Configure
SyncOnWritefor synchronous durability (default async like RocksDB’s default). For latency-sensitive deployments, consider enabling to emulate Badger’sSyncWrites. - After large flushes, forcing
Rotatekeeps WAL files short, reducing replay time. - Archived WAL segments can be copied alongside manifest files for hot-backup strategies—since the manifest contains the WAL log number, snapshots behave like RocksDB’s
Checkpoints.
9. Truncation Metadata
raftstore/engine/wal_storagekeeps a per-group index of[firstIndex,lastIndex]spans for each WAL record so it can map raft log indices back to the segment that stored them.- When a log is truncated (either via snapshot or future compaction hooks), the manifest is updated via
LogRaftTruncatewith the index/term, segment ID (RaftLogPointer.SegmentIndex), and byte offset (RaftLogPointer.TruncatedOffset) that delimit the remaining WAL data. lsm/levelManager.canRemoveWalSegmentnow blocks garbage collection whenever any raft group still references a segment through its truncation metadata, preventing slow followers from losing required WAL history while letting aggressively compacted groups release older segments earlier.
For broader context, read the architecture overview and flush pipeline documents.