Value Log (vlog) Design
NoKV keeps the LSM tree lean by separating large values into sequential value log (vlog) files. The module is split between
vlog/manager.go– owns the open file set, rotation, and segment lifecycle helpers.vlog/io.go– append/read/iterate/verify/sample IO paths.vlog.go– integrates the manager with the DB write path, discard statistics, and garbage collection (GC).
The design echoes BadgerDB’s value log while remaining manifest-driven like RocksDB’s blob_db: vlog metadata (head pointer, pending deletions) is persisted inside the manifest so recovery can reconstruct the exact state without scanning the filesystem.
1. Layering (Engine View)
The value log is split into three layers so IO can stay reusable while DB policy lives in the core package:
- DB policy layer (
vlog.go,vlog_gc.go) – integrates the manager with the DB write path, persists vlog metadata in the manifest, and drives GC scheduling based on discard stats. - Value-log manager (
vlog/) – owns segment lifecycle (open/rotate/remove), encodes/decodes entries, and exposes append/read/sample APIs without touching MVCC or LSM policy. - File IO (
file/) – mmap-backedLogFileprimitives (open/close/truncate, read/write, read-only remap) shared by WAL/vlog/SST. Vlog currently usesLogFiledirectly instead of an intermediate store abstraction.
2. Directory Layout & Naming
<workdir>/
vlog/
00000.vlog
00001.vlog
...
- Files are named
%05d.vlogand live underworkdir/vlog/.Manager.populatediscovers existing segments at open. Managertracks the active file ID (activeID) and byte offset;Manager.Headexposes these so the manifest can checkpoint them (manifest.EditValueLogHead).- Files created after a crash but never linked in the manifest are removed during
valueLog.reconcileManifest.
3. Record Format
The vlog uses the shared encoding helper (kv.EncodeEntryTo), so entries written to the value log and the WAL are byte-identical.
+--------+----------+------+-------------+-----------+-------+
| KeyLen | ValueLen | Meta | ExpiresAt | Key bytes | Value |
+--------+----------+------+-------------+-----------+-------+
+ CRC32 (4 B)
- Header fields are varint-encoded (
kv.EntryHeader). - The first 20 bytes of every segment are reserved (
kv.ValueLogHeaderSize) for future metadata; iteration always skips this fixed header. kv.EncodeEntryand the entry iterator (kv.EntryIterator) perform the layout work, and each append finishes with a CRC32 to detect torn writes.vlog.VerifyDirscans all segments withsanitizeValueLogto trim corrupted tails after crashes, mirroring RocksDB’sblob_file::Sanitize. Badger performs a similar truncation pass at startup.
4. Manager API Surface
mgr, _ := vlog.Open(vlog.Config{Dir: "...", MaxSize: 1<<29})
ptr, _ := mgr.AppendEntry(entry)
ptrs, _ := mgr.AppendEntries(entries, writeMask)
val, unlock, _ := mgr.Read(ptr)
unlock() // release per-file lock
_ = mgr.Rewind(*ptr) // rollback partially written batch
_ = mgr.Remove(fid) // close + delete file
Key behaviours:
- Append + Rotate –
Manager.AppendEntryencodes and appends into the active file. The reservation path handles rotation when the active segment would exceedMaxSize; manual rotation is rare. - Crash recovery –
Manager.Rewindtruncates the active file and removes newer files when a write batch fails mid-flight.valueLog.writeuses this to guarantee idempotent WAL/value log ordering. - Safe reads –
Manager.Readreturns an mmap-backed slice plus an unlock callback. Active segments take a per-fileRWMutex, while sealed segments use a pin/unpin path to avoid long-held locks; callers that need ownership should copy the bytes before releasing the lock. - Verification –
VerifyDirvalidates entire directories (used by CLI and recovery) by parsing headers and CRCs.
Compared with RocksDB’s blob manager the surface is intentionally small—NoKV treats the manager as an append-only log with rewind semantics, while RocksDB maintains index structures inside the blob file metadata.
5. Integration with DB Writes
sequenceDiagram
participant Commit as commitWorker
participant Mgr as vlog.Manager
participant WAL as wal.Manager
participant Mem as MemTable
Commit->>Mgr: AppendEntries(entries, writeMask)
Mgr-->>Commit: ValuePtr list
Commit->>WAL: Append(entries+ptrs)
Commit->>Mem: apply to skiplist
valueLog.writebuilds a write mask for each batch, then delegates toManager.AppendEntries. Entries staying in LSM (shouldWriteValueToLSM) receive zero-value pointers.- Rotation is handled inside the manager when the reserved bytes would exceed
MaxSize. The WAL append happens after the value log append so crash replay observes consistent pointers. - Any error triggers
Manager.Rewindback to the saved head pointer, removing new files and truncating partial bytes.vlog_test.goexercises both append- and rotate-failure paths. Txn.Commitand batched writes share the same pipeline: the commit worker always writes the value log first, then applies to WAL/memtable, keeping MVCC and durability ordering consistent.
Badger follows the same ordering (value log first, then write batch). RocksDB’s blob DB instead embeds blob references into the WAL entry before the blob file write, relying on two-phase commit between WAL and blob; NoKV avoids the extra coordination by reusing a single batching loop.
5. Discard Statistics & GC
flowchart LR FlushMgr -- "obsolete ptrs" --> DiscardStats DiscardStats -->|"batch json"| writeCh valuePtr["valueLog.newValuePtr(lfDiscardStatsKey)"] writeCh --> valuePtr valueLog -- "GC trigger" --> Manager
lfDiscardStatsaggregates per-file discard counts fromlsm.FlushTablecompletion (valueLog.lfDiscardStats.pushinsidelsm/flush). Once the in-memory counter crossesdiscardStatsFlushThreshold, it marshals the map into JSON and writes it back through the DB pipeline under the special key!NoKV!discard.valueLog.flushDiscardStatsconsumes those stats, ensuring they are persisted even across crashes. During recoveryvalueLog.populateDiscardStatsreplays the JSON payload to repopulate the in-memory map.- GC uses
discardRatio = discardedBytes/totalBytesderived fromManager.Sample, which applies windowed iteration based on configurable ratios. If a file exceeds the configured threshold,valueLog.doRunGCrewrites live entries into the current head (usingManager.Append) and thenvalueLog.rewriteschedules deletion edits in the manifest.- Sampling behaviour is controlled by
Options.ValueLogGCSampleSizeRatio(default 0.10 of the file) andOptions.ValueLogGCSampleCountRatio(default 1% of the configured entry limit). Setting either to<=0keeps the default heuristics.Options.ValueLogGCSampleFromHeadstarts sampling from the beginning instead of a random window.
- Sampling behaviour is controlled by
- Completed deletions are logged via
lsm.LogValueLogDeleteso the manifest can skip them during replay. When GC rotates to a new head,valueLog.updateHeadrecords the pointer and bumps theNoKV.ValueLog.HeadUpdatescounter.
RocksDB’s blob GC leans on compaction iterators to discover obsolete blobs. NoKV, like Badger, leverages flush/compaction discard stats so GC does not need to rescan SSTs.
6. Recovery Semantics
DB.Openrestores the manifest and fetches the last persisted head pointer.valueLog.openlaunchesflushDiscardStatsand iterates every vlog file viavalueLog.replayLog. Files marked invalid in the manifest are removed; valid ones are registered in the manager’s file map.valueLog.replayLogstreams entries to validate checksums and trims torn tails; it does not reapply data into the LSM. WAL replay remains the sole source of committed state.Manager.VerifyDirtrims torn records so replay never sees corrupt payloads.- After validation,
valueLog.populateDiscardStatsrehydrates discard counters from the persisted JSON entry if present.
The flow mirrors Badger’s vlog scanning but keeps state reconstruction anchored on WAL + manifest checkpoints, similar to RocksDB’s reliance on MANIFEST to mark blob files live or obsolete.
7. Observability & CLI
- Metrics in
stats.goreport segment counts, pending deletions, discard queue depth, and GC head pointer viaexpvar. nokv vlog --workdir <dir>loads a manager in read-only mode and prints current head plus file status (valid, gc candidate). It invokesvlog.VerifyDirbefore describing segments.- Recovery traces controlled by
RECOVERY_TRACE_METRICSlog every head movement and file removal, aiding pressure testing of GC edge cases. For ad-hoc diagnostics, enableOptions.ValueLogVerboseto emit replay/GC messages to stdout.
8. Quick Comparison
| Capability | RocksDB BlobDB | BadgerDB | NoKV |
|---|---|---|---|
| Head tracking | In MANIFEST (blob log number + offset) | Internal to vlog directory | Manifest entry via EditValueLogHead |
| GC trigger | Compaction sampling, blob garbage score | Discard stats from LSM tables | Discard stats flushed through lfDiscardStats |
| Failure recovery | Blob DB and WAL coordinate two-phase commits | Replays value log then LSM | Rewind-on-error + manifest-backed deletes |
| Read path | Separate blob cache | Direct read + checksum | Manager.Read with copy + per-file lock |
By anchoring the vlog state in the manifest and exposing rewind/verify primitives, NoKV maintains the determinism of RocksDB while keeping Badger’s simple sequential layout.
9. Further Reading
docs/recovery.md– failure matrix covering append crashes, GC interruptions, and manifest rewrites.docs/cache.md– how vlog-backed entries interact with the block cache.docs/stats.md– metric names surfaced for monitoring.