NoKV Documentation
This book collects the project docs under docs/ and makes them readable via
mdBook + GitHub Pages. Use the table of contents on the left to navigate.
Notes:
- The content here mirrors the files in
docs/. - Personal notes can be added in
notes.md.
NoKV Architecture Overview
NoKV delivers a hybrid storage engine that can operate as a standalone embedded KV store or as a TinyKv-compatible distributed service. This document captures the key building blocks, how they interact, and the execution flow from client to disk.
1. High-Level Layout
┌─────────────────────────┐ TinyKv gRPC ┌─────────────────────────┐
│ raftstore Service │◀──────────────▶ │ raftstore/client │
└───────────┬─────────────┘ │ (Get / Scan / Mutate) │
│ └─────────────────────────┘
│ ReadCommand / ProposeCommand
▼
┌─────────────────────────┐
│ store.Store / peer.Peer │ ← multi-Raft region lifecycle
│ ├ Manifest snapshot │
│ ├ Router / RegionHooks │
│ └ transport (gRPC) │
└───────────┬─────────────┘
│ Apply via kv.Apply
▼
┌─────────────────────────┐
│ kv.Apply + percolator │
│ ├ Get / Scan │
│ ├ Prewrite / Commit │
│ └ Latch manager │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Embedded NoKV core │
│ ├ WAL Manager │
│ ├ MemTable / Flush │
│ ├ ValueLog + GC │
│ └ Manifest / Stats │
└─────────────────────────┘
- Embedded mode uses
NoKV.Opendirectly: WAL→MemTable→SST durability, ValueLog separation, MVCC semantics, rich stats. - Distributed mode layers
raftstoreon top: multi-Raft regions reuse the same WAL/Manifest, expose metrics, and serve TinyKv RPCs. - Clients obtain leader-aware routing, automatic NotLeader/EpochNotMatch retries, and two-phase commit helpers.
2. Embedded Engine
2.1 WAL & MemTable
wal.Managerappends[len|payload|crc]records, rotates segments, and replays logs on crash.MemTableaccumulates writes until full, then enters the flush queue;flush.ManagerrunsPrepare → Build → Install → Release, logs edits, and releases WAL segments.- Writes are handled by a single commit worker that performs value-log append first, then WAL/memtable apply, keeping durability ordering simple and consistent.
2.2 ValueLog
- Large values are written to the ValueLog before the WAL append; the resulting
ValuePtris stored in WAL/LSM so replay can recover. vlog.Managertracks the active head and uses flush discard stats to trigger GC; manifest records new heads and removed segments.
2.3 Manifest
manifest.Managerstores SST metadata, WAL checkpoints, ValueLog metadata, and (importantly) Region descriptors used by raftstore.CURRENTprovides crash-safe pointer updates; Region state is replicated through manifest edits.
2.4 LSM Compaction & Ingest Buffer
compact.Managerdrives compaction cycles;lsm.levelManagersupplies table metadata and executes the plan.- Planning is split:
compact.PlanFor*selects table IDs + key ranges, then LSM resolves IDs back to tables and runs the merge. compact.Stateguards overlapping key ranges and tracks in-flight table IDs.- Ingest shard selection is policy-driven in
compact(PickShardOrder/PickShardByBacklog) while the ingest buffer remains inlsm.
flowchart TD Manager["compact.Manager"] --> LSM["lsm.levelManager"] LSM -->|TableMeta snapshot| Planner["compact.PlanFor*"] Planner --> Plan["compact.Plan (fid+range)"] Plan -->|resolvePlanLocked| Exec["LSM executor"] Exec --> State["compact.State guard"] Exec --> Build["subcompact/build SST"] Build --> Manifest["manifest edits"] L0["L0 tables"] -->|moveToIngest| Ingest["ingest buffer shards"] Ingest -->|ingest-only| Main["Main tables"] Ingest -->|ingest-merge| Ingest
2.5 MVCC
txn.goexposes MVCC transactions with timestamps fromoracle.percolatorpackage implements Prewrite/Commit/ResolveLock/CheckTxnStatus;kv.Applysimply dispatches Raft commands to these helpers.- Watermarks (
utils.WaterMark) gate read snapshots and commit visibility. They are synchronous (no goroutine/channel) and advance with a single mutex + atomics to reduce select/cond wait.
2.6 Write Pipeline & Backpressure
- Writes enqueue into a commit queue (
db_write.go) where requests are coalesced into batches before a commit worker drains them. - The commit worker always writes the value log first (when needed), then applies WAL/LSM updates;
SyncWritesadds a WAL fsync step. - Batch sizing adapts to backlog (
WriteBatchMaxCount/Size,WriteBatchWait) and hot-key pressure can expand batch limits temporarily to drain spikes. - Backpressure is enforced in two places: LSM throttling toggles
db.blockWriteswhen L0 backlog grows, and HotRing can reject hot keys viaWriteHotKeyLimit.
3. Replication Layer (raftstore)
| Package | Responsibility |
|---|---|
store | Region catalog, router, RegionMetrics, Region hooks, manifest integration, helpers such as StartPeer / SplitRegion. |
peer | Wraps etcd/raft RawNode, handles Ready pipeline, snapshot resend queue, backlog instrumentation. |
engine | WALStorage/DiskStorage/MemoryStorage, reusing the DB’s WAL while keeping manifest metadata in sync. |
transport | gRPC transport for Raft Step messages, connection management, retries/blocks/TLS. Also acts as the host for TinyKv RPC. |
kv | TinyKv RPC handler plus kv.Apply bridging Raft commands to MVCC logic. |
server | ServerConfig + New combine DB, Store, transport, and TinyKv service into a reusable node instance. |
3.1 Bootstrap Sequence
raftstore.NewServerwires DB, store configuration (StoreID, hooks, scheduler), Raft config, and transport address. It registers TinyKv RPC on the shared gRPC server and setstransport.SetHandler(store.Step).- CLI (
nokv serve) or application enumeratesManifest.RegionSnapshot()and callsStore.StartPeerfor every Region containing the local store:peer.Configincludes Raft params, transport,kv.NewEntryApplier, WAL/Manifest handles, Region metadata.- Router registration, regionManager bookkeeping, optional
Peer.Bootstrapwith initial peer list, leader campaign.
- Peers from other stores can be configured through
transport.SetPeer(storeID, addr), allowing dynamic updates from a scheduler.
3.2 Command Paths
- ReadCommand (
KvGet/KvScan): validate Region & leader, flush pending Ready, then runcommandApplier(i.e.kv.Applyin read mode) to fetch data directly from the DB. This yields leader-strong reads without a Raft round trip. - ProposeCommand (write): encode the request, push through Router to the leader peer, replicate via Raft, and apply in
kv.Applywhich maps to MVCC operations.
3.3 Transport
- gRPC server handles Step RPCs and TinyKv RPCs on the same endpoint; peers are registered via
SetPeer. - Retry policies (
WithRetry) and TLS credentials are configurable. Tests cover partitions, blocked peers, and slow followers.
4. TinyKv Service
raftstore/kv/service.go exposes pb.TinyKv RPCs:
| RPC | Execution | Result |
|---|---|---|
KvGet | store.ReadCommand → kv.Apply GET | pb.GetResponse / RegionError |
KvScan | store.ReadCommand → kv.Apply SCAN | pb.ScanResponse / RegionError |
KvPrewrite | store.ProposeCommand → percolator.Prewrite | pb.PrewriteResponse |
KvCommit | store.ProposeCommand → percolator.Commit | pb.CommitResponse |
KvResolveLock | percolator.ResolveLock | pb.ResolveLockResponse |
KvCheckTxnStatus | percolator.CheckTxnStatus | pb.CheckTxnStatusResponse |
nokv serve is the CLI entry point—open the DB, construct raftstore.Server, register peers, start local Raft peers, and display a manifest summary (Regions, key ranges, peers). scripts/run_local_cluster.sh builds the CLI, writes a minimal region manifest, launches multiple nokv serve processes on localhost, and handles cleanup on Ctrl+C.
5. Client Workflow
raftstore/client offers a leader-aware client with retry logic and convenient helpers:
- Initialization: provide
[]StoreEndpoint+[]RegionConfigdescribing region boundaries and known leaders. - Reads:
GetandScanpick the leader store for a key range, issue TinyKv RPCs, and retry on NotLeader/EpochNotMatch. - Writes:
Mutatebundles operations per region and drives Prewrite/Commit (primary first, secondaries after);PutandDeleteare convenience wrappers using the same 2PC path. - Timestamps: clients must supply
startVersion/commitVersion. For distributed demos, reuse the TSO sample underscripts/tsoto obtain globally increasing values before callingTwoPhaseCommit. - Bootstrap helpers:
scripts/run_local_cluster.sh --config raft_config.example.jsonbuilds the binaries, seeds manifests vianokv-config manifest, launches the stores declared in the config, and starts the HTTP TSO allocator when thetsoblock is present.
Example (two regions)
- Regions
[a,m)and[m,+∞), each led by a different store. Mutate(ctx, primary="alfa", mutations, startTs, commitTs, ttl)prewrites and commits across the relevant regions.Get/Scanretries automatically if the leader changes.- See
raftstore/server/server_client_integration_test.gofor a full end-to-end example using realraftstore.Serverinstances.
6. Failure Handling
- Manifest edits capture Region metadata, WAL checkpoints, and ValueLog pointers. Restart simply reads
CURRENTand replays edits. - WAL replay reconstructs memtables and Raft groups; ValueLog recovery trims partial records.
Stats.StartStatsresumes metrics sampling immediately after restart, making it easy to verify recovery correctness vianokv stats.
7. Observability & Tooling
StatsSnapshotpublishes flush/compaction/WAL/VLog/txn/region metrics.nokv statsand the expvar endpoint expose the same data.nokv regionsinspects Manifest-backed Region metadata.nokv serveadvertises Region samples on startup (ID, key range, peers) for quick verification.- Scripts:
scripts/run_local_cluster.sh– launch a multi-node TinyKv cluster locally.scripts/recovery_scenarios.sh– crash-recovery test harness.scripts/transport_chaos.sh– inject network faults and observe transport metrics.
8. When to Use NoKV
- Embedded: call
NoKV.Open, use the MVCC store locally. - Distributed: deploy
nokv servenodes, useraftstore/client(or any TinyKv gRPC client) to perform reads, scans, and 2PC writes. - Observability-first: inspection via CLI or expvar is built-in; Region, WAL, Flush, and Raft metrics are accessible without extra instrumentation.
See also docs/raftstore.md for deeper internals and docs/testing.md for coverage details.
Configuration & Options
NoKV exposes two configuration surfaces:
- Runtime options for the embedded engine (
Optionsinoptions.go). - Cluster topology for distributed mode (
raft_config.example.jsonviaconfig.LoadFile/Validate).
1. Runtime Options (Embedded Engine)
NoKV.NewDefaultOptions() returns a tuned baseline. Override fields before
calling NoKV.Open(opt).
Key option groups (see options.go for the full list):
- Paths & durability
WorkDir,SyncWrites,ManifestSync,ManifestRewriteThreshold
- Write pipeline
WriteBatchMaxCount,WriteBatchMaxSize,WriteBatchWaitCommitPipelineDepth,CommitApplyConcurrency
- Value log
ValueThreshold,ValueLogFileSize,ValueLogMaxEntriesValueLogGCInterval,ValueLogGCDiscardRatioValueLogGCSampleSizeRatio,ValueLogGCSampleCountRatio,ValueLogGCSampleFromHead
- LSM & compaction
MemTableSize,MemTableEngine,SSTableMaxSz,NumCompactorsNumLevelZeroTables,IngestCompactBatchSize,IngestBacklogMergeScoreCompactionValueWeight,CompactionValueAlertThreshold
- Caches
BlockCacheSize,BloomCacheSize
- Hot key throttling
WriteHotKeyLimit,HotWriteBurstThreshold,HotWriteBatchMultiplierHotRingEnabled,HotRingTopK, decay/window settings
- WAL watchdog
EnableWALWatchdog,WALAutoGCIntervalWALAutoGCMinRemovable,WALAutoGCMaxBatchWALTypedRecordWarnRatio,WALTypedRecordWarnSegments
- Raft lag warnings (stats only)
RaftLagWarnSegments
Example:
opt := NoKV.NewDefaultOptions()
opt.WorkDir = "./data"
opt.SyncWrites = true
opt.ValueThreshold = 1024
opt.WriteBatchMaxCount = 128
db := NoKV.Open(opt)
defer db.Close()
2. Raft Topology File
raft_config.example.json is the single source of truth for distributed
topology. It is consumed by scripts, cmd/nokv-redis, and the config package.
Minimal shape:
{
"max_retries": 8,
"tso": { "listen_addr": "127.0.0.1:9494", "advertise_url": "http://127.0.0.1:9494" },
"stores": [
{ "store_id": 1, "listen_addr": "127.0.0.1:20170", "addr": "127.0.0.1:20170" }
],
"regions": [
{
"id": 1,
"start_key": "-",
"end_key": "-",
"epoch": { "version": 1, "conf_version": 1 },
"peers": [{ "store_id": 1, "peer_id": 101 }],
"leader_store_id": 1
}
]
}
Notes:
start_key/end_keyaccept plain strings,hex:<bytes>, or base64. Use"-"or empty for unbounded ranges.storesdefine both host and docker addresses for local runs vs containers.leader_store_idis optional; clients use it for initial routing hints.
Programmatic loading:
cfg, _ := config.LoadFile("raft_config.example.json")
if err := cfg.Validate(); err != nil { /* handle */ }
Related tools:
scripts/run_local_cluster.sh --config raft_config.example.jsongo run ./cmd/nokv-redis --raft-config raft_config.example.json
CLI (cmd/nokv) Reference
The nokv command provides operational visibility similar to RocksDB’s ldb and Badger’s badger CLI, but emits JSON to integrate easily with scripts and CI pipelines.
Installation
go install ./cmd/nokv
Use GOBIN if you prefer a custom binary directory.
Shared Flags
--workdir <path>– location of the NoKV database (must containCURRENT).--json– emit structured JSON (default is human-readable tables).--expvar <url>– forstatscommand, pull metrics from a running process exposingexpvar.--no-region-metrics– forstatsoffline mode; skip attachingRegionMetricsand report manifest-only figures.
Subcommands
nokv stats
- Reads
StatsSnapshoteither offline (--workdir) or via HTTP (--expvar). - Output fields include:
flush_queue_length,flush_wait_ms,flush_build_mscompaction_backlog,wal_active_segment,wal_segments_removedvlog_head,vlog_segments,vlog_pending_deletes,vlog_discard_queuetxns_active,txns_committed,txns_conflictsregion_total(plusregion_new,region_running,region_removing,region_tombstone,region_other)hot_keys(Top-N hits captured byhotring)
- Example:
nokv stats --workdir ./testdata/db --json | jq '.flush_queue_length'
nokv manifest
- Parses the manifest using
manifest.Manager.Version(). - Reports per-level file counts, smallest/largest keys, WAL checkpoint, and ValueLog metadata.
- Helpful for verifying flush/compaction results and ensuring manifest rewrites succeeded.
nokv vlog
- Lists vlog segments with status flags (
active,candidate_for_gc,deleted). - Shows head file/offset and pending GC actions.
- Use after running GC or recovery to confirm stale segments are purged.
Integration Tips
- Combine with
RECOVERY_TRACE_METRICS=1to cross-check logs: run tests, then inspect CLI output to ensure metrics match expectations. - In CI, capture JSON output and diff against golden files to detect regressions (see
cmd/nokv/main_test.go). - When comparing against RocksDB/Badger, treat
nokv manifest+nokv vlogas equivalents toldb manifest_dumpand Badger’sbadgerinspect vlogcommands.
For architecture context, see architecture.md and the module deep dives.
nokv regions– Dumps the manifest-backed Region catalog (ID/state/key range/peers). Supports--jsonfor automation and complements the Region metrics shown innokv stats.
Memtable Design & Lifecycle
NoKV’s write path mirrors RocksDB: every write lands in the WAL and an in-memory memtable backed by a selectable in-memory index (skiplist or ART). The implementation lives in lsm/memtable.go and ties directly into the flush manager (lsm/flush).
1. Structure
type memTable struct {
lsm *LSM
segmentID uint32 // WAL segment backing this memtable
index memIndex
maxVersion uint64
walSize int64
}
The memtable index is an interface that can be backed by either a skiplist or ART:
type memIndex interface {
Add(*kv.Entry)
Search([]byte) kv.ValueStruct
NewIterator(*utils.Options) utils.Iterator
MemSize() int64
IncrRef()
DecrRef()
}
- Memtable engine –
Options.MemTableEngineselectsskiplist(default) orartvianewMemIndex. Skiplist favors simpler writes; ART favors tighter memory and ordered scans. - Arena sizing –
utils.NewSkiplistusesarenaSizeFor;utils.NewARTusesarenaSizeForARTto reserve more space for variable node payloads and prefix spills. - WAL coupling – every
Setuseskv.EncodeEntryto materialise the payload to the active WAL segment before inserting into the chosen index.walSizetracks how much of the segment is consumed so flush can release it later. - Segment ID –
LSM.NewMemtableatomically incrementslevels.maxFID, switches the WAL to a new segment (wal.Manager.SwitchSegment), and tags the memtable with that FID. This matches RocksDB’slogfile_numberfield. - ART specifics – ART stores prefix-compressed inner nodes (Node4/16/48/256), uses optimistic version checks for reads with localized locks for writes, and iterators walk the tree in key order.
2. Lifecycle
sequenceDiagram
participant WAL
participant MT as MemTable
participant Flush
participant Manifest
WAL->>MT: Append+Set(entry)
MT->>Flush: freeze (Size() >= limit)
Flush->>Manifest: LogPointer + AddFile
Manifest-->>Flush: ack
Flush->>WAL: Release segments ≤ segmentID
- Active → Immutable – when
mt.Size()crosses thresholds (Options.MemTableSize), the memtable is swapped out and pushed onto the flush queue. The new active memtable triggers another WAL segment switch. - Flush – the flush manager drains immutable memtables, builds SSTables, logs manifest edits, and releases the WAL segment ID recorded in
memTable.segmentIDonce the SST is durably installed. - Recovery –
LSM.recoveryscans WAL files, reopens memtables per segment (most recent becomes active), and deletes segments ≤ the manifest’s log pointer. Entries are replayed viawal.Manager.ReplaySegmentinto fresh indexes, rebuildingmaxVersionfor the oracle.
Badger follows the same pattern, while RocksDB often uses skiplist-backed arenas with reference counting—NoKV reuses Badger’s arena allocator for simplicity.
3. Read Semantics
memTable.Getlooks up the chosen index and returns a copy of the entry. MVCC versions stay encoded in the key suffix (KeyWithTs), so iterators naturally merge across memtables and SSTables.MemTable.IncrRef/DecrRefdelegate to the index, allowing iterators to hold references while the flush manager processes immutable tables—mirroring RocksDB’sMemTable::Ref/Unreflifecycle.- WAL-backed values that exceed the value threshold are stored as pointers; the memtable stores the encoded pointer, and the transaction/iterator logic reads from the vlog on demand.
4. Integration with Other Subsystems
| Subsystem | Interaction |
|---|---|
| Transactions | Txn.commitAndSend writes entries into the active memtable after WAL append; pending writes bypass the memtable until commit so per-txn isolation is preserved. |
| Manifest | Flush completion logs EditLogPointer(segmentID) so restart can discard WAL files already persisted into SSTs. |
| Stats | Stats.Snapshot pulls FlushPending/Active/Queue counters via lsm.FlushMetrics, exposing how many immutables are waiting. |
| Value Log | lsm.flush emits discard stats keyed by segmentID, letting the value log GC know when entries become obsolete. |
5. Comparison
| Aspect | RocksDB | BadgerDB | NoKV |
|---|---|---|---|
| Data structure | Skiplist + arena | Skiplist + arena | Skiplist or ART + arena |
| WAL linkage | logfile_number per memtable | Segment ID stored in vlog entries | segmentID on memTable, logged via manifest |
| Recovery | Memtable replays from WAL, referencing MANIFEST | Replays WAL segments | Replays WAL segments, prunes ≤ manifest log pointer |
| Flush trigger | Size/entries/time | Size-based | Size-based with explicit queue metrics |
6. Operational Notes
- Tuning
Options.MemTableSizeaffects WAL segment count and flush latency. Larger memtables reduce flush churn but increase crash recovery time. - Monitor
NoKV.Stats.Flush.*metrics to catch stalled immutables—an ever-growing queue often indicates slow SST builds or manifest contention. - Because memtables carry WAL segment IDs, deleting WAL files manually can lead to recovery failures; always rely on the engine’s manifest-driven cleanup.
See docs/flush.md for the end-to-end flush scheduler and [docs/architecture.md](architecture.md#3-end-to-end-write-flow) for where memtables sit in the write pipeline.
MemTable Flush Pipeline
NoKV’s flush subsystem translates immutable memtables into persisted SSTables while coordinating WAL checkpoints and ValueLog discard statistics. The code lives in lsm/flush and is tightly integrated with DB.doWrites and manifest.Manager.
1. Responsibilities
- Reliability – ensure immutables become SSTables atomically, and failures are recoverable.
- Coordination – release WAL segments only after manifest commits, and feed discard stats to ValueLog GC.
- Observability – expose queue depth, stage durations, and task counts through
Stats.collectand the CLI.
Compared with RocksDB: the stage transitions mirror RocksDB’s flush job lifecycle (PickMemTable, WriteLevel0Table, InstallMemTable), while the discard stats channel is inspired by Badger’s integration with vlog GC.
2. Stage Machine
flowchart LR
Active[Active MemTable]
Immutable[Immutable MemTable]
FlushQ[flush.Manager queue]
Build[StageBuild]
Install[StageInstall]
Release[StageRelease]
Active -->|threshold reached| Immutable --> FlushQ
FlushQ --> Build --> Install --> Release --> Active
- StagePrepare –
Manager.Submitassigns a task ID, records enqueue time, and bumps queue metrics. - StageBuild –
Manager.Nexthands tasks to background workers.buildTableserialises data into a temporary.sst.tmpusinglsm/builder.go. - StageInstall – manifest edits (
EditAddFile,EditLogPointer) are logged. Only on success is the temp file renamed and the WAL checkpoint advanced. - StageRelease – metrics record release duration, discard stats are flushed to
valueLog.lfDiscardStats, andwal.Manager.Removedrops obsolete segments.
Manager.Update transitions between stages and collects timing data (WaitNs, BuildNs, ReleaseNs). These appear as NoKV.Flush.Queue, NoKV.Flush.BuildAvgMs, etc., in CLI output.
3. Key Types
type Task struct {
ID uint64
SegmentID uint32
Stage Stage
Data any // memtable pointer, temp file info, etc.
Err error
}
type Manager struct {
queue []*Task
active map[uint64]*Task
cond *sync.Cond
// atomic metrics fields (pending, queueLen, waitNs...)
}
StageenumeratesStagePrepare,StageBuild,StageInstall,StageRelease.Metricsaggregates pending/active counts and nanosecond accumulators; the CLI converts them to human-friendly durations.- The queue uses condition variables to coordinate between background workers and producers; the design avoids busy waiting, unlike some RocksDB flush queues.
4. Execution Path in Code
DB.applyBatchesdetects when the active memtable is full and hands it tolsm.LSM.scheduleFlush, which callsflush.Manager.Submit.- Background goroutines call
Nextto retrieve tasks;lsm.(*LSM).runFlushMemTableperforms the build and install phases. lsm.(*LSM).installLevel0Tablewrites the manifest edit and renames the SST (atomicos.Rename, same as RocksDB’s flush job).- After install,
valueLog.updateDiscardStatsis called so GC can reclaim vlog entries belonging to dropped keys. - Once release completes,
wal.Manager.Removeevicts segments whose entries are fully represented in SSTs, matching RocksDB’sLogFileManager::PurgeObsoleteLogs.
5. Recovery Considerations
- Before Install – temp files remain in
tmp/. On restart, no manifest entry exists, solsm.LSM.replayManifestignores them and the memtable is rebuilt from WAL. - After Install but before Release – manifest records the SST while WAL segments may still exist. Recovery sees the edit, ensures the file exists, and release metrics resume from StageRelease.
- Metrics – because timing data is stored atomically in the manager, recovery resets counters but does not prevent the CLI from reporting backlog immediately after restart.
RocksDB uses flush job logs; NoKV reuses metrics and CLI output for similar visibility.
6. Observability & CLI
StatsSnapshot.Flush.Queue– number of pending tasks.StatsSnapshot.Flush.WaitMs– average wait time before build.StatsSnapshot.Flush.BuildMs– average build duration.StatsSnapshot.Flush.Completed– cumulative tasks finished.
The CLI command nokv stats --workdir <dir> prints these metrics alongside compaction and transaction statistics, enabling operators to detect stalled flush workers or WAL backlog quickly.
7. Interplay with ValueLog GC
Flush completion sends discard stats via db.lsm.SetDiscardStatsCh(&(db.vlog.lfDiscardStats.flushChan)). ValueLog GC uses this feed to determine how much of each vlog segment is obsolete, similar to Badger’s discard ratio heuristic. Without flush-driven stats, vlog GC would have to rescan SSTables, so this channel is crucial for keeping GC cheap.
8. Testing Matrix
lsm/flush/manager_test.go(implicit vialsm/lsm_test.go) validates stage transitions and metrics.db_recovery_test.gocovers crash scenarios before/after install, ensuring WAL replay plus manifest reconciliation recovers gracefully.- Future additions: inject write failures during
StageBuildto test retry logic, analogous to RocksDB’s simulated IO errors.
See the recovery plan and testing matrix for more context.
Compaction & Cache Strategy
NoKV’s compaction pipeline borrows the leveled‑LSM layout from RocksDB, but layers it with an ingest buffer, lightweight cache telemetry, and simple concurrency guards so the implementation stays approachable while still handling bursty workloads.
1. Overview
Compactions are orchestrated by compact.Manager with lsm.levelManager implementing the executor hooks. Each level owns two lists of tables:
tables– the canonical sorted run for the level.ingest– a staging buffer that temporarily holds SSTables moved from the level above when there is not yet enough work (or bandwidth) to do a full merge.
The compaction manager periodically calls into the executor to build a list of compact.Priority entries. The priorities consider three signals:
- L0 table count – loosely capped by
Options.NumLevelZeroTables. - Level size vs target – computed by
levelTargets(), which dynamically adjusts the “base” level depending on total data volume. - Ingest buffer backlog – if a level’s
ingestshards have data, they receive elevated scores so staged tables are merged promptly.
The highest adjusted score is processed first. L0 compactions can either move tables into the ingest buffer of the base level (cheap re‑parenting) or compact directly into a lower level when the overlap warrants it.
Planning now happens via compact.Plan: LSM snapshots table metadata into compact.TableMeta, compact.PlanFor* selects table IDs + key ranges, and LSM resolves the plan back to *table before executing.
2. Ingest Buffer
moveToIngest (see lsm/executor.go) performs a metadata-only migration:
- Records a
manifest.EditDeleteFilefor the source level. - Logs a new
manifest.EditAddFiletargeting the destination level. - Removes the table from
thisLevel.tablesand appends it tonextLevel.ingest.
This keeps write amplification low when many small L0 tables arrive at once. Reads still see the newest data because levelHandler.searchIngestSST checks ingest before consulting tables.
Compaction tests (lsm/compaction_cache_test.go) now assert that after calling moveToIngest the table disappears from the source level and shows up in the ingest buffer.
3. Concurrency Guards
To prevent overlapping compactions:
compact.State.CompareAndAddtracks the key range of each in-flight compaction per level.- Attempts to register a compaction whose ranges intersect an existing one are rejected.
- When a compaction finishes,
compact.State.Deleteremoves the ranges and table IDs from the guard.
This mechanism is intentionally simple—just a mutex‐protected slice—yet effective in tests (TestCompactStatusGuards) that simulate back‑to‑back registrations on the same key range.
4. Cache Telemetry
NoKV’s cache is split into three parts (lsm/cache.go):
| Component | Purpose | Metrics hook |
|---|---|---|
| Block cache (hot) | LRU list capturing most recent hits (typically L0/L1). | cacheMetrics.recordBlock(level, hit) |
| Block cache (cold) | CLOCK cache for deeper levels, keeping the memory footprint bounded. | Same as above |
| Bloom cache | Stores decoded bloom filters to reduce disk touches. | recordBloom(hit) |
CacheMetrics() on DB surfaces hits/misses per layer, which is especially helpful when tuning ingest behaviour—if L0/L1 cache misses spike, the ingest buffer likely needs to be drained faster. TestCacheHotColdMetrics verifies that the hot and cold tiers tick the counters as expected.
5. Interaction with Value Log
Compaction informs value‑log GC via discard statistics:
- During
subcompact, every entry merged out is inspected. If it stores aValuePtr, the amount is added to the discard map. - At the end of subcompaction, the accumulated discard map is pushed through
setDiscardStatsCh. valueLogreceives the stats and can safely rewrite or delete vlog segments with predominantly obsolete data.
This tight coupling keeps the value log from growing indefinitely after heavy overwrite workloads.
6. Testing Checklist
Relevant tests to keep compaction healthy:
lsm/compaction_cache_test.goTestCompactionMoveToIngest– ensures metadata migration works and the ingest buffer grows.TestCacheHotColdMetrics– validates cache hit accounting.TestCompactStatusGuards– checks overlap detection.
lsm/lsm_test.goTestCompact/TestHitStorage– end‑to‑end verification that data remains queryable across memtable flushes and compactions.
When adding new compaction heuristics or cache tiers, extend these tests (or introduce new ones) so the behaviour stays observable.
7. Practical Tips
- Tune
Options.IngestCompactBatchSizewhen ingest queues build up; increasing it lets a single move cover more tables. - Observe
DB.CacheMetrics()andDB.CompactionStats()via the CLI (nokv stats) to decide whether you need more compaction workers or bigger caches. - For workloads dominated by range scans, consider increasing
Options.BlockCacheSizeif you want to keep more L0/L1 blocks in the user-space cache; cold data依赖 OS page cache。 - Keep an eye on
NoKV.ValueLog.GcRunsandValueLogHeadUpdates; if compactions are generating discard stats but the value log head doesn’t move, GC thresholds may be too conservative.
With these mechanisms, NoKV stays resilient under bursty writes while keeping the code path small and discoverable—ideal for learning or embedding. Dive into the source files referenced above for deeper implementation details.
Ingest Buffer Architecture (English)
The ingest buffer is a per-level staging area for SSTables—typically promoted from L0—designed to absorb bursts, reduce overlap, and unlock parallel compaction without touching the main level tables immediately. It combines fixed sharding, adaptive scheduling, and optional ingest-only merge to keep write amplification and contention low.
flowchart LR
L0["L0 SSTables"] -->|moveToIngest| Ingest["Ingest Buffer (sharded)"]
subgraph levelN["Level N"]
Ingest -->|ingest-only compact| MainTables["Main Tables"]
Ingest -->|ingest-merge| Ingest
end
Ingest -.read path merge.-> ClientReads["Reads/Iterators"]
Design Highlights
- Sharded by key prefix: ingest tables are routed into fixed shards (top bits of the first byte). Sharding cuts cross-range overlap and enables safe parallel drain.
- Snapshot-friendly reads: ingest tables are read under the level
RLock, and iterators hold table refs so mmap-backed data stays valid without additional snapshots. - Two ingest paths:
- Ingest-only compaction: drain ingest → main level (or next level) with optional multi-shard parallelism guarded by
compact.State. - Ingest-merge: compact ingest tables back into ingest (stay in-place) to drop superseded versions before promoting, reducing downstream write amplification.
- Ingest-only compaction: drain ingest → main level (or next level) with optional multi-shard parallelism guarded by
- IngestMode enum: plans carry an
IngestModewithIngestNone,IngestDrain, andIngestKeep.IngestDraincorresponds to ingest-only, whileIngestKeepcorresponds to ingest-merge. - Adaptive scheduling:
- Shard selection is driven by
compact.PickShardOrder/compact.PickShardByBacklogusing per-shard size, age, and density. - Shard parallelism scales with backlog score (based on shard size/target file size) bounded by
IngestShardParallelism. - Batch size scales with shard backlog to drain faster under pressure.
- Ingest-merge triggers when backlog score exceeds
IngestBacklogMergeScore(default 2.0), with dynamic lowering under extreme backlog/age.
- Shard selection is driven by
- Observability: expvar/stats expose ingest-only vs ingest-merge counts, duration, and tables processed, plus ingest size/value density per level/shard.
Configuration
IngestShardParallelism: max shards to compact in parallel (defaultmax(NumCompactors/2, 2), auto-scaled by backlog).IngestCompactBatchSize: base batch size per ingest compaction (auto-boosted by shard backlog).IngestBacklogMergeScore: backlog score threshold to trigger ingest-merge (default 2.0).
Benefits
- Lower write amplification: bursty L0 SSTables land in ingest first; ingest-merge prunes duplicates before full compaction.
- Reduced contention: sharding +
compact.Stateallow parallel ingest drain with minimal overlap. - Predictable reads: ingest is part of the read snapshot, so moving tables in/out does not change read semantics.
- Tunable and observable: knobs for parallelism and merge aggressiveness, with per-path metrics to guide tuning.
Future Work
- Deeper adaptive policies (IO/latency-aware), richer shard-level metrics, and more exhaustive parallel/restart testing under fault injection.
WAL Subsystem
NoKV’s write-ahead log mirrors RocksDB’s durability model and is implemented as a compact Go module similar to Badger’s journal. WAL appends happen alongside memtable writes (via lsm.Set), while values that are routed to the value log are written before the WAL so that replay always sees durable value pointers.
1. File Layout & Naming
- Location:
${Options.WorkDir}/wal/. - Naming pattern:
%05d.wal(e.g.00001.wal). - Rotation threshold: configurable via
wal.Config.SegmentSize(defaults to 64 MiB, minimum 64 KiB). - Verification:
wal.VerifyDirensures the directory exists prior toDB.Open.
On open, wal.Manager scans the directory, sorts segment IDs, and resumes the highest ID—exactly how RocksDB re-opens its MANIFEST and WAL sequence files.
2. Record Format
uint32 length (big-endian, includes type + payload)
uint8 type
[]byte payload
uint32 checksum (CRC32 Castagnoli over type + payload)
- Checksums use
kv.CastagnoliCrcTable, the same polynomial used by RocksDB (Castagnoli). Record encoding/decoding lives inwal/record.go. - The type byte allows mixing LSM mutations with raft log/state/snapshot records in the same WAL segment.
- Appends are buffered by
bufio.Writerso batches become single system calls. - Replay stops cleanly at truncated tails; tests simulate torn writes by truncating the final bytes and verifying replay remains idempotent (
wal/manager_test.go::TestReplayTruncatedTail).
3. Public API (Go)
mgr, _ := wal.Open(wal.Config{Dir: path})
infos, _ := mgr.Append(batchPayload)
_ = mgr.Sync()
_ = mgr.Rotate()
_ = mgr.Replay(func(info wal.EntryInfo, payload []byte) error {
// reapply to memtable
return nil
})
Key behaviours:
Appendautomatically callsensureCapacityto decide when to rotate; it returnsEntryInfo{SegmentID, Offset, Length}for each payload so higher layers can build value pointers or manifest checkpoints.Syncflushes the active file (used forOptions.SyncWrites).Rotateforces a new segment (used after flush/compaction checkpoints similar to RocksDB’sLogFileManager::SwitchLog).Replayiterates segments in numeric order, forwarding each payload to the callback. Errors abort replay so recovery can surface corruption early.- Metrics (
wal.Manager.Metrics) reveal the active segment ID, total segments, and number of removed segments—these feed directly intoStatsSnapshotandnokv statsoutput.
Compared with Badger: Badger keeps a single vlog for both data and durability. NoKV splits WAL (durability) from vlog (value separation), matching RocksDB’s separation of WAL and blob files.
4. Integration Points
| Call Site | Purpose |
|---|---|
lsm.memTable.set | Encodes each entry (kv.EncodeEntry) and appends to WAL before inserting into the skiplist. |
DB.commitWorker | Commit worker applies batched writes via writeToLSM, which flows into lsm.Set and thus WAL. |
DB.Set | Direct write path: calls lsm.Set, which appends to WAL and updates the memtable. |
manifest.Manager.LogEdit | Uses EntryInfo.SegmentID to persist the WAL checkpoint (EditLogPointer). This acts as the log number seen in RocksDB manifest entries. |
lsm/flush.Manager.Update | Once an SST is installed, WAL segments older than the checkpoint are released (wal.Manager.Remove). |
db.runRecoveryChecks | Ensures WAL directory invariants before manifest replay, similar to Badger’s directory bootstrap. |
5. Metrics & Observability
Stats.collect reads the manager metrics and exposes them as:
NoKV.WAL.ActiveSegmentNoKV.WAL.SegmentCountNoKV.WAL.RemovedSegments
The CLI command nokv stats --workdir <dir> prints these alongside backlog, making WAL health visible without manual inspection. In high-throughput scenarios the active segment ID mirrors RocksDB’s LOG number growth.
6. WAL Watchdog (Auto GC)
The WAL watchdog runs inside the DB process to keep WAL backlog in check and surface warnings when raft-typed records dominate the log. It:
- Samples WAL metrics + per-segment metrics and combines them with
manifest.RaftPointerSnapshot()to compute removable segments. - Removes up to
WALAutoGCMaxBatchsegments when at leastWALAutoGCMinRemovableare eligible. - Exposes counters (
WALAutoGCRuns/Removed/LastUnix) and warning state (WALTypedRecordRatio/Warning/Reason) throughStatsSnapshot.
Relevant options (see options.go for defaults):
EnableWALWatchdogWALAutoGCIntervalWALAutoGCMinRemovableWALAutoGCMaxBatchWALTypedRecordWarnRatioWALTypedRecordWarnSegments
7. Recovery Walkthrough
wal.Openreopens the highest segment, leaving the file pointer at the end (switchSegmentLocked).manifest.Managersupplies the WAL checkpoint (segment + offset) while building the version. Replay skips entries up to this checkpoint, ensuring we only reapply writes not yet materialised in SSTables.wal.Manager.Replay(invoked by the LSM recovery path) rebuilds memtables from entries newer than the manifest checkpoint. Value-log recovery only validates/truncates segments and does not reapply data.- If the final record is partially written, the CRC mismatch stops replay and the segment is truncated during recovery tests, mimicking RocksDB’s tolerant behaviour.
8. Operational Tips
- Configure
SyncOnWritefor synchronous durability (default async like RocksDB’s default). For latency-sensitive deployments, consider enabling to emulate Badger’sSyncWrites. - After large flushes, forcing
Rotatekeeps WAL files short, reducing replay time. - Archived WAL segments can be copied alongside manifest files for hot-backup strategies—since the manifest contains the WAL log number, snapshots behave like RocksDB’s
Checkpoints.
9. Truncation Metadata
raftstore/engine/wal_storagekeeps a per-group index of[firstIndex,lastIndex]spans for each WAL record so it can map raft log indices back to the segment that stored them.- When a log is truncated (either via snapshot or future compaction hooks), the manifest is updated via
LogRaftTruncatewith the index/term, segment ID (RaftLogPointer.SegmentIndex), and byte offset (RaftLogPointer.TruncatedOffset) that delimit the remaining WAL data. lsm/levelManager.canRemoveWalSegmentnow blocks garbage collection whenever any raft group still references a segment through its truncation metadata, preventing slow followers from losing required WAL history while letting aggressively compacted groups release older segments earlier.
For broader context, read the architecture overview and flush pipeline documents.
Value Log (vlog) Design
NoKV keeps the LSM tree lean by separating large values into sequential value log (vlog) files. The module is split between
vlog/manager.go– owns the open file set, rotation, and segment lifecycle helpers.vlog/io.go– append/read/iterate/verify/sample IO paths.vlog.go– integrates the manager with the DB write path, discard statistics, and garbage collection (GC).
The design echoes BadgerDB’s value log while remaining manifest-driven like RocksDB’s blob_db: vlog metadata (head pointer, pending deletions) is persisted inside the manifest so recovery can reconstruct the exact state without scanning the filesystem.
1. Layering (Engine View)
The value log is split into three layers so IO can stay reusable while DB policy lives in the core package:
- DB policy layer (
vlog.go,vlog_gc.go) – integrates the manager with the DB write path, persists vlog metadata in the manifest, and drives GC scheduling based on discard stats. - Value-log manager (
vlog/) – owns segment lifecycle (open/rotate/remove), encodes/decodes entries, and exposes append/read/sample APIs without touching MVCC or LSM policy. - File IO (
file/) – mmap-backedLogFileprimitives (open/close/truncate, read/write, read-only remap) shared by WAL/vlog/SST. Vlog currently usesLogFiledirectly instead of an intermediate store abstraction.
2. Directory Layout & Naming
<workdir>/
vlog/
00000.vlog
00001.vlog
...
- Files are named
%05d.vlogand live underworkdir/vlog/.Manager.populatediscovers existing segments at open. Managertracks the active file ID (activeID) and byte offset;Manager.Headexposes these so the manifest can checkpoint them (manifest.EditValueLogHead).- Files created after a crash but never linked in the manifest are removed during
valueLog.reconcileManifest.
3. Record Format
The vlog uses the shared encoding helper (kv.EncodeEntryTo), so entries written to the value log and the WAL are byte-identical.
+--------+----------+------+-------------+-----------+-------+
| KeyLen | ValueLen | Meta | ExpiresAt | Key bytes | Value |
+--------+----------+------+-------------+-----------+-------+
+ CRC32 (4 B)
- Header fields are varint-encoded (
kv.EntryHeader). - The first 20 bytes of every segment are reserved (
kv.ValueLogHeaderSize) for future metadata; iteration always skips this fixed header. kv.EncodeEntryand the entry iterator (kv.EntryIterator) perform the layout work, and each append finishes with a CRC32 to detect torn writes.vlog.VerifyDirscans all segments withsanitizeValueLogto trim corrupted tails after crashes, mirroring RocksDB’sblob_file::Sanitize. Badger performs a similar truncation pass at startup.
4. Manager API Surface
mgr, _ := vlog.Open(vlog.Config{Dir: "...", MaxSize: 1<<29})
ptr, _ := mgr.AppendEntry(entry)
ptrs, _ := mgr.AppendEntries(entries, writeMask)
val, unlock, _ := mgr.Read(ptr)
unlock() // release per-file lock
_ = mgr.Rewind(*ptr) // rollback partially written batch
_ = mgr.Remove(fid) // close + delete file
Key behaviours:
- Append + Rotate –
Manager.AppendEntryencodes and appends into the active file. The reservation path handles rotation when the active segment would exceedMaxSize; manual rotation is rare. - Crash recovery –
Manager.Rewindtruncates the active file and removes newer files when a write batch fails mid-flight.valueLog.writeuses this to guarantee idempotent WAL/value log ordering. - Safe reads –
Manager.Readreturns an mmap-backed slice plus an unlock callback. Active segments take a per-fileRWMutex, while sealed segments use a pin/unpin path to avoid long-held locks; callers that need ownership should copy the bytes before releasing the lock. - Verification –
VerifyDirvalidates entire directories (used by CLI and recovery) by parsing headers and CRCs.
Compared with RocksDB’s blob manager the surface is intentionally small—NoKV treats the manager as an append-only log with rewind semantics, while RocksDB maintains index structures inside the blob file metadata.
5. Integration with DB Writes
sequenceDiagram
participant Commit as commitWorker
participant Mgr as vlog.Manager
participant WAL as wal.Manager
participant Mem as MemTable
Commit->>Mgr: AppendEntries(entries, writeMask)
Mgr-->>Commit: ValuePtr list
Commit->>WAL: Append(entries+ptrs)
Commit->>Mem: apply to skiplist
valueLog.writebuilds a write mask for each batch, then delegates toManager.AppendEntries. Entries staying in LSM (shouldWriteValueToLSM) receive zero-value pointers.- Rotation is handled inside the manager when the reserved bytes would exceed
MaxSize. The WAL append happens after the value log append so crash replay observes consistent pointers. - Any error triggers
Manager.Rewindback to the saved head pointer, removing new files and truncating partial bytes.vlog_test.goexercises both append- and rotate-failure paths. Txn.Commitand batched writes share the same pipeline: the commit worker always writes the value log first, then applies to WAL/memtable, keeping MVCC and durability ordering consistent.
Badger follows the same ordering (value log first, then write batch). RocksDB’s blob DB instead embeds blob references into the WAL entry before the blob file write, relying on two-phase commit between WAL and blob; NoKV avoids the extra coordination by reusing a single batching loop.
5. Discard Statistics & GC
flowchart LR FlushMgr -- "obsolete ptrs" --> DiscardStats DiscardStats -->|"batch json"| writeCh valuePtr["valueLog.newValuePtr(lfDiscardStatsKey)"] writeCh --> valuePtr valueLog -- "GC trigger" --> Manager
lfDiscardStatsaggregates per-file discard counts fromlsm.FlushTablecompletion (valueLog.lfDiscardStats.pushinsidelsm/flush). Once the in-memory counter crossesdiscardStatsFlushThreshold, it marshals the map into JSON and writes it back through the DB pipeline under the special key!NoKV!discard.valueLog.flushDiscardStatsconsumes those stats, ensuring they are persisted even across crashes. During recoveryvalueLog.populateDiscardStatsreplays the JSON payload to repopulate the in-memory map.- GC uses
discardRatio = discardedBytes/totalBytesderived fromManager.Sample, which applies windowed iteration based on configurable ratios. If a file exceeds the configured threshold,valueLog.doRunGCrewrites live entries into the current head (usingManager.Append) and thenvalueLog.rewriteschedules deletion edits in the manifest.- Sampling behaviour is controlled by
Options.ValueLogGCSampleSizeRatio(default 0.10 of the file) andOptions.ValueLogGCSampleCountRatio(default 1% of the configured entry limit). Setting either to<=0keeps the default heuristics.Options.ValueLogGCSampleFromHeadstarts sampling from the beginning instead of a random window.
- Sampling behaviour is controlled by
- Completed deletions are logged via
lsm.LogValueLogDeleteso the manifest can skip them during replay. When GC rotates to a new head,valueLog.updateHeadrecords the pointer and bumps theNoKV.ValueLog.HeadUpdatescounter.
RocksDB’s blob GC leans on compaction iterators to discover obsolete blobs. NoKV, like Badger, leverages flush/compaction discard stats so GC does not need to rescan SSTs.
6. Recovery Semantics
DB.Openrestores the manifest and fetches the last persisted head pointer.valueLog.openlaunchesflushDiscardStatsand iterates every vlog file viavalueLog.replayLog. Files marked invalid in the manifest are removed; valid ones are registered in the manager’s file map.valueLog.replayLogstreams entries to validate checksums and trims torn tails; it does not reapply data into the LSM. WAL replay remains the sole source of committed state.Manager.VerifyDirtrims torn records so replay never sees corrupt payloads.- After validation,
valueLog.populateDiscardStatsrehydrates discard counters from the persisted JSON entry if present.
The flow mirrors Badger’s vlog scanning but keeps state reconstruction anchored on WAL + manifest checkpoints, similar to RocksDB’s reliance on MANIFEST to mark blob files live or obsolete.
7. Observability & CLI
- Metrics in
stats.goreport segment counts, pending deletions, discard queue depth, and GC head pointer viaexpvar. nokv vlog --workdir <dir>loads a manager in read-only mode and prints current head plus file status (valid, gc candidate). It invokesvlog.VerifyDirbefore describing segments.- Recovery traces controlled by
RECOVERY_TRACE_METRICSlog every head movement and file removal, aiding pressure testing of GC edge cases. For ad-hoc diagnostics, enableOptions.ValueLogVerboseto emit replay/GC messages to stdout.
8. Quick Comparison
| Capability | RocksDB BlobDB | BadgerDB | NoKV |
|---|---|---|---|
| Head tracking | In MANIFEST (blob log number + offset) | Internal to vlog directory | Manifest entry via EditValueLogHead |
| GC trigger | Compaction sampling, blob garbage score | Discard stats from LSM tables | Discard stats flushed through lfDiscardStats |
| Failure recovery | Blob DB and WAL coordinate two-phase commits | Replays value log then LSM | Rewind-on-error + manifest-backed deletes |
| Read path | Separate blob cache | Direct read + checksum | Manager.Read with copy + per-file lock |
By anchoring the vlog state in the manifest and exposing rewind/verify primitives, NoKV maintains the determinism of RocksDB while keeping Badger’s simple sequential layout.
9. Further Reading
docs/recovery.md– failure matrix covering append crashes, GC interruptions, and manifest rewrites.docs/cache.md– how vlog-backed entries interact with the block cache.docs/stats.md– metric names surfaced for monitoring.
Manifest & Version Management
The manifest keeps the source of truth for SST files, WAL checkpoints, and ValueLog heads. NoKV’s implementation (manifest/manager.go, manifest/codec.go, manifest/types.go) borrows RocksDB’s VersionEdit + CURRENT pattern while adding metadata required for value separation.
1. File Layout
WorkDir/
CURRENT # stores the active MANIFEST file name
MANIFEST-000001 # log of manifest edits
MANIFEST-000002 # newer file after rewrite
CURRENTis atomically swapped viaCURRENT.tmp→CURRENTrename.- Each
MANIFEST-*contains a series of binary edits prefixed by the magic string"NoKV"(encoding lives inmanifest/codec.go). - During
manifest.Open,loadCurrentopens the file referenced by CURRENT; if missing,createNewbootstraps an empty manifest.
2. Edit Types
type EditType uint8
const (
EditAddFile EditType = iota
EditDeleteFile
EditLogPointer
EditValueLogHead
EditDeleteValueLog
EditUpdateValueLog
EditRaftPointer
EditRegion
)
Each edit serialises one logical action:
EditAddFile/EditDeleteFile– manage SST metadata (FileMeta: level, fileID, size, key bounds, timestamps).EditLogPointer– persists the latest WAL segment + offset checkpoint, analogous to RocksDB’slog_numberandprev_log_numberfields.EditValueLogHead– records the head pointer for vlog append, ensuring recovery resumes from the correct file/offset.EditDeleteValueLog– marks a vlog segment logically deleted (GC has reclaimed it).EditUpdateValueLog– updates metadata for an existing vlog file (used when GC rewrites a segment).EditRaftPointer– persists raft-group WAL progress (segment, offset, applied/truncated index & term, etc.).EditRegion– persists Region metadata (key range, epoch, peers, lifecycle state).
manifest.Manager.apply interprets each edit and updates the in-memory Version structure, which is consumed by LSM initialisation and value log recovery.
3. Version Structure
type Version struct {
Levels map[int][]FileMeta
LogSegment uint32
LogOffset uint64
ValueLogs map[uint32]ValueLogMeta
ValueLogHead ValueLogMeta
RaftPointers map[uint64]RaftLogPointer
Regions map[uint64]RegionMeta
}
Levelsmirrors the LSM tree levels; during recoverylsm.LSMloads files per level.LogSegment/LogOffsetensure WAL replay starts exactly where persistent state ended.ValueLogsholds metadata for every known vlog file;ValueLogHeadcaches the active head for quick access.
Compared with RocksDB: RocksDB’s manifest stores blob file metadata when BlobDB is enabled. NoKV integrates vlog metadata natively to avoid a separate blob manifest.
4. Lifecycle
sequenceDiagram
participant DB
participant Manifest
participant CURRENT
DB->>Manifest: Open(dir)
Manifest->>CURRENT: read file name
Manifest->>Manifest: replay edits → Version
DB->>Manifest: LogEdit(EditAddFile+LogPointer)
Manifest->>Manifest: append edit
Manifest-->>DB: updated Version
Note over Manifest,CURRENT: On rewrite -> write tmp -> rename CURRENT
- Open/Rebuild –
replayreads all edits, applying them sequentially (bufio.Readerensures streaming). If any edit fails to decode, recovery aborts so operators can inspect the manifest, similar to RocksDB’s strictness. - LogEdit – obtains the mutex, appends the encoded edit, flushes, and updates the in-memory
Versionbefore returning. - Rewrite – when the manifest grows beyond
Options.ManifestRewriteThreshold, the manager writes a newMANIFEST-xxxxxxcontaining a full snapshot of the currentVersion, fsyncs it, updatesCURRENT, and removes the old file. This mirrors RocksDB’smax_manifest_file_sizebehavior while keeping recovery simple. - Close – flushes and closes the underlying file handle; the version stays available for introspection via
Manager.Version()(used by CLI).
5. Interaction with Other Modules
| Module | Manifest usage |
|---|---|
lsm | installLevel0Table logs EditAddFile + EditLogPointer to checkpoint WAL progress. Compaction deletes old files via EditDeleteFile. |
wal | Manifest’s log pointer tells WAL replay where to resume. |
vlog | valueLog.rewrite writes EditUpdateValueLog / EditDeleteValueLog after GC, ensuring stale segments are not reopened. |
CLI | nokv manifest reads manifest.Manager.Version() and prints levels, vlog head, and deletion status. |
Badger keeps a separate value.log directory without manifest-level bookkeeping; NoKV’s integrated manifest avoids scanning the filesystem during recovery.
6. Recovery Scenarios
- Missing SST file – if
MANIFESTreferences000123.sstbut the file is absent,db_recovery_test.go::TestRecoveryCleansMissingSSTFromManifestverifies that recovery removes the edit, mimicking RocksDB’s lost table handling. - ValueLog deletion –
TestRecoveryRemovesStaleValueLogSegmentensuresEditDeleteValueLogentries trigger file removal during recovery. - Manifest rewrite crash –
TestRecoveryManifestRewriteCrashsimulates a crash after writing the new manifest but before updatingCURRENT; recovery still points to the old manifest and resumes safely, exactly like RocksDB’s two-phase rewrite. - Stale WAL pointer – WAL replay respects
LogSegment/Offset; tests cover truncated WALs to confirm idempotency.
7. CLI Output
nokv manifest --workdir <dir> --json prints:
- Level file counts and key ranges.
wal_log_segment/wal_log_offsetcheckpoint.value_log_headmetadata.- List of vlog files with
validstatus (mirroring RocksDB’s blob file dump).
This structured output enables automated validation in CI and ad-hoc audits.
8. Extensibility
- Column families – add a column family identifier to
FileMetaand extend edits accordingly, as RocksDB does. - Snapshots – persistent snapshots can be derived from manifest versions (keep a copy of the current Version and WAL pointer).
- Remote manifests – similar to RocksDB’s remote compaction, storing manifests in object storage is straightforward because edits are append-only.
For end-to-end recovery context, see recovery.md and the architecture overview.
File Abstractions
The file package encapsulates direct file-system interaction for WAL, SST, and value-log files. It provides portable mmap helpers, allocation primitives, and log file wrappers.
1. Core Types
| Type | Purpose | Key Methods |
|---|---|---|
Options | Parameter bag for opening files (FID, path, size). | Used by WAL/vlog managers. |
CoreFile | Interface abstracting platform-specific operations. | NewReader, Bytes, Sync, Delete. |
MmapFile | Cross-platform mmap wrapper. | OpenMmapFile, AppendBuffer, Truncature, Sync. |
LogFile | Value-log specific helper built on MmapFile. | Open, Write, Read, DoneWriting, EncodeEntry. |
Darwin-specific builds live alongside (mmap_darwin.go, sstable_darwin.go) ensuring the package compiles on macOS without manual tuning.
2. Mmap Management
OpenMmapFileopens or creates a file, optionally extending it tomaxSz, then mmaps it. The returnedMmapFileexposesData []byteand the underlying*os.Filehandle.- Writes grow the map on demand:
AppendBufferchecks if the write would exceed the current mapping and callsTruncatureto expand (doubling up to 1 GiB increments). Syncflushes dirty pages (mmap.Msync), whileDeleteunmaps, truncates, closes, and removes the file—used when dropping SSTs or value-log segments.
RocksDB relies on custom Env implementations for portability; NoKV keeps the logic in Go, relying on build tags for OS differences.
3. LogFile Semantics
LogFile wraps MmapFile to simplify value-log operations:
lf := &file.LogFile{}
_ = lf.Open(&file.Options{FID: 1, FileName: "00001.vlog", MaxSz: 1<<29})
ptr, _ := lf.EncodeEntry(entry, buf, offset)
_ = lf.Write(offset, buf.Bytes())
_ = lf.DoneWriting(nextOffset)
Openmmaps the file and records current size (guarded to< 4 GiB).Readvalidates offsets against both the mmap length and tracked size, preventing partial reads when GC or drop operations shrink the file.EncodeEntryuses the sharedkv.EntryHeaderand CRC32 helpers to produce the exact on-disk layout consumed byvlog.Managerandwal.Manager.DoneWritingsyncs, truncates to the provided offset, reinitialises the mmap, and keeps the file open in read-write mode—supporting subsequent appends.Rewind(viavlog.Manager.Rewind) leveragesLogFile.TruncateandInitto roll back partial batches after errors.
4. SST Helpers
While SSTable builders/readers live under lsm/table.go, they rely on file helpers to map index/data blocks efficiently. The build tags (sstable_linux.go, sstable_darwin.go) provide OS-specific tuning for direct I/O hints or mmap flags.
5. Comparison
| Engine | Approach |
|---|---|
| RocksDB | C++ Env & random-access file wrappers. |
| Badger | y.File abstraction with mmap. |
| NoKV | Go-native mmap wrappers with explicit log helpers. |
By keeping all filesystem primitives in one package, NoKV ensures WAL, vlog, and SST layers share consistent behaviour (sync semantics, truncation rules) and simplifies testing (mmap_linux_test.go).
6. Operational Notes
- Value-log and WAL segments rely on
DoneWriting/Truncateto seal files; avoid manipulating files externally or mmap metadata may desynchronise. LogFile.AddSizeupdates the cached size used by reads—critical when rewinding or rewriting segments.SyncDir(seemmap_linux.go) is invoked when new files are created to persist directory entries, similar to RocksDB’sEnv::FsyncDir.
For more on how these primitives plug into higher layers, see docs/wal.md and docs/vlog.md.
Cache & Bloom Filters
NoKV’s LSM tier layers a multi-level block cache with bloom filter caching to accelerate lookups. The implementation is in lsm/cache.go.
1. Components
| Component | Purpose | Source |
|---|---|---|
cache.indexs + indexHot | Table index cache (fid → *pb.TableIndex) reused across reopen + small CLOCK hot tier fed by HotRing hits. | utils/cache |
blockCache | Ristretto-based block cache (L0/L1 only) with per-table direct slots; hot block tier (small CLOCK) keeps hotspot blocks resident. | lsm/cache.go |
bloomCache + hot | LRU cache of bloom filter bitsets per SST plus small CLOCK hot tier to protect frequent filters. | lsm/cache.go |
cacheMetrics | Atomic hit/miss counters for L0/L1 blocks and blooms. | lsm/cache.go#L30-L110 |
Badger uses a similar block cache split (Pinner/Cache) while RocksDB exposes block cache(s) via the BlockBasedTableOptions. NoKV keeps it Go-native and GC-friendly.
1.1 Index Cache & Handles
- SSTable metadata stays with the
tablestruct, while decoded protobuf indexes are stored incache.indexs. Lookups first hit the cache before falling back to disk. - SST handles are reopened on demand for lower levels. L0/L1 tables keep their file descriptors pinned, while deeper levels close them once no iterator is using the table.
2. Block Cache Strategy
User-space block cache (L0/L1, parsed blocks, Ristretto LFU-ish)
Small hot tier (CLOCK) for hotspot blocks
Deeper levels rely on OS page cache + mmap readahead
Options.BlockCacheSizesets capacity in blocks (cost=1 per block). Entries keep parsed blocks (data slice + offsets/baseKey/checksum), so hits avoid re-parsing.- Hot tier: requests marked
hot(prefetch/hotspot reads) promote blocks into the small CLOCK hot set derived from the main capacity, making them harder to evict under long-tail traffic. - Per-table direct slots (
table.cacheSlots[idx]) give a lock-free fast path. Misses fall back to the shared Ristretto cache (approx LFU with admission). - Evictions clear the table slot via
OnEvict; user-space cache only tracks L0/L1 blocks. Deeper levels depend on the OS page cache. - Access patterns:
getBlockalso updates hit/miss metrics for L0/L1; deeper levels bypass the cache and do not affect metrics.
flowchart LR Read --> CheckHot CheckHot -->|hit| Return CheckHot -->|miss| LoadFromTable["LoadFromTable (mmap + OS page cache)"] LoadFromTable --> InsertHot InsertHot --> Return
By default only L0 and L1 blocks are cached (level > 1 short-circuits), reflecting the higher re-use for top levels.
3. Bloom Cache
bloomCachestores the raw filter bitset (utils.Filter) per table ID. Entries are deep-copied (SafeCopy) to avoid sharing memory with mmaps.- Main tier is LRU with a tiny CLOCK hot set to protect frequently hit filters from being washed out by scans.
- Capacity is controlled by
Options.BloomCacheSize; the hot CLOCK tier auto-scales from a few dozen up to a few hundred entries. - Bloom hits/misses are recorded via
cacheMetrics.recordBloom, feeding intoStatsSnapshot.BloomHitRate.
4. Metrics & Observability
cache.metricsSnapshot() produces:
type CacheMetrics struct {
L0Hits, L0Misses uint64
L1Hits, L1Misses uint64
BloomHits, BloomMisses uint64
IndexHits, IndexMisses uint64
}
Stats.Snapshot converts these into hit rates. Monitor them alongside the block cache sizes to decide when to scale memory.
5. Hot Integration (HotRing)
- Hot detection: HotRing counts on read/write paths raise a
hotflag once thresholds are met; only hot keys trigger prefetch. - Cache promotion: hot hits/prefetch promote blocks into the CLOCK hot tier and promote indexes/Blooms into their CLOCK tiers; cold data stays in the main cache to avoid pollution.
- Compaction coupling: HotRing top-k feeds compaction scoring; levels/ingest shards covering hot ranges get higher scores to trim overlap sooner.
- Tuning: Hot thresholds come from HotRing options (window/decay configurable); hot tier capacities are small and derived from existing cache sizes.
6. Interaction with Value Log
- Keys stored as value pointers (large values) still populate block cache entries for the key/index block. The value payload is read directly from the vlog (
valueLog.read), so block cache hit rates remain meaningful. - Discard stats from flushes can demote cached blocks via
cache.dropBlock, ensuring obsolete SST data leaves the cache quickly.
7. Comparison
| Feature | RocksDB | BadgerDB | NoKV |
|---|---|---|---|
| Hot/cold tiers | Configurable multiple caches | Single cache | Ristretto (hot) + OS page cache (cold) |
| Bloom cache | Enabled per table, no explicit cache | Optional | Dedicated LRU storing filters |
| Metrics | Block cache stats via GetAggregatedIntProperty | Limited | NoKV.Stats.Cache.* hit rates |
8. Operational Tips
- If bloom hit rate falls below ~60%, consider increasing bits-per-key or Bloom cache size.
- Track
nokv stats --jsoncache metrics over time; drops often indicate iterator misuse or working-set shifts.
More on SST layout lives in docs/manifest.md and docs/architecture.md.
HotRing – Hot Key Tracking
hotring is NoKV’s built-in hot-key tracker. It samples read/write access frequency per key and exposes the hottest entries to the stats subsystem and CLI. The implementation resides in hotring/.
1. Motivation
- Cache hints –
DB.prefetchLoop(seedb.go) consumes hot keys to schedule asynchronous reads into the block cache. - Operational insight –
StatsSnapshot.HotKeysandnokv stats --jsonsurface the hottest keys, aiding debugging of traffic hotspots. - Throttling –
HotRing.TouchAndClampenables simple rate caps: once a key crosses a threshold, callers can back off or log alerts.
Compared with RocksDB (which exposes block access stats via perf_context) and Badger (which lacks built-in hot-key reporting), NoKV offers a lightweight but concurrent-friendly tracker out of the box.
2. Data Structure
HotRing
buckets[] -> per-bucket lock-free linked list (Node)
hashFn -> hash(key) -> uint32
hashMask -> selects bucket (power of two size)
- Each bucket stores a sorted linked list of
Nodeordered by(tag, key), wheretagis derived from the upper bits of the hash. Head pointers areatomic.Pointer[Node], so readers walk the list without taking locks; writers use CAS to splice nodes while preserving order. defaultTableBits = 12→ 4096 buckets by default (NewHotRing). The mask ensures cheap modulo operations.- Nodes keep a
count(int32) updated atomically and anextpointer stored viaunsafe.Pointer. Sliding-window state is guarded by a tiny per-node spin lock instead of a process-wide mutex.
flowchart LR Key(key) -->|hash| Bucket["buckets[index] (atomic head)"] Bucket --> Node1 Node1 --> Node2 Node2 --> Node3 Node3 --> Nil[(nil)]
3. Core Operations
| Method | Behaviour | Notes |
|---|---|---|
Touch | Insert or increment key’s counter. | CAS-splices a new node if missing, then increments (window-aware when enabled). |
Frequency | Read-only counter lookup. | Lock-free lookup; uses sliding-window totals when configured. |
TouchAndClamp | Increment unless count >= limit, returning (count, limited). | Throttling follows sliding-window totals so hot bursts clamp quickly. |
TopN | Snapshot hottest keys sorted by count desc. | Walks buckets without locks, then sorts a copy. |
KeysAbove | Return all keys with counters ≥ threshold. | Handy for targeted throttling or debugging hot shards. |
Bucket ordering is preserved by findOrInsert, which CASes either the bucket head or the predecessor’s next pointer to splice new nodes. Reads never take locks; only per-node sliding-window updates spin briefly to avoid data races.
4. Integration Points
- DB reads –
Txn.Getand iterators calldb.recordRead, which in turn invokesHotRing.Touchfor every successful lookup. Writes touch the ring only whenOptions.WriteHotKeyLimitis set, so throttling can clamp abusive keys. - Stats –
StatsSnapshotcopieshot.TopNintoHotKeys.expvarpublishes the same view underNoKV.Stats.HotKeysfor automation. - Caching –
lsm/cachecan promote blocks referenced by frequently touched keys, keeping the hot tier warm.
5. Comparisons
| Engine | Approach |
|---|---|
| RocksDB | External – TRACE / perf_context requires manual sampling. |
| Badger | None built-in. |
| NoKV | In-process ring with expvar/CLI export and throttling helpers. |
The HotRing emphasises simplicity: lock-free bucket lists with atomic counters (plus optional per-node window tracking), avoiding sketches while staying light enough for hundreds of thousands of hot keys.
6. Operational Tips
Options.HotRingTopKcontrols how many keys show up in stats; default 16. Increase it when investigating workloads with broad hot sets.- Combine
TouchAndClampwith request middleware to detect abusive tenants: whenlimitedis true, log the key and latency impact. - Resetting the ring is as simple as instantiating a new
HotRing—useful for benchmarks that require clean counters between phases.
For end-to-end examples see docs/stats.md and the CLI walkthrough in docs/cli.md.
7. Write-Path Throttling
Options.WriteHotKeyLimit wires HotRing into the write path. When set to a positive integer, every call to DB.Set* or transactional Txn.Set* invokes HotRing.TouchAndClamp with the limit. Once a key (optionally scoped by column family via cfHotKey) reaches the limit, the write is rejected with utils.ErrHotKeyWriteThrottle. This keeps pathological tenants or hot shards from overwhelming a single Raft group without adding heavyweight rate-limiters to the client stack.
Operational hints:
StatsSnapshot.HotWriteLimitedand the CLI lineWrite.HotKeyThrottledexpose how many writes were rejected since the process started.- Applications should surface
utils.ErrHotKeyWriteThrottleto callers (e.g. HTTP 429) so clients can back off. - Prefetching continues to run independently—only writes are rejected; reads still register hotness so the cache layer knows what to prefetch.
- Set the limit conservatively (e.g. a few dozen) and pair it with richer
HotRinganalytics (top-K stats, expvar export) to identify outliers before tuning.
8. Time-Based Decay & Sliding Window
HotRing now exposes two complementary controls so “old” hotspots fade away automatically:
- Periodic decay (
Options.HotRingDecayInterval+HotRingDecayShift)
Everyintervalthe global counters are right-shifted (count >>= shift). This keepsTopNand stats output focused on recent traffic even if writes stop abruptly. - Sliding window (
Options.HotRingWindowSlots+HotRingWindowSlotDuration)
Per-key windows split time intoslots, each lastingslotDuration.Touchonly accumulates inside the current slot; once the window slides past, the stale contribution is dropped.TouchAndClampandFrequencyuse the sliding-window total, so write throttling reflects short-term pressure instead of lifetime counts.
Disable either mechanism by setting the interval/durations to zero. Typical starting points:
| Option | Default value | Effect |
|---|---|---|
HotRingDecayInterval | 1s | Halve legacy counters once per second. |
HotRingDecayShift | 1 | Simple divide-by-two decay. |
HotRingWindowSlots | 8 | Keep ~8 buckets of recency data. |
HotRingWindowSlotDuration | 250ms | Roughly 2s window for throttling. |
With both enabled, the decay loop keeps background stats tidy while the sliding window powers precise, short-term throttling logic.
Transaction & MVCC Design
NoKV provides snapshot-isolated transactions backed by a lightweight oracle that hands out timestamps, tracks conflicts, and coordinates with the write pipeline. The implementation lives entirely in txn.go with metrics surfaced via stats.go.
1. Components at a Glance
| Component | Purpose | Key Functions |
|---|---|---|
oracle | Issues read/commit timestamps, performs conflict checks, persists watermark progress. | readTs, newCommitTs, doneCommit |
Txn | User-facing transaction state: pending writes, read-set fingerprints, MVCC metadata. | SetEntry, Get, Commit |
pendingWritesIterator | Allows iterator merge to see unflushed txn writes. | newPendingWritesIterator |
| Metrics | Tracks counts of started/committed/conflicted txns. | trackTxnStart, txnMetricsSnapshot |
The oracle is initialised during DB.Open, sharing lineage with BadgerDB’s MVCC model. Unlike RocksDB—which relies on WriteBatch/TwoPhaseCommit extensions—transactions are first-class citizens, and the core engine enforces ordering.
2. Timestamp & Conflict Flow
sequenceDiagram
participant Client
participant DB
participant Oracle
participant Commit as commitWorker
participant Mgr as vlog.Manager
participant WAL
participant Mem as MemTable
Client->>DB: NewTransaction(update)
DB->>Oracle: readTs()
Oracle-->>DB: snapshot ts (nextTxnTs-1)
Client->>DB: Set/Delete/Get
DB->>Txn: stage pendingWrites, record read hashes
Client->>DB: Commit
DB->>Oracle: newCommitTs(txn)
alt conflict
Oracle-->>DB: ErrConflict
else success
Oracle-->>DB: commitTs
DB->>Commit: batch requests
Commit->>Mgr: AppendEntries(entries, writeMask)
Commit->>WAL: Append(entries with commitTs)
Commit->>Mem: apply to skiplist
DB->>Oracle: doneCommit(commitTs)
end
- Start –
DB.newTransactioncallsoracle.readTs, which waits for all prior commits to finish (txnMark.WaitForMark) so new readers see a consistent snapshot. In distributed deployments, clients must obtain thestartVersionthemselves (see Timestamp sources). - Reads –
Txn.Getfirst checkspendingWrites; otherwise it merges LSM iterators and value-log pointers under the read timestamp. For update transactions the read key fingerprint is recorded inTxn.readsviaaddReadKey. - Conflict detection – When
Options.DetectConflictsis enabled,oracle.newCommitTsiteratesoracle.committedTxnsand compares read fingerprints against keys written by newer commits. This mirrors Badger’s optimistic strategy. - Commit timestamp –
newCommitTsincrementsnextTxnTs, registers the commit intxnMark, and stores the conflict key set for future comparisons. - Apply –
Txn.commitAndSendassigns the commit timestamp to each pending entry (kv.KeyWithTs), enqueues them throughsendToWriteCh, and returns a callback that waits for the batch completion. Only after the callback runs does the oracle’sdoneCommitrelease the commit watermark. - Value log ordering – As with non-transactional writes, the commit worker runs
valueLog.write(which callsManager.AppendEntries) before the WAL append. On failurevlog.manager.Rewindensures partial writes do not leak.
RocksDB’s default WriteBatch lacks conflict detection, relying on application-level locking; NoKV ships with snapshot isolation and optional detection, similar to Badger’s Txn but with integrated metrics and iterator pooling.
3. Data Structures
Oracle Watermarks
oracle{
nextTxnTs // next commit timestamp to assign
txnMark // watermark waiting for WAL/vlog durability
readMark // tracks oldest active read timestamp
committedTxns[] // sliding window of conflict key sets
}
txnMark/readMarkareutils.WaterMarkinstances. They guarantee all writes with timestamp ≤readTsare durable before a new read snapshot begins, mirroring Badger’s approach to avoid reading half-committed data.cleanupCommittedTransactionsprunes conflict history based on the oldest outstanding read, preventing unbounded memory use.
Txn State
type Txn struct {
readTs uint64
commitTs uint64
pendingWrites map[string]*kv.Entry
conflictKeys map[uint64]struct{}
reads []uint64
numIterators int32
discarded bool
update bool
}
- Pending writes retain the caller’s entry pointers until commit; NoKV copies values only when moving them into the write batch.
- Read fingerprints use
kv.MemHash, so conflict detection is order-independent and compact. - MVCC versions are encoded in the key suffix (
KeyWithTs), matching the LSM’s descending version order.
Iterator Integration
Txn.newPendingWritesIteratormaterialises staged entries as a sorted slice, allowing transaction iterators to merge them with memtables/SST tables. This ensuresTxn.NewIteratorsees writes immediately without affecting other snapshots.Txn.numIteratorsenforces that all iterators close before commit/discard—helpful for catching resource leaks in tests (txn_iterator_test.go).
4. Commit & Error Handling
| Stage | Failure Handling |
|---|---|
| Conflict | oracle.newCommitTs returns (0, true); Txn.Commit surfaces utils.ErrConflict and leaves state untouched. |
| Value log append | valueLog.write rewinds via Manager.Rewind; req.Wait returns the error so callers can retry safely. |
| WAL append | sendToWriteCh propagates WAL errors; commit watermark is cleared immediately in that case. |
| Callback mode | Txn.CommitWith schedules runTxnCallback on a goroutine; user callbacks always execute (success or error). |
The final call to Txn.Discard runs regardless of success, marking the read watermark done and decrementing the oracle’s active counter.
5. Comparisons
| Feature | RocksDB | BadgerDB | NoKV |
|---|---|---|---|
| Isolation | Optional (WritePrepared/2PC) | Snapshot isolation | Snapshot isolation with WaterMark barriers |
| Conflict detection | External | Optional optimistic | Optional optimistic keyed by utils.MemHash |
| Iterator view | Snapshot handles, manual merging | Built-in | Built-in with pending write iterator |
| Metrics | rocksdb.transactions.* when enabled | Basic stats | NoKV.Txns.* expvar counters + CLI |
NoKV inherits Badger’s optimistic concurrency but strengthens durability ordering by coupling commits with the same write pipeline that non-transactional writes use. Compared with RocksDB’s transactional library, the Go implementation remains lightweight and requires no external locks.
6. Operational Notes
- Long-running reads: watch
NoKV.Txns.Activeandoracle.readMark.DoneUntil()—slow consumers keep old versions alive, delaying vlog GC and compaction reclamation. - Non-transactional APIs:
DB.Set/Get/DelandSetCF/GetCF/DelCFuse a MaxUint64 sentinel version for “latest”. Do not mix these writes with MVCC/Txn writes in the same database. - Managed mode: exposing
Txn.SetEntrywith pre-set versions allows replication/replay flows. Because commit timestamps may diverge, transaction markers are only set when all entries share a single commitTs. - Throttling: combine
HotRing.TouchAndClampwith per-transaction analytics to detect hot-key write storms that lead to frequent conflicts.
See docs/testing.md for the regression matrix covering conflict detection, iterator semantics, and managed timestamps.
7. Timestamp Sources
Replica nodes do not generate timestamps during TinyKV RPC handling; the values sent in KvPrewrite/KvCommit are applied verbatim. For teaching and prototyping you can pick from two approaches:
-
Single-client experiments – choose monotonically increasing integers in your client code (as shown in
raftstore/client/client_test.go). -
Shared allocator – run the sample TSO service under
scripts/tsoto hand out globally increasing timestamps:go run ./scripts/tso --addr 127.0.0.1:9494 --start 100 # request one timestamp curl -s http://127.0.0.1:9494/tso # request a batch of 16 curl -s "http://127.0.0.1:9494/tso?batch=16"Each call returns JSON (
{"timestamp":123,"count":1}), wheretimestampis the first value in the allocated range. Clients can use the first value forstartVersion, or the entire range to provision multiple transactions. This keeps the learning focus on the Percolator flow while demonstrating how production systems would obtain globally ordered timestamps.
RaftStore Deep Dive
raftstore powers NoKV’s distributed mode by layering multi-Raft replication on top of the embedded storage engine. This note explains the major packages, the boot and command paths, how transport and storage interact, and the supporting tooling for observability and testing.
1. Package Structure
| Package | Responsibility |
|---|---|
store | Orchestrates peer set, command pipeline, region manager, scheduler/heartbeat loops; exposes helpers such as StartPeer, ProposeCommand, SplitRegion. |
peer | Wraps etcd/raft RawNode, drives Ready processing (persist to WAL, send messages, apply entries), tracks snapshot resend/backlog. |
engine | WALStorage/DiskStorage/MemoryStorage across all Raft groups, leveraging the NoKV WAL while keeping manifest metadata in sync. |
transport | gRPC transport with retry/TLS/backpressure; exposes the raft Step RPC and can host additional services (TinyKv). |
kv | TinyKv RPC implementation, bridging Raft commands to MVCC operations via kv.Apply. |
server | ServerConfig + New that bind DB, Store, transport, and TinyKv server into a reusable node primitive. |
2. Boot Sequence
-
Construct Server
srv, _ := raftstore.NewServer(raftstore.ServerConfig{ DB: db, Store: raftstore.StoreConfig{StoreID: 1}, Raft: myraft.Config{ElectionTick: 10, HeartbeatTick: 2, PreVote: true}, TransportAddr: "127.0.0.1:20160", })- A gRPC transport is created, the TinyKv service is registered, and
transport.SetHandler(store.Step)wires raft Step handling. store.Storeloadsmanifest.RegionSnapshot()to rebuild the Region catalog (router + metrics).
- A gRPC transport is created, the TinyKv service is registered, and
-
Start local peers
- CLI (
nokv serve) iterates the manifest snapshot and callsStore.StartPeerfor every region that includes the local store. - Each
peer.Configcarries raft parameters, the transport reference,kv.NewEntryApplier, WAL/manifest handles, and Region metadata. StartPeerregisters the peer through the peer-set/routing layer and may bootstrap or campaign for leadership.
- CLI (
-
Peer connectivity
transport.SetPeer(storeID, addr)defines outbound raft connections; the CLI exposes it via--peer storeID=addr.- Additional services can reuse the same gRPC server through
transport.WithServerRegistrar.
3. Command Execution
Read (strong leader read)
kv.Service.KvGetbuildspb.RaftCmdRequestand invokesStore.ReadCommand.validateCommandensures the region exists, epoch matches, and the local peer is leader; a RegionError is returned otherwise.peer.Flush()drains pending Ready, guaranteeing the latest committed log is applied.commandApplier(i.e.kv.Apply) runs GET/SCAN directly against the DB, using MVCC readers to honour locks and version visibility.
Write (via Propose)
- Write RPCs (Prewrite/Commit/…) call
Store.ProposeCommand, encoding the command and routing to the leader peer. - The leader appends the encoded request to raft, replicates, and once committed the command pipeline hands data to
kv.Apply, which maps Prewrite/Commit/ResolveLock to thepercolatorpackage. engine.WALStoragepersists raft entries/state snapshots and updates manifest raft pointers. This keeps WAL GC and raft truncation aligned.
4. Transport
- gRPC transport listens on
TransportAddr, serving both raft Step RPC and TinyKv RPC. SetPeerupdates the mapping of remote store IDs to addresses;BlockPeercan be used by tests or chaos tooling.- Configurable retry/backoff/timeout options mirror production requirements. Tests cover message loss, blocked peers, and partitions.
5. Storage Backend (engine)
WALStoragepiggybacks on the embedded WAL: each Raft group writes typed entries, HardState, and snapshots into the shared log.LogRaftPointerandLogRaftTruncateedit manifest metadata so WAL GC knows how far it can compact per group.- Alternative storage backends (
DiskStorage,MemoryStorage) are available for tests and special scenarios.
6. TinyKv RPC Integration
| RPC | Execution Path | Notes |
|---|---|---|
KvGet / KvScan | ReadCommand → kv.Apply (read mode) | No raft round-trip; leader-only. |
KvPrewrite / KvCommit / KvBatchRollback / KvResolveLock / KvCheckTxnStatus | ProposeCommand → command pipeline → raft log → kv.Apply | Pipeline matches proposals with apply results; MVCC latch manager prevents write conflicts. |
The cmd/nokv serve command uses raftstore.Server internally and prints a manifest summary (key ranges, peers) so operators can verify the node’s view at startup.
7. Client Interaction (raftstore/client)
- Region-aware routing with NotLeader/EpochNotMatch retry.
Mutatesplits mutations by region and performs two-phase commit (primary first).Put/Deleteare convenience wrappers.Scantransparently walks region boundaries.- End-to-end coverage lives in
raftstore/server/server_client_integration_test.go, which launches real servers, uses the client to write and delete keys, and verifies the results.
8. Control Plane & Region Operations
8.1 Topology & Routing
- Topology is sourced from
raft_config.example.json(viaconfig.LoadFile) and reused by scripts, Docker Compose, and the Redis gateway. - The client builds a static region map (
[]RegionConfig) and store endpoints from the same file; there is no dynamic PD-style reconfiguration today. - The built-in scheduler currently emits leader-transfer operations only
(see
raftstore/scheduler), acting as a minimal control plane.
8.2 Split / Merge
- Split: leaders call
Store.ProposeSplit, which writes a splitAdminCommandinto the parent region’s raft log. On apply,Store.SplitRegionupdates the parent range/epoch and starts the child peer. - Merge: leaders call
Store.ProposeMerge, writing a mergeAdminCommand. On apply, the target region range/epoch is expanded and the source peer is stopped/removed from the manifest. - These operations are explicit and are not auto-triggered by size/traffic heuristics; a higher-level controller could call the same APIs.
9. Observability
store.RegionMetrics()feeds intoStatsSnapshot, making region counts and backlog visible via expvar andnokv stats.nokv regionsshows manifest-backed regions: ID, range, peers, state.scripts/transport_chaos.shexercises transport metrics under faults;scripts/run_local_cluster.shspins up multi-node clusters for manual inspection.
Store internals at a glance
| Component | File | Responsibility |
|---|---|---|
| Peer set | peer_set.go | Tracks active peers, synchronises router registration, exposes thread-safe lookups/iteration. |
| Command pipeline | command_pipeline.go | Assigns request IDs, records proposals, matches apply results, returns responses/errors to callers. |
| Region manager | region_manager.go | Validates state transitions, writes manifest edits, updates peer metadata, triggers region hooks. |
| Operation scheduler | operation_scheduler.go | Buffers planner output, enforces cooldown & burst limits, dispatches leader transfers or other operations. |
| Heartbeat loop | heartbeat_loop.go | Periodically publishes region/store heartbeats and re-runs the planner to produce scheduling actions. |
| Global registry | global.go | Records live stores for CLI/scripting (Store.Close() automatically unregisters instances). |
10. Extending raftstore
- Adding peers: update the manifest with new Region metadata, then call
Store.StartPeeron the target node. - Follower or lease reads: extend
ReadCommandto include ReadIndex or leader lease checks; current design only serves leader reads. - Scheduler integration: pair
RegionSnapshot()andRegionMetrics()with an external scheduler (PD-like) for dynamic balancing.
This layering keeps the embedded storage engine intact while providing a production-ready replication path, robust observability, and straightforward integration in both CLI and programmatic contexts.
Crash Recovery Playbook
This playbook documents how NoKV rebuilds state after a crash and which automated checks ensure correctness. It ties together WAL replay, manifest reconciliation, ValueLog GC, and flush pipelines—mirroring RocksDB’s layered recovery while incorporating Badger-style value log hygiene.
1. Recovery Phases
flowchart TD
Start[DB.Open]
Verify[runRecoveryChecks]
Manifest[manifest.Open → replay]
WAL[wal.Manager.Replay]
VLog[valueLog.recover]
Flush[Recreate memtables]
Stats[Stats.StartStats]
Start --> Verify --> Manifest --> WAL --> VLog --> Flush --> Stats
- Directory verification –
DB.runRecoveryCheckscallsmanifest.Verify,wal.VerifyDir, and initialises the vlog directory. Missing directories fail fast. - Manifest replay –
manifest.OpenreadsCURRENT, replaysEditAddFile/DeleteFile,EditLogPointer, and vlog edits into an in-memoryVersion. - WAL replay –
wal.Manager.Replayprocesses segments newer than the manifest checkpoint, rebuilding memtables from committed entries. - ValueLog reconciliation –
valueLog.recoverscans existing.vlogfiles, drops segments marked invalid, and trims torn tails to the last valid entry. - Flush backlog – Immutable memtables recreated from WAL are resubmitted to
flush.Manager; temporary.sst.tmpfiles are either reinstalled or cleaned up. - Stats bootstrap – the metrics goroutine restarts so CLI commands immediately reflect queue backlogs and GC status.
This mirrors RocksDB’s DBImpl::Recover while extending to handle value log metadata automatically.
2. Failure Scenarios & Expected Outcomes
| Failure Point | Example Simulation | Expected Recovery Behaviour | Tests |
|---|---|---|---|
| WAL tail truncation | truncate last 2 bytes of 000005.wal | Replay stops at truncated record, previously flushed SST remains intact | wal/manager_test.go::TestReplayTruncatedTail |
| Flush crash before install | crash after writing .sst.tmp | WAL replay rebuilds memtable; temp file removed; no manifest edit present | db_recovery_test.go::TestRecoveryWALReplayRestoresData |
| Flush crash after install | crash after logging manifest edit but before WAL release | Manifest still lists SST; recovery verifies file exists and releases WAL on reopen | db_recovery_test.go::TestRecoveryCleansMissingSSTFromManifest |
| ValueLog GC crash | delete edit written, file still on disk | Recovery removes stale .vlog file and keeps manifest consistent | db_recovery_test.go::TestRecoveryRemovesStaleValueLogSegment |
| Manifest rewrite crash | new MANIFEST written, CURRENT not updated | Recovery keeps using old manifest; stale temp file cleaned | db_recovery_test.go::TestRecoveryManifestRewriteCrash |
| Transaction in-flight | crash between WAL append and memtable update | WAL replay reapplies entry; transactions remain atomic because commit order is vlog → WAL → memtable | txn_test.go::TestTxnCommitPersists |
3. Automation & Tooling
3.1 Go Test Matrix
GOCACHE=$PWD/.gocache GOMODCACHE=$PWD/.gomodcache go test ./... -run 'Recovery'
- Exercises WAL replay, manifest cleanup, vlog GC, and managed transaction recovery.
- Set
RECOVERY_TRACE_METRICS=1to emit structured logs (key/value pairs) for each scenario.
3.2 Shell Script Harness
scripts/recovery_scenarios.sh orchestrates the matrix end-to-end:
- Spins up a temporary database, injects writes, and crashes at chosen checkpoints.
- Reopens the database and validates via CLI (
nokv stats,nokv manifest,nokv vlog). - Archives logs under
artifacts/recovery/<scenario>.logfor CI inspection.
3.3 CLI Validation
nokv manifest --workdir <dir>: confirm WAL checkpoint, level files, vlog head.nokv stats --workdir <dir>: observe flush backlog drop to zero after replay.nokv vlog --workdir <dir>: ensure stale segments disappear after GC recovery.
These commands give the same insight as RocksDB’s ldb manifest_dump or Badger’s CLI but with JSON output for automation.
4. Metrics Emitted During Recovery
When RECOVERY_TRACE_METRICS=1:
RECOVERY_METRIC phase="manifest" ...– manifest replay progress.RECOVERY_METRIC phase="wal" segment=... offset=...– WAL records applied.RECOVERY_METRIC phase="vlog_gc" fid=... action="delete"– vlog cleanup status.
StatsSnapshot also exposes:
NoKV.Flush.Queue– remaining flush tasks.NoKV.ValueLog.HeadFID– head file after recovery.NoKV.Txns.Active– should reset to zero post-recovery.
5. Comparison with RocksDB & Badger
| Aspect | RocksDB | BadgerDB | NoKV |
|---|---|---|---|
| WAL replay | DBImpl::RecoverLogFiles replays per log number | Journal (value log) is replayed into LSM | Dedicated WAL manager with manifest checkpoint, plus vlog trim |
| Manifest reconciliation | Removes missing files, handles CURRENT rewrite | Minimal manifest (mainly tables) | Tracks SST + vlog metadata; auto-cleans missing SST/vlog |
| Value log recovery | Optional (BlobDB) requires external blob manifest | Primary log, re-scanned on start | Manifest-backed head + discard stats to avoid rescan |
| Tooling | ldb for manifest dump | badger CLI | nokv CLI with JSON output |
NoKV inherits RocksDB’s strict manifest semantics and Badger’s value log durability, yielding deterministic restart behaviour even under mixed workloads.
6. Extending the Matrix
Future enhancements to cover:
- Compaction crash – simulate partial compaction output and verify manifest rollback.
- Prefetch queue state – ensure hot-key prefetch map resets cleanly.
- Raft integration – once replication is added, validate raft log catch-up interacts correctly with WAL replay.
Contributions adding new recovery scenarios should update this document and the shell harness to keep observability aligned.
Stats & Observability Pipeline
NoKV exposes internal health via the Go expvar package and the nokv stats CLI. The statistics subsystem is implemented in stats.go and runs continuously once the DB is open.
1. Architecture
flowchart TD
subgraph Collectors
Flush[lsm.FlushMetrics]
Levels[lsm.CompactionStats]
VLog[valueLog.metrics]
WAL[wal.Manager.Metrics]
Txn[oracle.txnMetricsSnapshot]
Cache[lsm.CacheMetrics]
Hot[hotring.TopN]
end
Collectors --> Stats
Stats -->|expvar publish| Runtime
Stats -->|Snapshot| CLI
newStatswires together reusableexpvar.Int/Floatgauges (avoiding duplicates if the process restarts an embedded DB).Stats.StartStatslaunches a goroutine that ticks every 5s (configurable viaStats.interval) to refresh values.Stats.Snapshotcan be called on-demand (e.g. CLI) without mutating expvar state.
2. Snapshot Fields
| Field | Source | Description |
|---|---|---|
Entries | lsm.EntryCount() | Total MVCC entries (L0-Ln + memtables). Mirrors Stats.EntryNum for backwards compat. |
FlushPending/Queue/Active | lsm.FlushMetrics() | Pending immutables, queue length, workers currently building SSTs. |
FlushWait/Build/ReleaseMs | Derived from WaitNs/BuildNs/ReleaseNs averages | End-to-end latency of flush pipeline stages. |
CompactionBacklog/MaxScore | lsm.CompactionStats() | How many level files await compaction and the hottest score. |
ValueLogSegments/PendingDel/DiscardQueue/Head | valueLog.metrics() | Tracks vlog utilisation and GC backlog. |
WALActiveSegment/SegmentCount/Removed/ActiveSize | wal.Manager.Metrics() | Observes WAL rotation cadence and current segment byte usage (pairs with raft lag metrics). |
WALTypedRecordRatio/Warning/Reason | WAL backlog watchdog (Stats.Snapshot) | Tracks ratio of raft typed records in the WAL and surfaces warnings with reasons when exceeding thresholds. |
WALAutoGCRuns/Removed/LastUnix | WAL backlog watchdog | Automated WAL GC passes, total segments removed, and the Unix timestamp of the last run. |
WriteQueueDepth/Entries/Bytes | writeMetrics.snapshot() | Size of the asynchronous write queue. |
WriteAvg* | writeMetrics averages | Request wait times, vlog latency, apply latency. |
WriteBatchesTotal | writeMetrics | Lifetime batches processed. |
HotWriteLimited | db.hotWriteLimited | Number of write attempts rejected by Options.WriteHotKeyLimit (HotRing write throttling). |
WriteThrottleActive | db.blockWrites | Indicates when writes are being throttled. |
TxnsActive/Started/Committed/Conflicts | oracle.txnMetricsSnapshot() | MVCC activity counters. |
HotKeys | hotring.TopN() | Top-K hot key counts. |
BlockL0/L1/BloomHitRate | lsm.CacheMetrics() | Block and bloom cache hit ratios. |
IndexHitRate | lsm.CacheMetrics() | SST 索引块缓存命中率。 |
IteratorReused | iteratorPool.reused() | Frequency of iterator pooling hits. |
RaftGroupCount/LaggingGroups/MaxLagSegments/LagWarnThreshold/RaftLagWarning | manifest.RaftPointerSnapshot() | Tracks follower backlogs; LagWarnThreshold comes from Options.RaftLagWarnSegments, and RaftLagWarning toggles when any group exceeds it. |
RegionTotal/New/Running/Removing/Tombstone/Other | store.RegionMetrics | Multi-Raft region state distribution. CLI attaches the first available RegionMetrics by default; pass --no-region-metrics to disable. |
All values are exported under the NoKV.* namespace via expvar (see newStats).
3. CLI & JSON Output
nokv stats --workdir <dir>prints a human-readable table (queue lengths, throughput, hot keys, region totals). It automatically attachesRegionMetricswhen available; add--no-region-metricsto produce a manifest-only snapshot.- When
RaftLagWarning=truethe CLI emits an extraRaft.Warningline; it also surfacesRegions.Total (...)so operators can quickly gauge Region lifecycle health. nokv stats --jsonemits the raw snapshot for automation. Example snippet:
{
"entries": 1048576,
"flush_queue_length": 2,
"vlog_head": {"fid": 5, "offset": 184320},
"hot_keys": [{"key": "user:123", "count": 42}]
}
The CLI internally instantiates a read-only DB handle, calls Stats.Snapshot, and formats the response—no background goroutine is needed.
4. Integration with Other Modules
| Module | Contribution |
|---|---|
| WAL | wal.Manager.Metrics() counts active/removable segments, aiding post-recovery validation. |
| Value Log | valueLog.metrics() exposes GC backlog, enabling alerting when discard queues stall. |
| HotRing | Publishes hot key JSON via expvar so dashboards can visualise top offenders. |
| Transactions | Oracle counters help gauge contention (high conflicts → tune workload). |
| Cache | Hit rates clarify whether cache sizing (hot/cold tier) needs adjustment. |
5. Comparisons
| Engine | Observability |
|---|---|
| RocksDB | iostats, perf_context, ldb commands. Requires manual parsing. |
| Badger | Prometheus metrics (optional). |
| NoKV | Built-in expvar gauges + CLI + recovery trace toggles. |
NoKV emphasises zero-dependency observability. Everything is consumable via HTTP /debug/vars or the CLI, making it easy to integrate with Go services.
6. Operational Guidance
- Watch
FlushQueueLengthandCompactionBacklogtogether—if both grow, increase flush workers or adjust level sizes. ValueLogDiscardQueue > 0for extended periods indicates GC is blocked; inspectNoKV.ValueLog.GcRunsand consider tuning thresholds.WriteThrottleActivetoggling frequently suggests L0 is overwhelmed; cross-checkBlockL0HitRateand compaction metrics.HotWriteLimitedclimbing steadily means HotRing write throttling is firing—surfaceutils.ErrHotKeyWriteThrottleto clients and investigate abusive keys via theHotKeyslist.RaftLagWarningtoggling totruemeans at least one follower lags the leader by more thanOptions.RaftLagWarnSegments; inspectRaft.Warningfrom the CLI and consider snapshot resend or throttling the offending node.Regions.Totalshould match the expected cluster topology; sustainedRemoving/Tombstonecounts indicate stalled cleanup—investigate split/merge logic or stuck replicas.
Refer to docs/testing.md for scripted checks that validate stats during CI runs.
Testing & Validation Matrix
This document inventories NoKV’s automated coverage and provides guidance for extending tests. It aligns module-level unit tests, integration suites, and benchmarking harnesses with the architectural features described elsewhere.
1. Quick Commands
# All unit + integration tests (uses local module caches)
GOCACHE=$PWD/.gocache GOMODCACHE=$PWD/.gomodcache go test ./...
# Focused transaction suite
go test ./... -run '^TestTxn|TestConflict|TestTxnIterator'
# Crash recovery scenarios
RECOVERY_TRACE_METRICS=1 ./scripts/recovery_scenarios.sh
# gRPC transport chaos tests + watchdog metrics
CHAOS_TRACE_METRICS=1 ./scripts/transport_chaos.sh
# Sample timestamp allocator (TSO) for multi-client transaction tests
go run ./scripts/tso --addr 127.0.0.1:9494 --start 100
# Local three-node cluster (includes manifest bootstrap + optional TSO)
./scripts/run_local_cluster.sh --config ./raft_config.example.json
# Tear down with Ctrl+C
# Docker-compose sandbox (3 nodes + TSO)
docker compose up --build
docker compose down -v
# Build RocksDB locally (installs into ./third_party/rocksdb/dist by default)
./scripts/build_rocksdb.sh
# YCSB baseline (records=1e6, ops=1e6, warmup=1e5, conc=16)
./scripts/run_benchmarks.sh
# YCSB with RocksDB (requires CGO, `benchmark_rocksdb`, and the RocksDB build above)
LD_LIBRARY_PATH="$(pwd)/third_party/rocksdb/dist/lib:${LD_LIBRARY_PATH}" \
CGO_CFLAGS="-I$(pwd)/third_party/rocksdb/dist/include" \
CGO_LDFLAGS="-L$(pwd)/third_party/rocksdb/dist/lib -lrocksdb -lz -lbz2 -lsnappy -lzstd -llz4" \
YCSB_ENGINES="nokv,badger,rocksdb" ./scripts/run_benchmarks.sh
# One-click script (auto-detect RocksDB, supports `YCSB_*` env vars to override defaults)
./scripts/run_benchmarks.sh
# Quick smoke run (smaller dataset)
NOKV_RUN_BENCHMARKS=1 YCSB_RECORDS=10000 YCSB_OPS=50000 YCSB_WARM_OPS=0 \
./scripts/run_benchmarks.sh -ycsb_workloads=A -ycsb_engines=nokv
Tip: Pin
GOCACHE/GOMODCACHEin CI to keep build artefacts local and avoid permission issues.
2. Module Coverage Overview
| Module | Tests | Coverage Highlights | Gaps / Next Steps |
|---|---|---|---|
| WAL | wal/manager_test.go | Segment rotation, sync semantics, replay tolerance for truncation, directory bootstrap. | Add IO fault injection, concurrent append stress. |
| LSM / Flush / Compaction | lsm/lsm_test.go, lsm/compact_test.go, lsm/flush/*_test.go | Memtable correctness, iterator merging, flush pipeline metrics, compaction scheduling. | Extend backpressure assertions, test cache hot/cold split. |
| Manifest | manifest/manager_test.go, manifest/levels_test.go | CURRENT swap safety, rewrite crash handling, vlog metadata persistence. | Simulate partial edit corruption, column family extensions. |
| ValueLog | vlog/vlog_test.go, vlog/gc_test.go | ValuePtr encoding/decoding, GC rewrite, concurrent iterator safety. | Long-running GC with transactions, discard ratio edge cases. |
| Transactions / Oracle | txn_test.go, txn_iterator_test.go, txn_metrics_test.go | MVCC timestamps, conflict detection, iterator snapshots, metrics accounting. | Mixed workload fuzzing, managed transactions with TTL. |
| DB Integration | db_test.go, db_recovery_test.go, db_recovery_managed_test.go | End-to-end writes, recovery, managed vs. unmanaged transactions, throttle behaviour. | Combine ValueLog GC + compaction stress, multi-DB interference. |
| CLI & Stats | cmd/nokv/main_test.go, stats_test.go | Golden JSON output, stats snapshot correctness, hot key ranking. | CLI error handling, expvar HTTP integration tests. |
| Redis Gateway | cmd/nokv-redis/backend_embedded_test.go, cmd/nokv-redis/server_test.go, cmd/nokv-redis/backend_raft_test.go | Embedded backend semantics (NX/XX, TTL, counters), RESP parser, raft backend config wiring & TSO discovery. | End-to-end multi-region CRUD with raft backend, TTL lock cleanup under failures. |
| Scripts & Tooling | scripts/scripts_test.go, cmd/nokv-config/main_test.go | serve_from_config.sh address scoping (host/docker) and manifest skipping, nokv-config JSON/simple formats, manifest logging CLI. | Golden coverage for run_local_cluster.sh, failure-path diagnostics. |
| Benchmark | benchmark/ycsb_test.go, benchmark/ycsb_runner.go | YCSB throughput/latency comparisons across engines with detailed percentile + operation mix reporting. | Automate multi-node deployments, add more workloads (D/E/F) and multi-GB datasets. |
3. System Scenarios
| Scenario | Coverage | Focus |
|---|---|---|
| Crash recovery | db_recovery_test.go, scripts/recovery_scenarios.sh | WAL replay, missing SST cleanup, vlog GC restart, manifest rewrite safety. |
| WAL pointer desync | raftstore/engine/wal_storage_test.go::TestWALStorageDetectsTruncatedSegment | Detects manifest pointer offsets beyond truncated WAL tails to avoid silent corruption. |
| Transaction contention | TestConflict, TestTxnReadAfterWrite, TestTxnDiscard | Oracle watermark handling, conflict errors, managed commit path. |
| Value separation + GC | vlog/gc_test.go, db_recovery_test.go::TestRecoveryRemovesStaleValueLogSegment | GC correctness, manifest integration, iterator stability. |
| Iterator consistency | txn_iterator_test.go, lsm/iterator_test.go | Snapshot visibility, merging iterators across levels and memtables. |
| Throttling / backpressure | lsm/compact_test.go, db_test.go::TestWriteThrottle | L0 backlog triggers, flush queue growth, metrics observation. |
| Distributed TinyKv client | raftstore/client/client_test.go::TestClientTwoPhaseCommitAndGet, raftstore/transport/grpc_transport_test.go::TestGRPCTransportManualTicksDriveElection | Region-aware routing, NotLeader retries, manual tick-driven elections, cross-region 2PC sequencing. |
| Performance regression | benchmark package | Compare NoKV vs Badger/RocksDB, produce human-readable reports under benchmark/benchmark_results. |
4. Observability in Tests
- RECOVERY_METRIC logs – produced when
RECOVERY_TRACE_METRICS=1; consumed by recovery script and helpful when triaging CI failures. - TRANSPORT_METRIC logs – emitted by
scripts/transport_chaos.shwhenCHAOS_TRACE_METRICS=1, capturing gRPC watchdog counters during network partitions and retries. - Stats snapshots –
stats_test.goverifies JSON structure so CLI output remains backwards compatible. - Benchmark artefacts – stored under
benchmark/benchmark_results/*.txtfor historical comparison. Aligns with README instructions.
5. Extending Coverage
- Property-based testing – integrate
testing/quickor third-party generators to randomise transaction sequences (Badger uses similar fuzz tests for transaction ordering). - Stress harness – add a Go-based stress driver to run mixed read/write workloads for hours, capturing metrics akin to RocksDB’s
db_stresstool. - Distributed readiness – when Raft or replication is introduced, craft tests that validate WAL shipping combined with manifest updates.
- CLI smoke tests – simulate corrupted directories to ensure CLI emits actionable errors.
Keep this matrix updated when adding new modules or scenarios so documentation and automation remain aligned.
Scripts Overview
NoKV ships a small collection of helper scripts to streamline local experimentation, demos, diagnostics, and automation. This page summarises what each script does, how to use it, and which shared configuration it consumes.
Cluster helpers
scripts/run_local_cluster.sh
- Purpose – builds
nokv,nokv-config, andnokv-tso, readsraft_config.json, seeds manifests, and starts the TinyKv nodes (plus TSO when configured). If a store directory already contains a manifest (CURRENT), the seeding step is skipped so previously bootstrapped data is reused. - Usage
./scripts/run_local_cluster.sh --config ./raft_config.example.json --workdir ./artifacts/cluster
--config defaults to the repository’s raft_config.example.json; --workdir chooses the data root (./artifacts/cluster by default). For every entry under stores the script creates store-<id>, calls nokv-config manifest, and, if tso.listen_addr is set, launches nokv-tso. The script runs in the foreground—press Ctrl+C to stop all spawned processes.
❗️ Shutdown / restart note — To avoid WAL/manifest mismatches, always stop the script with
Ctrl+Cand wait for theShutting down...message. If you crash the process or the host, clean the workdir (rm -rf ./artifacts/cluster) before starting again; otherwise the replay step may panic when it encounters truncated WAL segments.
scripts/bootstrap_from_config.sh
- Purpose – manifest-only bootstrap, typically used in Docker Compose before the nodes start. Stores that already hold a manifest are detected and skipped.
- Usage
The script iterates over every store in the config and writes Region metadata via./scripts/bootstrap_from_config.sh --config /etc/nokv/raft_config.json --path-template /data/store-{id}nokv-config manifestinto the provided path template.
scripts/serve_from_config.sh
- Purpose – translate
raft_config.jsoninto anokv servecommand, avoiding manual--peerlists. It resolves peer IDs from the region metadata and maps every peer (other than the local store) to its advertised address so that gRPC transport works out of the box. - Usage
./scripts/serve_from_config.sh \ --config ./raft_config.json \ --store-id 1 \ --workdir ./artifacts/cluster/store-1 \ --scope local # use --scope docker inside containers--scopedecides whether to use the local addresses or the container-friendly ones. The script assembles all peer mappings (excluding the local store) and execsnokv serve.
Diagnostics & benchmarking
| Script | Purpose |
|---|---|
scripts/recovery_scenarios.sh | Runs crash-recovery scenarios across WAL/manifest/vlog. Set RECOVERY_TRACE_METRICS=1 to collect metrics under artifacts/recovery/. |
scripts/transport_chaos.sh | Injects disconnects/blocks/delay into the raftstore transport to observe behaviour under faulty networks. |
scripts/run_benchmarks.sh | Executes the comparison benchmarks (NoKV vs Badger/RocksDB). |
scripts/analyze_pprof.sh | Aggregates CPU/heap profiles from pprof_output/ and renders SVG/PNG summaries. |
scripts/debug.sh | Convenience wrapper around dlv test for targeted debugging. |
scripts/gen.sh | Generates mock data or helper artefacts (see inline comments for details). |
Other helpers
scripts/tso
A small Go program (not shell) that exposes an HTTP timestamp oracle:
go run ./scripts/tso --addr 0.0.0.0:9494 --start 100
run_local_cluster.sh and Docker Compose invoke it automatically when tso.listen_addr is present in the shared config.
Relationship with nokv-config
nokv-config stores/regions/tsoprovide structured views overraft_config.json, making it easy for scripts and CI to query the topology.nokv-config manifestwrites Region metadata into manifests and replaces the historicalmanifestctlbinary.cmd/nokv-redisreads the same config; when--tso-urlis omitted it falls back to thetsosection.- Go tools or custom scripts can import
github.com/feichai0017/NoKV/configand callconfig.LoadFile/Validateto consume the sameraft_config.json, avoiding divergent schemas.
Maintaining a single raft_config.json keeps local scripts, Docker Compose, Redis gateway, and automated tests aligned.
Redis Gateway
cmd/nokv-redis exposes NoKV through a RESP-compatible endpoint. The gateway reuses the engine’s MVCC/transaction semantics and can operate in two modes:
| Mode | Description | Key flags |
|---|---|---|
Embedded (embedded) | Opens a local *NoKV.DB work directory. Commands (SET, SET NX/XX, EX/PX/EXAT/PXAT, MSET, INCR/DECR, DEL, MGET, EXISTS, …) run inside db.Update / db.View, providing atomic single-key updates and snapshot reads across multiple keys. | --workdir <dir> |
Raft (raft) | Routes requests through raftstore/client and a TinyKv cluster. Writes execute via TwoPhaseCommit; TTL metadata is stored under !redis:ttl!<key>. When --tso-url is omitted, the gateway consults the tso block in raft_config.json and falls back to a local oracle if the block is absent. | --raft-config <file>--tso-url http://host:port (optional) |
Usage examples
Embedded backend
go run ./cmd/nokv-redis \
--addr 127.0.0.1:6380 \
--workdir ./work_redis \
--metrics-addr 127.0.0.1:9100 # optional expvar endpoint
Validate with redis-cli -p 6380 ping. Metrics are exposed at http://127.0.0.1:9100/debug/vars under the NoKV.Redis key.
Raft backend
-
Start TinyKv and, if configured, the TSO using the helper script or Docker Compose. Both consume
raft_config.example.json, initialise manifests for each store, and launchnokv-tsoautomatically whentso.listen_addris present:./scripts/run_local_cluster.sh # or: docker compose up --build -
Run the gateway:
go run ./cmd/nokv-redis \ --addr 127.0.0.1:6380 \ --raft-config raft_config.example.jsonSupply
--tso-urlonly when you need to override the config file; otherwise the gateway usestso.advertise_url(orlisten_addr) from the same JSON. If the block is missing, it falls back to the embedded timestamp oracle.
Supported commands
- String operations:
GET,SET,SET NX/XX,EX/PX/EXAT/PXAT,DEL,MGET,MSET,EXISTS - Integer operations:
INCR,DECR,INCRBY,DECRBY - Utility:
PING,ECHO,QUIT
In both modes write commands are atomic. The Raft backend batches multi-key updates (MSET, DEL, …) into a single TwoPhaseCommit, matching the embedded semantics. Reads use snapshot transactions locally (db.View) and leader reads with TTL checks remotely.
Configuration file
raft_config.example.json is shared by scripts/run_local_cluster.sh, Docker Compose, and the Redis gateway. Important fields:
stores– store ID, gRPC address, and optional container listen/advertise addressesregions– region ID, start/end keys (usehex:<bytes>for binary data), epoch, peer list, leader store IDmax_retries– maximum retries for region errors in the distributed client
Use nokv-config to inspect or validate the configuration:
nokv-config stores --config raft_config.json
nokv-config regions --config raft_config.json --format json | jq '.[] | {id:.id, peers:.peers}'
For Go tooling, import github.com/feichai0017/NoKV/config and call config.LoadFile / Validate to reuse the same schema and defaults across CLIs, scripts, and applications.
Metrics
With --metrics-addr enabled the gateway publishes NoKV.Redis on /debug/vars, for example:
{
"commands_total": 128,
"errors_total": 0,
"connections_active": 1,
"connections_accepted": 4,
"commands_per_operation": {
"PING": 4,
"SET": 32,
"GET": 64,
"MGET": 8,
"DEL": 10,
"INCR": 10
}
}
These counters are part of the process-wide expvar output and can be scraped alongside the rest of NoKV’s metrics.
Notes
Use this folder to capture per-debug or per-investigation notes. Keep entries short, factual, and easy to skim.
Add a new note
- Create a new file in
docs/notes/namedYYYY-MM-DD-short-title.md. - Add it to
docs/SUMMARY.mdunder Notes. - Use the template below to keep entries consistent.
Template
Context
Symptom
Repro
Investigation
Root cause
Fix
Follow-ups
2026-01-16 mmap choice
这是一些碎碎念记录,想把 mmap 的选择理由写得清楚一些,尤其是围绕 SSTable 和 VLog 的定义、使用场景和读写交互逻辑。
概念与定位
SSTable 是 LSM 的核心持久化文件,按 key 有序且不可变,内部由索引、数据块与过滤器等结构组成,因此它在读路径里几乎无处不在。VLog 则用于存放较大的 value,写入时顺序追加,LSM 内只保存 value pointer,读取时再回查 VLog。用一句话概括就是:SSTable 读密集且不可变,VLog 顺序写但读是随机的。
读写交互逻辑
下面这张图展示了写入与读取的主要交互路径,重点是读路径几乎一定触达 SSTable,而 VLog 只在 value 外置时才参与。
flowchart LR
W[Write] --> M[Memtable + WAL]
M --> C{Value large?}
C -- no --> I[Inline value]
C -- yes --> V[Append to VLog]
V --> P[Store ValuePtr]
M --> F[Flush/Compaction]
F --> S[SSTable]
R[Read] --> Q[Memtable/LSM search]
Q --> T{Inline?}
T -- yes --> U[Return value]
T -- no --> G[ValuePtr -> VLog read]
G --> U
IO 方案对比的直观理解
mmap 的核心优势是随机读成本低,系统调用少,而且读取可以直接落在 OS 的页缓存路径上;但它的缺点也很明确,RSS 和 page cache 不可控,写入必须处理好 msync 语义,并且跨平台细节差异较多。相比之下,pread 或 buffered read 配合自建 cache 更容易控制内存和行为,但会引入额外拷贝和系统调用成本。direct I/O 能绕过 page cache,避免污染,但工程复杂度高,并且在随机读场景并不总是更快。
为什么 SSTable 更适合 mmap
SSTable 不可变且读取频繁,映射稳定,很少需要 remap,这使得 mmap 的工程成本低而收益明显。加上读路径以随机读为主,mmap 能把很多读转化成轻量页缺失,配合 OS 的页缓存形成自然的热点命中。因此在 SSTable 上采用 mmap 通常是可预期且合理的选择。
为什么我们在 VLog 上也用了 mmap
我们目前的实现方式是让 VLog 直接走 mmap,这样读路径可以用 Bytes/View 直接得到切片,写入也可以通过 mmap buffer 追加并配合 msync 落盘,这让实现保持简洁并与 SSTable 的风格一致。代价在于 VLog 文件往往更大,随机读更分散,page cache 污染风险显著更高,RSS 波动也更容易出现。如果 value 的冷热分布不稳定,mmap 带来的缓存收益不一定能抵消它的副作用。
与 Badger 的思路对比
Badger 更倾向于把 mmap 用在 SSTable,而在 VLog 上偏向 FileIO 或 pread,目的就是减少大文件对页缓存的冲击,让热点集中在 SSTable 的 block 上。它也提供了可配置的模式,但整体倾向体现了一个理念:热点应主要由 SSTable 驱动,VLog 更应该谨慎消耗 page cache。
Linux 侧的 IO 选择
在 Linux 上我们可以组合使用多种 IO 手段,比如常规的 read/pread/write,以及 mmap 配合 madvise 提示访问模式,也可以用 posix_fadvise 或 readahead 做预读提示;如果需要更细粒度控制,还可以使用 O_DIRECT 进行 direct I/O,或者基于 io_uring 做异步 IO。我们在 file 包中已经实现了一个基础的 io_uring 框架,后续如果要做更强的异步读写或并发调度,可以基于它扩展。
小结
SSTable 的读密集与不可变特性让 mmap 成为一个相对稳妥的默认选择,而 VLog 的大文件与随机读特性让 mmap 的代价更明显。当前实现偏向工程简化,但从长期来看,VLog 可能更适合 pread + 小型缓存的策略,并在热点稳定时再开放 mmap 作为可选模式。
2026-01-16 hotring design
这条记录和之前的 mmap choice 一样,是一份偏叙述的 note,用来讲 HotRing 的设计动机、交互流程以及我对它的理解。它不是一个“理论最优”的结构,但它足够轻、足够快,也足够实用。
设计动机
在 LSM 系统里,热点通常不是均匀分布的,一小撮 key 会持续放大缓存抖动、读放大和写冲突。HotRing 的定位就是把这种热点快速“变成可见”,让我们能在监控、限流、调优时快速找到真正的热源,而不是只看到一堆模糊的全局指标。
交互逻辑
HotRing 并不改变读写路径,只是以旁路的方式记录访问频次。读请求成功命中后调用 Touch,写请求在启用了 WriteHotKeyLimit 时调用 TouchAndClamp。统计系统定期拉取 TopN,CLI 可以直接显示热点。
flowchart LR R[Read path] --> L[LSM lookup] L --> H[HotRing.Touch] W[Write path] --> C[HotRing.TouchAndClamp] H --> B[Bucket list update] C --> B B --> S[TopN snapshot] S --> X[Stats/CLI/Debug]
示例代码
ring := hotring.NewHotRing(12, nil)
ring.EnableSlidingWindow(8, 250*time.Millisecond)
ring.EnableDecay(time.Second, 1)
ring.Touch("user:42")
count, limited := ring.TouchAndClamp("user:42", 128)
if limited {
// 可以记录告警,或触发写入限流
_ = count
}
hot := ring.TopN(16)
_ = hot
结构直觉与实现选择
HotRing 的内部是“固定桶 + 有序链表”。key 先哈希到桶,然后在桶内按 tag + key 排序。读路径无锁,写路径使用 CAS 插入节点,避免全局锁带来的抖动。它没有引入复杂的近似结构,而是尽量保持数据结构简单,让它能长期存在于读写路径上而不成为负担。
时间语义方面它提供了两种手段:滑动窗口让突发热点迅速出现,衰减机制让历史热点自然淡出,这两者叠加后,结果更符合“实际热度”的直觉。
个人心得
HotRing 最有意思的点不是“聪明”,而是“够用且稳定”。它把热点从不可见变成可见,又不会因为自己太复杂而制造新的热点。很多时候工程上真正需要的是“一个很快能工作的热键探测器”,而不是一个理论上更漂亮、但成本更高的结构。