Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NoKV Documentation

This book collects the project docs under docs/ and makes them readable via mdBook + GitHub Pages. Use the table of contents on the left to navigate.

Notes:

  • The content here mirrors the files in docs/.
  • Personal notes can be added in notes.md.

NoKV Architecture Overview

NoKV delivers a hybrid storage engine that can operate as a standalone embedded KV store or as a TinyKv-compatible distributed service. This document captures the key building blocks, how they interact, and the execution flow from client to disk.


1. High-Level Layout

┌─────────────────────────┐   TinyKv gRPC   ┌─────────────────────────┐
│ raftstore Service       │◀──────────────▶ │ raftstore/client        │
└───────────┬─────────────┘                 │  (Get / Scan / Mutate)  │
            │                               └─────────────────────────┘
            │ ReadCommand / ProposeCommand
            ▼
┌─────────────────────────┐
│ store.Store / peer.Peer │  ← multi-Raft region lifecycle
│  ├ Manifest snapshot    │
│  ├ Router / RegionHooks │
│  └ transport (gRPC)     │
└───────────┬─────────────┘
            │ Apply via kv.Apply
            ▼
┌─────────────────────────┐
│ kv.Apply + percolator   │
│  ├ Get / Scan           │
│  ├ Prewrite / Commit    │
│  └ Latch manager        │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ Embedded NoKV core      │
│  ├ WAL Manager          │
│  ├ MemTable / Flush     │
│  ├ ValueLog + GC        │
│  └ Manifest / Stats     │
└─────────────────────────┘
  • Embedded mode uses NoKV.Open directly: WAL→MemTable→SST durability, ValueLog separation, MVCC semantics, rich stats.
  • Distributed mode layers raftstore on top: multi-Raft regions reuse the same WAL/Manifest, expose metrics, and serve TinyKv RPCs.
  • Clients obtain leader-aware routing, automatic NotLeader/EpochNotMatch retries, and two-phase commit helpers.

2. Embedded Engine

2.1 WAL & MemTable

  • wal.Manager appends [len|payload|crc] records, rotates segments, and replays logs on crash.
  • MemTable accumulates writes until full, then enters the flush queue; flush.Manager runs Prepare → Build → Install → Release, logs edits, and releases WAL segments.
  • Writes are handled by a single commit worker that performs value-log append first, then WAL/memtable apply, keeping durability ordering simple and consistent.

2.2 ValueLog

  • Large values are written to the ValueLog before the WAL append; the resulting ValuePtr is stored in WAL/LSM so replay can recover.
  • vlog.Manager tracks the active head and uses flush discard stats to trigger GC; manifest records new heads and removed segments.

2.3 Manifest

  • manifest.Manager stores SST metadata, WAL checkpoints, ValueLog metadata, and (importantly) Region descriptors used by raftstore.
  • CURRENT provides crash-safe pointer updates; Region state is replicated through manifest edits.

2.4 LSM Compaction & Ingest Buffer

  • compact.Manager drives compaction cycles; lsm.levelManager supplies table metadata and executes the plan.
  • Planning is split: compact.PlanFor* selects table IDs + key ranges, then LSM resolves IDs back to tables and runs the merge.
  • compact.State guards overlapping key ranges and tracks in-flight table IDs.
  • Ingest shard selection is policy-driven in compact (PickShardOrder / PickShardByBacklog) while the ingest buffer remains in lsm.
flowchart TD
  Manager["compact.Manager"] --> LSM["lsm.levelManager"]
  LSM -->|TableMeta snapshot| Planner["compact.PlanFor*"]
  Planner --> Plan["compact.Plan (fid+range)"]
  Plan -->|resolvePlanLocked| Exec["LSM executor"]
  Exec --> State["compact.State guard"]
  Exec --> Build["subcompact/build SST"]
  Build --> Manifest["manifest edits"]
  L0["L0 tables"] -->|moveToIngest| Ingest["ingest buffer shards"]
  Ingest -->|ingest-only| Main["Main tables"]
  Ingest -->|ingest-merge| Ingest

2.5 MVCC

  • txn.go exposes MVCC transactions with timestamps from oracle.
  • percolator package implements Prewrite/Commit/ResolveLock/CheckTxnStatus; kv.Apply simply dispatches Raft commands to these helpers.
  • Watermarks (utils.WaterMark) gate read snapshots and commit visibility. They are synchronous (no goroutine/channel) and advance with a single mutex + atomics to reduce select/cond wait.

2.6 Write Pipeline & Backpressure

  • Writes enqueue into a commit queue (db_write.go) where requests are coalesced into batches before a commit worker drains them.
  • The commit worker always writes the value log first (when needed), then applies WAL/LSM updates; SyncWrites adds a WAL fsync step.
  • Batch sizing adapts to backlog (WriteBatchMaxCount/Size, WriteBatchWait) and hot-key pressure can expand batch limits temporarily to drain spikes.
  • Backpressure is enforced in two places: LSM throttling toggles db.blockWrites when L0 backlog grows, and HotRing can reject hot keys via WriteHotKeyLimit.

3. Replication Layer (raftstore)

PackageResponsibility
storeRegion catalog, router, RegionMetrics, Region hooks, manifest integration, helpers such as StartPeer / SplitRegion.
peerWraps etcd/raft RawNode, handles Ready pipeline, snapshot resend queue, backlog instrumentation.
engineWALStorage/DiskStorage/MemoryStorage, reusing the DB’s WAL while keeping manifest metadata in sync.
transportgRPC transport for Raft Step messages, connection management, retries/blocks/TLS. Also acts as the host for TinyKv RPC.
kvTinyKv RPC handler plus kv.Apply bridging Raft commands to MVCC logic.
serverServerConfig + New combine DB, Store, transport, and TinyKv service into a reusable node instance.

3.1 Bootstrap Sequence

  1. raftstore.NewServer wires DB, store configuration (StoreID, hooks, scheduler), Raft config, and transport address. It registers TinyKv RPC on the shared gRPC server and sets transport.SetHandler(store.Step).
  2. CLI (nokv serve) or application enumerates Manifest.RegionSnapshot() and calls Store.StartPeer for every Region containing the local store:
    • peer.Config includes Raft params, transport, kv.NewEntryApplier, WAL/Manifest handles, Region metadata.
    • Router registration, regionManager bookkeeping, optional Peer.Bootstrap with initial peer list, leader campaign.
  3. Peers from other stores can be configured through transport.SetPeer(storeID, addr), allowing dynamic updates from a scheduler.

3.2 Command Paths

  • ReadCommand (KvGet/KvScan): validate Region & leader, flush pending Ready, then run commandApplier (i.e. kv.Apply in read mode) to fetch data directly from the DB. This yields leader-strong reads without a Raft round trip.
  • ProposeCommand (write): encode the request, push through Router to the leader peer, replicate via Raft, and apply in kv.Apply which maps to MVCC operations.

3.3 Transport

  • gRPC server handles Step RPCs and TinyKv RPCs on the same endpoint; peers are registered via SetPeer.
  • Retry policies (WithRetry) and TLS credentials are configurable. Tests cover partitions, blocked peers, and slow followers.

4. TinyKv Service

raftstore/kv/service.go exposes pb.TinyKv RPCs:

RPCExecutionResult
KvGetstore.ReadCommandkv.Apply GETpb.GetResponse / RegionError
KvScanstore.ReadCommandkv.Apply SCANpb.ScanResponse / RegionError
KvPrewritestore.ProposeCommandpercolator.Prewritepb.PrewriteResponse
KvCommitstore.ProposeCommandpercolator.Commitpb.CommitResponse
KvResolveLockpercolator.ResolveLockpb.ResolveLockResponse
KvCheckTxnStatuspercolator.CheckTxnStatuspb.CheckTxnStatusResponse

nokv serve is the CLI entry point—open the DB, construct raftstore.Server, register peers, start local Raft peers, and display a manifest summary (Regions, key ranges, peers). scripts/run_local_cluster.sh builds the CLI, writes a minimal region manifest, launches multiple nokv serve processes on localhost, and handles cleanup on Ctrl+C.


5. Client Workflow

raftstore/client offers a leader-aware client with retry logic and convenient helpers:

  • Initialization: provide []StoreEndpoint + []RegionConfig describing region boundaries and known leaders.
  • Reads: Get and Scan pick the leader store for a key range, issue TinyKv RPCs, and retry on NotLeader/EpochNotMatch.
  • Writes: Mutate bundles operations per region and drives Prewrite/Commit (primary first, secondaries after); Put and Delete are convenience wrappers using the same 2PC path.
  • Timestamps: clients must supply startVersion/commitVersion. For distributed demos, reuse the TSO sample under scripts/tso to obtain globally increasing values before calling TwoPhaseCommit.
  • Bootstrap helpers: scripts/run_local_cluster.sh --config raft_config.example.json builds the binaries, seeds manifests via nokv-config manifest, launches the stores declared in the config, and starts the HTTP TSO allocator when the tso block is present.

Example (two regions)

  1. Regions [a,m) and [m,+∞), each led by a different store.
  2. Mutate(ctx, primary="alfa", mutations, startTs, commitTs, ttl) prewrites and commits across the relevant regions.
  3. Get/Scan retries automatically if the leader changes.
  4. See raftstore/server/server_client_integration_test.go for a full end-to-end example using real raftstore.Server instances.

6. Failure Handling

  • Manifest edits capture Region metadata, WAL checkpoints, and ValueLog pointers. Restart simply reads CURRENT and replays edits.
  • WAL replay reconstructs memtables and Raft groups; ValueLog recovery trims partial records.
  • Stats.StartStats resumes metrics sampling immediately after restart, making it easy to verify recovery correctness via nokv stats.

7. Observability & Tooling

  • StatsSnapshot publishes flush/compaction/WAL/VLog/txn/region metrics. nokv stats and the expvar endpoint expose the same data.
  • nokv regions inspects Manifest-backed Region metadata.
  • nokv serve advertises Region samples on startup (ID, key range, peers) for quick verification.
  • Scripts:
    • scripts/run_local_cluster.sh – launch a multi-node TinyKv cluster locally.
    • scripts/recovery_scenarios.sh – crash-recovery test harness.
    • scripts/transport_chaos.sh – inject network faults and observe transport metrics.

8. When to Use NoKV

  • Embedded: call NoKV.Open, use the MVCC store locally.
  • Distributed: deploy nokv serve nodes, use raftstore/client (or any TinyKv gRPC client) to perform reads, scans, and 2PC writes.
  • Observability-first: inspection via CLI or expvar is built-in; Region, WAL, Flush, and Raft metrics are accessible without extra instrumentation.

See also docs/raftstore.md for deeper internals and docs/testing.md for coverage details.

Configuration & Options

NoKV exposes two configuration surfaces:

  1. Runtime options for the embedded engine (Options in options.go).
  2. Cluster topology for distributed mode (raft_config.example.json via config.LoadFile/Validate).

1. Runtime Options (Embedded Engine)

NoKV.NewDefaultOptions() returns a tuned baseline. Override fields before calling NoKV.Open(opt).

Key option groups (see options.go for the full list):

  • Paths & durability
    • WorkDir, SyncWrites, ManifestSync, ManifestRewriteThreshold
  • Write pipeline
    • WriteBatchMaxCount, WriteBatchMaxSize, WriteBatchWait
    • CommitPipelineDepth, CommitApplyConcurrency
  • Value log
    • ValueThreshold, ValueLogFileSize, ValueLogMaxEntries
    • ValueLogGCInterval, ValueLogGCDiscardRatio
    • ValueLogGCSampleSizeRatio, ValueLogGCSampleCountRatio, ValueLogGCSampleFromHead
  • LSM & compaction
    • MemTableSize, MemTableEngine, SSTableMaxSz, NumCompactors
    • NumLevelZeroTables, IngestCompactBatchSize, IngestBacklogMergeScore
    • CompactionValueWeight, CompactionValueAlertThreshold
  • Caches
    • BlockCacheSize, BloomCacheSize
  • Hot key throttling
    • WriteHotKeyLimit, HotWriteBurstThreshold, HotWriteBatchMultiplier
    • HotRingEnabled, HotRingTopK, decay/window settings
  • WAL watchdog
    • EnableWALWatchdog, WALAutoGCInterval
    • WALAutoGCMinRemovable, WALAutoGCMaxBatch
    • WALTypedRecordWarnRatio, WALTypedRecordWarnSegments
  • Raft lag warnings (stats only)
    • RaftLagWarnSegments

Example:

opt := NoKV.NewDefaultOptions()
opt.WorkDir = "./data"
opt.SyncWrites = true
opt.ValueThreshold = 1024
opt.WriteBatchMaxCount = 128
db := NoKV.Open(opt)
defer db.Close()

2. Raft Topology File

raft_config.example.json is the single source of truth for distributed topology. It is consumed by scripts, cmd/nokv-redis, and the config package.

Minimal shape:

{
  "max_retries": 8,
  "tso": { "listen_addr": "127.0.0.1:9494", "advertise_url": "http://127.0.0.1:9494" },
  "stores": [
    { "store_id": 1, "listen_addr": "127.0.0.1:20170", "addr": "127.0.0.1:20170" }
  ],
  "regions": [
    {
      "id": 1,
      "start_key": "-",
      "end_key": "-",
      "epoch": { "version": 1, "conf_version": 1 },
      "peers": [{ "store_id": 1, "peer_id": 101 }],
      "leader_store_id": 1
    }
  ]
}

Notes:

  • start_key / end_key accept plain strings, hex:<bytes>, or base64. Use "-" or empty for unbounded ranges.
  • stores define both host and docker addresses for local runs vs containers.
  • leader_store_id is optional; clients use it for initial routing hints.

Programmatic loading:

cfg, _ := config.LoadFile("raft_config.example.json")
if err := cfg.Validate(); err != nil { /* handle */ }

Related tools:

  • scripts/run_local_cluster.sh --config raft_config.example.json
  • go run ./cmd/nokv-redis --raft-config raft_config.example.json

CLI (cmd/nokv) Reference

The nokv command provides operational visibility similar to RocksDB’s ldb and Badger’s badger CLI, but emits JSON to integrate easily with scripts and CI pipelines.


Installation

go install ./cmd/nokv

Use GOBIN if you prefer a custom binary directory.


Shared Flags

  • --workdir <path> – location of the NoKV database (must contain CURRENT).
  • --json – emit structured JSON (default is human-readable tables).
  • --expvar <url> – for stats command, pull metrics from a running process exposing expvar.
  • --no-region-metrics – for stats offline mode; skip attaching RegionMetrics and report manifest-only figures.

Subcommands

nokv stats

  • Reads StatsSnapshot either offline (--workdir) or via HTTP (--expvar).
  • Output fields include:
    • flush_queue_length, flush_wait_ms, flush_build_ms
    • compaction_backlog, wal_active_segment, wal_segments_removed
    • vlog_head, vlog_segments, vlog_pending_deletes, vlog_discard_queue
    • txns_active, txns_committed, txns_conflicts
    • region_total (plus region_new, region_running, region_removing, region_tombstone, region_other)
    • hot_keys (Top-N hits captured by hotring)
  • Example:
nokv stats --workdir ./testdata/db --json | jq '.flush_queue_length'

nokv manifest

  • Parses the manifest using manifest.Manager.Version().
  • Reports per-level file counts, smallest/largest keys, WAL checkpoint, and ValueLog metadata.
  • Helpful for verifying flush/compaction results and ensuring manifest rewrites succeeded.

nokv vlog

  • Lists vlog segments with status flags (active, candidate_for_gc, deleted).
  • Shows head file/offset and pending GC actions.
  • Use after running GC or recovery to confirm stale segments are purged.

Integration Tips

  • Combine with RECOVERY_TRACE_METRICS=1 to cross-check logs: run tests, then inspect CLI output to ensure metrics match expectations.
  • In CI, capture JSON output and diff against golden files to detect regressions (see cmd/nokv/main_test.go).
  • When comparing against RocksDB/Badger, treat nokv manifest + nokv vlog as equivalents to ldb manifest_dump and Badger’s badger inspect vlog commands.

For architecture context, see architecture.md and the module deep dives.

  • nokv regions – Dumps the manifest-backed Region catalog (ID/state/key range/peers). Supports --json for automation and complements the Region metrics shown in nokv stats.

Memtable Design & Lifecycle

NoKV’s write path mirrors RocksDB: every write lands in the WAL and an in-memory memtable backed by a selectable in-memory index (skiplist or ART). The implementation lives in lsm/memtable.go and ties directly into the flush manager (lsm/flush).


1. Structure

type memTable struct {
    lsm        *LSM
    segmentID  uint32       // WAL segment backing this memtable
    index      memIndex
    maxVersion uint64
    walSize    int64
}

The memtable index is an interface that can be backed by either a skiplist or ART:

type memIndex interface {
    Add(*kv.Entry)
    Search([]byte) kv.ValueStruct
    NewIterator(*utils.Options) utils.Iterator
    MemSize() int64
    IncrRef()
    DecrRef()
}
  • Memtable engineOptions.MemTableEngine selects skiplist (default) or art via newMemIndex. Skiplist favors simpler writes; ART favors tighter memory and ordered scans.
  • Arena sizingutils.NewSkiplist uses arenaSizeFor; utils.NewART uses arenaSizeForART to reserve more space for variable node payloads and prefix spills.
  • WAL coupling – every Set uses kv.EncodeEntry to materialise the payload to the active WAL segment before inserting into the chosen index. walSize tracks how much of the segment is consumed so flush can release it later.
  • Segment IDLSM.NewMemtable atomically increments levels.maxFID, switches the WAL to a new segment (wal.Manager.SwitchSegment), and tags the memtable with that FID. This matches RocksDB’s logfile_number field.
  • ART specifics – ART stores prefix-compressed inner nodes (Node4/16/48/256), uses optimistic version checks for reads with localized locks for writes, and iterators walk the tree in key order.

2. Lifecycle

sequenceDiagram
    participant WAL
    participant MT as MemTable
    participant Flush
    participant Manifest
    WAL->>MT: Append+Set(entry)
    MT->>Flush: freeze (Size() >= limit)
    Flush->>Manifest: LogPointer + AddFile
    Manifest-->>Flush: ack
    Flush->>WAL: Release segments ≤ segmentID
  1. Active → Immutable – when mt.Size() crosses thresholds (Options.MemTableSize), the memtable is swapped out and pushed onto the flush queue. The new active memtable triggers another WAL segment switch.
  2. Flush – the flush manager drains immutable memtables, builds SSTables, logs manifest edits, and releases the WAL segment ID recorded in memTable.segmentID once the SST is durably installed.
  3. RecoveryLSM.recovery scans WAL files, reopens memtables per segment (most recent becomes active), and deletes segments ≤ the manifest’s log pointer. Entries are replayed via wal.Manager.ReplaySegment into fresh indexes, rebuilding maxVersion for the oracle.

Badger follows the same pattern, while RocksDB often uses skiplist-backed arenas with reference counting—NoKV reuses Badger’s arena allocator for simplicity.


3. Read Semantics

  • memTable.Get looks up the chosen index and returns a copy of the entry. MVCC versions stay encoded in the key suffix (KeyWithTs), so iterators naturally merge across memtables and SSTables.
  • MemTable.IncrRef/DecrRef delegate to the index, allowing iterators to hold references while the flush manager processes immutable tables—mirroring RocksDB’s MemTable::Ref/Unref lifecycle.
  • WAL-backed values that exceed the value threshold are stored as pointers; the memtable stores the encoded pointer, and the transaction/iterator logic reads from the vlog on demand.

4. Integration with Other Subsystems

SubsystemInteraction
TransactionsTxn.commitAndSend writes entries into the active memtable after WAL append; pending writes bypass the memtable until commit so per-txn isolation is preserved.
ManifestFlush completion logs EditLogPointer(segmentID) so restart can discard WAL files already persisted into SSTs.
StatsStats.Snapshot pulls FlushPending/Active/Queue counters via lsm.FlushMetrics, exposing how many immutables are waiting.
Value Loglsm.flush emits discard stats keyed by segmentID, letting the value log GC know when entries become obsolete.

5. Comparison

AspectRocksDBBadgerDBNoKV
Data structureSkiplist + arenaSkiplist + arenaSkiplist or ART + arena
WAL linkagelogfile_number per memtableSegment ID stored in vlog entriessegmentID on memTable, logged via manifest
RecoveryMemtable replays from WAL, referencing MANIFESTReplays WAL segmentsReplays WAL segments, prunes ≤ manifest log pointer
Flush triggerSize/entries/timeSize-basedSize-based with explicit queue metrics

6. Operational Notes

  • Tuning Options.MemTableSize affects WAL segment count and flush latency. Larger memtables reduce flush churn but increase crash recovery time.
  • Monitor NoKV.Stats.Flush.* metrics to catch stalled immutables—an ever-growing queue often indicates slow SST builds or manifest contention.
  • Because memtables carry WAL segment IDs, deleting WAL files manually can lead to recovery failures; always rely on the engine’s manifest-driven cleanup.

See docs/flush.md for the end-to-end flush scheduler and [docs/architecture.md](architecture.md#3-end-to-end-write-flow) for where memtables sit in the write pipeline.

MemTable Flush Pipeline

NoKV’s flush subsystem translates immutable memtables into persisted SSTables while coordinating WAL checkpoints and ValueLog discard statistics. The code lives in lsm/flush and is tightly integrated with DB.doWrites and manifest.Manager.


1. Responsibilities

  1. Reliability – ensure immutables become SSTables atomically, and failures are recoverable.
  2. Coordination – release WAL segments only after manifest commits, and feed discard stats to ValueLog GC.
  3. Observability – expose queue depth, stage durations, and task counts through Stats.collect and the CLI.

Compared with RocksDB: the stage transitions mirror RocksDB’s flush job lifecycle (PickMemTable, WriteLevel0Table, InstallMemTable), while the discard stats channel is inspired by Badger’s integration with vlog GC.


2. Stage Machine

flowchart LR
    Active[Active MemTable]
    Immutable[Immutable MemTable]
    FlushQ[flush.Manager queue]
    Build[StageBuild]
    Install[StageInstall]
    Release[StageRelease]

    Active -->|threshold reached| Immutable --> FlushQ
    FlushQ --> Build --> Install --> Release --> Active
  • StagePrepareManager.Submit assigns a task ID, records enqueue time, and bumps queue metrics.
  • StageBuildManager.Next hands tasks to background workers. buildTable serialises data into a temporary .sst.tmp using lsm/builder.go.
  • StageInstall – manifest edits (EditAddFile, EditLogPointer) are logged. Only on success is the temp file renamed and the WAL checkpoint advanced.
  • StageRelease – metrics record release duration, discard stats are flushed to valueLog.lfDiscardStats, and wal.Manager.Remove drops obsolete segments.

Manager.Update transitions between stages and collects timing data (WaitNs, BuildNs, ReleaseNs). These appear as NoKV.Flush.Queue, NoKV.Flush.BuildAvgMs, etc., in CLI output.


3. Key Types

type Task struct {
    ID        uint64
    SegmentID uint32
    Stage     Stage
    Data      any      // memtable pointer, temp file info, etc.
    Err       error
}

type Manager struct {
    queue []*Task
    active map[uint64]*Task
    cond  *sync.Cond
    // atomic metrics fields (pending, queueLen, waitNs...)
}
  • Stage enumerates StagePrepare, StageBuild, StageInstall, StageRelease.
  • Metrics aggregates pending/active counts and nanosecond accumulators; the CLI converts them to human-friendly durations.
  • The queue uses condition variables to coordinate between background workers and producers; the design avoids busy waiting, unlike some RocksDB flush queues.

4. Execution Path in Code

  1. DB.applyBatches detects when the active memtable is full and hands it to lsm.LSM.scheduleFlush, which calls flush.Manager.Submit.
  2. Background goroutines call Next to retrieve tasks; lsm.(*LSM).runFlushMemTable performs the build and install phases.
  3. lsm.(*LSM).installLevel0Table writes the manifest edit and renames the SST (atomic os.Rename, same as RocksDB’s flush job).
  4. After install, valueLog.updateDiscardStats is called so GC can reclaim vlog entries belonging to dropped keys.
  5. Once release completes, wal.Manager.Remove evicts segments whose entries are fully represented in SSTs, matching RocksDB’s LogFileManager::PurgeObsoleteLogs.

5. Recovery Considerations

  • Before Install – temp files remain in tmp/. On restart, no manifest entry exists, so lsm.LSM.replayManifest ignores them and the memtable is rebuilt from WAL.
  • After Install but before Release – manifest records the SST while WAL segments may still exist. Recovery sees the edit, ensures the file exists, and release metrics resume from StageRelease.
  • Metrics – because timing data is stored atomically in the manager, recovery resets counters but does not prevent the CLI from reporting backlog immediately after restart.

RocksDB uses flush job logs; NoKV reuses metrics and CLI output for similar visibility.


6. Observability & CLI

  • StatsSnapshot.Flush.Queue – number of pending tasks.
  • StatsSnapshot.Flush.WaitMs – average wait time before build.
  • StatsSnapshot.Flush.BuildMs – average build duration.
  • StatsSnapshot.Flush.Completed – cumulative tasks finished.

The CLI command nokv stats --workdir <dir> prints these metrics alongside compaction and transaction statistics, enabling operators to detect stalled flush workers or WAL backlog quickly.


7. Interplay with ValueLog GC

Flush completion sends discard stats via db.lsm.SetDiscardStatsCh(&(db.vlog.lfDiscardStats.flushChan)). ValueLog GC uses this feed to determine how much of each vlog segment is obsolete, similar to Badger’s discard ratio heuristic. Without flush-driven stats, vlog GC would have to rescan SSTables, so this channel is crucial for keeping GC cheap.


8. Testing Matrix

  • lsm/flush/manager_test.go (implicit via lsm/lsm_test.go) validates stage transitions and metrics.
  • db_recovery_test.go covers crash scenarios before/after install, ensuring WAL replay plus manifest reconciliation recovers gracefully.
  • Future additions: inject write failures during StageBuild to test retry logic, analogous to RocksDB’s simulated IO errors.

See the recovery plan and testing matrix for more context.

Compaction & Cache Strategy

NoKV’s compaction pipeline borrows the leveled‑LSM layout from RocksDB, but layers it with an ingest buffer, lightweight cache telemetry, and simple concurrency guards so the implementation stays approachable while still handling bursty workloads.


1. Overview

Compactions are orchestrated by compact.Manager with lsm.levelManager implementing the executor hooks. Each level owns two lists of tables:

  • tables – the canonical sorted run for the level.
  • ingest – a staging buffer that temporarily holds SSTables moved from the level above when there is not yet enough work (or bandwidth) to do a full merge.

The compaction manager periodically calls into the executor to build a list of compact.Priority entries. The priorities consider three signals:

  1. L0 table count – loosely capped by Options.NumLevelZeroTables.
  2. Level size vs target – computed by levelTargets(), which dynamically adjusts the “base” level depending on total data volume.
  3. Ingest buffer backlog – if a level’s ingest shards have data, they receive elevated scores so staged tables are merged promptly.

The highest adjusted score is processed first. L0 compactions can either move tables into the ingest buffer of the base level (cheap re‑parenting) or compact directly into a lower level when the overlap warrants it.

Planning now happens via compact.Plan: LSM snapshots table metadata into compact.TableMeta, compact.PlanFor* selects table IDs + key ranges, and LSM resolves the plan back to *table before executing.


2. Ingest Buffer

moveToIngest (see lsm/executor.go) performs a metadata-only migration:

  1. Records a manifest.EditDeleteFile for the source level.
  2. Logs a new manifest.EditAddFile targeting the destination level.
  3. Removes the table from thisLevel.tables and appends it to nextLevel.ingest.

This keeps write amplification low when many small L0 tables arrive at once. Reads still see the newest data because levelHandler.searchIngestSST checks ingest before consulting tables.

Compaction tests (lsm/compaction_cache_test.go) now assert that after calling moveToIngest the table disappears from the source level and shows up in the ingest buffer.


3. Concurrency Guards

To prevent overlapping compactions:

  • compact.State.CompareAndAdd tracks the key range of each in-flight compaction per level.
  • Attempts to register a compaction whose ranges intersect an existing one are rejected.
  • When a compaction finishes, compact.State.Delete removes the ranges and table IDs from the guard.

This mechanism is intentionally simple—just a mutex‐protected slice—yet effective in tests (TestCompactStatusGuards) that simulate back‑to‑back registrations on the same key range.


4. Cache Telemetry

NoKV’s cache is split into three parts (lsm/cache.go):

ComponentPurposeMetrics hook
Block cache (hot)LRU list capturing most recent hits (typically L0/L1).cacheMetrics.recordBlock(level, hit)
Block cache (cold)CLOCK cache for deeper levels, keeping the memory footprint bounded.Same as above
Bloom cacheStores decoded bloom filters to reduce disk touches.recordBloom(hit)

CacheMetrics() on DB surfaces hits/misses per layer, which is especially helpful when tuning ingest behaviour—if L0/L1 cache misses spike, the ingest buffer likely needs to be drained faster. TestCacheHotColdMetrics verifies that the hot and cold tiers tick the counters as expected.


5. Interaction with Value Log

Compaction informs value‑log GC via discard statistics:

  1. During subcompact, every entry merged out is inspected. If it stores a ValuePtr, the amount is added to the discard map.
  2. At the end of subcompaction, the accumulated discard map is pushed through setDiscardStatsCh.
  3. valueLog receives the stats and can safely rewrite or delete vlog segments with predominantly obsolete data.

This tight coupling keeps the value log from growing indefinitely after heavy overwrite workloads.


6. Testing Checklist

Relevant tests to keep compaction healthy:

  • lsm/compaction_cache_test.go
    • TestCompactionMoveToIngest – ensures metadata migration works and the ingest buffer grows.
    • TestCacheHotColdMetrics – validates cache hit accounting.
    • TestCompactStatusGuards – checks overlap detection.
  • lsm/lsm_test.go
    • TestCompact / TestHitStorage – end‑to‑end verification that data remains queryable across memtable flushes and compactions.

When adding new compaction heuristics or cache tiers, extend these tests (or introduce new ones) so the behaviour stays observable.


7. Practical Tips

  • Tune Options.IngestCompactBatchSize when ingest queues build up; increasing it lets a single move cover more tables.
  • Observe DB.CacheMetrics() and DB.CompactionStats() via the CLI (nokv stats) to decide whether you need more compaction workers or bigger caches.
  • For workloads dominated by range scans, consider increasing Options.BlockCacheSize if you want to keep more L0/L1 blocks in the user-space cache; cold data依赖 OS page cache。
  • Keep an eye on NoKV.ValueLog.GcRuns and ValueLogHeadUpdates; if compactions are generating discard stats but the value log head doesn’t move, GC thresholds may be too conservative.

With these mechanisms, NoKV stays resilient under bursty writes while keeping the code path small and discoverable—ideal for learning or embedding. Dive into the source files referenced above for deeper implementation details.

Ingest Buffer Architecture (English)

The ingest buffer is a per-level staging area for SSTables—typically promoted from L0—designed to absorb bursts, reduce overlap, and unlock parallel compaction without touching the main level tables immediately. It combines fixed sharding, adaptive scheduling, and optional ingest-only merge to keep write amplification and contention low.

flowchart LR
  L0["L0 SSTables"] -->|moveToIngest| Ingest["Ingest Buffer (sharded)"]
  subgraph levelN["Level N"]
    Ingest -->|ingest-only compact| MainTables["Main Tables"]
    Ingest -->|ingest-merge| Ingest
  end
  Ingest -.read path merge.-> ClientReads["Reads/Iterators"]

Design Highlights

  • Sharded by key prefix: ingest tables are routed into fixed shards (top bits of the first byte). Sharding cuts cross-range overlap and enables safe parallel drain.
  • Snapshot-friendly reads: ingest tables are read under the level RLock, and iterators hold table refs so mmap-backed data stays valid without additional snapshots.
  • Two ingest paths:
    • Ingest-only compaction: drain ingest → main level (or next level) with optional multi-shard parallelism guarded by compact.State.
    • Ingest-merge: compact ingest tables back into ingest (stay in-place) to drop superseded versions before promoting, reducing downstream write amplification.
  • IngestMode enum: plans carry an IngestMode with IngestNone, IngestDrain, and IngestKeep. IngestDrain corresponds to ingest-only, while IngestKeep corresponds to ingest-merge.
  • Adaptive scheduling:
    • Shard selection is driven by compact.PickShardOrder / compact.PickShardByBacklog using per-shard size, age, and density.
    • Shard parallelism scales with backlog score (based on shard size/target file size) bounded by IngestShardParallelism.
    • Batch size scales with shard backlog to drain faster under pressure.
    • Ingest-merge triggers when backlog score exceeds IngestBacklogMergeScore (default 2.0), with dynamic lowering under extreme backlog/age.
  • Observability: expvar/stats expose ingest-only vs ingest-merge counts, duration, and tables processed, plus ingest size/value density per level/shard.

Configuration

  • IngestShardParallelism: max shards to compact in parallel (default max(NumCompactors/2, 2), auto-scaled by backlog).
  • IngestCompactBatchSize: base batch size per ingest compaction (auto-boosted by shard backlog).
  • IngestBacklogMergeScore: backlog score threshold to trigger ingest-merge (default 2.0).

Benefits

  • Lower write amplification: bursty L0 SSTables land in ingest first; ingest-merge prunes duplicates before full compaction.
  • Reduced contention: sharding + compact.State allow parallel ingest drain with minimal overlap.
  • Predictable reads: ingest is part of the read snapshot, so moving tables in/out does not change read semantics.
  • Tunable and observable: knobs for parallelism and merge aggressiveness, with per-path metrics to guide tuning.

Future Work

  • Deeper adaptive policies (IO/latency-aware), richer shard-level metrics, and more exhaustive parallel/restart testing under fault injection.

WAL Subsystem

NoKV’s write-ahead log mirrors RocksDB’s durability model and is implemented as a compact Go module similar to Badger’s journal. WAL appends happen alongside memtable writes (via lsm.Set), while values that are routed to the value log are written before the WAL so that replay always sees durable value pointers.


1. File Layout & Naming

  • Location: ${Options.WorkDir}/wal/.
  • Naming pattern: %05d.wal (e.g. 00001.wal).
  • Rotation threshold: configurable via wal.Config.SegmentSize (defaults to 64 MiB, minimum 64 KiB).
  • Verification: wal.VerifyDir ensures the directory exists prior to DB.Open.

On open, wal.Manager scans the directory, sorts segment IDs, and resumes the highest ID—exactly how RocksDB re-opens its MANIFEST and WAL sequence files.


2. Record Format

uint32 length (big-endian, includes type + payload)
uint8  type
[]byte payload
uint32 checksum (CRC32 Castagnoli over type + payload)
  • Checksums use kv.CastagnoliCrcTable, the same polynomial used by RocksDB (Castagnoli). Record encoding/decoding lives in wal/record.go.
  • The type byte allows mixing LSM mutations with raft log/state/snapshot records in the same WAL segment.
  • Appends are buffered by bufio.Writer so batches become single system calls.
  • Replay stops cleanly at truncated tails; tests simulate torn writes by truncating the final bytes and verifying replay remains idempotent (wal/manager_test.go::TestReplayTruncatedTail).

3. Public API (Go)

mgr, _ := wal.Open(wal.Config{Dir: path})
infos, _ := mgr.Append(batchPayload)
_ = mgr.Sync()
_ = mgr.Rotate()
_ = mgr.Replay(func(info wal.EntryInfo, payload []byte) error {
    // reapply to memtable
    return nil
})

Key behaviours:

  • Append automatically calls ensureCapacity to decide when to rotate; it returns EntryInfo{SegmentID, Offset, Length} for each payload so higher layers can build value pointers or manifest checkpoints.
  • Sync flushes the active file (used for Options.SyncWrites).
  • Rotate forces a new segment (used after flush/compaction checkpoints similar to RocksDB’s LogFileManager::SwitchLog).
  • Replay iterates segments in numeric order, forwarding each payload to the callback. Errors abort replay so recovery can surface corruption early.
  • Metrics (wal.Manager.Metrics) reveal the active segment ID, total segments, and number of removed segments—these feed directly into StatsSnapshot and nokv stats output.

Compared with Badger: Badger keeps a single vlog for both data and durability. NoKV splits WAL (durability) from vlog (value separation), matching RocksDB’s separation of WAL and blob files.


4. Integration Points

Call SitePurpose
lsm.memTable.setEncodes each entry (kv.EncodeEntry) and appends to WAL before inserting into the skiplist.
DB.commitWorkerCommit worker applies batched writes via writeToLSM, which flows into lsm.Set and thus WAL.
DB.SetDirect write path: calls lsm.Set, which appends to WAL and updates the memtable.
manifest.Manager.LogEditUses EntryInfo.SegmentID to persist the WAL checkpoint (EditLogPointer). This acts as the log number seen in RocksDB manifest entries.
lsm/flush.Manager.UpdateOnce an SST is installed, WAL segments older than the checkpoint are released (wal.Manager.Remove).
db.runRecoveryChecksEnsures WAL directory invariants before manifest replay, similar to Badger’s directory bootstrap.

5. Metrics & Observability

Stats.collect reads the manager metrics and exposes them as:

  • NoKV.WAL.ActiveSegment
  • NoKV.WAL.SegmentCount
  • NoKV.WAL.RemovedSegments

The CLI command nokv stats --workdir <dir> prints these alongside backlog, making WAL health visible without manual inspection. In high-throughput scenarios the active segment ID mirrors RocksDB’s LOG number growth.


6. WAL Watchdog (Auto GC)

The WAL watchdog runs inside the DB process to keep WAL backlog in check and surface warnings when raft-typed records dominate the log. It:

  • Samples WAL metrics + per-segment metrics and combines them with manifest.RaftPointerSnapshot() to compute removable segments.
  • Removes up to WALAutoGCMaxBatch segments when at least WALAutoGCMinRemovable are eligible.
  • Exposes counters (WALAutoGCRuns/Removed/LastUnix) and warning state (WALTypedRecordRatio/Warning/Reason) through StatsSnapshot.

Relevant options (see options.go for defaults):

  • EnableWALWatchdog
  • WALAutoGCInterval
  • WALAutoGCMinRemovable
  • WALAutoGCMaxBatch
  • WALTypedRecordWarnRatio
  • WALTypedRecordWarnSegments

7. Recovery Walkthrough

  1. wal.Open reopens the highest segment, leaving the file pointer at the end (switchSegmentLocked).
  2. manifest.Manager supplies the WAL checkpoint (segment + offset) while building the version. Replay skips entries up to this checkpoint, ensuring we only reapply writes not yet materialised in SSTables.
  3. wal.Manager.Replay (invoked by the LSM recovery path) rebuilds memtables from entries newer than the manifest checkpoint. Value-log recovery only validates/truncates segments and does not reapply data.
  4. If the final record is partially written, the CRC mismatch stops replay and the segment is truncated during recovery tests, mimicking RocksDB’s tolerant behaviour.

8. Operational Tips

  • Configure SyncOnWrite for synchronous durability (default async like RocksDB’s default). For latency-sensitive deployments, consider enabling to emulate Badger’s SyncWrites.
  • After large flushes, forcing Rotate keeps WAL files short, reducing replay time.
  • Archived WAL segments can be copied alongside manifest files for hot-backup strategies—since the manifest contains the WAL log number, snapshots behave like RocksDB’s Checkpoints.

9. Truncation Metadata

  • raftstore/engine/wal_storage keeps a per-group index of [firstIndex,lastIndex] spans for each WAL record so it can map raft log indices back to the segment that stored them.
  • When a log is truncated (either via snapshot or future compaction hooks), the manifest is updated via LogRaftTruncate with the index/term, segment ID (RaftLogPointer.SegmentIndex), and byte offset (RaftLogPointer.TruncatedOffset) that delimit the remaining WAL data.
  • lsm/levelManager.canRemoveWalSegment now blocks garbage collection whenever any raft group still references a segment through its truncation metadata, preventing slow followers from losing required WAL history while letting aggressively compacted groups release older segments earlier.

For broader context, read the architecture overview and flush pipeline documents.

Value Log (vlog) Design

NoKV keeps the LSM tree lean by separating large values into sequential value log (vlog) files. The module is split between

  • vlog/manager.go – owns the open file set, rotation, and segment lifecycle helpers.
  • vlog/io.go – append/read/iterate/verify/sample IO paths.
  • vlog.go – integrates the manager with the DB write path, discard statistics, and garbage collection (GC).

The design echoes BadgerDB’s value log while remaining manifest-driven like RocksDB’s blob_db: vlog metadata (head pointer, pending deletions) is persisted inside the manifest so recovery can reconstruct the exact state without scanning the filesystem.


1. Layering (Engine View)

The value log is split into three layers so IO can stay reusable while DB policy lives in the core package:

  • DB policy layer (vlog.go, vlog_gc.go) – integrates the manager with the DB write path, persists vlog metadata in the manifest, and drives GC scheduling based on discard stats.
  • Value-log manager (vlog/) – owns segment lifecycle (open/rotate/remove), encodes/decodes entries, and exposes append/read/sample APIs without touching MVCC or LSM policy.
  • File IO (file/) – mmap-backed LogFile primitives (open/close/truncate, read/write, read-only remap) shared by WAL/vlog/SST. Vlog currently uses LogFile directly instead of an intermediate store abstraction.

2. Directory Layout & Naming

<workdir>/
  vlog/
    00000.vlog
    00001.vlog
    ...
  • Files are named %05d.vlog and live under workdir/vlog/. Manager.populate discovers existing segments at open.
  • Manager tracks the active file ID (activeID) and byte offset; Manager.Head exposes these so the manifest can checkpoint them (manifest.EditValueLogHead).
  • Files created after a crash but never linked in the manifest are removed during valueLog.reconcileManifest.

3. Record Format

The vlog uses the shared encoding helper (kv.EncodeEntryTo), so entries written to the value log and the WAL are byte-identical.

+--------+----------+------+-------------+-----------+-------+
| KeyLen | ValueLen | Meta | ExpiresAt   | Key bytes | Value |
+--------+----------+------+-------------+-----------+-------+
                                             + CRC32 (4 B)
  • Header fields are varint-encoded (kv.EntryHeader).
  • The first 20 bytes of every segment are reserved (kv.ValueLogHeaderSize) for future metadata; iteration always skips this fixed header.
  • kv.EncodeEntry and the entry iterator (kv.EntryIterator) perform the layout work, and each append finishes with a CRC32 to detect torn writes.
  • vlog.VerifyDir scans all segments with sanitizeValueLog to trim corrupted tails after crashes, mirroring RocksDB’s blob_file::Sanitize. Badger performs a similar truncation pass at startup.

4. Manager API Surface

mgr, _ := vlog.Open(vlog.Config{Dir: "...", MaxSize: 1<<29})
ptr, _ := mgr.AppendEntry(entry)
ptrs, _ := mgr.AppendEntries(entries, writeMask)
val, unlock, _ := mgr.Read(ptr)
unlock()             // release per-file lock
_ = mgr.Rewind(*ptr) // rollback partially written batch
_ = mgr.Remove(fid)  // close + delete file

Key behaviours:

  1. Append + RotateManager.AppendEntry encodes and appends into the active file. The reservation path handles rotation when the active segment would exceed MaxSize; manual rotation is rare.
  2. Crash recoveryManager.Rewind truncates the active file and removes newer files when a write batch fails mid-flight. valueLog.write uses this to guarantee idempotent WAL/value log ordering.
  3. Safe readsManager.Read returns an mmap-backed slice plus an unlock callback. Active segments take a per-file RWMutex, while sealed segments use a pin/unpin path to avoid long-held locks; callers that need ownership should copy the bytes before releasing the lock.
  4. VerificationVerifyDir validates entire directories (used by CLI and recovery) by parsing headers and CRCs.

Compared with RocksDB’s blob manager the surface is intentionally small—NoKV treats the manager as an append-only log with rewind semantics, while RocksDB maintains index structures inside the blob file metadata.


5. Integration with DB Writes

sequenceDiagram
    participant Commit as commitWorker
    participant Mgr as vlog.Manager
    participant WAL as wal.Manager
    participant Mem as MemTable
    Commit->>Mgr: AppendEntries(entries, writeMask)
    Mgr-->>Commit: ValuePtr list
    Commit->>WAL: Append(entries+ptrs)
    Commit->>Mem: apply to skiplist
  1. valueLog.write builds a write mask for each batch, then delegates to Manager.AppendEntries. Entries staying in LSM (shouldWriteValueToLSM) receive zero-value pointers.
  2. Rotation is handled inside the manager when the reserved bytes would exceed MaxSize. The WAL append happens after the value log append so crash replay observes consistent pointers.
  3. Any error triggers Manager.Rewind back to the saved head pointer, removing new files and truncating partial bytes. vlog_test.go exercises both append- and rotate-failure paths.
  4. Txn.Commit and batched writes share the same pipeline: the commit worker always writes the value log first, then applies to WAL/memtable, keeping MVCC and durability ordering consistent.

Badger follows the same ordering (value log first, then write batch). RocksDB’s blob DB instead embeds blob references into the WAL entry before the blob file write, relying on two-phase commit between WAL and blob; NoKV avoids the extra coordination by reusing a single batching loop.


5. Discard Statistics & GC

flowchart LR
  FlushMgr -- "obsolete ptrs" --> DiscardStats
  DiscardStats -->|"batch json"| writeCh
  valuePtr["valueLog.newValuePtr(lfDiscardStatsKey)"]
  writeCh --> valuePtr
  valueLog -- "GC trigger" --> Manager

  • lfDiscardStats aggregates per-file discard counts from lsm.FlushTable completion (valueLog.lfDiscardStats.push inside lsm/flush). Once the in-memory counter crosses discardStatsFlushThreshold, it marshals the map into JSON and writes it back through the DB pipeline under the special key !NoKV!discard.
  • valueLog.flushDiscardStats consumes those stats, ensuring they are persisted even across crashes. During recovery valueLog.populateDiscardStats replays the JSON payload to repopulate the in-memory map.
  • GC uses discardRatio = discardedBytes/totalBytes derived from Manager.Sample, which applies windowed iteration based on configurable ratios. If a file exceeds the configured threshold, valueLog.doRunGC rewrites live entries into the current head (using Manager.Append) and then valueLog.rewrite schedules deletion edits in the manifest.
    • Sampling behaviour is controlled by Options.ValueLogGCSampleSizeRatio (default 0.10 of the file) and Options.ValueLogGCSampleCountRatio (default 1% of the configured entry limit). Setting either to <=0 keeps the default heuristics. Options.ValueLogGCSampleFromHead starts sampling from the beginning instead of a random window.
  • Completed deletions are logged via lsm.LogValueLogDelete so the manifest can skip them during replay. When GC rotates to a new head, valueLog.updateHead records the pointer and bumps the NoKV.ValueLog.HeadUpdates counter.

RocksDB’s blob GC leans on compaction iterators to discover obsolete blobs. NoKV, like Badger, leverages flush/compaction discard stats so GC does not need to rescan SSTs.


6. Recovery Semantics

  1. DB.Open restores the manifest and fetches the last persisted head pointer.
  2. valueLog.open launches flushDiscardStats and iterates every vlog file via valueLog.replayLog. Files marked invalid in the manifest are removed; valid ones are registered in the manager’s file map.
  3. valueLog.replayLog streams entries to validate checksums and trims torn tails; it does not reapply data into the LSM. WAL replay remains the sole source of committed state.
  4. Manager.VerifyDir trims torn records so replay never sees corrupt payloads.
  5. After validation, valueLog.populateDiscardStats rehydrates discard counters from the persisted JSON entry if present.

The flow mirrors Badger’s vlog scanning but keeps state reconstruction anchored on WAL + manifest checkpoints, similar to RocksDB’s reliance on MANIFEST to mark blob files live or obsolete.


7. Observability & CLI

  • Metrics in stats.go report segment counts, pending deletions, discard queue depth, and GC head pointer via expvar.
  • nokv vlog --workdir <dir> loads a manager in read-only mode and prints current head plus file status (valid, gc candidate). It invokes vlog.VerifyDir before describing segments.
  • Recovery traces controlled by RECOVERY_TRACE_METRICS log every head movement and file removal, aiding pressure testing of GC edge cases. For ad-hoc diagnostics, enable Options.ValueLogVerbose to emit replay/GC messages to stdout.

8. Quick Comparison

CapabilityRocksDB BlobDBBadgerDBNoKV
Head trackingIn MANIFEST (blob log number + offset)Internal to vlog directoryManifest entry via EditValueLogHead
GC triggerCompaction sampling, blob garbage scoreDiscard stats from LSM tablesDiscard stats flushed through lfDiscardStats
Failure recoveryBlob DB and WAL coordinate two-phase commitsReplays value log then LSMRewind-on-error + manifest-backed deletes
Read pathSeparate blob cacheDirect read + checksumManager.Read with copy + per-file lock

By anchoring the vlog state in the manifest and exposing rewind/verify primitives, NoKV maintains the determinism of RocksDB while keeping Badger’s simple sequential layout.


9. Further Reading

  • docs/recovery.md – failure matrix covering append crashes, GC interruptions, and manifest rewrites.
  • docs/cache.md – how vlog-backed entries interact with the block cache.
  • docs/stats.md – metric names surfaced for monitoring.

Manifest & Version Management

The manifest keeps the source of truth for SST files, WAL checkpoints, and ValueLog heads. NoKV’s implementation (manifest/manager.go, manifest/codec.go, manifest/types.go) borrows RocksDB’s VersionEdit + CURRENT pattern while adding metadata required for value separation.


1. File Layout

WorkDir/
  CURRENT             # stores the active MANIFEST file name
  MANIFEST-000001     # log of manifest edits
  MANIFEST-000002     # newer file after rewrite
  • CURRENT is atomically swapped via CURRENT.tmpCURRENT rename.
  • Each MANIFEST-* contains a series of binary edits prefixed by the magic string "NoKV" (encoding lives in manifest/codec.go).
  • During manifest.Open, loadCurrent opens the file referenced by CURRENT; if missing, createNew bootstraps an empty manifest.

2. Edit Types

type EditType uint8
const (
    EditAddFile EditType = iota
    EditDeleteFile
    EditLogPointer
    EditValueLogHead
    EditDeleteValueLog
    EditUpdateValueLog
    EditRaftPointer
    EditRegion
)

Each edit serialises one logical action:

  • EditAddFile / EditDeleteFile – manage SST metadata (FileMeta: level, fileID, size, key bounds, timestamps).
  • EditLogPointer – persists the latest WAL segment + offset checkpoint, analogous to RocksDB’s log_number and prev_log_number fields.
  • EditValueLogHead – records the head pointer for vlog append, ensuring recovery resumes from the correct file/offset.
  • EditDeleteValueLog – marks a vlog segment logically deleted (GC has reclaimed it).
  • EditUpdateValueLog – updates metadata for an existing vlog file (used when GC rewrites a segment).
  • EditRaftPointer – persists raft-group WAL progress (segment, offset, applied/truncated index & term, etc.).
  • EditRegion – persists Region metadata (key range, epoch, peers, lifecycle state).

manifest.Manager.apply interprets each edit and updates the in-memory Version structure, which is consumed by LSM initialisation and value log recovery.


3. Version Structure

type Version struct {
    Levels       map[int][]FileMeta
    LogSegment   uint32
    LogOffset    uint64
    ValueLogs    map[uint32]ValueLogMeta
    ValueLogHead ValueLogMeta
    RaftPointers map[uint64]RaftLogPointer
    Regions      map[uint64]RegionMeta
}
  • Levels mirrors the LSM tree levels; during recovery lsm.LSM loads files per level.
  • LogSegment/LogOffset ensure WAL replay starts exactly where persistent state ended.
  • ValueLogs holds metadata for every known vlog file; ValueLogHead caches the active head for quick access.

Compared with RocksDB: RocksDB’s manifest stores blob file metadata when BlobDB is enabled. NoKV integrates vlog metadata natively to avoid a separate blob manifest.


4. Lifecycle

sequenceDiagram
    participant DB
    participant Manifest
    participant CURRENT
    DB->>Manifest: Open(dir)
    Manifest->>CURRENT: read file name
    Manifest->>Manifest: replay edits → Version
    DB->>Manifest: LogEdit(EditAddFile+LogPointer)
    Manifest->>Manifest: append edit
    Manifest-->>DB: updated Version
    Note over Manifest,CURRENT: On rewrite -> write tmp -> rename CURRENT
  • Open/Rebuildreplay reads all edits, applying them sequentially (bufio.Reader ensures streaming). If any edit fails to decode, recovery aborts so operators can inspect the manifest, similar to RocksDB’s strictness.
  • LogEdit – obtains the mutex, appends the encoded edit, flushes, and updates the in-memory Version before returning.
  • Rewrite – when the manifest grows beyond Options.ManifestRewriteThreshold, the manager writes a new MANIFEST-xxxxxx containing a full snapshot of the current Version, fsyncs it, updates CURRENT, and removes the old file. This mirrors RocksDB’s max_manifest_file_size behavior while keeping recovery simple.
  • Close – flushes and closes the underlying file handle; the version stays available for introspection via Manager.Version() (used by CLI).

5. Interaction with Other Modules

ModuleManifest usage
lsminstallLevel0Table logs EditAddFile + EditLogPointer to checkpoint WAL progress. Compaction deletes old files via EditDeleteFile.
walManifest’s log pointer tells WAL replay where to resume.
vlogvalueLog.rewrite writes EditUpdateValueLog / EditDeleteValueLog after GC, ensuring stale segments are not reopened.
CLInokv manifest reads manifest.Manager.Version() and prints levels, vlog head, and deletion status.

Badger keeps a separate value.log directory without manifest-level bookkeeping; NoKV’s integrated manifest avoids scanning the filesystem during recovery.


6. Recovery Scenarios

  1. Missing SST file – if MANIFEST references 000123.sst but the file is absent, db_recovery_test.go::TestRecoveryCleansMissingSSTFromManifest verifies that recovery removes the edit, mimicking RocksDB’s lost table handling.
  2. ValueLog deletionTestRecoveryRemovesStaleValueLogSegment ensures EditDeleteValueLog entries trigger file removal during recovery.
  3. Manifest rewrite crashTestRecoveryManifestRewriteCrash simulates a crash after writing the new manifest but before updating CURRENT; recovery still points to the old manifest and resumes safely, exactly like RocksDB’s two-phase rewrite.
  4. Stale WAL pointer – WAL replay respects LogSegment/Offset; tests cover truncated WALs to confirm idempotency.

7. CLI Output

nokv manifest --workdir <dir> --json prints:

  • Level file counts and key ranges.
  • wal_log_segment / wal_log_offset checkpoint.
  • value_log_head metadata.
  • List of vlog files with valid status (mirroring RocksDB’s blob file dump).

This structured output enables automated validation in CI and ad-hoc audits.


8. Extensibility

  • Column families – add a column family identifier to FileMeta and extend edits accordingly, as RocksDB does.
  • Snapshots – persistent snapshots can be derived from manifest versions (keep a copy of the current Version and WAL pointer).
  • Remote manifests – similar to RocksDB’s remote compaction, storing manifests in object storage is straightforward because edits are append-only.

For end-to-end recovery context, see recovery.md and the architecture overview.

File Abstractions

The file package encapsulates direct file-system interaction for WAL, SST, and value-log files. It provides portable mmap helpers, allocation primitives, and log file wrappers.


1. Core Types

TypePurposeKey Methods
OptionsParameter bag for opening files (FID, path, size).Used by WAL/vlog managers.
CoreFileInterface abstracting platform-specific operations.NewReader, Bytes, Sync, Delete.
MmapFileCross-platform mmap wrapper.OpenMmapFile, AppendBuffer, Truncature, Sync.
LogFileValue-log specific helper built on MmapFile.Open, Write, Read, DoneWriting, EncodeEntry.

Darwin-specific builds live alongside (mmap_darwin.go, sstable_darwin.go) ensuring the package compiles on macOS without manual tuning.


2. Mmap Management

  • OpenMmapFile opens or creates a file, optionally extending it to maxSz, then mmaps it. The returned MmapFile exposes Data []byte and the underlying *os.File handle.
  • Writes grow the map on demand: AppendBuffer checks if the write would exceed the current mapping and calls Truncature to expand (doubling up to 1 GiB increments).
  • Sync flushes dirty pages (mmap.Msync), while Delete unmaps, truncates, closes, and removes the file—used when dropping SSTs or value-log segments.

RocksDB relies on custom Env implementations for portability; NoKV keeps the logic in Go, relying on build tags for OS differences.


3. LogFile Semantics

LogFile wraps MmapFile to simplify value-log operations:

lf := &file.LogFile{}
_ = lf.Open(&file.Options{FID: 1, FileName: "00001.vlog", MaxSz: 1<<29})
ptr, _ := lf.EncodeEntry(entry, buf, offset)
_ = lf.Write(offset, buf.Bytes())
_ = lf.DoneWriting(nextOffset)
  • Open mmaps the file and records current size (guarded to < 4 GiB).
  • Read validates offsets against both the mmap length and tracked size, preventing partial reads when GC or drop operations shrink the file.
  • EncodeEntry uses the shared kv.EntryHeader and CRC32 helpers to produce the exact on-disk layout consumed by vlog.Manager and wal.Manager.
  • DoneWriting syncs, truncates to the provided offset, reinitialises the mmap, and keeps the file open in read-write mode—supporting subsequent appends.
  • Rewind (via vlog.Manager.Rewind) leverages LogFile.Truncate and Init to roll back partial batches after errors.

4. SST Helpers

While SSTable builders/readers live under lsm/table.go, they rely on file helpers to map index/data blocks efficiently. The build tags (sstable_linux.go, sstable_darwin.go) provide OS-specific tuning for direct I/O hints or mmap flags.


5. Comparison

EngineApproach
RocksDBC++ Env & random-access file wrappers.
Badgery.File abstraction with mmap.
NoKVGo-native mmap wrappers with explicit log helpers.

By keeping all filesystem primitives in one package, NoKV ensures WAL, vlog, and SST layers share consistent behaviour (sync semantics, truncation rules) and simplifies testing (mmap_linux_test.go).


6. Operational Notes

  • Value-log and WAL segments rely on DoneWriting/Truncate to seal files; avoid manipulating files externally or mmap metadata may desynchronise.
  • LogFile.AddSize updates the cached size used by reads—critical when rewinding or rewriting segments.
  • SyncDir (see mmap_linux.go) is invoked when new files are created to persist directory entries, similar to RocksDB’s Env::FsyncDir.

For more on how these primitives plug into higher layers, see docs/wal.md and docs/vlog.md.

Cache & Bloom Filters

NoKV’s LSM tier layers a multi-level block cache with bloom filter caching to accelerate lookups. The implementation is in lsm/cache.go.


1. Components

ComponentPurposeSource
cache.indexs + indexHotTable index cache (fid*pb.TableIndex) reused across reopen + small CLOCK hot tier fed by HotRing hits.utils/cache
blockCacheRistretto-based block cache (L0/L1 only) with per-table direct slots; hot block tier (small CLOCK) keeps hotspot blocks resident.lsm/cache.go
bloomCache + hotLRU cache of bloom filter bitsets per SST plus small CLOCK hot tier to protect frequent filters.lsm/cache.go
cacheMetricsAtomic hit/miss counters for L0/L1 blocks and blooms.lsm/cache.go#L30-L110

Badger uses a similar block cache split (Pinner/Cache) while RocksDB exposes block cache(s) via the BlockBasedTableOptions. NoKV keeps it Go-native and GC-friendly.


1.1 Index Cache & Handles

  • SSTable metadata stays with the table struct, while decoded protobuf indexes are stored in cache.indexs. Lookups first hit the cache before falling back to disk.
  • SST handles are reopened on demand for lower levels. L0/L1 tables keep their file descriptors pinned, while deeper levels close them once no iterator is using the table.

2. Block Cache Strategy

User-space block cache (L0/L1, parsed blocks, Ristretto LFU-ish)
Small hot tier (CLOCK) for hotspot blocks
Deeper levels rely on OS page cache + mmap readahead
  • Options.BlockCacheSize sets capacity in blocks (cost=1 per block). Entries keep parsed blocks (data slice + offsets/baseKey/checksum), so hits avoid re-parsing.
  • Hot tier: requests marked hot (prefetch/hotspot reads) promote blocks into the small CLOCK hot set derived from the main capacity, making them harder to evict under long-tail traffic.
  • Per-table direct slots (table.cacheSlots[idx]) give a lock-free fast path. Misses fall back to the shared Ristretto cache (approx LFU with admission).
  • Evictions clear the table slot via OnEvict; user-space cache only tracks L0/L1 blocks. Deeper levels depend on the OS page cache.
  • Access patterns: getBlock also updates hit/miss metrics for L0/L1; deeper levels bypass the cache and do not affect metrics.
flowchart LR
  Read --> CheckHot
  CheckHot -->|hit| Return
  CheckHot -->|miss| LoadFromTable["LoadFromTable (mmap + OS page cache)"]
  LoadFromTable --> InsertHot
  InsertHot --> Return

By default only L0 and L1 blocks are cached (level > 1 short-circuits), reflecting the higher re-use for top levels.


3. Bloom Cache

  • bloomCache stores the raw filter bitset (utils.Filter) per table ID. Entries are deep-copied (SafeCopy) to avoid sharing memory with mmaps.
  • Main tier is LRU with a tiny CLOCK hot set to protect frequently hit filters from being washed out by scans.
  • Capacity is controlled by Options.BloomCacheSize; the hot CLOCK tier auto-scales from a few dozen up to a few hundred entries.
  • Bloom hits/misses are recorded via cacheMetrics.recordBloom, feeding into StatsSnapshot.BloomHitRate.

4. Metrics & Observability

cache.metricsSnapshot() produces:

type CacheMetrics struct {
    L0Hits, L0Misses uint64
    L1Hits, L1Misses uint64
    BloomHits, BloomMisses uint64
    IndexHits, IndexMisses uint64
}

Stats.Snapshot converts these into hit rates. Monitor them alongside the block cache sizes to decide when to scale memory.


5. Hot Integration (HotRing)

  • Hot detection: HotRing counts on read/write paths raise a hot flag once thresholds are met; only hot keys trigger prefetch.
  • Cache promotion: hot hits/prefetch promote blocks into the CLOCK hot tier and promote indexes/Blooms into their CLOCK tiers; cold data stays in the main cache to avoid pollution.
  • Compaction coupling: HotRing top-k feeds compaction scoring; levels/ingest shards covering hot ranges get higher scores to trim overlap sooner.
  • Tuning: Hot thresholds come from HotRing options (window/decay configurable); hot tier capacities are small and derived from existing cache sizes.

6. Interaction with Value Log

  • Keys stored as value pointers (large values) still populate block cache entries for the key/index block. The value payload is read directly from the vlog (valueLog.read), so block cache hit rates remain meaningful.
  • Discard stats from flushes can demote cached blocks via cache.dropBlock, ensuring obsolete SST data leaves the cache quickly.

7. Comparison

FeatureRocksDBBadgerDBNoKV
Hot/cold tiersConfigurable multiple cachesSingle cacheRistretto (hot) + OS page cache (cold)
Bloom cacheEnabled per table, no explicit cacheOptionalDedicated LRU storing filters
MetricsBlock cache stats via GetAggregatedIntPropertyLimitedNoKV.Stats.Cache.* hit rates

8. Operational Tips

  • If bloom hit rate falls below ~60%, consider increasing bits-per-key or Bloom cache size.
  • Track nokv stats --json cache metrics over time; drops often indicate iterator misuse or working-set shifts.

More on SST layout lives in docs/manifest.md and docs/architecture.md.

HotRing – Hot Key Tracking

hotring is NoKV’s built-in hot-key tracker. It samples read/write access frequency per key and exposes the hottest entries to the stats subsystem and CLI. The implementation resides in hotring/.


1. Motivation

  • Cache hintsDB.prefetchLoop (see db.go) consumes hot keys to schedule asynchronous reads into the block cache.
  • Operational insightStatsSnapshot.HotKeys and nokv stats --json surface the hottest keys, aiding debugging of traffic hotspots.
  • ThrottlingHotRing.TouchAndClamp enables simple rate caps: once a key crosses a threshold, callers can back off or log alerts.

Compared with RocksDB (which exposes block access stats via perf_context) and Badger (which lacks built-in hot-key reporting), NoKV offers a lightweight but concurrent-friendly tracker out of the box.


2. Data Structure

HotRing
  buckets[] -> per-bucket lock-free linked list (Node)
  hashFn   -> hash(key) -> uint32
  hashMask -> selects bucket (power of two size)
  • Each bucket stores a sorted linked list of Node ordered by (tag, key), where tag is derived from the upper bits of the hash. Head pointers are atomic.Pointer[Node], so readers walk the list without taking locks; writers use CAS to splice nodes while preserving order.
  • defaultTableBits = 12 → 4096 buckets by default (NewHotRing). The mask ensures cheap modulo operations.
  • Nodes keep a count (int32) updated atomically and a next pointer stored via unsafe.Pointer. Sliding-window state is guarded by a tiny per-node spin lock instead of a process-wide mutex.
flowchart LR
  Key(key) -->|hash| Bucket["buckets[index] (atomic head)"]
  Bucket --> Node1
  Node1 --> Node2
  Node2 --> Node3
  Node3 --> Nil[(nil)]

3. Core Operations

MethodBehaviourNotes
TouchInsert or increment key’s counter.CAS-splices a new node if missing, then increments (window-aware when enabled).
FrequencyRead-only counter lookup.Lock-free lookup; uses sliding-window totals when configured.
TouchAndClampIncrement unless count >= limit, returning (count, limited).Throttling follows sliding-window totals so hot bursts clamp quickly.
TopNSnapshot hottest keys sorted by count desc.Walks buckets without locks, then sorts a copy.
KeysAboveReturn all keys with counters ≥ threshold.Handy for targeted throttling or debugging hot shards.

Bucket ordering is preserved by findOrInsert, which CASes either the bucket head or the predecessor’s next pointer to splice new nodes. Reads never take locks; only per-node sliding-window updates spin briefly to avoid data races.


4. Integration Points

  • DB readsTxn.Get and iterators call db.recordRead, which in turn invokes HotRing.Touch for every successful lookup. Writes touch the ring only when Options.WriteHotKeyLimit is set, so throttling can clamp abusive keys.
  • StatsStatsSnapshot copies hot.TopN into HotKeys. expvar publishes the same view under NoKV.Stats.HotKeys for automation.
  • Cachinglsm/cache can promote blocks referenced by frequently touched keys, keeping the hot tier warm.

5. Comparisons

EngineApproach
RocksDBExternal – TRACE / perf_context requires manual sampling.
BadgerNone built-in.
NoKVIn-process ring with expvar/CLI export and throttling helpers.

The HotRing emphasises simplicity: lock-free bucket lists with atomic counters (plus optional per-node window tracking), avoiding sketches while staying light enough for hundreds of thousands of hot keys.


6. Operational Tips

  • Options.HotRingTopK controls how many keys show up in stats; default 16. Increase it when investigating workloads with broad hot sets.
  • Combine TouchAndClamp with request middleware to detect abusive tenants: when limited is true, log the key and latency impact.
  • Resetting the ring is as simple as instantiating a new HotRing—useful for benchmarks that require clean counters between phases.

For end-to-end examples see docs/stats.md and the CLI walkthrough in docs/cli.md.


7. Write-Path Throttling

Options.WriteHotKeyLimit wires HotRing into the write path. When set to a positive integer, every call to DB.Set* or transactional Txn.Set* invokes HotRing.TouchAndClamp with the limit. Once a key (optionally scoped by column family via cfHotKey) reaches the limit, the write is rejected with utils.ErrHotKeyWriteThrottle. This keeps pathological tenants or hot shards from overwhelming a single Raft group without adding heavyweight rate-limiters to the client stack.

Operational hints:

  • StatsSnapshot.HotWriteLimited and the CLI line Write.HotKeyThrottled expose how many writes were rejected since the process started.
  • Applications should surface utils.ErrHotKeyWriteThrottle to callers (e.g. HTTP 429) so clients can back off.
  • Prefetching continues to run independently—only writes are rejected; reads still register hotness so the cache layer knows what to prefetch.
  • Set the limit conservatively (e.g. a few dozen) and pair it with richer HotRing analytics (top-K stats, expvar export) to identify outliers before tuning.

8. Time-Based Decay & Sliding Window

HotRing now exposes two complementary controls so “old” hotspots fade away automatically:

  1. Periodic decay (Options.HotRingDecayInterval + HotRingDecayShift)
    Every interval the global counters are right-shifted (count >>= shift). This keeps TopN and stats output focused on recent traffic even if writes stop abruptly.
  2. Sliding window (Options.HotRingWindowSlots + HotRingWindowSlotDuration)
    Per-key windows split time into slots, each lasting slotDuration. Touch only accumulates inside the current slot; once the window slides past, the stale contribution is dropped. TouchAndClamp and Frequency use the sliding-window total, so write throttling reflects short-term pressure instead of lifetime counts.

Disable either mechanism by setting the interval/durations to zero. Typical starting points:

OptionDefault valueEffect
HotRingDecayInterval1sHalve legacy counters once per second.
HotRingDecayShift1Simple divide-by-two decay.
HotRingWindowSlots8Keep ~8 buckets of recency data.
HotRingWindowSlotDuration250msRoughly 2s window for throttling.

With both enabled, the decay loop keeps background stats tidy while the sliding window powers precise, short-term throttling logic.

Transaction & MVCC Design

NoKV provides snapshot-isolated transactions backed by a lightweight oracle that hands out timestamps, tracks conflicts, and coordinates with the write pipeline. The implementation lives entirely in txn.go with metrics surfaced via stats.go.


1. Components at a Glance

ComponentPurposeKey Functions
oracleIssues read/commit timestamps, performs conflict checks, persists watermark progress.readTs, newCommitTs, doneCommit
TxnUser-facing transaction state: pending writes, read-set fingerprints, MVCC metadata.SetEntry, Get, Commit
pendingWritesIteratorAllows iterator merge to see unflushed txn writes.newPendingWritesIterator
MetricsTracks counts of started/committed/conflicted txns.trackTxnStart, txnMetricsSnapshot

The oracle is initialised during DB.Open, sharing lineage with BadgerDB’s MVCC model. Unlike RocksDB—which relies on WriteBatch/TwoPhaseCommit extensions—transactions are first-class citizens, and the core engine enforces ordering.


2. Timestamp & Conflict Flow

sequenceDiagram
    participant Client
    participant DB
    participant Oracle
    participant Commit as commitWorker
    participant Mgr as vlog.Manager
    participant WAL
    participant Mem as MemTable
    Client->>DB: NewTransaction(update)
    DB->>Oracle: readTs()
    Oracle-->>DB: snapshot ts (nextTxnTs-1)
    Client->>DB: Set/Delete/Get
    DB->>Txn: stage pendingWrites, record read hashes
    Client->>DB: Commit
    DB->>Oracle: newCommitTs(txn)
    alt conflict
        Oracle-->>DB: ErrConflict
    else success
        Oracle-->>DB: commitTs
        DB->>Commit: batch requests
        Commit->>Mgr: AppendEntries(entries, writeMask)
        Commit->>WAL: Append(entries with commitTs)
        Commit->>Mem: apply to skiplist
        DB->>Oracle: doneCommit(commitTs)
    end
  1. StartDB.newTransaction calls oracle.readTs, which waits for all prior commits to finish (txnMark.WaitForMark) so new readers see a consistent snapshot. In distributed deployments, clients must obtain the startVersion themselves (see Timestamp sources).
  2. ReadsTxn.Get first checks pendingWrites; otherwise it merges LSM iterators and value-log pointers under the read timestamp. For update transactions the read key fingerprint is recorded in Txn.reads via addReadKey.
  3. Conflict detection – When Options.DetectConflicts is enabled, oracle.newCommitTs iterates oracle.committedTxns and compares read fingerprints against keys written by newer commits. This mirrors Badger’s optimistic strategy.
  4. Commit timestampnewCommitTs increments nextTxnTs, registers the commit in txnMark, and stores the conflict key set for future comparisons.
  5. ApplyTxn.commitAndSend assigns the commit timestamp to each pending entry (kv.KeyWithTs), enqueues them through sendToWriteCh, and returns a callback that waits for the batch completion. Only after the callback runs does the oracle’s doneCommit release the commit watermark.
  6. Value log ordering – As with non-transactional writes, the commit worker runs valueLog.write (which calls Manager.AppendEntries) before the WAL append. On failure vlog.manager.Rewind ensures partial writes do not leak.

RocksDB’s default WriteBatch lacks conflict detection, relying on application-level locking; NoKV ships with snapshot isolation and optional detection, similar to Badger’s Txn but with integrated metrics and iterator pooling.


3. Data Structures

Oracle Watermarks

oracle{
  nextTxnTs       // next commit timestamp to assign
  txnMark         // watermark waiting for WAL/vlog durability
  readMark        // tracks oldest active read timestamp
  committedTxns[] // sliding window of conflict key sets
}
  • txnMark / readMark are utils.WaterMark instances. They guarantee all writes with timestamp ≤ readTs are durable before a new read snapshot begins, mirroring Badger’s approach to avoid reading half-committed data.
  • cleanupCommittedTransactions prunes conflict history based on the oldest outstanding read, preventing unbounded memory use.

Txn State

type Txn struct {
    readTs   uint64
    commitTs uint64
    pendingWrites map[string]*kv.Entry
    conflictKeys  map[uint64]struct{}
    reads         []uint64
    numIterators  int32
    discarded     bool
    update        bool
}
  • Pending writes retain the caller’s entry pointers until commit; NoKV copies values only when moving them into the write batch.
  • Read fingerprints use kv.MemHash, so conflict detection is order-independent and compact.
  • MVCC versions are encoded in the key suffix (KeyWithTs), matching the LSM’s descending version order.

Iterator Integration

  • Txn.newPendingWritesIterator materialises staged entries as a sorted slice, allowing transaction iterators to merge them with memtables/SST tables. This ensures Txn.NewIterator sees writes immediately without affecting other snapshots.
  • Txn.numIterators enforces that all iterators close before commit/discard—helpful for catching resource leaks in tests (txn_iterator_test.go).

4. Commit & Error Handling

StageFailure Handling
Conflictoracle.newCommitTs returns (0, true); Txn.Commit surfaces utils.ErrConflict and leaves state untouched.
Value log appendvalueLog.write rewinds via Manager.Rewind; req.Wait returns the error so callers can retry safely.
WAL appendsendToWriteCh propagates WAL errors; commit watermark is cleared immediately in that case.
Callback modeTxn.CommitWith schedules runTxnCallback on a goroutine; user callbacks always execute (success or error).

The final call to Txn.Discard runs regardless of success, marking the read watermark done and decrementing the oracle’s active counter.


5. Comparisons

FeatureRocksDBBadgerDBNoKV
IsolationOptional (WritePrepared/2PC)Snapshot isolationSnapshot isolation with WaterMark barriers
Conflict detectionExternalOptional optimisticOptional optimistic keyed by utils.MemHash
Iterator viewSnapshot handles, manual mergingBuilt-inBuilt-in with pending write iterator
Metricsrocksdb.transactions.* when enabledBasic statsNoKV.Txns.* expvar counters + CLI

NoKV inherits Badger’s optimistic concurrency but strengthens durability ordering by coupling commits with the same write pipeline that non-transactional writes use. Compared with RocksDB’s transactional library, the Go implementation remains lightweight and requires no external locks.


6. Operational Notes

  • Long-running reads: watch NoKV.Txns.Active and oracle.readMark.DoneUntil()—slow consumers keep old versions alive, delaying vlog GC and compaction reclamation.
  • Non-transactional APIs: DB.Set/Get/Del and SetCF/GetCF/DelCF use a MaxUint64 sentinel version for “latest”. Do not mix these writes with MVCC/Txn writes in the same database.
  • Managed mode: exposing Txn.SetEntry with pre-set versions allows replication/replay flows. Because commit timestamps may diverge, transaction markers are only set when all entries share a single commitTs.
  • Throttling: combine HotRing.TouchAndClamp with per-transaction analytics to detect hot-key write storms that lead to frequent conflicts.

See docs/testing.md for the regression matrix covering conflict detection, iterator semantics, and managed timestamps.


7. Timestamp Sources

Replica nodes do not generate timestamps during TinyKV RPC handling; the values sent in KvPrewrite/KvCommit are applied verbatim. For teaching and prototyping you can pick from two approaches:

  • Single-client experiments – choose monotonically increasing integers in your client code (as shown in raftstore/client/client_test.go).

  • Shared allocator – run the sample TSO service under scripts/tso to hand out globally increasing timestamps:

    go run ./scripts/tso --addr 127.0.0.1:9494 --start 100
    
    # request one timestamp
    curl -s http://127.0.0.1:9494/tso
    # request a batch of 16
    curl -s "http://127.0.0.1:9494/tso?batch=16"
    

    Each call returns JSON ({"timestamp":123,"count":1}), where timestamp is the first value in the allocated range. Clients can use the first value for startVersion, or the entire range to provision multiple transactions. This keeps the learning focus on the Percolator flow while demonstrating how production systems would obtain globally ordered timestamps.

RaftStore Deep Dive

raftstore powers NoKV’s distributed mode by layering multi-Raft replication on top of the embedded storage engine. This note explains the major packages, the boot and command paths, how transport and storage interact, and the supporting tooling for observability and testing.


1. Package Structure

PackageResponsibility
storeOrchestrates peer set, command pipeline, region manager, scheduler/heartbeat loops; exposes helpers such as StartPeer, ProposeCommand, SplitRegion.
peerWraps etcd/raft RawNode, drives Ready processing (persist to WAL, send messages, apply entries), tracks snapshot resend/backlog.
engineWALStorage/DiskStorage/MemoryStorage across all Raft groups, leveraging the NoKV WAL while keeping manifest metadata in sync.
transportgRPC transport with retry/TLS/backpressure; exposes the raft Step RPC and can host additional services (TinyKv).
kvTinyKv RPC implementation, bridging Raft commands to MVCC operations via kv.Apply.
serverServerConfig + New that bind DB, Store, transport, and TinyKv server into a reusable node primitive.

2. Boot Sequence

  1. Construct Server

    srv, _ := raftstore.NewServer(raftstore.ServerConfig{
        DB: db,
        Store: raftstore.StoreConfig{StoreID: 1},
        Raft: myraft.Config{ElectionTick: 10, HeartbeatTick: 2, PreVote: true},
        TransportAddr: "127.0.0.1:20160",
    })
    
    • A gRPC transport is created, the TinyKv service is registered, and transport.SetHandler(store.Step) wires raft Step handling.
    • store.Store loads manifest.RegionSnapshot() to rebuild the Region catalog (router + metrics).
  2. Start local peers

    • CLI (nokv serve) iterates the manifest snapshot and calls Store.StartPeer for every region that includes the local store.
    • Each peer.Config carries raft parameters, the transport reference, kv.NewEntryApplier, WAL/manifest handles, and Region metadata.
    • StartPeer registers the peer through the peer-set/routing layer and may bootstrap or campaign for leadership.
  3. Peer connectivity

    • transport.SetPeer(storeID, addr) defines outbound raft connections; the CLI exposes it via --peer storeID=addr.
    • Additional services can reuse the same gRPC server through transport.WithServerRegistrar.

3. Command Execution

Read (strong leader read)

  1. kv.Service.KvGet builds pb.RaftCmdRequest and invokes Store.ReadCommand.
  2. validateCommand ensures the region exists, epoch matches, and the local peer is leader; a RegionError is returned otherwise.
  3. peer.Flush() drains pending Ready, guaranteeing the latest committed log is applied.
  4. commandApplier (i.e. kv.Apply) runs GET/SCAN directly against the DB, using MVCC readers to honour locks and version visibility.

Write (via Propose)

  1. Write RPCs (Prewrite/Commit/…) call Store.ProposeCommand, encoding the command and routing to the leader peer.
  2. The leader appends the encoded request to raft, replicates, and once committed the command pipeline hands data to kv.Apply, which maps Prewrite/Commit/ResolveLock to the percolator package.
  3. engine.WALStorage persists raft entries/state snapshots and updates manifest raft pointers. This keeps WAL GC and raft truncation aligned.

4. Transport

  • gRPC transport listens on TransportAddr, serving both raft Step RPC and TinyKv RPC.
  • SetPeer updates the mapping of remote store IDs to addresses; BlockPeer can be used by tests or chaos tooling.
  • Configurable retry/backoff/timeout options mirror production requirements. Tests cover message loss, blocked peers, and partitions.

5. Storage Backend (engine)

  • WALStorage piggybacks on the embedded WAL: each Raft group writes typed entries, HardState, and snapshots into the shared log.
  • LogRaftPointer and LogRaftTruncate edit manifest metadata so WAL GC knows how far it can compact per group.
  • Alternative storage backends (DiskStorage, MemoryStorage) are available for tests and special scenarios.

6. TinyKv RPC Integration

RPCExecution PathNotes
KvGet / KvScanReadCommandkv.Apply (read mode)No raft round-trip; leader-only.
KvPrewrite / KvCommit / KvBatchRollback / KvResolveLock / KvCheckTxnStatusProposeCommand → command pipeline → raft log → kv.ApplyPipeline matches proposals with apply results; MVCC latch manager prevents write conflicts.

The cmd/nokv serve command uses raftstore.Server internally and prints a manifest summary (key ranges, peers) so operators can verify the node’s view at startup.


7. Client Interaction (raftstore/client)

  • Region-aware routing with NotLeader/EpochNotMatch retry.
  • Mutate splits mutations by region and performs two-phase commit (primary first). Put / Delete are convenience wrappers.
  • Scan transparently walks region boundaries.
  • End-to-end coverage lives in raftstore/server/server_client_integration_test.go, which launches real servers, uses the client to write and delete keys, and verifies the results.

8. Control Plane & Region Operations

8.1 Topology & Routing

  • Topology is sourced from raft_config.example.json (via config.LoadFile) and reused by scripts, Docker Compose, and the Redis gateway.
  • The client builds a static region map ([]RegionConfig) and store endpoints from the same file; there is no dynamic PD-style reconfiguration today.
  • The built-in scheduler currently emits leader-transfer operations only (see raftstore/scheduler), acting as a minimal control plane.

8.2 Split / Merge

  • Split: leaders call Store.ProposeSplit, which writes a split AdminCommand into the parent region’s raft log. On apply, Store.SplitRegion updates the parent range/epoch and starts the child peer.
  • Merge: leaders call Store.ProposeMerge, writing a merge AdminCommand. On apply, the target region range/epoch is expanded and the source peer is stopped/removed from the manifest.
  • These operations are explicit and are not auto-triggered by size/traffic heuristics; a higher-level controller could call the same APIs.

9. Observability

  • store.RegionMetrics() feeds into StatsSnapshot, making region counts and backlog visible via expvar and nokv stats.
  • nokv regions shows manifest-backed regions: ID, range, peers, state.
  • scripts/transport_chaos.sh exercises transport metrics under faults; scripts/run_local_cluster.sh spins up multi-node clusters for manual inspection.

Store internals at a glance

ComponentFileResponsibility
Peer setpeer_set.goTracks active peers, synchronises router registration, exposes thread-safe lookups/iteration.
Command pipelinecommand_pipeline.goAssigns request IDs, records proposals, matches apply results, returns responses/errors to callers.
Region managerregion_manager.goValidates state transitions, writes manifest edits, updates peer metadata, triggers region hooks.
Operation scheduleroperation_scheduler.goBuffers planner output, enforces cooldown & burst limits, dispatches leader transfers or other operations.
Heartbeat loopheartbeat_loop.goPeriodically publishes region/store heartbeats and re-runs the planner to produce scheduling actions.
Global registryglobal.goRecords live stores for CLI/scripting (Store.Close() automatically unregisters instances).

10. Extending raftstore

  • Adding peers: update the manifest with new Region metadata, then call Store.StartPeer on the target node.
  • Follower or lease reads: extend ReadCommand to include ReadIndex or leader lease checks; current design only serves leader reads.
  • Scheduler integration: pair RegionSnapshot() and RegionMetrics() with an external scheduler (PD-like) for dynamic balancing.

This layering keeps the embedded storage engine intact while providing a production-ready replication path, robust observability, and straightforward integration in both CLI and programmatic contexts.

Crash Recovery Playbook

This playbook documents how NoKV rebuilds state after a crash and which automated checks ensure correctness. It ties together WAL replay, manifest reconciliation, ValueLog GC, and flush pipelines—mirroring RocksDB’s layered recovery while incorporating Badger-style value log hygiene.


1. Recovery Phases

flowchart TD
    Start[DB.Open]
    Verify[runRecoveryChecks]
    Manifest[manifest.Open → replay]
    WAL[wal.Manager.Replay]
    VLog[valueLog.recover]
    Flush[Recreate memtables]
    Stats[Stats.StartStats]

    Start --> Verify --> Manifest --> WAL --> VLog --> Flush --> Stats
  1. Directory verificationDB.runRecoveryChecks calls manifest.Verify, wal.VerifyDir, and initialises the vlog directory. Missing directories fail fast.
  2. Manifest replaymanifest.Open reads CURRENT, replays EditAddFile/DeleteFile, EditLogPointer, and vlog edits into an in-memory Version.
  3. WAL replaywal.Manager.Replay processes segments newer than the manifest checkpoint, rebuilding memtables from committed entries.
  4. ValueLog reconciliationvalueLog.recover scans existing .vlog files, drops segments marked invalid, and trims torn tails to the last valid entry.
  5. Flush backlog – Immutable memtables recreated from WAL are resubmitted to flush.Manager; temporary .sst.tmp files are either reinstalled or cleaned up.
  6. Stats bootstrap – the metrics goroutine restarts so CLI commands immediately reflect queue backlogs and GC status.

This mirrors RocksDB’s DBImpl::Recover while extending to handle value log metadata automatically.


2. Failure Scenarios & Expected Outcomes

Failure PointExample SimulationExpected Recovery BehaviourTests
WAL tail truncationtruncate last 2 bytes of 000005.walReplay stops at truncated record, previously flushed SST remains intactwal/manager_test.go::TestReplayTruncatedTail
Flush crash before installcrash after writing .sst.tmpWAL replay rebuilds memtable; temp file removed; no manifest edit presentdb_recovery_test.go::TestRecoveryWALReplayRestoresData
Flush crash after installcrash after logging manifest edit but before WAL releaseManifest still lists SST; recovery verifies file exists and releases WAL on reopendb_recovery_test.go::TestRecoveryCleansMissingSSTFromManifest
ValueLog GC crashdelete edit written, file still on diskRecovery removes stale .vlog file and keeps manifest consistentdb_recovery_test.go::TestRecoveryRemovesStaleValueLogSegment
Manifest rewrite crashnew MANIFEST written, CURRENT not updatedRecovery keeps using old manifest; stale temp file cleaneddb_recovery_test.go::TestRecoveryManifestRewriteCrash
Transaction in-flightcrash between WAL append and memtable updateWAL replay reapplies entry; transactions remain atomic because commit order is vlog → WAL → memtabletxn_test.go::TestTxnCommitPersists

3. Automation & Tooling

3.1 Go Test Matrix

GOCACHE=$PWD/.gocache GOMODCACHE=$PWD/.gomodcache go test ./... -run 'Recovery'
  • Exercises WAL replay, manifest cleanup, vlog GC, and managed transaction recovery.
  • Set RECOVERY_TRACE_METRICS=1 to emit structured logs (key/value pairs) for each scenario.

3.2 Shell Script Harness

scripts/recovery_scenarios.sh orchestrates the matrix end-to-end:

  1. Spins up a temporary database, injects writes, and crashes at chosen checkpoints.
  2. Reopens the database and validates via CLI (nokv stats, nokv manifest, nokv vlog).
  3. Archives logs under artifacts/recovery/<scenario>.log for CI inspection.

3.3 CLI Validation

  • nokv manifest --workdir <dir>: confirm WAL checkpoint, level files, vlog head.
  • nokv stats --workdir <dir>: observe flush backlog drop to zero after replay.
  • nokv vlog --workdir <dir>: ensure stale segments disappear after GC recovery.

These commands give the same insight as RocksDB’s ldb manifest_dump or Badger’s CLI but with JSON output for automation.


4. Metrics Emitted During Recovery

When RECOVERY_TRACE_METRICS=1:

  • RECOVERY_METRIC phase="manifest" ... – manifest replay progress.
  • RECOVERY_METRIC phase="wal" segment=... offset=... – WAL records applied.
  • RECOVERY_METRIC phase="vlog_gc" fid=... action="delete" – vlog cleanup status.

StatsSnapshot also exposes:

  • NoKV.Flush.Queue – remaining flush tasks.
  • NoKV.ValueLog.HeadFID – head file after recovery.
  • NoKV.Txns.Active – should reset to zero post-recovery.

5. Comparison with RocksDB & Badger

AspectRocksDBBadgerDBNoKV
WAL replayDBImpl::RecoverLogFiles replays per log numberJournal (value log) is replayed into LSMDedicated WAL manager with manifest checkpoint, plus vlog trim
Manifest reconciliationRemoves missing files, handles CURRENT rewriteMinimal manifest (mainly tables)Tracks SST + vlog metadata; auto-cleans missing SST/vlog
Value log recoveryOptional (BlobDB) requires external blob manifestPrimary log, re-scanned on startManifest-backed head + discard stats to avoid rescan
Toolingldb for manifest dumpbadger CLInokv CLI with JSON output

NoKV inherits RocksDB’s strict manifest semantics and Badger’s value log durability, yielding deterministic restart behaviour even under mixed workloads.


6. Extending the Matrix

Future enhancements to cover:

  • Compaction crash – simulate partial compaction output and verify manifest rollback.
  • Prefetch queue state – ensure hot-key prefetch map resets cleanly.
  • Raft integration – once replication is added, validate raft log catch-up interacts correctly with WAL replay.

Contributions adding new recovery scenarios should update this document and the shell harness to keep observability aligned.

Stats & Observability Pipeline

NoKV exposes internal health via the Go expvar package and the nokv stats CLI. The statistics subsystem is implemented in stats.go and runs continuously once the DB is open.


1. Architecture

flowchart TD
    subgraph Collectors
        Flush[lsm.FlushMetrics]
        Levels[lsm.CompactionStats]
        VLog[valueLog.metrics]
        WAL[wal.Manager.Metrics]
        Txn[oracle.txnMetricsSnapshot]
        Cache[lsm.CacheMetrics]
        Hot[hotring.TopN]
    end
    Collectors --> Stats
    Stats -->|expvar publish| Runtime
    Stats -->|Snapshot| CLI
  • newStats wires together reusable expvar.Int/Float gauges (avoiding duplicates if the process restarts an embedded DB).
  • Stats.StartStats launches a goroutine that ticks every 5s (configurable via Stats.interval) to refresh values.
  • Stats.Snapshot can be called on-demand (e.g. CLI) without mutating expvar state.

2. Snapshot Fields

FieldSourceDescription
Entrieslsm.EntryCount()Total MVCC entries (L0-Ln + memtables). Mirrors Stats.EntryNum for backwards compat.
FlushPending/Queue/Activelsm.FlushMetrics()Pending immutables, queue length, workers currently building SSTs.
FlushWait/Build/ReleaseMsDerived from WaitNs/BuildNs/ReleaseNs averagesEnd-to-end latency of flush pipeline stages.
CompactionBacklog/MaxScorelsm.CompactionStats()How many level files await compaction and the hottest score.
ValueLogSegments/PendingDel/DiscardQueue/HeadvalueLog.metrics()Tracks vlog utilisation and GC backlog.
WALActiveSegment/SegmentCount/Removed/ActiveSizewal.Manager.Metrics()Observes WAL rotation cadence and current segment byte usage (pairs with raft lag metrics).
WALTypedRecordRatio/Warning/ReasonWAL backlog watchdog (Stats.Snapshot)Tracks ratio of raft typed records in the WAL and surfaces warnings with reasons when exceeding thresholds.
WALAutoGCRuns/Removed/LastUnixWAL backlog watchdogAutomated WAL GC passes, total segments removed, and the Unix timestamp of the last run.
WriteQueueDepth/Entries/ByteswriteMetrics.snapshot()Size of the asynchronous write queue.
WriteAvg*writeMetrics averagesRequest wait times, vlog latency, apply latency.
WriteBatchesTotalwriteMetricsLifetime batches processed.
HotWriteLimiteddb.hotWriteLimitedNumber of write attempts rejected by Options.WriteHotKeyLimit (HotRing write throttling).
WriteThrottleActivedb.blockWritesIndicates when writes are being throttled.
TxnsActive/Started/Committed/Conflictsoracle.txnMetricsSnapshot()MVCC activity counters.
HotKeyshotring.TopN()Top-K hot key counts.
BlockL0/L1/BloomHitRatelsm.CacheMetrics()Block and bloom cache hit ratios.
IndexHitRatelsm.CacheMetrics()SST 索引块缓存命中率。
IteratorReusediteratorPool.reused()Frequency of iterator pooling hits.
RaftGroupCount/LaggingGroups/MaxLagSegments/LagWarnThreshold/RaftLagWarningmanifest.RaftPointerSnapshot()Tracks follower backlogs; LagWarnThreshold comes from Options.RaftLagWarnSegments, and RaftLagWarning toggles when any group exceeds it.
RegionTotal/New/Running/Removing/Tombstone/Otherstore.RegionMetricsMulti-Raft region state distribution. CLI attaches the first available RegionMetrics by default; pass --no-region-metrics to disable.

All values are exported under the NoKV.* namespace via expvar (see newStats).


3. CLI & JSON Output

  • nokv stats --workdir <dir> prints a human-readable table (queue lengths, throughput, hot keys, region totals). It automatically attaches RegionMetrics when available; add --no-region-metrics to produce a manifest-only snapshot.
  • When RaftLagWarning=true the CLI emits an extra Raft.Warning line; it also surfaces Regions.Total (...) so operators can quickly gauge Region lifecycle health.
  • nokv stats --json emits the raw snapshot for automation. Example snippet:
{
  "entries": 1048576,
  "flush_queue_length": 2,
  "vlog_head": {"fid": 5, "offset": 184320},
  "hot_keys": [{"key": "user:123", "count": 42}]
}

The CLI internally instantiates a read-only DB handle, calls Stats.Snapshot, and formats the response—no background goroutine is needed.


4. Integration with Other Modules

ModuleContribution
WALwal.Manager.Metrics() counts active/removable segments, aiding post-recovery validation.
Value LogvalueLog.metrics() exposes GC backlog, enabling alerting when discard queues stall.
HotRingPublishes hot key JSON via expvar so dashboards can visualise top offenders.
TransactionsOracle counters help gauge contention (high conflicts → tune workload).
CacheHit rates clarify whether cache sizing (hot/cold tier) needs adjustment.

5. Comparisons

EngineObservability
RocksDBiostats, perf_context, ldb commands. Requires manual parsing.
BadgerPrometheus metrics (optional).
NoKVBuilt-in expvar gauges + CLI + recovery trace toggles.

NoKV emphasises zero-dependency observability. Everything is consumable via HTTP /debug/vars or the CLI, making it easy to integrate with Go services.


6. Operational Guidance

  • Watch FlushQueueLength and CompactionBacklog together—if both grow, increase flush workers or adjust level sizes.
  • ValueLogDiscardQueue > 0 for extended periods indicates GC is blocked; inspect NoKV.ValueLog.GcRuns and consider tuning thresholds.
  • WriteThrottleActive toggling frequently suggests L0 is overwhelmed; cross-check BlockL0HitRate and compaction metrics.
  • HotWriteLimited climbing steadily means HotRing write throttling is firing—surface utils.ErrHotKeyWriteThrottle to clients and investigate abusive keys via the HotKeys list.
  • RaftLagWarning toggling to true means at least one follower lags the leader by more than Options.RaftLagWarnSegments; inspect Raft.Warning from the CLI and consider snapshot resend or throttling the offending node.
  • Regions.Total should match the expected cluster topology; sustained Removing/Tombstone counts indicate stalled cleanup—investigate split/merge logic or stuck replicas.

Refer to docs/testing.md for scripted checks that validate stats during CI runs.

Testing & Validation Matrix

This document inventories NoKV’s automated coverage and provides guidance for extending tests. It aligns module-level unit tests, integration suites, and benchmarking harnesses with the architectural features described elsewhere.


1. Quick Commands

# All unit + integration tests (uses local module caches)
GOCACHE=$PWD/.gocache GOMODCACHE=$PWD/.gomodcache go test ./...

# Focused transaction suite
go test ./... -run '^TestTxn|TestConflict|TestTxnIterator'

# Crash recovery scenarios
RECOVERY_TRACE_METRICS=1 ./scripts/recovery_scenarios.sh

# gRPC transport chaos tests + watchdog metrics
CHAOS_TRACE_METRICS=1 ./scripts/transport_chaos.sh

# Sample timestamp allocator (TSO) for multi-client transaction tests
go run ./scripts/tso --addr 127.0.0.1:9494 --start 100

# Local three-node cluster (includes manifest bootstrap + optional TSO)
./scripts/run_local_cluster.sh --config ./raft_config.example.json
# Tear down with Ctrl+C

# Docker-compose sandbox (3 nodes + TSO)
docker compose up --build
docker compose down -v

# Build RocksDB locally (installs into ./third_party/rocksdb/dist by default)
./scripts/build_rocksdb.sh
# YCSB baseline (records=1e6, ops=1e6, warmup=1e5, conc=16)
./scripts/run_benchmarks.sh
# YCSB with RocksDB (requires CGO, `benchmark_rocksdb`, and the RocksDB build above)
LD_LIBRARY_PATH="$(pwd)/third_party/rocksdb/dist/lib:${LD_LIBRARY_PATH}" \
CGO_CFLAGS="-I$(pwd)/third_party/rocksdb/dist/include" \
CGO_LDFLAGS="-L$(pwd)/third_party/rocksdb/dist/lib -lrocksdb -lz -lbz2 -lsnappy -lzstd -llz4" \
YCSB_ENGINES="nokv,badger,rocksdb" ./scripts/run_benchmarks.sh
# One-click script (auto-detect RocksDB, supports `YCSB_*` env vars to override defaults)
./scripts/run_benchmarks.sh
# Quick smoke run (smaller dataset)
NOKV_RUN_BENCHMARKS=1 YCSB_RECORDS=10000 YCSB_OPS=50000 YCSB_WARM_OPS=0 \
./scripts/run_benchmarks.sh -ycsb_workloads=A -ycsb_engines=nokv

Tip: Pin GOCACHE/GOMODCACHE in CI to keep build artefacts local and avoid permission issues.


2. Module Coverage Overview

ModuleTestsCoverage HighlightsGaps / Next Steps
WALwal/manager_test.goSegment rotation, sync semantics, replay tolerance for truncation, directory bootstrap.Add IO fault injection, concurrent append stress.
LSM / Flush / Compactionlsm/lsm_test.go, lsm/compact_test.go, lsm/flush/*_test.goMemtable correctness, iterator merging, flush pipeline metrics, compaction scheduling.Extend backpressure assertions, test cache hot/cold split.
Manifestmanifest/manager_test.go, manifest/levels_test.goCURRENT swap safety, rewrite crash handling, vlog metadata persistence.Simulate partial edit corruption, column family extensions.
ValueLogvlog/vlog_test.go, vlog/gc_test.goValuePtr encoding/decoding, GC rewrite, concurrent iterator safety.Long-running GC with transactions, discard ratio edge cases.
Transactions / Oracletxn_test.go, txn_iterator_test.go, txn_metrics_test.goMVCC timestamps, conflict detection, iterator snapshots, metrics accounting.Mixed workload fuzzing, managed transactions with TTL.
DB Integrationdb_test.go, db_recovery_test.go, db_recovery_managed_test.goEnd-to-end writes, recovery, managed vs. unmanaged transactions, throttle behaviour.Combine ValueLog GC + compaction stress, multi-DB interference.
CLI & Statscmd/nokv/main_test.go, stats_test.goGolden JSON output, stats snapshot correctness, hot key ranking.CLI error handling, expvar HTTP integration tests.
Redis Gatewaycmd/nokv-redis/backend_embedded_test.go, cmd/nokv-redis/server_test.go, cmd/nokv-redis/backend_raft_test.goEmbedded backend semantics (NX/XX, TTL, counters), RESP parser, raft backend config wiring & TSO discovery.End-to-end multi-region CRUD with raft backend, TTL lock cleanup under failures.
Scripts & Toolingscripts/scripts_test.go, cmd/nokv-config/main_test.goserve_from_config.sh address scoping (host/docker) and manifest skipping, nokv-config JSON/simple formats, manifest logging CLI.Golden coverage for run_local_cluster.sh, failure-path diagnostics.
Benchmarkbenchmark/ycsb_test.go, benchmark/ycsb_runner.goYCSB throughput/latency comparisons across engines with detailed percentile + operation mix reporting.Automate multi-node deployments, add more workloads (D/E/F) and multi-GB datasets.

3. System Scenarios

ScenarioCoverageFocus
Crash recoverydb_recovery_test.go, scripts/recovery_scenarios.shWAL replay, missing SST cleanup, vlog GC restart, manifest rewrite safety.
WAL pointer desyncraftstore/engine/wal_storage_test.go::TestWALStorageDetectsTruncatedSegmentDetects manifest pointer offsets beyond truncated WAL tails to avoid silent corruption.
Transaction contentionTestConflict, TestTxnReadAfterWrite, TestTxnDiscardOracle watermark handling, conflict errors, managed commit path.
Value separation + GCvlog/gc_test.go, db_recovery_test.go::TestRecoveryRemovesStaleValueLogSegmentGC correctness, manifest integration, iterator stability.
Iterator consistencytxn_iterator_test.go, lsm/iterator_test.goSnapshot visibility, merging iterators across levels and memtables.
Throttling / backpressurelsm/compact_test.go, db_test.go::TestWriteThrottleL0 backlog triggers, flush queue growth, metrics observation.
Distributed TinyKv clientraftstore/client/client_test.go::TestClientTwoPhaseCommitAndGet, raftstore/transport/grpc_transport_test.go::TestGRPCTransportManualTicksDriveElectionRegion-aware routing, NotLeader retries, manual tick-driven elections, cross-region 2PC sequencing.
Performance regressionbenchmark packageCompare NoKV vs Badger/RocksDB, produce human-readable reports under benchmark/benchmark_results.

4. Observability in Tests

  • RECOVERY_METRIC logs – produced when RECOVERY_TRACE_METRICS=1; consumed by recovery script and helpful when triaging CI failures.
  • TRANSPORT_METRIC logs – emitted by scripts/transport_chaos.sh when CHAOS_TRACE_METRICS=1, capturing gRPC watchdog counters during network partitions and retries.
  • Stats snapshotsstats_test.go verifies JSON structure so CLI output remains backwards compatible.
  • Benchmark artefacts – stored under benchmark/benchmark_results/*.txt for historical comparison. Aligns with README instructions.

5. Extending Coverage

  1. Property-based testing – integrate testing/quick or third-party generators to randomise transaction sequences (Badger uses similar fuzz tests for transaction ordering).
  2. Stress harness – add a Go-based stress driver to run mixed read/write workloads for hours, capturing metrics akin to RocksDB’s db_stress tool.
  3. Distributed readiness – when Raft or replication is introduced, craft tests that validate WAL shipping combined with manifest updates.
  4. CLI smoke tests – simulate corrupted directories to ensure CLI emits actionable errors.

Keep this matrix updated when adding new modules or scenarios so documentation and automation remain aligned.

Scripts Overview

NoKV ships a small collection of helper scripts to streamline local experimentation, demos, diagnostics, and automation. This page summarises what each script does, how to use it, and which shared configuration it consumes.


Cluster helpers

scripts/run_local_cluster.sh

  • Purpose – builds nokv, nokv-config, and nokv-tso, reads raft_config.json, seeds manifests, and starts the TinyKv nodes (plus TSO when configured). If a store directory already contains a manifest (CURRENT), the seeding step is skipped so previously bootstrapped data is reused.
  • Usage
    ./scripts/run_local_cluster.sh --config ./raft_config.example.json --workdir ./artifacts/cluster
    

--config defaults to the repository’s raft_config.example.json; --workdir chooses the data root (./artifacts/cluster by default). For every entry under stores the script creates store-<id>, calls nokv-config manifest, and, if tso.listen_addr is set, launches nokv-tso. The script runs in the foreground—press Ctrl+C to stop all spawned processes.

❗️ Shutdown / restart note — To avoid WAL/manifest mismatches, always stop the script with Ctrl+C and wait for the Shutting down... message. If you crash the process or the host, clean the workdir (rm -rf ./artifacts/cluster) before starting again; otherwise the replay step may panic when it encounters truncated WAL segments.

scripts/bootstrap_from_config.sh

  • Purpose – manifest-only bootstrap, typically used in Docker Compose before the nodes start. Stores that already hold a manifest are detected and skipped.
  • Usage
    ./scripts/bootstrap_from_config.sh --config /etc/nokv/raft_config.json --path-template /data/store-{id}
    
    The script iterates over every store in the config and writes Region metadata via nokv-config manifest into the provided path template.

scripts/serve_from_config.sh

  • Purpose – translate raft_config.json into a nokv serve command, avoiding manual --peer lists. It resolves peer IDs from the region metadata and maps every peer (other than the local store) to its advertised address so that gRPC transport works out of the box.
  • Usage
    ./scripts/serve_from_config.sh \
        --config ./raft_config.json \
        --store-id 1 \
        --workdir ./artifacts/cluster/store-1 \
        --scope local   # use --scope docker inside containers
    
    --scope decides whether to use the local addresses or the container-friendly ones. The script assembles all peer mappings (excluding the local store) and execs nokv serve.

Diagnostics & benchmarking

ScriptPurpose
scripts/recovery_scenarios.shRuns crash-recovery scenarios across WAL/manifest/vlog. Set RECOVERY_TRACE_METRICS=1 to collect metrics under artifacts/recovery/.
scripts/transport_chaos.shInjects disconnects/blocks/delay into the raftstore transport to observe behaviour under faulty networks.
scripts/run_benchmarks.shExecutes the comparison benchmarks (NoKV vs Badger/RocksDB).
scripts/analyze_pprof.shAggregates CPU/heap profiles from pprof_output/ and renders SVG/PNG summaries.
scripts/debug.shConvenience wrapper around dlv test for targeted debugging.
scripts/gen.shGenerates mock data or helper artefacts (see inline comments for details).

Other helpers

scripts/tso

A small Go program (not shell) that exposes an HTTP timestamp oracle:

go run ./scripts/tso --addr 0.0.0.0:9494 --start 100

run_local_cluster.sh and Docker Compose invoke it automatically when tso.listen_addr is present in the shared config.


Relationship with nokv-config

  • nokv-config stores / regions / tso provide structured views over raft_config.json, making it easy for scripts and CI to query the topology.
  • nokv-config manifest writes Region metadata into manifests and replaces the historical manifestctl binary.
  • cmd/nokv-redis reads the same config; when --tso-url is omitted it falls back to the tso section.
  • Go tools or custom scripts can import github.com/feichai0017/NoKV/config and call config.LoadFile / Validate to consume the same raft_config.json, avoiding divergent schemas.

Maintaining a single raft_config.json keeps local scripts, Docker Compose, Redis gateway, and automated tests aligned.

Redis Gateway

cmd/nokv-redis exposes NoKV through a RESP-compatible endpoint. The gateway reuses the engine’s MVCC/transaction semantics and can operate in two modes:

ModeDescriptionKey flags
Embedded (embedded)Opens a local *NoKV.DB work directory. Commands (SET, SET NX/XX, EX/PX/EXAT/PXAT, MSET, INCR/DECR, DEL, MGET, EXISTS, …) run inside db.Update / db.View, providing atomic single-key updates and snapshot reads across multiple keys.--workdir <dir>
Raft (raft)Routes requests through raftstore/client and a TinyKv cluster. Writes execute via TwoPhaseCommit; TTL metadata is stored under !redis:ttl!<key>. When --tso-url is omitted, the gateway consults the tso block in raft_config.json and falls back to a local oracle if the block is absent.--raft-config <file>
--tso-url http://host:port (optional)

Usage examples

Embedded backend

go run ./cmd/nokv-redis \
  --addr 127.0.0.1:6380 \
  --workdir ./work_redis \
  --metrics-addr 127.0.0.1:9100  # optional expvar endpoint

Validate with redis-cli -p 6380 ping. Metrics are exposed at http://127.0.0.1:9100/debug/vars under the NoKV.Redis key.

Raft backend

  1. Start TinyKv and, if configured, the TSO using the helper script or Docker Compose. Both consume raft_config.example.json, initialise manifests for each store, and launch nokv-tso automatically when tso.listen_addr is present:

    ./scripts/run_local_cluster.sh
    # or: docker compose up --build
    
  2. Run the gateway:

    go run ./cmd/nokv-redis \
      --addr 127.0.0.1:6380 \
      --raft-config raft_config.example.json
    

    Supply --tso-url only when you need to override the config file; otherwise the gateway uses tso.advertise_url (or listen_addr) from the same JSON. If the block is missing, it falls back to the embedded timestamp oracle.

Supported commands

  • String operations: GET, SET, SET NX/XX, EX/PX/EXAT/PXAT, DEL, MGET, MSET, EXISTS
  • Integer operations: INCR, DECR, INCRBY, DECRBY
  • Utility: PING, ECHO, QUIT

In both modes write commands are atomic. The Raft backend batches multi-key updates (MSET, DEL, …) into a single TwoPhaseCommit, matching the embedded semantics. Reads use snapshot transactions locally (db.View) and leader reads with TTL checks remotely.

Configuration file

raft_config.example.json is shared by scripts/run_local_cluster.sh, Docker Compose, and the Redis gateway. Important fields:

  • stores – store ID, gRPC address, and optional container listen/advertise addresses
  • regions – region ID, start/end keys (use hex:<bytes> for binary data), epoch, peer list, leader store ID
  • max_retries – maximum retries for region errors in the distributed client

Use nokv-config to inspect or validate the configuration:

nokv-config stores --config raft_config.json
nokv-config regions --config raft_config.json --format json | jq '.[] | {id:.id, peers:.peers}'

For Go tooling, import github.com/feichai0017/NoKV/config and call config.LoadFile / Validate to reuse the same schema and defaults across CLIs, scripts, and applications.

Metrics

With --metrics-addr enabled the gateway publishes NoKV.Redis on /debug/vars, for example:

{
  "commands_total": 128,
  "errors_total": 0,
  "connections_active": 1,
  "connections_accepted": 4,
  "commands_per_operation": {
    "PING": 4,
    "SET": 32,
    "GET": 64,
    "MGET": 8,
    "DEL": 10,
    "INCR": 10
  }
}

These counters are part of the process-wide expvar output and can be scraped alongside the rest of NoKV’s metrics.

Notes

Use this folder to capture per-debug or per-investigation notes. Keep entries short, factual, and easy to skim.

Add a new note

  1. Create a new file in docs/notes/ named YYYY-MM-DD-short-title.md.
  2. Add it to docs/SUMMARY.md under Notes.
  3. Use the template below to keep entries consistent.

Template

Context

Symptom

Repro

Investigation

Root cause

Fix

Follow-ups

2026-01-16 mmap choice

这是一些碎碎念记录,想把 mmap 的选择理由写得清楚一些,尤其是围绕 SSTable 和 VLog 的定义、使用场景和读写交互逻辑。

概念与定位

SSTable 是 LSM 的核心持久化文件,按 key 有序且不可变,内部由索引、数据块与过滤器等结构组成,因此它在读路径里几乎无处不在。VLog 则用于存放较大的 value,写入时顺序追加,LSM 内只保存 value pointer,读取时再回查 VLog。用一句话概括就是:SSTable 读密集且不可变,VLog 顺序写但读是随机的。

读写交互逻辑

下面这张图展示了写入与读取的主要交互路径,重点是读路径几乎一定触达 SSTable,而 VLog 只在 value 外置时才参与。

flowchart LR
  W[Write] --> M[Memtable + WAL]
  M --> C{Value large?}
  C -- no --> I[Inline value]
  C -- yes --> V[Append to VLog]
  V --> P[Store ValuePtr]
  M --> F[Flush/Compaction]
  F --> S[SSTable]

  R[Read] --> Q[Memtable/LSM search]
  Q --> T{Inline?}
  T -- yes --> U[Return value]
  T -- no --> G[ValuePtr -> VLog read]
  G --> U

IO 方案对比的直观理解

mmap 的核心优势是随机读成本低,系统调用少,而且读取可以直接落在 OS 的页缓存路径上;但它的缺点也很明确,RSS 和 page cache 不可控,写入必须处理好 msync 语义,并且跨平台细节差异较多。相比之下,pread 或 buffered read 配合自建 cache 更容易控制内存和行为,但会引入额外拷贝和系统调用成本。direct I/O 能绕过 page cache,避免污染,但工程复杂度高,并且在随机读场景并不总是更快。

为什么 SSTable 更适合 mmap

SSTable 不可变且读取频繁,映射稳定,很少需要 remap,这使得 mmap 的工程成本低而收益明显。加上读路径以随机读为主,mmap 能把很多读转化成轻量页缺失,配合 OS 的页缓存形成自然的热点命中。因此在 SSTable 上采用 mmap 通常是可预期且合理的选择。

为什么我们在 VLog 上也用了 mmap

我们目前的实现方式是让 VLog 直接走 mmap,这样读路径可以用 Bytes/View 直接得到切片,写入也可以通过 mmap buffer 追加并配合 msync 落盘,这让实现保持简洁并与 SSTable 的风格一致。代价在于 VLog 文件往往更大,随机读更分散,page cache 污染风险显著更高,RSS 波动也更容易出现。如果 value 的冷热分布不稳定,mmap 带来的缓存收益不一定能抵消它的副作用。

与 Badger 的思路对比

Badger 更倾向于把 mmap 用在 SSTable,而在 VLog 上偏向 FileIO 或 pread,目的就是减少大文件对页缓存的冲击,让热点集中在 SSTable 的 block 上。它也提供了可配置的模式,但整体倾向体现了一个理念:热点应主要由 SSTable 驱动,VLog 更应该谨慎消耗 page cache。

Linux 侧的 IO 选择

在 Linux 上我们可以组合使用多种 IO 手段,比如常规的 read/pread/write,以及 mmap 配合 madvise 提示访问模式,也可以用 posix_fadvise 或 readahead 做预读提示;如果需要更细粒度控制,还可以使用 O_DIRECT 进行 direct I/O,或者基于 io_uring 做异步 IO。我们在 file 包中已经实现了一个基础的 io_uring 框架,后续如果要做更强的异步读写或并发调度,可以基于它扩展。

小结

SSTable 的读密集与不可变特性让 mmap 成为一个相对稳妥的默认选择,而 VLog 的大文件与随机读特性让 mmap 的代价更明显。当前实现偏向工程简化,但从长期来看,VLog 可能更适合 pread + 小型缓存的策略,并在热点稳定时再开放 mmap 作为可选模式。

2026-01-16 hotring design

这条记录和之前的 mmap choice 一样,是一份偏叙述的 note,用来讲 HotRing 的设计动机、交互流程以及我对它的理解。它不是一个“理论最优”的结构,但它足够轻、足够快,也足够实用。

设计动机

在 LSM 系统里,热点通常不是均匀分布的,一小撮 key 会持续放大缓存抖动、读放大和写冲突。HotRing 的定位就是把这种热点快速“变成可见”,让我们能在监控、限流、调优时快速找到真正的热源,而不是只看到一堆模糊的全局指标。

交互逻辑

HotRing 并不改变读写路径,只是以旁路的方式记录访问频次。读请求成功命中后调用 Touch,写请求在启用了 WriteHotKeyLimit 时调用 TouchAndClamp。统计系统定期拉取 TopN,CLI 可以直接显示热点。

flowchart LR
  R[Read path] --> L[LSM lookup]
  L --> H[HotRing.Touch]
  W[Write path] --> C[HotRing.TouchAndClamp]
  H --> B[Bucket list update]
  C --> B
  B --> S[TopN snapshot]
  S --> X[Stats/CLI/Debug]

示例代码

ring := hotring.NewHotRing(12, nil)
ring.EnableSlidingWindow(8, 250*time.Millisecond)
ring.EnableDecay(time.Second, 1)

ring.Touch("user:42")
count, limited := ring.TouchAndClamp("user:42", 128)
if limited {
    // 可以记录告警,或触发写入限流
    _ = count
}

hot := ring.TopN(16)
_ = hot

结构直觉与实现选择

HotRing 的内部是“固定桶 + 有序链表”。key 先哈希到桶,然后在桶内按 tag + key 排序。读路径无锁,写路径使用 CAS 插入节点,避免全局锁带来的抖动。它没有引入复杂的近似结构,而是尽量保持数据结构简单,让它能长期存在于读写路径上而不成为负担。

时间语义方面它提供了两种手段:滑动窗口让突发热点迅速出现,衰减机制让历史热点自然淡出,这两者叠加后,结果更符合“实际热度”的直觉。

个人心得

HotRing 最有意思的点不是“聪明”,而是“够用且稳定”。它把热点从不可见变成可见,又不会因为自己太复杂而制造新的热点。很多时候工程上真正需要的是“一个很快能工作的热键探测器”,而不是一个理论上更漂亮、但成本更高的结构。