Overview

NoKV

High-Performance, Cloud-Native Distributed Key-Value Database

🔥 Why NoKV?

NoKV is designed for modern hardware and distributed workloads. It combines the best of academic research (WiscKey, W-TinyLFU) with industrial-grade engineering (Raft, Percolator).

🏎️ Extreme Performance

Lock-light commit queue and Batch WAL writing deliver write throughput that saturates NVMe SSDs.

🧠 Smart Caching

Built-in W-TinyLFU Block Cache (via Ristretto) and HotRing implementation ensure 99% cache hit rates and adapt to skew access patterns.

🌐 Distributed Consistency

Multi-Raft replication for high availability. Percolator model for cross-row ACID transactions. Snapshot Isolation by default.

🔌 Redis Compatible

Drop-in replacement for Redis. Supports the RESP protocol so you can use your existing tools and client libraries.

📊 Performance Benchmark

Latest full baseline (generated on 2026-02-23 with default make bench profile: records=1M, ops=1M, conc=16, value_size=256, workloads A-G, engines NoKV/Badger/Pebble):

Workload	NoKV (ops/s)	Badger (ops/s)	Pebble (ops/s)
YCSB-A	847,660	396,314	1,282,218
YCSB-B	1,742,820	716,151	1,941,330
YCSB-C	2,070,856	826,766	847,764
YCSB-D	1,754,955	842,637	2,509,809
YCSB-E	205,489	41,508	554,557
YCSB-F	715,946	326,343	1,123,473
YCSB-G	413,521	399,405	583,584

Click to view full benchmark summary

NoKV    YCSB-A 847660   YCSB-B 1742820  YCSB-C 2070856  YCSB-D 1754955  YCSB-E 205489  YCSB-F 715946  YCSB-G 413521
Badger  YCSB-A 396314   YCSB-B 716151   YCSB-C 826766   YCSB-D 842637   YCSB-E 41508   YCSB-F 326343  YCSB-G 399405
Pebble  YCSB-A 1282218  YCSB-B 1941330  YCSB-C 847764   YCSB-D 2509809  YCSB-E 554557  YCSB-F 1123473 YCSB-G 583584

Raw report: benchmark_results_20260223_195951.txt

🏗️ Architecture

graph TD
    Client["Client / Redis"] -->|RESP Protocol| Gateway["Redis Gateway"]
    Gateway -->|RaftCmd| RaftStore
    
    subgraph "RaftStore (Distributed Layer)"
        RaftStore -->|Propose| RaftLog["Raft Log (WAL)"]
        RaftLog -->|Consensus| Apply["Apply Worker"]
    end
    
    subgraph "Storage Engine (LSM)"
        Apply -->|Batch Set| MemTable
        MemTable -->|Flush| SSTable["SSTables (L0-L6)"]
        SSTable -->|Compact| SSTable
        
        Apply -->|Large Value| VLog["Value Log"]
    end
    
    subgraph "Cache Layer"
        BlockCache["Block Cache (Ristretto)"] -.-> SSTable
        IndexCache["Index Cache (W-TinyLFU)"] -.-> SSTable
    end

_{Built with ❤️ by feichai0017 and contributors.}

Getting Started

This guide gets you from zero to a running NoKV cluster (or an embedded DB) in a few minutes.

Prerequisites

Go 1.26+
Git
(Optional) Docker + Docker Compose for containerized runs

Option A: Local Cluster (recommended for dev)

This launches a 3-node Raft cluster plus a PD-lite service.

./scripts/run_local_cluster.sh --config ./raft_config.example.json

Start the Redis-compatible gateway in another shell:

go run ./cmd/nokv-redis --addr 127.0.0.1:6380 --raft-config raft_config.example.json

Quick smoke test:

redis-cli -p 6380 ping

Inspect stats

go run ./cmd/nokv stats --workdir ./artifacts/cluster/store-1

Option B: Docker Compose

This runs the cluster and gateway in containers.

docker compose up --build

Tear down:

docker compose down -v

Embedded Usage (single-process)

Use NoKV as a library when you do not need raftstore.

package main

import (
	"fmt"
	"log"

	NoKV "github.com/feichai0017/NoKV"
)

func main() {
	opt := NoKV.NewDefaultOptions()
	opt.WorkDir = "./workdir-demo"

	db := NoKV.Open(opt)
	defer db.Close()

	key := []byte("hello")
	if err := db.Set(key, []byte("world")); err != nil {
		log.Fatalf("set failed: %v", err)
	}

	entry, err := db.Get(key)
	if err != nil {
		log.Fatalf("get failed: %v", err)
	}
	fmt.Printf("value=%s\n", entry.Value)
}

Note:

DB.Get returns detached entries (do not call DecrRef).

DB.GetInternalEntry returns borrowed entries and callers must call DecrRef exactly once.

DB.SetWithTTL accepts time.Duration (relative TTL). DB.Set/DB.SetWithTTL reject nil values; use DB.Del for deletes.

DB.NewIterator exposes user-facing entries, while DB.NewInternalIterator scans raw internal keys (cf+user_key+ts).

Benchmarks

Micro benchmarks:

go test -bench=. -run=^$ ./...

YCSB (default: NoKV + Badger + Pebble, workloads A-G):

make bench

Override defaults with env vars:

YCSB_RECORDS=1000000 YCSB_OPS=1000000 YCSB_CONC=8 make bench

Latest full baseline (2026-02-23):

Workload	NoKV (ops/s)	Badger (ops/s)	Pebble (ops/s)
YCSB-A	847,660	396,314	1,282,218
YCSB-B	1,742,820	716,151	1,941,330
YCSB-C	2,070,856	826,766	847,764
YCSB-D	1,754,955	842,637	2,509,809
YCSB-E	205,489	41,508	554,557
YCSB-F	715,946	326,343	1,123,473
YCSB-G	413,521	399,405	583,584

Cleanup

If a local run crashes or you want a clean slate:

make clean

Troubleshooting

WAL replay errors after crash: wipe the workdir and restart the cluster.
Port conflicts: adjust addresses in raft_config.example.json.
Slow startup: reduce YCSB_RECORDS or YCSB_OPS when benchmarking locally.

NoKV Architecture Overview

NoKV delivers a hybrid storage engine that can operate as a standalone embedded KV store or as a distributed NoKV service. The distributed RPC surface follows a TinyKV/TiKV-style region + MVCC design, but the service identity and deployment model are NoKV’s own. This document captures the key building blocks, how they interact, and the execution flow from client to disk.

1. High-Level Layout

┌─────────────────────────┐   NoKV gRPC   ┌─────────────────────────┐
│ raftstore Service       │◀──────────────▶ │ raftstore/client        │
└───────────┬─────────────┘                 │  (Get / Scan / Mutate)  │
            │                               └─────────────────────────┘
            │ ReadCommand / ProposeCommand
            ▼
┌─────────────────────────┐
│ store.Store / peer.Peer │  ← multi-Raft region lifecycle
│  ├ Manifest snapshot    │
│  ├ Router / RegionHooks │
│  └ transport (gRPC)     │
└───────────┬─────────────┘
            │ Apply via kv.Apply
            ▼
┌─────────────────────────┐
│ kv.Apply + percolator   │
│  ├ Get / Scan           │
│  ├ Prewrite / Commit    │
│  └ Latch manager        │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ Embedded NoKV core      │
│  ├ WAL Manager          │
│  ├ MemTable / Flush     │
│  ├ ValueLog + GC        │
│  └ Manifest / Stats     │
└─────────────────────────┘

Embedded mode uses NoKV.Open directly: WAL→MemTable→SST durability, ValueLog separation, MVCC semantics, rich stats.
Distributed mode layers raftstore on top: multi-Raft regions reuse the same WAL/Manifest, expose metrics, and serve NoKV RPCs.
Control plane split: raft_config provides bootstrap topology; PD provides runtime routing/TSO/control-plane state in cluster mode.
Clients obtain leader-aware routing, automatic NotLeader/EpochNotMatch retries, and two-phase commit helpers.

Detailed Runtime Paths

For function-level call chains with sequence diagrams (embedded write/read, iterator scan, distributed read/write via Raft apply), see docs/runtime.md.

2. Embedded Engine

2.1 WAL & MemTable

wal.Manager appends [len|type|payload|crc] records (typed WAL), rotates segments, and replays logs on crash.
MemTable accumulates writes until full, then enters the flush queue; flush.Manager runs Prepare → Build → Install → Release, logs edits, and releases WAL segments.
Writes are handled by a single commit worker that performs value-log append first, then WAL/memtable apply, keeping durability ordering simple and consistent.

2.2 ValueLog

Large values are written to the ValueLog before the WAL append; the resulting ValuePtr is stored in WAL/LSM so replay can recover.
vlog.Manager tracks the active head and uses flush discard stats to trigger GC; manifest records new heads and removed segments.

2.3 Manifest

manifest.Manager stores SST metadata, WAL checkpoints, ValueLog metadata, and (importantly) Region descriptors used by raftstore.
CURRENT provides crash-safe pointer updates; Region state is replicated through manifest edits.

2.4 LSM Compaction & Ingest Buffer

compact.Manager drives compaction cycles; lsm.levelManager supplies table metadata and executes the plan.
Planning is split: compact.PlanFor* selects table IDs + key ranges, then LSM resolves IDs back to tables and runs the merge.
compact.State guards overlapping key ranges and tracks in-flight table IDs.
Ingest shard selection is policy-driven in compact (PickShardOrder / PickShardByBacklog) while the ingest buffer remains in lsm.

flowchart TD
  Manager["compact.Manager"] --> LSM["lsm.levelManager"]
  LSM -->|TableMeta snapshot| Planner["compact.PlanFor*"]
  Planner --> Plan["compact.Plan (fid+range)"]
  Plan -->|resolvePlanLocked| Exec["LSM executor"]
  Exec --> State["compact.State guard"]
  Exec --> Build["subcompact/build SST"]
  Build --> Manifest["manifest edits"]
  L0["L0 tables"] -->|moveToIngest| Ingest["ingest buffer shards"]
  Ingest -->|IngestDrain: ingest-only| Main["Main tables"]
  Ingest -->|IngestKeep: ingest-merge| Ingest

2.5 Distributed Transaction Path

percolator implements Prewrite/Commit/ResolveLock/CheckTxnStatus; kv.Apply dispatches raft commands to these helpers.
MVCC timestamps come from the distributed client/PD TSO flow, not from an embedded standalone transaction API.
Watermarks (utils.WaterMark) are used in durability/visibility coordination; they have no background goroutine and advance via mutex + atomics.

2.6 Write Pipeline & Backpressure

Writes enqueue into a commit queue (db_write.go) where requests are coalesced into batches before a commit worker drains them.
The commit worker always writes the value log first (when needed), then applies WAL/LSM updates; SyncWrites adds a WAL fsync step.
Batch sizing adapts to backlog (WriteBatchMaxCount/Size, WriteBatchWait) and hot-key pressure can expand batch limits temporarily to drain spikes.
Backpressure is enforced in two places: LSM throttling toggles db.blockWrites when L0 backlog grows, and HotRing can reject hot keys via WriteHotKeyLimit.

2.7 Ref-Count Lifecycle Contracts

NoKV uses fail-fast reference counting for internal pooled/owned objects. DecrRef underflow is treated as a lifecycle bug and panics.

Object	Owned by	Borrowed by	Release rule
`kv.Entry` (pooled)	internal write/read pipelines	codec iterator, memtable/lsm internal reads, request batches	Must call `DecrRef` exactly once per borrow.
`kv.Entry` (detached public result)	caller	none	Returned by `DB.Get`; must not call `DecrRef`.
`kv.Entry` (borrowed internal result)	caller	yes (`DecrRef`)	Returned by `DB.GetInternalEntry`; caller must release exactly once.
`request`	commit queue/worker	waiter path (`Wait`)	`IncrRef` on enqueue; `Wait` does one `DecrRef`; zero returns request to pool and releases entries.
`table`	level/main+ingest lists, block cache	table iterators, prefetch workers	Removed tables are decremented once after manifest+in-memory swap; zero deletes SST.
`Skiplist` / `ART` index	memtable	iterators	Iterator creation increments index ref; iterator `Close` decrements; double-close is idempotent.

3. Replication Layer (raftstore)

Package	Responsibility
`store`	Region catalog, router, RegionMetrics, Region hooks, manifest integration, helpers such as `StartPeer` / `SplitRegion`.
`peer`	Wraps etcd/raft `RawNode`, handles Ready pipeline, snapshot resend queue, backlog instrumentation.
`engine`	WALStorage/DiskStorage/MemoryStorage, reusing the DB’s WAL while keeping manifest metadata in sync.
`transport`	gRPC transport for Raft Step messages, connection management, retries/blocks/TLS. Also acts as the host for NoKV RPC.
`kv`	NoKV RPC handler plus `kv.Apply` bridging Raft commands to MVCC logic.
`server`	`ServerConfig` + `New` combine DB, Store, transport, and NoKV service into a reusable node instance.

3.1 Bootstrap Sequence

raftstore.NewServer wires DB, store configuration (StoreID, hooks, scheduler), Raft config, and transport address. It registers NoKV RPC on the shared gRPC server and sets transport.SetHandler(store.Step).
CLI (nokv serve) or application enumerates Manifest.RegionSnapshot() and calls Store.StartPeer for every Region containing the local store:
- peer.Config includes Raft params, transport, kv.NewEntryApplier, WAL/Manifest handles, Region metadata.
- Router registration, regionManager bookkeeping, optional Peer.Bootstrap with initial peer list, leader campaign.
Peers from other stores can be configured through transport.SetPeer(storeID, addr). In cluster mode, runtime routing/control-plane decisions come from PD.

3.2 Command Paths

ReadCommand (KvGet/KvScan): validate Region & leader, execute Raft ReadIndex (LinearizableRead) and WaitApplied, then run commandApplier (i.e. kv.Apply in read mode) to fetch data from the DB. This yields leader-strong reads with an explicit Raft linearizability barrier.
ProposeCommand (write): encode the request, push through Router to the leader peer, replicate via Raft, and apply in kv.Apply which maps to MVCC operations.

3.3 Transport

gRPC server handles Step RPCs and NoKV RPCs on the same endpoint; peers are registered via SetPeer.
Retry policies (WithRetry) and TLS credentials are configurable. Tests cover partitions, blocked peers, and slow followers.

4. NoKV Service

raftstore/kv/service.go exposes pb.NoKV RPCs:

RPC	Execution	Result
`KvGet`	`store.ReadCommand` → `kv.Apply` GET	`pb.GetResponse` / `RegionError`
`KvScan`	`store.ReadCommand` → `kv.Apply` SCAN	`pb.ScanResponse` / `RegionError`
`KvPrewrite`	`store.ProposeCommand` → `percolator.Prewrite`	`pb.PrewriteResponse`
`KvCommit`	`store.ProposeCommand` → `percolator.Commit`	`pb.CommitResponse`
`KvResolveLock`	`percolator.ResolveLock`	`pb.ResolveLockResponse`
`KvCheckTxnStatus`	`percolator.CheckTxnStatus`	`pb.CheckTxnStatusResponse`

nokv serve is the CLI entry point—open the DB, construct raftstore.Server, register peers, start local Raft peers, and display a manifest summary (Regions, key ranges, peers). scripts/run_local_cluster.sh builds the CLI, writes a minimal region manifest, launches multiple nokv serve processes on localhost, and handles cleanup on Ctrl+C.

The RPC request/response shape is intentionally close to TinyKV/TiKV so the MVCC and region semantics remain familiar, but the service name exposed on the wire is pb.NoKV.

5. Client Workflow

raftstore/client offers a leader-aware client with retry logic and convenient helpers:

Initialization: provide []StoreEndpoint + RegionResolver (GetRegionByKey) so runtime routing is PD-driven.
Reads: Get and Scan pick the leader store for a key range, issue NoKV RPCs, and retry on NotLeader/EpochNotMatch.
Writes: Mutate bundles operations per region and drives Prewrite/Commit (primary first, secondaries after); Put and Delete are convenience wrappers using the same 2PC path.
Timestamps: clients must supply startVersion/commitVersion. For distributed demos, use PD-lite (nokv pd) to obtain globally increasing values before calling TwoPhaseCommit.
Bootstrap helpers: scripts/run_local_cluster.sh --config raft_config.example.json builds the binaries, seeds manifests via nokv-config manifest, launches PD-lite, and starts the stores declared in the config.

Example (two regions)

Regions [a,m) and [m,+∞), each led by a different store.
Mutate(ctx, primary="alfa", mutations, startTs, commitTs, ttl) prewrites and commits across the relevant regions.
Get/Scan retries automatically if the leader changes.
See raftstore/server/server_test.go for a full end-to-end example using real raftstore.Server instances.

6. Failure Handling

Manifest edits capture Region metadata, WAL checkpoints, and ValueLog pointers. Restart simply reads CURRENT and replays edits.
WAL replay reconstructs memtables and Raft groups; ValueLog recovery trims partial records.
Stats.StartStats resumes metrics sampling immediately after restart, making it easy to verify recovery correctness via nokv stats.

7. Observability & Tooling

StatsSnapshot publishes flush/compaction/WAL/VLog/raft/region/hot/cache metrics. nokv stats and the expvar endpoint expose the same data.
nokv regions inspects Manifest-backed Region metadata.
nokv serve advertises Region samples on startup (ID, key range, peers) for quick verification.
Inspect scheduler/control-plane state via PD APIs/metrics.
Scripts:
- scripts/run_local_cluster.sh – launch a multi-node NoKV cluster locally.
- scripts/recovery_scenarios.sh – crash-recovery test harness.
- scripts/transport_chaos.sh – inject network faults and observe transport metrics.

8. When to Use NoKV

Embedded: call NoKV.Open, use the local non-transactional DB APIs.
Distributed: deploy nokv serve nodes, use raftstore/client (or any NoKV gRPC client) to perform reads, scans, and 2PC writes.
Observability-first: inspection via CLI or expvar is built-in; Region, WAL, Flush, and Raft metrics are accessible without extra instrumentation.

See also docs/raftstore.md for deeper internals, docs/pd.md for control-plane details, and docs/testing.md for coverage details.

Runtime Call Chains (Current)

This document focuses on the current execution paths in NoKV and maps API calls to concrete functions in the codebase.

It intentionally describes only what is running today.

1. API Surface Snapshot

Mode	Read APIs	Write APIs	Txn APIs
Embedded (`NoKV.DB`)	`Get`, `NewIterator`, `NewInternalIterator`	`Set`, `SetWithTTL`, `Del`, `ApplyInternalEntries`	N/A (no standalone local txn API)
Distributed (`raftstore/kv`)	`KvGet`, `KvBatchGet`, `KvScan`	N/A direct write	`KvPrewrite`, `KvCommit`, `KvBatchRollback`, `KvResolveLock`, `KvCheckTxnStatus`

Core entry points:

Embedded DB: db.go, db_write.go, iterator.go
Distributed RPC: raftstore/kv/service.go
Raft read/propose bridge: raftstore/store/command_service.go
MVCC logic: percolator/txn.go, percolator/reader.go

2. Embedded Write Path (`Set` / `SetWithTTL` / `Del`)

2.1 Function-Level Chain

DB.Set / DB.SetWithTTL / DB.Del creates internal-key entry via kv.NewInternalEntry.
DB.ApplyInternalEntries validates each internal key via kv.SplitInternalKey, then calls batchSet.
batchSet enqueues request (sendToWriteCh -> commit queue).
commitWorker drains a batch:
- vlog.write(requests) writes large values first and produces ValuePtr.
- applyRequests -> writeToLSM -> lsm.SetBatch.
lsm.SetBatch writes one atomic batch:
- memTable.setBatch
- wal.AppendEntryBatch
- mem index insert.

2.2 Sequence Diagram

sequenceDiagram
    participant U as User API
    participant DB as DB.Set/SetWithTTL/Del
    participant Q as commitQueue
    participant W as commitWorker
    participant V as vlog.write
    participant L as lsm.SetBatch
    participant M as memTable.setBatch
    participant WAL as wal.AppendEntryBatch
    U->>DB: Set/SetWithTTL/Del
    DB->>DB: NewInternalEntry + ApplyInternalEntries
    DB->>Q: sendToWriteCh / enqueueCommitRequest
    Q->>W: nextCommitBatch
    W->>V: write(requests)
    V-->>W: ValuePtr for large values
    W->>L: writeToLSM(entries)
    L->>M: setBatch(entries)
    M->>WAL: AppendEntryBatch(entries)
    M-->>L: index.Add(...)
    L-->>W: success

3. Embedded Read Path (`Get` / `GetInternalEntry`)

3.1 Function-Level Chain

DB.Get builds InternalKey(CFDefault, userKey, nonTxnMaxVersion).
loadBorrowedEntry calls lsm.Get for the newest visible internal record.
If value is pointer (BitValuePointer), read real bytes via vlog.read, clear pointer bit.
PopulateInternalMeta ensures CF/Version cache matches internal key.
DB.Get returns detached public entry via cloneEntry (user key + copied value).
DB.GetInternalEntry returns borrowed internal entry (caller must DecrRef).

3.2 Sequence Diagram

sequenceDiagram
    participant U as User API
    participant DB as DB.Get/GetInternalEntry
    participant LSM as lsm.Get
    participant VLOG as vlog.read
    U->>DB: Get(userKey)
    DB->>DB: InternalKey(CFDefault,userKey,nonTxnMaxVersion)
    DB->>LSM: Get(internalKey)
    LSM-->>DB: pooled Entry (internal key)
    alt BitValuePointer set
        DB->>VLOG: read(ValuePtr)
        VLOG-->>DB: raw value bytes
    end
    DB->>DB: PopulateInternalMeta
    alt Get (public)
        DB->>DB: cloneEntry -> user key/value copy
        DB-->>U: detached Entry
    else GetInternalEntry
        DB-->>U: borrowed internal Entry (DecrRef required)
    end

4. Iterator Paths

4.1 Public Iterator (`DB.NewIterator`)

Build merged internal iterator: lsm.NewIterators + lsm.NewMergeIterator.
Seek converts user key to internal seek key (CFDefault + nonTxnMaxVersion).
populate/materialize:
- parse internal key (kv.SplitInternalKey)
- apply bounds on user key
- optionally resolve vlog pointer
- expose user-key item.

4.2 Internal Iterator (`DB.NewInternalIterator`)

Directly returns merged iterator over internal keys.
No user-key rewrite; caller handles kv.SplitInternalKey.

flowchart TD
  A["DB.NewIterator"] --> B["lsm.NewMergeIterator(internal keys)"]
  B --> C["Seek(userKey -> InternalKey)"]
  C --> D["populate/materialize"]
  D --> E["SplitInternalKey + bounds check"]
  E --> F{"BitValuePointer?"}
  F -- no --> G["Expose inline value"]
  F -- yes --> H["vlog.read(ValuePtr)"]
  H --> G
  G --> I["Item.Entry uses user key"]

5. Distributed Read Path (`KvGet` / `KvBatchGet` / `KvScan`)

5.1 Function-Level Chain

raftstore/kv.Service builds RaftCmdRequest from NoKV RPC.
Store.ReadCommand:
- validateCommand (region/epoch/leader/key-range)
- peer.LinearizableRead
- peer.WaitApplied
- commandApplier(req) (injected as kv.Apply).
kv.Apply executes:
- handleGet -> percolator.Reader.GetLock + GetValue
- handleScan -> iterate CFWrite, resolve visible versions.

5.2 Sequence Diagram

sequenceDiagram
    participant C as NoKV Client
    participant SVC as kv.Service
    participant ST as Store.ReadCommand
    participant P as peer.LinearizableRead
    participant AP as kv.Apply
    participant R as percolator.Reader
    participant DB as DB
    C->>SVC: KvGet/KvBatchGet/KvScan
    SVC->>ST: ReadCommand(RaftCmdRequest)
    ST->>ST: validateCommand
    ST->>P: LinearizableRead + WaitApplied
    ST->>AP: commandApplier(req)
    AP->>R: GetLock/GetValue or scan CFWrite
    R->>DB: GetInternalEntry/NewInternalIterator
    AP-->>SVC: RaftCmdResponse
    SVC-->>C: NoKV response

6. Distributed Write Path (2PC via Raft Apply)

6.1 Function-Level Chain

Client (raftstore/client) runs Mutate / TwoPhaseCommit by region.
RPC layer (kv.Service) sends write commands through Store.ProposeCommand.
Raft replication commits log entries; apply path invokes kv.Apply.
kv.Apply dispatches to percolator.Prewrite/Commit/BatchRollback/ResolveLock/CheckTxnStatus.
Percolator mutators call applyVersionedOps:
- build entries via kv.NewInternalEntry
- call db.ApplyInternalEntries
- release refs (DecrRef).
Storage then follows the same embedded write pipeline (vlog -> LSM/WAL).

6.2 Sequence Diagram

sequenceDiagram
    participant CL as raftstore/client
    participant SVC as kv.Service
    participant ST as Store.ProposeCommand
    participant RF as Raft replicate/apply
    participant AP as kv.Apply
    participant TXN as percolator.txn
    participant DB as DB.ApplyInternalEntries
    CL->>SVC: KvPrewrite/KvCommit/...
    SVC->>ST: ProposeCommand
    ST->>RF: route to leader + replicate
    RF->>AP: apply committed command
    AP->>TXN: Prewrite/Commit/Resolve...
    TXN->>DB: applyVersionedOps -> ApplyInternalEntries
    DB-->>TXN: write result
    AP-->>SVC: RaftCmdResponse
    SVC-->>CL: NoKV response

7. Entry Ownership and Refcount Rules

Source	Returned entry type	Key form	Caller action
`DB.GetInternalEntry`	Borrowed pooled	Internal key	Must call `DecrRef()` once
`DB.Get`	Detached copy	User key	Must not call `DecrRef()`
`percolator.applyVersionedOps` temporary entries	Borrowed pooled	Internal key	Always `DecrRef()` after `ApplyInternalEntries`
`LSM.Get` / memtable reads	Borrowed pooled	Internal key	Upstream owner must release

8. Key/Value Shape by Stage

Stage	`Entry.Key`	`Entry.Value`	Notes
User write before queue	Internal key (`CF + user key + ts`)	Raw user bytes	Built by `NewInternalEntry`
After vlog step	Internal key	Inline value or `ValuePtr.Encode()`	Pointer marked by `BitValuePointer`
LSM/WAL stored form	Internal key	Encoded value payload	Used by replay/flush/compaction
`GetInternalEntry` output	Internal key	Raw value bytes (pointer resolved)	Internal caller view
`Get` / public iterator output	User key	Raw value bytes	External caller view

Error Handling Guide

This document defines how NoKV should own, define, and propagate errors.

1. Ownership Rules

Domain errors stay in domain packages.
Cross-cutting runtime errors may live in utils only when shared by multiple subsystems.
Command-local/business-flow errors should be unexported (errXxx) and stay in command/service packages.

Examples:

kv: entry codec/read-decode errors.
vfs: filesystem contract errors.
pd/core: control-plane validation/conflict errors.

2. Propagation Rules

Wrap with %w when crossing package boundaries.
Match via errors.Is, not string compare.
Keep stable sentinel values for retryable / control-flow decisions.
Add context in upper layers; do not lose original cause.

3. Naming Rules

Exported sentinels use ErrXxx.
Error text should be lowercase and package-scoped when useful (for example pd/core: ..., vfs: ...).
Avoid duplicate sentinels with identical semantics in different packages.

4. Current Error Map

Shared runtime sentinels

utils/error.go: common cross-package sentinels such as invalid request, key/value validation errors, throttling, and lifecycle guards.

Domain-specific sentinels

kv/entry_codec.go: ErrBadChecksum, ErrPartialEntry
vfs/vfs.go: ErrRenameNoReplaceUnsupported
lsm/compact/errors.go: compaction planner/runtime domain errors
raftstore/peer/errors.go: peer lifecycle/state errors
pb/errorpb.proto: region/store routing protobuf errors (RegionError, StoreNotMatch, RegionNotFound, KeyNotInRegion, …)
wal/errors.go: WAL encode/decode and segment errors
pd/core/errors.go: PD metadata and range validation errors

5. Propagation in Hot Paths

Embedded write path (DB.Set* -> commit worker -> LSM/WAL):
- validation returns direct sentinel (ErrEmptyKey, ErrNilValue, ErrInvalidRequest);
- storage boundary errors are wrapped with context and preserved via %w.
Distributed command path (kv.Service -> Store.*Command -> kv.Apply):
- region/leader/store/range failures are mapped to errorpb messages in protobuf responses;
- execution failures return Go errors to RPC layer and are translated to gRPC status.
Recovery/replay path (WAL/Vlog/Manifest):
- partial/corrupt records return domain sentinels and are handled by truncation or restart logic in upper layers.

Entry Lifecycle, Encoding, and Field Semantics

kv.Entry is the core record container that flows through user APIs, commit batching, WAL/value-log codecs, memtable indexes, SST blocks, and iterators.

This document explains:

How key/value bytes are encoded.
What Entry.Key / Entry.Value mean at each stage.
Which fields are authoritative vs derived.
How PopulateInternalMeta keeps internal fields consistent.
Borrowed-vs-detached ownership rules.

1. Structure Overview

Source: kv/entry.go, kv/key.go, kv/value.go

type Entry struct {
    Key       []byte
    Value     []byte
    ExpiresAt uint64
    CF        ColumnFamily
    Meta      byte
    Version   uint64
    Offset    uint32
    Hlen      int
    ValThreshold int64
    ref int32
}

Important interpretation:

Key is the canonical source of truth for internal records.
CF and Version are cached/derived fields for convenience.
Value can represent either:
- inline value bytes, or
- encoded ValuePtr bytes (Meta has BitValuePointer).

2. Encoding Layers

2.1 Internal Key Encoding

Source: kv/key.go

InternalKey(cf, userKey, ts) layout:

4-byte CF header: 0xFF 'C' 'F' <cf-byte>
raw user key bytes
8-byte big-endian descending timestamp (MaxUint64 - ts)

Helpers:

SplitInternalKey(internal) -> (cf, userKey, ts, ok)
Timestamp(key) / StripTimestamp(key) / SameKey(a, b)

2.2 ValueStruct Encoding

Source: kv/value.go

ValueStruct layout:

Meta (1B)
ExpiresAt (uvarint)
Value (raw bytes)

ValueStruct does not store Version; version is always taken from internal key.

2.3 Entry Record Encoding (WAL / Vlog record payload)

Source: kv/entry_codec.go

Entry codec layout:

header (uvarints: keyLen/valueLen/meta/expiresAt)
key bytes
value bytes
crc32

DecodeEntryFrom now calls PopulateInternalMeta() after key decode so internal records get consistent CF/Version immediately.

3. Stage-by-Stage Meaning of `Key` and `Value`

3.1 User write (`DB.Set`, `DB.SetWithTTL`, `DB.ApplyInternalEntries`)

Source: db.go

Set/SetWithTTL use NewInternalEntry(...):
- Key: encoded internal key.
- Value: user value bytes.
ApplyInternalEntries validates internal key, then writes back parsed CF/Version from key before entering write pipeline.

3.2 Commit worker: vlog then LSM apply

Source: db_write.go, vlog.go

Before LSM.SetBatch, large values are replaced by ValuePtr.Encode() bytes and BitValuePointer is set.
Small values stay inline.
Key remains internal key throughout.

3.3 WAL replay / vlog iteration decode

Source: kv/entry_codec.go, lsm/memtable.go

Decoded records carry internal key bytes in Key.
Value is record payload bytes (inline value or pointer bytes).
CF/Version are derived from key via PopulateInternalMeta.

3.4 Memtable index lookup

Source: lsm/memtable.go, utils/skiplist.go, utils/art.go

memIndex.Search(...) returns (matchedInternalKey, ValueStruct).
memTable.Get assembles pooled Entry from this and calls PopulateInternalMeta.
For miss, Key=nil and zero value struct (sentinel, not a valid record).

3.5 SST / iterator decode

Source: lsm/builder.go, lsm/table.go

Block iterator reconstructs internal key + value struct.
It calls PopulateInternalMeta before exposing item entry.
table.Search clones key/value and re-populates internal metadata.

3.6 Internal read API (`GetInternalEntry`)

Source: db.go

loadBorrowedEntry fetches from LSM.
If BitValuePointer is set:
- decode pointer from Value
- read real bytes from vlog
- replace Value with actual value bytes
- clear BitValuePointer
Return entry with internal key still in Key, and CF/Version re-populated.

3.7 Public read API (`Get`, public iterator)

Source: db.go, iterator.go

Public APIs convert internal key to user key for external consumers.
Returned entry is detached copy (DB.Get) or iterator materialized object.

3.8 Runtime State Flow Diagram

flowchart TD
  A["DB.Set/SetWithTTL/Del"] --> B["kv.NewInternalEntry"]
  B --> C["DB.ApplyInternalEntries"]
  C --> D["commitWorker: vlog.write"]
  D --> E["LSM/WAL persist"]
  E --> F["DB.GetInternalEntry"]
  F --> G{"BitValuePointer?"}
  G -- yes --> H["vlog.read + clear pointer bit"]
  G -- no --> I["inline value"]
  H --> J["PopulateInternalMeta"]
  I --> J
  J --> K["DB.Get cloneEntry -> user key/value"]
  J --> L["DB.NewIterator materialize -> user key/value"]

4. Field Validity Matrix

Context	`Key`	`Value`	`CF/Version`	Ownership
Internal write path	Internal key	Inline value or ptr-bytes (large values)	Valid (set at build/validate time)	Borrowed/pooled
Decoded WAL/vlog record	Internal key	Encoded record payload value bytes	Valid after `PopulateInternalMeta`	Borrowed/pooled
`GetInternalEntry` return	Internal key	Real value bytes (pointer resolved)	Valid (re-populated before return)	Borrowed/pooled
`DB.Get` return	User key	Real value bytes	Valid for external semantics	Detached copy
Memtable miss sentinel	`nil`	`nil`	`CFDefault/0`	Borrowed/pooled sentinel

Rule of thumb:

Internal code should treat Key as authoritative.
CF/Version should be expected to match SplitInternalKey(Key) after PopulateInternalMeta.

5. `PopulateInternalMeta` Semantics

Source: kv/entry.go

func (e *Entry) PopulateInternalMeta() bool

Behavior:

Parse e.Key as internal key.
If parse succeeds:
- e.CF = parsedCF
- e.Version = parsedTS
- return true
If parse fails:
- reset cache fields to safe defaults (CFDefault, Version=0)
- return false

Why it exists:

pooled entries can carry stale cached fields if not reset carefully;
parse-once helper gives a single normalization point at codec/index/read boundaries;
keeps key-derived metadata (CF/Version) authoritative and consistent.

6. Ownership and Refcount States

Source: kv/entry.go, docs/architecture.md

Borrowed entry (internal)

Returned by:

DecodeEntryFrom
memTable.Get
LSM.Get internals
DB.GetInternalEntry

Contract:

caller must call DecrRef() exactly once.
double release panics (underflow guard).

Detached entry (public)

Returned by:

DB.Get (via clone)

Contract:

do not call DecrRef().

7. Contributor Rules

Any new internal decode boundary should either:
- call PopulateInternalMeta, or
- immediately parse key with SplitInternalKey and set fields consistently.
Do not reintroduce value-side version caches.
Keep internal APIs internal-key-first; only external APIs should expose user keys.
Preserve borrowed/detached ownership contracts in comments and tests.

Configuration & Options

NoKV exposes two configuration surfaces:

Runtime options for the embedded engine (Options in options.go).
Cluster topology for distributed mode (raft_config.example.json via config.LoadFile/Validate).

1. Runtime Options (Embedded Engine)

NoKV.NewDefaultOptions() returns a tuned baseline. Override fields before calling NoKV.Open(opt).

Key option groups (see options.go for the full list):

Paths & durability
- WorkDir, SyncWrites, ManifestSync, ManifestRewriteThreshold
Write pipeline
- WriteBatchMaxCount, WriteBatchMaxSize, WriteBatchWait
Value log
- ValueThreshold, ValueLogFileSize, ValueLogMaxEntries
- ValueLogGCInterval, ValueLogGCDiscardRatio
- ValueLogGCParallelism, ValueLogGCReduceScore, ValueLogGCSkipScore
- ValueLogGCReduceBacklog, ValueLogGCSkipBacklog
- ValueLogGCSampleSizeRatio, ValueLogGCSampleCountRatio, ValueLogGCSampleFromHead
- ValueLogBucketCount, ValueLogHotBucketCount, ValueLogHotKeyThreshold
LSM & compaction
- MemTableSize, MemTableEngine, SSTableMaxSz, NumCompactors
- NumLevelZeroTables, IngestCompactBatchSize, IngestBacklogMergeScore
- CompactionValueWeight, CompactionValueAlertThreshold
Caches
- BlockCacheSize, BloomCacheSize
Hot key throttling
- WriteHotKeyLimit, HotWriteBurstThreshold, HotWriteBatchMultiplier
- HotRingEnabled, HotRingTopK, decay/window settings
- HotRingNodeCap, HotRingNodeSampleBits, HotRingRotationInterval
- ValueLogHotRingOverride + ValueLogHotRing* overrides
WAL watchdog
- EnableWALWatchdog, WALAutoGCInterval
- WALAutoGCMinRemovable, WALAutoGCMaxBatch
- WALTypedRecordWarnRatio, WALTypedRecordWarnSegments
Raft lag warnings (stats only)
- RaftLagWarnSegments

Example:

opt := NoKV.NewDefaultOptions()
opt.WorkDir = "./data"
opt.SyncWrites = true
opt.ValueThreshold = 1024
opt.WriteBatchMaxCount = 128
db := NoKV.Open(opt)
defer db.Close()

Load Options From TOML

For convenience, you can load engine options from a TOML file. Unspecified fields keep their defaults from NewDefaultOptions.

opt, err := NoKV.LoadOptionsFile("nokv.options.toml")
if err != nil {
    log.Fatal(err)
}
db := NoKV.Open(opt)
defer db.Close()

Example (TOML):

work_dir = "./data"
mem_table_engine = "art"
value_threshold = 1024
write_hot_key_limit = 128
value_log_gc_interval = "30s"

Notes:

Field names are case-insensitive; _ / - / . are ignored.
Durations accept Go-style strings (e.g. "30s", "200ms"). Numeric durations are interpreted as nanoseconds.
File extensions .toml and .tml are accepted.
JSON option files are rejected by design.
Unknown fields return an error so typos do not silently pass.

2. Raft Topology File

raft_config.example.json is the single source of truth for distributed topology. It is consumed by scripts, cmd/nokv-redis, and the config package.

Precedence rule: when a value can be provided by both CLI flags and config file, CLI flags take precedence; config acts as startup defaults.

Minimal shape:

{
  "max_retries": 8,
  "pd": {
    "addr": "127.0.0.1:2379",
    "docker_addr": "nokv-pd:2379",
    "work_dir": "./artifacts/cluster/pd",
    "docker_work_dir": "/var/lib/nokv-pd"
  },
  "stores": [
    { "store_id": 1, "listen_addr": "127.0.0.1:20170", "addr": "127.0.0.1:20170" }
  ],
  "regions": [
    {
      "id": 1,
      "start_key": "-",
      "end_key": "-",
      "epoch": { "version": 1, "conf_version": 1 },
      "peers": [{ "store_id": 1, "peer_id": 101 }],
      "leader_store_id": 1
    }
  ]
}

Notes:

start_key / end_key accept plain strings, hex:<bytes>, or base64. Use "-" or empty for unbounded ranges.
stores define both host and docker addresses for local runs vs containers.
pd.addr is the default PD endpoint for host scope; pd.docker_addr is used when tools run in docker scope.
pd.work_dir / pd.docker_work_dir are optional PD persistence directories used by bootstrap tooling and nokv pd --config ... when --workdir is not set explicitly.
leader_store_id is optional bootstrap metadata. Runtime routing in cluster mode is resolved through PD (GetRegionByKey), not static leader hints.

Programmatic loading:

cfg, _ := config.LoadFile("raft_config.example.json")
if err := cfg.Validate(); err != nil { /* handle */ }

Related tools:

scripts/run_local_cluster.sh --config raft_config.example.json
go run ./cmd/nokv-redis --raft-config raft_config.example.json

CLI (`cmd/nokv`) Reference

nokv provides operational visibility similar to RocksDB ldb / Badger CLI, with script-friendly JSON output.

Installation

go install ./cmd/nokv

Shared Flags

--workdir <path>: NoKV database directory (must contain CURRENT for manifest commands)
--json: JSON output (default is plain text)
--expvar <url>: for stats, fetch from /debug/vars
--no-region-metrics: for offline stats, skip attaching runtime region metrics

Subcommands

`nokv stats`

Reads StatsSnapshot either offline (--workdir) or online (--expvar)
JSON output is nested by domain (not flat)

Common fields:

entries
flush.pending, flush.queue_length, flush.last_wait_ms
compaction.backlog, compaction.max_score
value_log.segments, value_log.pending_deletes, value_log.gc.*
wal.active_segment, wal.segment_count, wal.typed_record_ratio
write.queue_depth, write.queue_entries, write.hot_key_limited
region.total, region.running, region.removing
hot.read_keys, hot.write_keys
lsm.levels, lsm.value_bytes_total
transport.*, redis.*

Example:

nokv stats --workdir ./testdata/db --json | jq '.flush.queue_length'

`nokv manifest`

Reads manifest version state
Shows log pointer, per-level file info, and value-log metadata

`nokv vlog`

Lists value-log segments and current head per bucket
Useful after GC/recovery checks

`nokv regions`

Dumps manifest-backed region catalog (state/range/epoch/peers)
Supports --json

`nokv serve`

Starts NoKV gRPC service backed by local raftstore
Requires --workdir, --store-id, and --pd-addr
Common flags:
- --addr (default 127.0.0.1:20160)
- --peer storeID=address (repeatable)
- --election-tick, --heartbeat-tick
- --raft-max-msg-bytes, --raft-max-inflight
- --raft-tick-interval, --raft-debug-log

Example:

nokv serve \
  --workdir ./artifacts/cluster/store-1 \
  --store-id 1 \
  --addr 127.0.0.1:20170 \
  --pd-addr 127.0.0.1:2379 \
  --peer 2=127.0.0.1:20171 \
  --peer 3=127.0.0.1:20172

Integration Tips

Combine with RECOVERY_TRACE_METRICS=1 for recovery validation.
In CI, compare JSON snapshots to detect observability regressions.
Use nokv stats --expvar for online diagnostics and --workdir for offline forensics.

Memtable Design & Lifecycle

NoKV’s write path mirrors RocksDB: every write lands in the WAL and an in-memory memtable backed by a selectable in-memory index (skiplist or ART). The implementation lives in lsm/memtable.go and ties directly into the flush manager (lsm/flush).

1. Structure

type memTable struct {
    lsm        *LSM
    segmentID  uint32       // WAL segment backing this memtable
    index      memIndex
    maxVersion uint64
    walSize    int64
}

The memtable index is an interface that can be backed by either a skiplist or ART:

type memIndex interface {
    Add(*kv.Entry)
    Search([]byte) ([]byte, kv.ValueStruct)
    NewIterator(*utils.Options) utils.Iterator
    MemSize() int64
    IncrRef()
    DecrRef()
}

Memtable engine – Options.MemTableEngine selects art (default) or skiplist via newMemIndex. ART is not a generic trie: it is an internal-key-only memtable index. It uses a reversible mem-comparable route key so trie ordering matches the LSM internal-key comparator; skiplist remains available as the simpler baseline alternative.
Arena sizing – both utils.NewSkiplist and utils.NewART use arenaSizeFor to derive arena capacity from Options.MemTableSize.
WAL coupling – every Set uses kv.EncodeEntry to materialise the payload to the active WAL segment before inserting into the chosen index. walSize tracks how much of the segment is consumed so flush can release it later.
Segment ID – LSM.NewMemtable atomically increments levels.maxFID, switches the WAL to a new segment (wal.Manager.SwitchSegment), and tags the memtable with that FID. This matches RocksDB’s logfile_number field.
ART specifics – ART stores prefix-compressed inner nodes (Node4/16/48/256). Each leaf keeps both the private route key used by the trie and the original canonical internal key returned to callers. The main concurrency model is still copy-on-write payload/node cloning with CAS installs; the only retained writer-side OLC-lite fast path is a narrow in-place replaceChild update. Reads stay lock-free and do not run full version validation.

2. Lifecycle

sequenceDiagram
    participant WAL
    participant MT as MemTable
    participant Flush
    participant Manifest
    WAL->>MT: Append+Set(entry)
    MT->>Flush: freeze (walSize + incomingEstimate > limit)
    Flush->>Manifest: LogPointer + AddFile
    Manifest-->>Flush: ack
    Flush->>WAL: Release segments ≤ segmentID

Active → Immutable – when mt.walSize + estimate exceeds Options.MemTableSize, the memtable is rotated and pushed onto the flush queue. The new active memtable triggers another WAL segment switch.
Flush – the flush manager drains immutable memtables, builds SSTables, logs manifest edits, and releases the WAL segment ID recorded in memTable.segmentID once the SST is durably installed.
Recovery – LSM.recovery scans WAL files, reopens memtables per segment (most recent becomes active), and deletes segments ≤ the manifest’s log pointer. Entries are replayed via wal.Manager.ReplaySegment into fresh indexes and the active in-memory state is rebuilt.

Badger follows the same pattern, while RocksDB often uses skiplist-backed arenas with reference counting—NoKV reuses Badger’s arena allocator for simplicity.

3. Read Semantics

memTable.Get looks up the chosen index and returns a borrowed, ref-counted *kv.Entry from the internal pool. The index search returns the matched internal key plus value struct, so memtable hit entries carry the concrete version key instead of the query sentinel key. Internal callers must release borrowed entries with DecrRef when done.
MemTable.IncrRef/DecrRef delegate to the index, allowing iterators to hold references while the flush manager processes immutable tables—mirroring RocksDB’s MemTable::Ref/Unref lifecycle.
WAL-backed values that exceed the value threshold are stored as pointers; the memtable stores the encoded pointer, and the transaction/iterator logic reads from the vlog on demand.
DB.Get returns detached entries; callers must not call DecrRef on them.
DB.GetInternalEntry returns borrowed entries; callers must call DecrRef exactly once.

4. Integration with Other Subsystems

Subsystem	Interaction
Distributed 2PC	`kv.Apply` + `percolator` write committed MVCC versions through the same WAL/memtable pipeline in raft mode.
Manifest	Flush completion logs `EditLogPointer(segmentID)` so restart can discard WAL files already persisted into SSTs.
Stats	`Stats.Snapshot` pulls `FlushPending/Active/Queue` counters via `lsm.FlushMetrics`, exposing how many immutables are waiting.
Value Log	`lsm.flush` emits discard stats keyed by `segmentID`, letting the value log GC know when entries become obsolete.

5. Comparison

Aspect	RocksDB	BadgerDB	NoKV
Data structure	Skiplist + arena	Skiplist + arena	Skiplist or ART + arena (`art` default)
WAL linkage	`logfile_number` per memtable	Segment ID stored in vlog entries	`segmentID` on `memTable`, logged via manifest
Recovery	Memtable replays from WAL, referencing `MANIFEST`	Replays WAL segments	Replays WAL segments, prunes ≤ manifest log pointer
Flush trigger	Size/entries/time	Size-based	WAL-size budget (`walSize`) with explicit queue metrics

6. Operational Notes

Tuning Options.MemTableSize affects WAL segment count and flush latency. Larger memtables reduce flush churn but increase crash recovery time.
ART currently uses noticeably more memindex arena memory than skiplist because it stores both route keys and original internal keys in leaves; in local measurements the ART memindex is roughly 2x the skiplist memindex footprint for the same key/value set.
Monitor NoKV.Stats.flush.* fields to catch stalled immutables—an ever-growing queue often indicates slow SST builds or manifest contention.
Because memtables carry WAL segment IDs, deleting WAL files manually can lead to recovery failures; always rely on the engine’s manifest-driven cleanup.

See docs/flush.md for the end-to-end flush scheduler and [docs/architecture.md](architecture.md#3-end-to-end-write-flow) for where memtables sit in the write pipeline.

MemTable Flush Pipeline

NoKV’s flush path converts immutable memtables into L0 SST files, then advances the manifest WAL checkpoint and reclaims obsolete WAL segments. The task scheduler is in lsm/flush; SST persistence and manifest install are in lsm/builder.go and lsm/levels.go.

1. Responsibilities

Persistence: materialize immutable memtables into SST files.
Ordering: publish SST metadata to manifest only after the SST is durably installed (strict mode).
Cleanup: remove WAL segments once checkpoint and raft constraints allow removal.
Observability: export queue/build/release timing through flush metrics.

2. Stage Machine

flowchart LR
    Active[Active MemTable]
    Immutable[Immutable MemTable]
    FlushQ[flush.Manager queue]
    Build[StageBuild]
    Install[StageInstall]
    Release[StageRelease]

    Active -->|threshold reached| Immutable --> FlushQ
    FlushQ --> Build --> Install --> Release --> Active

StagePrepare: flush.Manager.Submit enqueues task and records wait-start time.
StageBuild: worker pulls task via flush.Manager.Next, builds SST (levelManager.flush -> openTable -> tableBuilder.flush).
StageInstall: after SST + manifest edits succeed, worker marks install complete (flush.Manager.Update(..., StageInstall, ...)).
StageRelease: worker removes immutable from in-memory list, closes memtable, records release metrics, and completes task.

3. SST Persistence Modes

Flush uses two write modes controlled by Options.ManifestSync:

Fast path (ManifestSync=false)
- Writes SST directly to final filename with O_CREATE|O_EXCL.
- No temp file/rename step.
- Highest throughput, weaker crash-consistency guarantees.
Strict path (ManifestSync=true)
- Writes to "<table>.tmp.<pid>.<ns>".
- tmp.Sync() to persist SST bytes.
- RenameNoReplace(tmp, final) installs file atomically. If unsupported by platform/filesystem, returns vfs.ErrRenameNoReplaceUnsupported.
- SyncDir(workdir) is called before manifest edit so directory entry is durable.

This is the durability ordering used by current code.

4. Execution Path in Code

lsm.Set/lsm.SetBatch detects walSize + estimate > MemTableSize and rotates memtable.
Rotated memtable is submitted to flush.Manager (lsm.submitFlush).
Worker executes levelManager.flush(mt):
- iterates memtable entries,
- builds SST via tableBuilder,
- prepares manifest edits: EditAddFile + EditLogPointer.
In strict mode, SyncDir runs before manifest.LogEdits(...).
On successful manifest commit, table is added to L0 and wal.RemoveSegment runs when allowed.

5. Recovery Notes

Startup rebuild (levelManager.build) validates manifest SST entries against disk.
Missing or unreadable SSTs are treated as stale and removed from manifest via EditDeleteFile, allowing startup to continue.
Temp SST names are only used in strict mode and are created in WorkDir with suffix .tmp.<pid>.<ns> (not a dedicated tmp/ directory).

6. Metrics & CLI

flush.Manager.Stats() feeds StatsSnapshot.Flush:

pending, queue, active
wait/build/release totals, counts, last, max
completed

Use:

nokv stats --workdir <dir>

to inspect flush backlog and latency.

lsm/flush/manager_test.go: queue/stage transitions and timing counters.
db_test.go::TestRecoveryWALReplayRestoresData: replay still restores data after crash before flush completion.
db_test.go::TestRecoveryCleansMissingSSTFromManifest and db_test.go::TestRecoveryCleansCorruptSSTFromManifest: stale manifest SST cleanup on startup.

See also recovery.md, memtable.md, and wal.md.

Compaction & Cache Strategy

NoKV’s compaction pipeline borrows the leveled‑LSM layout from RocksDB, but layers it with an ingest buffer, lightweight cache telemetry, and simple concurrency guards so the implementation stays approachable while still handling bursty workloads.

1. Overview

Compactions are orchestrated by compact.Manager with lsm.levelManager implementing the executor hooks. Each level owns two lists of tables:

tables – the canonical sorted run for the level.
ingest – a staging buffer that temporarily holds SSTables moved from the level above when there is not yet enough work (or bandwidth) to do a full merge.

The compaction manager periodically calls into the executor to build a list of compact.Priority entries. The priorities consider three signals:

L0 table count – loosely capped by Options.NumLevelZeroTables.
Level size vs target – computed by levelTargets(), which dynamically adjusts the “base” level depending on total data volume.
Ingest buffer backlog – if a level’s ingest shards have data, they receive elevated scores so staged tables are merged promptly.

The highest adjusted score is processed first. L0 compactions can either move tables into the ingest buffer of the base level (cheap re‑parenting) or compact directly into a lower level when the overlap warrants it.

Planning now happens via compact.Plan: LSM snapshots table metadata into compact.TableMeta, compact.PlanFor* selects table IDs + key ranges, and LSM resolves the plan back to *table before executing.

2. Ingest Buffer

moveToIngest (see lsm/executor.go) performs a metadata-only migration:

Records a manifest.EditDeleteFile for the source level.
Logs a new manifest.EditAddFile targeting the destination level.
Removes the table from thisLevel.tables and appends it to nextLevel.ingest.

This keeps write amplification low when many small L0 tables arrive at once. Reads still see the newest data because levelHandler.searchIngestSST checks ingest before consulting tables.

Compaction tests (lsm/compaction_test.go) assert that after calling moveToIngest the table disappears from the source level and shows up in the ingest buffer.

3. Concurrency Guards

To prevent overlapping compactions:

compact.State.CompareAndAdd tracks the key range of each in-flight compaction per level.
Attempts to register a compaction whose ranges intersect an existing one are rejected.
When a compaction finishes, compact.State.Delete removes the ranges and table IDs from the guard.

This mechanism is intentionally simple—just a mutex‐protected slice—yet effective in tests (TestCompactStatusGuards) that simulate back‑to‑back registrations on the same key range.

4. Cache Telemetry

NoKV’s cache is split into three parts (lsm/cache.go):

Component	Purpose	Metrics hook
Block cache	Ristretto cache for L0/L1 blocks.	`cacheMetrics.recordBlock(level, hit)`
OS page cache path	Deeper levels bypass user-space cache and rely on mmap + kernel page cache.	Same as above
Bloom cache	Stores decoded bloom filters to reduce disk touches.	`recordBloom(hit)`

Cache hit/miss signals are exported through StatsSnapshot.Cache (and surfaced by nokv stats / expvar), which is especially helpful when tuning ingest behaviour—if L0/L1 cache misses spike, the ingest buffer likely needs to be drained faster. TestCacheHotColdMetrics verifies cache hit accounting.

5. Interaction with Value Log

Compaction informs value‑log GC via discard statistics:

During subcompact, every entry merged out is inspected. If it stores a ValuePtr, the amount is added to the discard map.
At the end of subcompaction, the accumulated discard map is pushed through setDiscardStatsCh.
valueLog receives the stats and can safely rewrite or delete vlog segments with predominantly obsolete data.

This tight coupling keeps the value log from growing indefinitely after heavy overwrite workloads.

6. Testing Checklist

Relevant tests to keep compaction healthy:

lsm/compaction_test.go
- TestCompactionMoveToIngest – ensures metadata migration works and the ingest buffer grows.
- TestCompactStatusGuards – checks overlap detection.
lsm/cache_test.go
- TestCacheHotColdMetrics – validates cache hit accounting.
lsm/lsm_test.go
- TestCompact / TestHitStorage – end‑to‑end verification that data remains queryable across memtable flushes and compactions.

When adding new compaction heuristics or cache behaviour, extend these tests (or introduce new ones) so the behaviour stays observable.

7. Practical Tips

Tune Options.IngestCompactBatchSize when ingest queues build up; increasing it lets a single move cover more tables.
Observe NoKV.Stats.cache.* and NoKV.Stats.compaction.* via the CLI (nokv stats) to decide whether you need more compaction workers or bigger caches.
For workloads dominated by range scans, consider increasing Options.BlockCacheSize if you want to keep more L0/L1 blocks in the user-space cache; cold data relies on the OS page cache.
Keep an eye on NoKV.Stats.value_log.gc (for example gc_runs and head_updates); if compactions are generating discard stats but the value log head doesn’t move, GC thresholds may be too conservative.

With these mechanisms, NoKV stays resilient under bursty writes while keeping the code path small and discoverable—ideal for learning or embedding. Dive into the source files referenced above for deeper implementation details.

Ingest Buffer Architecture

The ingest buffer is a per-level staging area for SSTables—typically promoted from L0—designed to absorb bursts, reduce overlap, and unlock parallel compaction without touching the main level tables immediately. It combines fixed sharding, adaptive scheduling, and optional IngestKeep (ingest-merge) passes to keep write amplification and contention low.

flowchart LR
  L0["L0 SSTables"] -->|moveToIngest| Ingest["Ingest Buffer (sharded)"]
  subgraph levelN["Level N"]
    Ingest -->|IngestDrain: ingest-only| MainTables["Main Tables"]
    Ingest -->|IngestKeep: ingest-merge| Ingest
  end
  Ingest -.read path merge.-> ClientReads["Reads/Iterators"]

Design Highlights

Sharded by key prefix: ingest tables are routed into fixed shards (top bits of the first byte). Sharding cuts cross-range overlap and enables safe parallel drain.
Snapshot-friendly reads: ingest tables are read under the level RLock, and iterators hold table refs so mmap-backed data stays valid without additional snapshots.
Two ingest paths:
- Ingest-only compaction: drain ingest → main level (or next level) with optional multi-shard parallelism guarded by compact.State.
- Ingest-merge: compact ingest tables back into ingest (stay in-place) to drop superseded versions before promoting, reducing downstream write amplification.
IngestMode enum: plans carry an IngestMode with IngestNone, IngestDrain, and IngestKeep. IngestDrain corresponds to ingest-only (drain into main tables), while IngestKeep corresponds to ingest-merge (compact within ingest).
Adaptive scheduling:
- Shard selection is driven by compact.PickShardOrder / compact.PickShardByBacklog using per-shard size, age, and density.
- Shard parallelism scales with backlog score (based on shard size/target file size) bounded by IngestShardParallelism.
- Batch size scales with shard backlog to drain faster under pressure.
- Ingest-merge triggers when backlog score exceeds IngestBacklogMergeScore (default 2.0), with dynamic lowering under extreme backlog/age.
Observability: expvar/stats expose IngestDrain vs IngestKeep counts, duration, and tables processed, plus ingest size/value density per level/shard.

Configuration

IngestShardParallelism: max shards to compact in parallel (default max(NumCompactors/2, 2), auto-scaled by backlog).
IngestCompactBatchSize: base batch size per ingest compaction (auto-boosted by shard backlog).
IngestBacklogMergeScore: backlog score threshold to trigger IngestKeep/ingest-merge (default 2.0).

Benefits

Lower write amplification: bursty L0 SSTables land in ingest first; IngestKeep/ingest-merge prunes duplicates before full compaction.
Reduced contention: sharding + compact.State allow parallel ingest drain with minimal overlap.
Predictable reads: ingest is part of the read snapshot, so moving tables in/out does not change read semantics.
Tunable and observable: knobs for parallelism and merge aggressiveness, with per-path metrics to guide tuning.

Future Work

Deeper adaptive policies (IO/latency-aware), richer shard-level metrics, and more exhaustive parallel/restart testing under fault injection.

WAL Subsystem

NoKV’s write-ahead log mirrors RocksDB’s durability model and is implemented as a compact Go module similar to Badger’s journal. WAL appends happen alongside memtable writes (via lsm.Set), while values that are routed to the value log are written before the WAL so that replay always sees durable value pointers.

1. File Layout & Naming

Location: ${Options.WorkDir}/wal/.
Naming pattern: %05d.wal (e.g. 00001.wal).
Rotation threshold: configurable via wal.Config.SegmentSize (defaults to 64 MiB, minimum 64 KiB).
Verification: wal.VerifyDir ensures the directory exists prior to DB.Open.

On open, wal.Manager scans the directory, sorts segment IDs, and resumes the highest ID—exactly how RocksDB re-opens its MANIFEST and WAL sequence files.

2. Record Format

uint32 length (big-endian, includes type + payload)
uint8  type
[]byte payload
uint32 checksum (CRC32 Castagnoli over type + payload)

Checksums use kv.CastagnoliCrcTable, the same polynomial used by RocksDB (Castagnoli). Record encoding/decoding lives in wal/record.go.
The type byte allows mixing LSM mutations with raft log/state/snapshot records in the same WAL segment.
Appends are buffered by bufio.Writer so batches become single system calls.
Replay stops cleanly at truncated tails; tests simulate torn writes by truncating the final bytes and verifying replay remains idempotent (wal/manager_test.go::TestManagerReplayHandlesTruncate).

3. Public API (Go)

mgr, _ := wal.Open(wal.Config{Dir: path})
info, _ := mgr.AppendEntry(entry)
batchInfo, _ := mgr.AppendEntryBatch(entries)
typedInfos, _ := mgr.AppendRecords(wal.Record{
    Type:    wal.RecordTypeRaftEntry,
    Payload: raftPayload,
})
_ = mgr.Sync()
_ = mgr.Rotate()
_ = mgr.Replay(func(info wal.EntryInfo, payload []byte) error {
    // reapply to memtable
    return nil
})

Key behaviours:

AppendEntry/AppendEntryBatch/AppendRecords automatically call ensureCapacity to decide when to rotate; they return EntryInfo{SegmentID, Offset, Length} so upper layers can keep manifest/raft pointers.
Sync flushes the active file (used for Options.SyncWrites).
Rotate forces a new segment (used after flush/compaction checkpoints similar to RocksDB’s LogFileManager::SwitchLog).
Replay iterates segments in numeric order, forwarding each payload to the callback. Errors abort replay so recovery can surface corruption early.
Metrics (wal.Manager.Metrics) reveal the active segment ID, total segments, and number of removed segments—these feed directly into StatsSnapshot and nokv stats output.

Compared with Badger: Badger keeps a single vlog for both data and durability. NoKV splits WAL (durability) from vlog (value separation), matching RocksDB’s separation of WAL and blob files.

4. Integration Points

Call Site	Purpose
`lsm.memTable.set`	Encodes each entry (`kv.EncodeEntry`) and appends to WAL before inserting into the active memtable index (`ART` by default, `skiplist` when explicitly selected).
`DB.commitWorker`	Commit worker applies batched writes via `writeToLSM`, which calls `lsm.SetBatch` and appends one WAL entry-batch record per request batch.
`DB.Set` / `DB.SetWithTTL` / `DB.Del` / `DB.ApplyInternalEntries`	User/internal writes all flow through the same commit queue and eventually reach `lsm.SetBatch` + WAL append.
`lsm/levels.go::flush`	Persists WAL checkpoint via `manifest.LogEdits(EditAddFile, EditLogPointer)` during flush install.
`lsm/levels.go::flush` + `lsm/levelManager.canRemoveWalSegment`	Removes obsolete WAL segments after checkpoint/raft constraints are satisfied.
`db.runRecoveryChecks`	Ensures WAL directory invariants before manifest replay, similar to Badger’s directory bootstrap.

5. Metrics & Observability

Stats.collect reads the manager metrics and exposes them as:

NoKV.Stats.wal.active_segment
NoKV.Stats.wal.segment_count
NoKV.Stats.wal.segments_removed

The CLI command nokv stats --workdir <dir> prints these alongside backlog, making WAL health visible without manual inspection. In high-throughput scenarios the active segment ID mirrors RocksDB’s LOG number growth.

6. WAL Watchdog (Auto GC)

The WAL watchdog runs inside the DB process to keep WAL backlog in check and surface warnings when raft-typed records dominate the log. It:

Samples WAL metrics + per-segment metrics and combines them with manifest.RaftPointerSnapshot() to compute removable segments.
Removes up to WALAutoGCMaxBatch segments when at least WALAutoGCMinRemovable are eligible.
Exposes counters (wal.auto_gc_runs/removed/last_unix) and warning state (wal.typed_record_ratio/warning/reason) through StatsSnapshot.WAL.

Relevant options (see options.go for defaults):

EnableWALWatchdog
WALAutoGCInterval
WALAutoGCMinRemovable
WALAutoGCMaxBatch
WALTypedRecordWarnRatio
WALTypedRecordWarnSegments

7. Recovery Walkthrough

wal.Open reopens the highest segment, leaving the file pointer at the end (switchSegmentLocked).
manifest.Manager supplies the WAL checkpoint (segment + offset) while building the version. Replay skips entries up to this checkpoint, ensuring we only reapply writes not yet materialised in SSTables.
wal.Manager.Replay (invoked by the LSM recovery path) rebuilds memtables from entries newer than the manifest checkpoint. Value-log recovery only validates/truncates segments and does not reapply data.
If the final record is partially written, the CRC mismatch stops replay and the segment is truncated during recovery tests, mimicking RocksDB’s tolerant behaviour.

8. Operational Tips

Configure Options.SyncWrites=true for synchronous durability (default async, similar to RocksDB’s default).
After large flushes, forcing Rotate keeps WAL files short, reducing replay time.
Archived WAL segments can be copied alongside manifest files for hot-backup strategies—since the manifest contains the WAL log number, snapshots behave like RocksDB’s Checkpoints.

9. Truncation Metadata

raftstore/engine/wal_storage keeps a per-group index of [firstIndex,lastIndex] spans for each WAL record so it can map raft log indices back to the segment that stored them.
When a log is truncated (either via snapshot or future compaction hooks), the manifest is updated via LogRaftTruncate with the index/term, segment ID (RaftLogPointer.SegmentIndex), and byte offset (RaftLogPointer.TruncatedOffset) that delimit the remaining WAL data.
lsm/levelManager.canRemoveWalSegment now blocks garbage collection whenever any raft group still references a segment through its truncation metadata, preventing slow followers from losing required WAL history while letting aggressively compacted groups release older segments earlier.

For broader context, read the architecture overview and flush pipeline documents.

Value Log (vlog) Design

NoKV keeps the LSM tree lean by separating large values into sequential value log (vlog) files. The module is split between

vlog/manager.go – owns the open file set, rotation, and segment lifecycle helpers.
vlog/io.go – append/read/iterate/verify/sample IO paths.
vlog.go – integrates the manager with the DB write path, discard statistics, and garbage collection (GC).

The design echoes BadgerDB’s value log while remaining manifest-driven like RocksDB’s blob_db: vlog metadata (head pointer, pending deletions) is persisted inside the manifest so recovery can reconstruct the exact state without scanning the filesystem.

1. Layering (Engine View)

The value log is split into three layers so IO can stay reusable while DB policy lives in the core package:

DB policy layer (vlog.go, vlog_gc.go) – integrates the manager with the DB write path, persists vlog metadata in the manifest, and drives GC scheduling based on discard stats.
Value-log manager (vlog/) – owns segment lifecycle (open/rotate/remove), encodes/decodes entries, and exposes append/read/sample APIs without touching MVCC or LSM policy.
File IO (file/) – mmap-backed LogFile primitives (open/close/truncate, read/write, read-only remap) shared by WAL/vlog/SST. Vlog currently uses LogFile directly instead of an intermediate store abstraction.

2. Directory Layout & Naming

<workdir>/
  vlog/
    bucket-000/
      00000.vlog
      00001.vlog
    bucket-001/
      00000.vlog
      00001.vlog
    ...

Files are named %05d.vlog and live under workdir/vlog/bucket-XXX/ when Options.ValueLogBucketCount > 1. Manager.populate discovers existing segments at open.
Manager tracks the active file ID (activeID) and byte offset; Manager.Head exposes these so the manifest can checkpoint them (manifest.EditValueLogHead).
Files created after a crash but never linked in the manifest are removed during valueLog.reconcileManifest.
When HotRing routing is enabled (ValueLogHotBucketCount + ValueLogHotKeyThreshold), buckets are split into hot vs cold ranges to isolate update-heavy keys from GC pressure.

3. Record Format

The vlog uses the shared encoding helper (kv.EncodeEntryTo), so entries written to the value log and the WAL are byte-identical.

+--------+----------+------+-------------+-----------+-------+
| KeyLen | ValueLen | Meta | ExpiresAt   | Key bytes | Value |
+--------+----------+------+-------------+-----------+-------+
                                             + CRC32 (4 B)

Header fields are varint-encoded (kv.EntryHeader).
The first 20 bytes of every segment are reserved (kv.ValueLogHeaderSize) for future metadata; iteration always skips this fixed header.
kv.EncodeEntry and the entry iterator (kv.EntryIterator) perform the layout work, and each append finishes with a CRC32 to detect torn writes.
vlog.VerifyDir scans all segments with sanitizeValueLog to trim corrupted tails after crashes, mirroring RocksDB’s blob_file::Sanitize. Badger performs a similar truncation pass at startup.

4. Manager API Surface

mgr, _ := vlog.Open(vlog.Config{Dir: "...", MaxSize: 1<<29})
ptr, _ := mgr.AppendEntry(entry)
ptrs, _ := mgr.AppendEntries(entries, writeMask)
val, unlock, _ := mgr.Read(ptr)
unlock()             // release per-file lock
_ = mgr.Rewind(*ptr) // rollback partially written batch
_ = mgr.Remove(fid)  // close + delete file

Key behaviours:

Append + Rotate – Manager.AppendEntry encodes and appends into the active file. The reservation path handles rotation when the active segment would exceed MaxSize; manual rotation is rare.
Crash recovery – Manager.Rewind truncates the active file and removes newer files when a write batch fails mid-flight. valueLog.write uses this to guarantee idempotent WAL/value log ordering.
Safe reads – Manager.Read returns an mmap-backed slice plus an unlock callback. Active segments take a per-file RWMutex, while sealed segments use a pin/unpin path to avoid long-held locks; callers that need ownership should copy the bytes before releasing the lock.
Verification – VerifyDir validates entire directories (used by CLI and recovery) by parsing headers and CRCs.

Compared with RocksDB’s blob manager the surface is intentionally small—NoKV treats the manager as an append-only log with rewind semantics, while RocksDB maintains index structures inside the blob file metadata.

5. Integration with DB Writes

sequenceDiagram
    participant Commit as commitWorker
    participant Mgr as vlog.Manager
    participant WAL as wal.Manager
    participant Mem as MemTable
    Commit->>Mgr: AppendEntries(entries, writeMask)
    Mgr-->>Commit: ValuePtr list
    Commit->>WAL: Append(entries+ptrs)
    Commit->>Mem: apply to active memtable index

valueLog.write builds a write mask for each batch, then delegates to Manager.AppendEntries. Entries staying in LSM (shouldWriteValueToLSM) receive zero-value pointers.
Rotation is handled inside the manager when the reserved bytes would exceed MaxSize. The WAL append happens after the value log append so crash replay observes consistent pointers.
Any error triggers Manager.Rewind back to the saved head pointer, removing new files and truncating partial bytes. vlog_test.go exercises both append- and rotate-failure paths.
All batched writes share the same pipeline: the commit worker always writes the value log first, then applies to WAL/memtable, keeping ordering and durability consistent.

Badger follows the same ordering (value log first, then write batch). RocksDB’s blob DB instead embeds blob references into the WAL entry before the blob file write, relying on two-phase commit between WAL and blob; NoKV avoids the extra coordination by reusing a single batching loop.

6. Discard Statistics & GC

flowchart LR
  Compaction -- "obsolete ptrs" --> DiscardStats
  DiscardStats -->|"batch json"| writeCh
  valuePtr["valueLog.newValuePtr(lfDiscardStatsKey)"]
  writeCh --> valuePtr
  valueLog -- "GC trigger" --> Manager

lfDiscardStats aggregates per-file discard counts from LSM compaction workers (lsm/executor.go, subcompact -> updateDiscardStats). Once the in-memory counter crosses discardStatsFlushThreshold, it marshals the map into JSON and writes it back through the DB pipeline under the special key !NoKV!discard.
valueLog.flushDiscardStats consumes those stats, ensuring they are persisted even across crashes. During recovery valueLog.populateDiscardStats replays the JSON payload to repopulate the in-memory map.
GC uses discardRatio = discardedBytes/totalBytes derived from Manager.Sample, which applies windowed iteration based on configurable ratios. If a file exceeds the configured threshold, valueLog.doRunGC rewrites live entries into the current head via db.batchSet (the normal commit pipeline) and then valueLog.rewrite triggers manifest delete edits through removeValueLogFile.
- Sampling behaviour is controlled by Options.ValueLogGCSampleSizeRatio (default 0.10 of the file) and Options.ValueLogGCSampleCountRatio (default 1% of the configured entry limit). Setting either to <=0 keeps the default heuristics. Options.ValueLogGCSampleFromHead starts sampling from the beginning instead of a random window.
Completed deletions are logged via lsm.LogValueLogDelete so the manifest can skip them during replay. When GC rotates to a new head, valueLog.updateHead records the pointer and increments NoKV.Stats.value_log.gc.head_updates.

RocksDB’s blob GC leans on compaction iterators to discover obsolete blobs. NoKV, like Badger, leverages flush/compaction discard stats so GC does not need to rescan SSTs.

7. GC Scheduling & Parallelism

NoKV runs value-log GC with bucket-aware parallelism while protecting the LSM from overload:

ValueLogGCParallelism controls the maximum number of concurrent GC tasks. When set to <= 0, it auto-tunes to max(NumCompactors/2, 1) and is capped by the bucket count.
Each bucket has a lock-free guard, so no two GC jobs run in the same bucket at once.
A lightweight semaphore limits total concurrency without blocking the GC scheduler thread.

Pressure-aware throttling

Compaction backlog and score feed into the GC scheduler:

Reduce: when compaction backlog or max score crosses ValueLogGCReduceBacklog / ValueLogGCReduceScore, GC parallelism is halved.
Skip: when compaction backlog or max score crosses ValueLogGCSkipBacklog / ValueLogGCSkipScore, GC is skipped for that tick.

This keeps GC throughput high under light load but avoids compaction starvation under pressure.

8. Recovery Semantics

DB.Open restores the manifest and fetches the last persisted head pointer.
valueLog.open launches flushDiscardStats and iterates every vlog file via valueLog.replayLog. Files marked invalid in the manifest are removed; valid ones are registered in the manager’s file map.
valueLog.replayLog streams entries to validate checksums and trims torn tails; it does not reapply data into the LSM. WAL replay remains the sole source of committed state.
Manager.VerifyDir trims torn records so replay never sees corrupt payloads.
After validation, valueLog.populateDiscardStats rehydrates discard counters from the persisted JSON entry if present.

The flow mirrors Badger’s vlog scanning but keeps state reconstruction anchored on WAL + manifest checkpoints, similar to RocksDB’s reliance on MANIFEST to mark blob files live or obsolete.

9. Observability & CLI

Metrics in stats.go report segment counts, pending deletions, discard queue depth, and GC head pointer via expvar.
GC scheduling exposes NoKV.Stats.value_log.gc (including gc_parallelism, gc_active, gc_scheduled, gc_throttled, gc_skipped, gc_rejected) for diagnostics.
nokv vlog --workdir <dir> loads a manager in read-only mode and prints current head plus file status (valid, gc candidate). It invokes vlog.VerifyDir before describing segments.
Recovery traces controlled by RECOVERY_TRACE_METRICS log every head movement and file removal, aiding pressure testing of GC edge cases. For ad-hoc diagnostics, enable Options.ValueLogVerbose to emit replay/GC messages to stdout.

10. Quick Comparison

Capability	RocksDB BlobDB	BadgerDB	NoKV
Head tracking	In MANIFEST (blob log number + offset)	Internal to vlog directory	Manifest entry via `EditValueLogHead`
GC trigger	Compaction sampling, blob garbage score	Discard stats from LSM tables	Discard stats flushed through `lfDiscardStats`
Failure recovery	Blob DB and WAL coordinate two-phase commits	Replays value log then LSM	Rewind-on-error + manifest-backed deletes
Read path	Separate blob cache	Direct read + checksum	`Manager.Read` with copy + per-file lock

By anchoring the vlog state in the manifest and exposing rewind/verify primitives, NoKV maintains the determinism of RocksDB while keeping Badger’s simple sequential layout.

11. Further Reading

docs/recovery.md – failure matrix covering append crashes, GC interruptions, and manifest rewrites.
docs/cache.md – how vlog-backed entries interact with the block cache.
docs/stats.md – metric names surfaced for monitoring.

Manifest & Version Management

The manifest is NoKV’s metadata log for:

SST files (EditAddFile / EditDeleteFile)
WAL checkpoint (EditLogPointer)
value-log metadata (EditValueLogHead, EditDeleteValueLog, EditUpdateValueLog)
raft and region metadata (EditRaftPointer, EditRegion)

Implementation: manifest/manager.go, manifest/codec.go, manifest/types.go.

1. Files on Disk

WorkDir/
  CURRENT
  MANIFEST-000001
  MANIFEST-000002

CURRENT stores the active manifest filename.
CURRENT is updated via CURRENT.tmp -> CURRENT rename.
MANIFEST-* stores append-only encoded edits.

2. In-Memory Version Model

type Version struct {
    Levels       map[int][]FileMeta
    LogSegment   uint32
    LogOffset    uint64
    ValueLogs    map[ValueLogID]ValueLogMeta
    ValueLogHead map[uint32]ValueLogMeta
    RaftPointers map[uint64]RaftLogPointer
    Regions      map[uint64]RegionMeta
}

Levels: per-level SST metadata.
LogSegment/LogOffset: WAL replay checkpoint.
ValueLogs + ValueLogHead: all known vlog segments and per-bucket active heads.
RaftPointers/Regions: raftstore progress + region metadata.

3. Edit Append Semantics

Manager.LogEdits(edits...) does:

Encode edits to a buffer.
Write encoded bytes to current manifest file.
Conditionally call manifest.Sync() when:
- Manager.syncWrites == true, and
- at least one edit type requires sync (Add/DeleteFile, LogPointer, value-log edits).
Apply edits to in-memory Version.
Trigger manifest rewrite if size crosses threshold.

SetSync(bool) and SetRewriteThreshold(int64) are configured by LSM options.

4. Rewrite Flow

When rewrite threshold is exceeded (or Rewrite() is called):

Create next MANIFEST-xxxxxx.
Write a full snapshot of current Version.
Flush writer, and Sync() the new manifest when syncWrites is enabled.
Update CURRENT to point to new file.
Reopen the new manifest for appends and remove old manifest file.

If rewrite fails before CURRENT update, restart continues using previous manifest.

5. Interaction with Other Modules

Module	Manifest usage
`lsm/levels.go::flush`	Logs `EditAddFile` + `EditLogPointer` after SST install; compaction logs add/delete edits.
`lsm/levels.go::build`	During startup, missing/corrupt SST entries are marked stale and cleaned via `EditDeleteFile`.
`wal`	Replays from manifest checkpoint (`LogSegment`, `LogOffset`).
`vlog`	Persists head/update/delete metadata and uses manifest state for stale/orphan cleanup on startup.
`raftstore`	Persists raft pointers and region metadata through manifest edits.

6. Recovery-Relevant Guarantees

Manifest append is ordered by single manager mutex.
WAL replay starts from manifest checkpoint.
Stale manifest SST entries are self-healed on startup (delete edit appended).
CURRENT indirection protects against partial manifest rewrite publication.

7. Operational Commands

nokv manifest --workdir <dir>

Useful fields:

log_pointer.segment, log_pointer.offset
levels[*].files
value_log_heads
value_logs[*].valid

See recovery.md and flush.md for startup and flush ordering details.

VFS

The vfs package provides a small filesystem abstraction used by WAL, manifest, SST, value-log, and raftstore paths.

1. Core Interfaces

vfs.FS includes:

file open/create: OpenHandle, OpenFileHandle
path ops: MkdirAll, Remove, RemoveAll, Rename, RenameNoReplace, Stat, ReadDir, Glob
helpers: ReadFile, WriteFile, Truncate, Hostname

vfs.File includes:

read/write/seek APIs
Sync, Truncate, Close, Stat, Name

vfs.Ensure(fs) maps nil to OSFS.

2. Rename Semantics

Rename:

normal rename/move semantics.
target replacement behavior is platform-dependent (os.Rename behavior).

RenameNoReplace:

contract: fail with os.ErrExist when destination already exists.
on Linux, uses renameat2(..., RENAME_NOREPLACE).
on macOS, uses renamex_np(..., RENAME_EXCL).
when atomic no-replace rename is unsupported by platform/filesystem, returns vfs.ErrRenameNoReplaceUnsupported (no non-atomic fallback).

3. Directory Sync Helper

vfs.SyncDir(fs, dir) fsyncs directory metadata to persist entry updates (create/rename/remove).

This is used in strict durability paths (for example SST install before manifest publication) to guarantee directory entry persistence.

4. Fault Injection (`FaultFS`)

FaultFS wraps an underlying FS and can inject failures by operation/path.

Rule helpers: FailOnceRule, FailOnNthRule, FailAfterNthRule, FailOnceRenameRule
File-handle faults: write/sync/close/truncate
Rename fault matching supports src/dst targeting

Used to test manifest/WAL/recovery failure paths deterministically.

5. Current Implementation Set

OSFS: production implementation (Go os package).
FaultFS: failure-injection wrapper over any FS.

No in-memory FS is included yet.

6. Design Notes

Keep storage code decoupled from direct os.* calls.
Make crash/failure tests reproducible.
Keep API minimal and only add operations required by real storage call sites.

References:

Pebble VFS: https://pkg.go.dev/github.com/cockroachdb/pebble/vfs
Pebble errorfs: https://pkg.go.dev/github.com/cockroachdb/pebble/vfs/errorfs

File Abstractions

The file package encapsulates direct file-system interaction for WAL, SST, and value-log files. It provides portable mmap helpers, allocation primitives, and log file wrappers.

1. Core Types

Type	Purpose	Key Methods
`Options`	Parameter bag for opening files (FID, path, size).	Used by WAL/vlog managers.
`MmapFile`	Cross-platform mmap wrapper.	`OpenMmapFile`, `AppendBuffer`, `Truncate`, `Sync`.
`LogFile`	Value-log specific helper built on `MmapFile`.	`Open`, `Write`, `Read`, `DoneWriting`, `Truncate`, `Bootstrap`.

Darwin-specific builds live alongside (mmap_darwin.go, sstable_darwin.go) ensuring the package compiles on macOS without manual tuning.

2. Mmap Management

OpenMmapFile opens or creates a file, optionally extending it to maxSz, then mmaps it. The returned MmapFile exposes Data []byte and the underlying *os.File handle.
Writes grow the map on demand: AppendBuffer checks if the write would exceed the current mapping and calls Truncate to expand (doubling up to 1 GiB increments).
Sync flushes dirty pages (mmap.Msync), while Delete unmaps, truncates, closes, and removes the file—used when dropping SSTs or value-log segments.

RocksDB relies on custom Env implementations for portability; NoKV keeps the logic in Go, relying on build tags for OS differences.

3. LogFile Semantics

LogFile wraps MmapFile to simplify value-log operations:

lf := &file.LogFile{}
_ = lf.Open(&file.Options{FID: 1, FileName: "00001.vlog", MaxSz: 1<<29})
var buf bytes.Buffer
payload, _ := kv.EncodeEntry(&buf, entry)
_ = lf.Write(offset, payload)
_ = lf.DoneWriting(offset + uint32(len(payload)))

Open mmaps the file and records current size (guarded to < 4 GiB).
Read validates offsets against both the mmap length and tracked size, preventing partial reads when GC or drop operations shrink the file.
Entry encoding uses shared helpers in kv (kv.EncodeEntry / kv.EncodeEntryTo); LogFile focuses on write/read/truncate + durability semantics.
DoneWriting guarantees durability for both data bytes [0, offset) and the file metadata (size).
- Sequence: It flushes dirty pages (msync), truncates the file to offset, and performs a file-descriptor level sync (fsync) to ensure the new file size is persisted on disk before returning.
- Contract: Success implies that after a crash, the file size will not exceed offset, and all data prior to offset is safe.
- After syncing, it reinitializes the mmap and keeps the file open in read-write mode for potential subsequent appends (if logic allows) or prepares it for read-only consumption.
Rewind (via vlog.Manager.Rewind) leverages LogFile.Truncate and Init to roll back partial batches after errors.

4. SST Helpers

While SSTable builders/readers live under lsm/table.go, they rely on file helpers to map index/data blocks efficiently. The build tags (sstable_linux.go, sstable_darwin.go) provide OS-specific tuning for direct I/O hints or mmap flags.

5. Comparison

Engine	Approach
RocksDB	C++ Env & random-access file wrappers.
Badger	`y.File` abstraction with mmap.
NoKV	Go-native mmap wrappers with explicit log helpers.

By keeping all filesystem primitives in one package, NoKV ensures WAL, vlog, and SST layers share consistent behaviour (sync semantics, truncation rules) and simplifies testing (file/mmap_linux_test.go).

6. Operational Notes

DoneWriting provides strong crash-consistency guarantees. Even on filesystems where ftruncate metadata persistence is asynchronous, the explicit post-truncate fsync ensures the file size is durable upon success.
Value-log and WAL segments rely on DoneWriting/Truncate to seal files; avoid manipulating files externally or mmap metadata may desynchronise.
LogFile updates cached size internally on Write/Truncate, so read bounds stay consistent during rewrite/rewind flows.
vfs.SyncDir is used by strict durability flows to persist directory entry changes (create/rename/remove). For example, strict SST flush calls SyncDir(workdir) before manifest publication.

For more on how these primitives plug into higher layers, see docs/wal.md and docs/vlog.md.

Cache & Bloom Filters

NoKV’s LSM tier layers a multi-level block cache with bloom filter caching to accelerate lookups. The implementation is in lsm/cache.go.

1. Components

Component	Purpose	Source
`cache.indexs`	Table index cache (`fid` → `*pb.TableIndex`) reused across reopen.	`utils/cache`
`blockCache`	Ristretto-based block cache (L0/L1 only) with per-table direct slots.	`lsm/cache.go`
`bloomCache`	LRU cache of bloom filter bitsets per SST.	`lsm/cache.go`
`cacheMetrics`	Atomic hit/miss counters for L0/L1 blocks and blooms.	`lsm/cache.go#L30-L110`

Badger uses a similar block cache split (Pinner/Cache) while RocksDB exposes block cache(s) via the BlockBasedTableOptions. NoKV keeps it Go-native and GC-friendly.

1.1 Index Cache & Handles

SSTable metadata stays with the table struct, while decoded protobuf indexes are stored in cache.indexs. Lookups first hit the cache before falling back to disk.
SST handles are reopened on demand for lower levels. L0/L1 tables keep their file descriptors pinned, while deeper levels close them once no iterator is using the table.

2. Block Cache Strategy

User-space block cache (L0/L1, parsed blocks, Ristretto LFU-ish)
Deeper levels rely on OS page cache + mmap readahead

Options.BlockCacheSize sets capacity in blocks (cost=1 per block). Entries keep parsed blocks (data slice + offsets/baseKey/checksum), so hits avoid re-parsing.
Per-table direct slots (table.cacheSlots[idx]) give a lock-free fast path. Misses fall back to the shared Ristretto cache (approx LFU with admission).
Evictions clear the table slot via OnEvict; user-space cache only tracks L0/L1 blocks. Deeper levels depend on the OS page cache.
Access patterns: getBlock also updates hit/miss metrics for L0/L1; deeper levels bypass the cache and do not affect metrics.

flowchart LR
  Read --> CheckCache
  CheckCache -->|hit| Return
  CheckCache -->|miss| LoadFromTable["LoadFromTable (mmap + OS page cache)"]
  LoadFromTable --> InsertCache
  InsertCache --> Return

By default only L0 and L1 blocks are cached (level > 1 short-circuits), reflecting the higher re-use for top levels.

3. Bloom Cache

bloomCache stores the raw filter bitset (utils.Filter) per table ID. Entries are deep-copied (SafeCopy) to avoid sharing memory with mmaps.
Cache policy is LRU.
Capacity is controlled by Options.BloomCacheSize.
Bloom hits/misses are recorded via cacheMetrics.recordBloom, feeding into StatsSnapshot.Cache.BloomHitRate.

4. Metrics & Observability

cache.metricsSnapshot() produces:

type CacheMetrics struct {
    L0Hits, L0Misses uint64
    L1Hits, L1Misses uint64
    BloomHits, BloomMisses uint64
    IndexHits, IndexMisses uint64
}

Stats.Snapshot converts these into hit rates. Monitor them alongside the block cache sizes to decide when to scale memory.

5. HotRing Integration

Hot detection: HotRing counts on read/write paths and triggers targeted prefetch for hot keys.
Cache warmup: prefetch loads target blocks into the normal L0/L1 block cache path.
Compaction coupling: HotRing top-k feeds compaction scoring; levels/ingest shards covering hot ranges get higher scores to trim overlap sooner.
Tuning: Hot thresholds come from HotRing options (window/decay configurable).

6. Interaction with Value Log

Keys stored as value pointers (large values) still populate block cache entries for the key/index block. The value payload is read directly from the vlog (valueLog.read), so block cache hit rates remain meaningful.
Discard stats from flushes can demote cached blocks via cache.dropBlock, ensuring obsolete SST data leaves the cache quickly.

7. Comparison

Feature	RocksDB	BadgerDB	NoKV
Block cache policy	Configurable multiple caches	Single cache	Ristretto for L0/L1 + OS page cache for deeper levels
Bloom cache	Enabled per table, no explicit cache	Optional	Dedicated LRU storing filters
Metrics	Block cache stats via `GetAggregatedIntProperty`	Limited	`NoKV.Stats.cache.*` hit rates

8. Operational Tips

If bloom hit rate falls below ~60%, consider increasing bits-per-key or Bloom cache size.
Track nokv stats --json cache metrics over time; drops often indicate iterator misuse or working-set shifts.

More on SST layout lives in docs/manifest.md and docs/architecture.md.

HotRing – Hot Key Tracking

hotring is NoKV’s hot-key tracker. It samples read/write access frequency per key and exposes the hottest entries to the stats subsystem and CLI. The implementation lives in hotring/ inside this repository.

1. Motivation

Cache hints – DB.prefetchLoop (see db.go) consumes hot keys to schedule asynchronous reads into the block cache.
Operational insight – StatsSnapshot.Hot.ReadKeys and nokv stats --json surface the hottest keys, aiding debugging of traffic hotspots.
Throttling – HotRing.TouchAndClamp enables simple rate caps: once a key crosses a threshold, callers can back off or log alerts.

Compared with RocksDB (which exposes block access stats via perf_context) and Badger (which lacks built-in hot-key reporting), NoKV offers a lightweight but concurrent-friendly tracker out of the box.

2. Data Structure

HotRing
  buckets[] -> per-bucket lock-free linked list (Node)
  hashFn   -> hash(key) -> uint32
  hashMask -> selects bucket (power of two size)

Each bucket stores a sorted linked list of Node ordered by (tag, key), where tag is derived from the upper bits of the hash. Head pointers are atomic.Pointer[Node], so readers walk the list without taking locks; writers use CAS to splice nodes while preserving order.
defaultTableBits = 12 → 4096 buckets by default (NewHotRing). The mask ensures cheap modulo operations.
Nodes keep a count (int32) updated atomically and a next pointer stored via unsafe.Pointer. Sliding-window state is guarded by a tiny per-node spin lock instead of a process-wide mutex.

flowchart LR
  Key(key) -->|hash| Bucket["buckets[index] (atomic head)"]
  Bucket --> Node1
  Node1 --> Node2
  Node2 --> Node3
  Node3 --> Nil[(nil)]

3. Core Operations

Method	Behaviour	Notes
`Touch`	Insert or increment key’s counter.	CAS-splices a new node if missing, then increments (window-aware when enabled).
`Frequency`	Read-only counter lookup.	Lock-free lookup; uses sliding-window totals when configured.
`TouchAndClamp`	Increment unless `count >= limit`, returning `(count, limited)`.	Throttling follows sliding-window totals so hot bursts clamp quickly.
`TopN`	Snapshot hottest keys sorted by count desc.	Walks buckets without locks, then sorts a copy.
`KeysAbove`	Return all keys with counters ≥ threshold.	Handy for targeted throttling or debugging hot shards.

Bucket ordering is preserved by findOrInsert, which CASes either the bucket head or the predecessor’s next pointer to splice new nodes. Reads never take locks; only per-node sliding-window updates spin briefly to avoid data races.

4. Integration Points

DB reads – DB.Get* and iterators call db.recordRead, which invokes HotRing.Touch on a read-only ring for every successful lookup.
Write throttling & hot batching – writes are tracked by a write-only ring. When Options.WriteHotKeyLimit > 0, writes use TouchAndClamp to enforce throttling; when throttling is disabled but HotWriteBurstThreshold > 0, writes still Touch so hot batching can trigger.
Stats – StatsSnapshot.Hot.ReadKeys and StatsSnapshot.Hot.WriteKeys publish read/write hot keys. expvar exposes these under NoKV.Stats.hot.read_keys and NoKV.Stats.hot.write_keys.
Caching – hot reads trigger asynchronous prefetch into the normal L0/L1 block cache path.
Value log routing – a dedicated HotRing instance powers vlog hot/cold bucket routing. It tracks write hotness only (no read signal) to avoid polluting bucket selection. Hot keys are routed to hot buckets (ValueLogHotBucketCount) once ValueLogHotKeyThreshold is reached; cold keys go to the cold range.

5. Comparisons

Engine	Approach
RocksDB	External – `TRACE` / `perf_context` requires manual sampling.
Badger	None built-in.
NoKV	In-process ring with expvar/CLI export and throttling helpers.

The HotRing emphasises simplicity: lock-free bucket lists with atomic counters (plus optional per-node window tracking), avoiding sketches while staying light enough for hundreds of thousands of hot keys.

6. Operational Tips

Options.HotRingTopK controls how many keys show up in stats; default 16. Increase it when investigating workloads with broad hot sets.
Combine TouchAndClamp with request middleware to detect abusive tenants: when limited is true, log the key and latency impact.
Resetting the ring is as simple as instantiating a new HotRing—useful for benchmarks that require clean counters between phases.

For end-to-end examples see docs/stats.md and the CLI walkthrough in docs/cli.md.

6.1 Default Configuration

Global HotRing defaults (NewDefaultOptions):

Option	Default value	Notes
`HotRingEnabled`	`true`	Master switch for DB hot tracking.
`HotRingBits`	`12`	4096 buckets.
`HotRingTopK`	`16`	Top-K hot keys for stats/CLI.
`HotRingDecayInterval`	`0`	Decay disabled by default.
`HotRingDecayShift`	`0`	Decay disabled by default.
`HotRingWindowSlots`	`8`	Sliding window enabled.
`HotRingWindowSlotDuration`	`250ms`	~2s window.
`HotRingRotationInterval`	`30m`	Dual-ring rotation enabled.
`HotRingNodeCap`	`250,000`	Strict cap per ring.
`HotRingNodeSampleBits`	`0`	Strict cap (no sampling).

Value-log override defaults (ValueLogHotRing*):

Option	Default value	Notes
`ValueLogHotRingOverride`	`true`	Use dedicated VLog settings.
`ValueLogHotRingBits`	`12`	4096 buckets.
`ValueLogHotRingRotationInterval`	`10m`	Faster rotation for write-hotness.
`ValueLogHotRingNodeCap`	`200,000`	Strict cap per ring.
`ValueLogHotRingNodeSampleBits`	`0`	Strict cap (no sampling).
`ValueLogHotRingDecayInterval`	`0`	Decay disabled (window handles recency).
`ValueLogHotRingDecayShift`	`0`	Decay disabled.
`ValueLogHotRingWindowSlots`	`6`	~600ms window.
`ValueLogHotRingWindowSlotDuration`	`100ms`	Shorter write-hotness window.

When ValueLogHotRingOverride=false, the value-log ring inherits the global HotRing settings. When override is enabled, zeros disable features (except bits=0, which falls back to the ring default).

7. Write-Path Throttling

Options.WriteHotKeyLimit wires the write-only HotRing into the write path. When set to a positive integer, user writes (DB.Set, DB.SetWithTTL) and internal writes (DB.ApplyInternalEntries) invoke HotRing.TouchAndClamp with the limit. Once a key (optionally scoped by column family via cfHotKey) reaches the limit, the write is rejected with utils.ErrHotKeyWriteThrottle. If throttling is disabled but HotWriteBurstThreshold > 0, the write ring still tracks frequency to enable hot write batching. This keeps pathological tenants or hot shards from overwhelming a single Raft group without adding heavyweight rate-limiters to the client stack.

Operational hints:

StatsSnapshot.Write.HotKeyLimited and the CLI line Write.HotKeyThrottled expose how many writes were rejected since the process started.
Applications should surface utils.ErrHotKeyWriteThrottle to callers (e.g. HTTP 429) so clients can back off.
Prefetching continues to run independently—only writes are rejected; reads still register hotness so the cache layer knows what to prefetch.
Set the limit conservatively (e.g. a few dozen) and pair it with richer HotRing analytics (top-K stats, expvar export) to identify outliers before tuning.

8. Time-Based Decay & Sliding Window

HotRing now exposes two complementary controls so “old” hotspots fade away automatically:

Periodic decay (Options.HotRingDecayInterval + HotRingDecayShift)
Every interval the global counters are right-shifted (count >>= shift). This keeps TopN and stats output focused on recent traffic even if writes stop abruptly.
Sliding window (Options.HotRingWindowSlots + HotRingWindowSlotDuration)
Per-key windows split time into slots, each lasting slotDuration. Touch only accumulates inside the current slot; once the window slides past, the stale contribution is dropped. TouchAndClamp and Frequency use the sliding-window total, so write throttling reflects short-term pressure instead of lifetime counts.

Disable either mechanism by setting the interval/durations to zero. Typical starting points:

Option	Default value	Effect
`HotRingDecayInterval`	`0`	Decay disabled by default.
`HotRingDecayShift`	`0`	Decay disabled by default.
`HotRingWindowSlots`	`8`	Keep ~8 buckets of recency data.
`HotRingWindowSlotDuration`	`250ms`	Roughly 2s window for throttling.

With both enabled, the decay loop keeps background stats tidy while the sliding window powers precise, short-term throttling logic.

Note: in NoKV, configuration normalization treats the sliding window as higher priority. If a window is enabled, decay is automatically disabled to avoid redundant background work.

9. Bounding Growth (Node Cap & Rotation)

HotRing does not automatically evict keys. To keep memory predictable in high-cardinality workloads, use a node cap (with optional sampling) and/or ring rotation.

Node cap + sampling

Options.HotRingNodeCap sets a per-ring upper bound on tracked keys.
Options.HotRingNodeSampleBits controls stable sampling once the cap is hit:
- 0 = strict cap (no new keys after the cap).
- N = allow roughly 1/2^N of new keys (soft cap).
- When HotRingNodeCap = 0, sampling is disabled.

Dual-ring rotation

Options.HotRingRotationInterval enables dual-ring rotation:
- active ring receives new touches
- warm ring keeps the previous generation to avoid sudden drops
Merge semantics:
- Frequency / TouchAndClamp → max(active, warm)
- TopN / KeysAbove → sum(active, warm)

Memory note: rotation keeps two rings, so the upper bound is roughly 2 × HotRingNodeCap. If you have a fixed budget, halve the per-ring cap.

Suggested starting points:

Option	Effect
`HotRingNodeCap`	Hard cap per ring (0 disables).
`HotRingNodeSampleBits`	Soft cap sampling rate.
`HotRingRotationInterval`	Rotation period (0 disables).

10. Value Log Overrides

NoKV maintains a value-log HotRing dedicated to hot/cold routing. By default this override is enabled so the write-only ring can use faster rotation and a shorter window. You can disable it to inherit the global HotRing config:

Options.ValueLogHotRingOverride = false (inherit global settings)
Or keep it enabled and tune ValueLogHotRing* fields explicitly.

When override is enabled, the value-log ring uses the override values verbatim; zeros disable a feature (for example, rotation). If override is disabled, it inherits the global HotRing* configuration.

Percolator Distributed Transaction Design

This document explains NoKV’s distributed transaction path implemented by percolator/ and executed through raftstore.

The scope here is the current code path:

Prewrite
Commit
BatchRollback
ResolveLock
CheckTxnStatus
MVCC read visibility (KvGet/KvScan through percolator.Reader)

1. Where It Runs

Percolator logic is executed on the Raft apply path:

Client sends NoKV RPC (KvPrewrite, KvCommit, …).
raftstore/kv/service.go wraps it into a RaftCmdRequest.
Store proposes command through Raft.
On apply, raftstore/kv/apply.go dispatches to percolator.*.

sequenceDiagram
    participant C as raftstore/client
    participant S as kv.Service
    participant R as Raft (leader->followers)
    participant A as kv.Apply
    participant P as percolator
    participant DB as NoKV DB
    C->>S: KvPrewrite/KvCommit...
    S->>R: ProposeCommand(RaftCmdRequest)
    R->>A: Apply committed log
    A->>P: percolator.Prewrite/Commit...
    P->>DB: CFDefault/CFLock/CFWrite reads+writes
    A-->>S: RaftCmdResponse
    S-->>C: NoKV RPC response

Key files:

1.1 RPC to Percolator Function Mapping

NoKV RPC	`kv.Apply` branch	Percolator function
`KvPrewrite`	`CMD_PREWRITE`	`Prewrite`
`KvCommit`	`CMD_COMMIT`	`Commit`
`KvBatchRollback`	`CMD_BATCH_ROLLBACK`	`BatchRollback`
`KvResolveLock`	`CMD_RESOLVE_LOCK`	`ResolveLock`
`KvCheckTxnStatus`	`CMD_CHECK_TXN_STATUS`	`CheckTxnStatus`
`KvGet`	`CMD_GET`	`Reader.GetLock` + `Reader.GetValue`
`KvScan`	`CMD_SCAN`	`Reader.GetLock` + CFWrite iteration + `GetInternalEntry`

2. MVCC Data Model

NoKV uses three MVCC column families:

CFDefault: stores user values at start_ts
CFLock: stores lock metadata at fixed lockColumnTs = MaxUint64
CFWrite: stores commit records at commit_ts

2.1 Lock Record

percolator.Lock (encoded by EncodeLock):

Primary
Ts (start timestamp)
TTL
Kind (Put/Delete/Lock)
MinCommitTs

2.2 Write Record

percolator.Write (encoded by EncodeWrite):

Kind
StartTs
ShortValue (codec supports it; current commit path does not populate it)

3. Concurrency Control: Latches

Before mutating keys, percolator acquires striped latches:

latch.Manager hashes keys to stripe mutexes.
Stripes are deduplicated and acquired in sorted order to avoid deadlocks.
Guard releases in reverse order.

In raftstore/kv, latches are passed explicitly:

NewEntryApplier creates one latch.NewManager(512) and reuses it.
Apply / NewApplier accept an injected manager; nil falls back to latch.NewManager(512).

This serializes conflicting apply operations on overlapping keys in one node.

4. Two-Phase Commit Flow

Client side (raftstore/client.Client.TwoPhaseCommit):

Group mutations by region.
Prewrite primary region.
Prewrite secondary regions.
Commit primary region.
Commit secondary regions.

sequenceDiagram
    participant Cli as Client
    participant R1 as Region(primary)
    participant R2 as Region(secondary)
    Cli->>R1: Prewrite(primary + local muts)
    Cli->>R2: Prewrite(secondary muts)
    Cli->>R1: Commit(keys,startTs,commitTs)
    Cli->>R2: Commit(keys,startTs,commitTs)

5. Write-Side Operations

5.1 Prewrite

Prewrite runs mutation-by-mutation:

Check existing lock on key:
- if lock exists with different Ts -> KeyError.Locked
Check latest committed write:
- if commit_ts >= req.start_version -> WriteConflict
Apply data intent:
- Put: write value into CFDefault at start_ts
- Delete/Lock: delete default value at start_ts (if exists)
Write lock into CFLock at lockColumnTs

5.2 Commit

For each key:

Read lock
If no lock:
- if write with same start_ts exists -> idempotent success
- else -> abort (lock not found)
If lock Ts != start_version -> KeyError.Locked
commitKey:
- if min_commit_ts > commit_version -> CommitTsExpired
- if write with same start_ts already exists:
  - rollback write -> abort
  - write with different commit ts -> treat success, clean lock
  - same commit ts -> success
- else write CFWrite[key@commit_ts] = {kind,start_ts}
- remove lock from CFLock

5.3 BatchRollback

For each key:

If already has write at start_ts:
- rollback marker already exists -> success
- non-rollback write exists -> success (already committed)
Remove lock (if any)
Remove default value at start_ts (if any)
Write rollback marker to CFWrite at start_ts

5.4 ResolveLock

commit_version == 0 -> rollback matching locks
commit_version > 0 -> commit matching locks
Returns number of resolved keys

6. Transaction Status Check

CheckTxnStatus targets the primary key and decides whether txn is alive, committed, or should be rolled back.

Decision order:

Read lock on primary
If lock exists but lock.ts != req.lock_ts -> KeyError.Locked
If lock exists and TTL expired (current_ts >= lock.ts + ttl):
- rollback primary
- action = TTLExpireRollback
If lock exists and caller pushes timestamp:
- min_commit_ts = max(min_commit_ts, caller_start_ts+1)
- action = MinCommitTsPushed
If no lock, check write by start_ts:
- committed write -> return commit_version
- rollback write -> action LockNotExistRollback
If no lock and no write, and rollback_if_not_exist is true:
- write rollback marker
- action LockNotExistRollback

7. Read Path Semantics (MVCC Visibility)

KvGet and KvScan read through percolator.Reader:

Check lock first:
- if lock exists and read_ts >= lock.ts, return locked error
Find visible write in CFWrite:
- latest commit_ts <= read_ts
Interpret write kind:
- Delete/Rollback => not found
- Put => read value from CFDefault at start_ts

Notes:

KvScan currently rejects reverse scan.
scanWrites uses internal iterator over CFWrite.

8. Error and Idempotency Behavior

Operation	Idempotency/Conflict behavior
Prewrite	Rejects lock conflicts and write conflicts; returns per-key `KeyError` list.
Commit	Idempotent for already committed keys with same `start_ts`; stale/missing lock may abort.
BatchRollback	Safe to repeat; rollback marker prevents duplicate side effects.
ResolveLock	Safe to retry per key set; resolves only matching `start_ts` locks.
CheckTxnStatus	May push `min_commit_ts`, rollback expired primary lock, or return committed version.

9. Current Operational Boundaries

Percolator execution is tied to NoKV RPC + Raft apply path, with the command shape still following the TinyKV/TiKV MVCC model.
Latch scope is process-local when one store shares a single latch.Manager; region correctness still comes from Raft ordering.
Write.ShortValue and Write.ExpiresAt are codec fields; current commit path stores primary value bytes in CFDefault and reads from there when short value is not present.

10. Validation and Tests

Primary coverage:

These tests cover 2PC happy path, lock conflicts, status checks, resolve/rollback behavior, and client region-aware retries.

RaftStore Deep Dive

raftstore powers NoKV’s distributed mode by layering multi-Raft replication on top of the embedded storage engine. Its RPC surface is exposed as the NoKV gRPC service, while the command model still tracks the TinyKV/TiKV region + MVCC design. This note explains the major packages, the boot and command paths, how transport and storage interact, and the supporting tooling for observability and testing.

1. Package Structure

Package	Responsibility
`store`	Orchestrates peer set, command pipeline, region manager, scheduler/heartbeat loops; exposes helpers such as `StartPeer`, `ProposeCommand`, `SplitRegion`.
`peer`	Wraps etcd/raft `RawNode`, drives Ready processing (persist to WAL, send messages, apply entries), tracks snapshot resend/backlog.
`engine`	WALStorage/DiskStorage/MemoryStorage across all Raft groups, leveraging the NoKV WAL while keeping manifest metadata in sync.
`transport`	gRPC transport with retry/TLS/backpressure; exposes the raft Step RPC and can host additional services (NoKV).
`kv`	NoKV RPC implementation, bridging Raft commands to MVCC operations via `kv.Apply`.
`server`	`ServerConfig` + `New` that bind DB, Store, transport, and NoKV server into a reusable node primitive.

2. Boot Sequence

Construct Server

srv, _ := raftstore.NewServer(raftstore.ServerConfig{
    DB: db,
    Store: raftstore.StoreConfig{StoreID: 1},
    Raft: myraft.Config{ElectionTick: 10, HeartbeatTick: 2, PreVote: true},
    TransportAddr: "127.0.0.1:20160",
})

A gRPC transport is created, the NoKV service is registered, and transport.SetHandler(store.Step) wires raft Step handling.
store.Store loads manifest.RegionSnapshot() to rebuild the Region catalog (router + metrics).

Start local peers
- CLI (nokv serve) iterates the manifest snapshot and calls Store.StartPeer for every region that includes the local store.
- Each peer.Config carries raft parameters, the transport reference, kv.NewEntryApplier, WAL/manifest handles, and Region metadata.
- StartPeer registers the peer through the peer-set/routing layer and may bootstrap or campaign for leadership.
Peer connectivity
- transport.SetPeer(storeID, addr) defines outbound raft connections; the CLI exposes it via --peer storeID=addr.
- Additional services can reuse the same gRPC server through transport.WithServerRegistrar.

3. Command Execution

Read (strong leader read)

kv.Service.KvGet builds pb.RaftCmdRequest and invokes Store.ReadCommand.
validateCommand ensures the region exists, epoch matches, and the local peer is leader; a RegionError is returned otherwise.
peer.LinearizableRead obtains a safe read index, then peer.WaitApplied waits until local apply index reaches it.
commandApplier (i.e. kv.Apply) runs GET/SCAN against the DB using MVCC readers to honor locks and version visibility.

Write (via Propose)

Write RPCs (Prewrite/Commit/…) call Store.ProposeCommand, encoding the command and routing to the leader peer.
The leader appends the encoded request to raft, replicates, and once committed the command pipeline hands data to kv.Apply, which maps Prewrite/Commit/ResolveLock to the percolator package.
engine.WALStorage persists raft entries/state snapshots and updates manifest raft pointers. This keeps WAL GC and raft truncation aligned.
Raft apply only accepts command-encoded payloads (RaftCmdRequest). Legacy raw KV payloads are rejected as unsupported.

Command flow diagram

sequenceDiagram
    participant C as NoKV Client
    participant SVC as kv.Service
    participant ST as store.Store
    participant PR as peer.Peer
    participant RF as raft log/replication
    participant AP as kv.Apply
    participant DB as NoKV DB

    rect rgb(237, 247, 255)
      C->>SVC: KvGet/KvScan
      SVC->>ST: ReadCommand
      ST->>PR: LinearizableRead + WaitApplied
      ST->>AP: commandApplier(req)
      AP->>DB: internal read path
      AP-->>C: read response
    end

    rect rgb(241, 253, 244)
      C->>SVC: KvPrewrite/KvCommit/...
      SVC->>ST: ProposeCommand
      ST->>RF: route + replicate
      RF->>AP: apply committed entry
      AP->>DB: percolator mutate -> ApplyInternalEntries
      AP-->>C: write response
    end

4. Transport

gRPC transport listens on TransportAddr, serving both raft Step RPC and NoKV RPC.
SetPeer updates the mapping of remote store IDs to addresses; BlockPeer can be used by tests or chaos tooling.
Configurable retry/backoff/timeout options mirror production requirements. Tests cover message loss, blocked peers, and partitions.

5. Storage Backend (engine)

WALStorage piggybacks on the embedded WAL: each Raft group writes typed entries, HardState, and snapshots into the shared log.
LogRaftPointer and LogRaftTruncate edit manifest metadata so WAL GC knows how far it can compact per group.
Alternative storage backends (DiskStorage, MemoryStorage) are available for tests and special scenarios.

6. NoKV RPC Integration

RPC	Execution Path	Notes
`KvGet` / `KvScan`	`ReadCommand` → `LinearizableRead(ReadIndex)` + `WaitApplied` → `kv.Apply` (read mode)	Leader-only strong read with Raft linearizability barrier.
`KvPrewrite` / `KvCommit` / `KvBatchRollback` / `KvResolveLock` / `KvCheckTxnStatus`	`ProposeCommand` → command pipeline → raft log → `kv.Apply`	Pipeline matches proposals with apply results; MVCC latch manager prevents write conflicts.

The cmd/nokv serve command uses raftstore.Server internally and prints a manifest summary (key ranges, peers) so operators can verify the node’s view at startup.

7. Client Interaction (`raftstore/client`)

Region-aware routing with NotLeader/EpochNotMatch retry.
Mutate splits mutations by region and performs two-phase commit (primary first). Put / Delete are convenience wrappers.
Scan transparently walks region boundaries.
End-to-end coverage lives in raftstore/server/server_test.go, which launches real servers, uses the client to write and delete keys, and verifies the results.

8. Control Plane & Region Operations

8.1 Topology & Routing

Topology is sourced from raft_config.example.json (via config.LoadFile) and reused by scripts, Docker Compose, and the Redis gateway.
Runtime routing is PD-first: raftstore/client resolves Regions by key through GetRegionByKey and caches route entries for retries.
raft_config regions are treated as bootstrap/deployment metadata and are not the runtime source of truth once PD is available.
PD is the only control-plane source of truth for runtime scheduling/routing.

8.2 Split / Merge

Split: leaders call Store.ProposeSplit, which writes a split AdminCommand into the parent region’s raft log. On apply, Store.SplitRegion updates the parent range/epoch and starts the child peer.
Merge: leaders call Store.ProposeMerge, writing a merge AdminCommand. On apply, the target region range/epoch is expanded and the source peer is stopped/removed from the manifest.
These operations are explicit/manual and are not auto-triggered by size/traffic heuristics.

9. Observability

store.RegionMetrics() feeds into StatsSnapshot, making region counts and backlog visible via expvar and nokv stats.
nokv regions shows manifest-backed regions: ID, range, peers, state.
scripts/transport_chaos.sh exercises transport metrics under faults; scripts/run_local_cluster.sh spins up multi-node clusters for manual inspection.

Store internals at a glance

Component	File	Responsibility
Store facade	`store.go`	Store construction/wiring and shared component ownership (router, region manager, command pipeline, scheduler runtime).
Peer lifecycle	`peer_lifecycle.go`	Start/stop peers, router registration, lifecycle hooks, and store shutdown sequencing.
Command service	`command_service.go`	Region/epoch/key-range validation and read/propose request handling.
Admin service	`admin_service.go`	Split/merge proposal handling and applied admin command side effects.
Membership service	`membership_service.go`	Conf-change proposal helpers and manifest metadata updates after membership changes.
Region catalog	`region_catalog.go`	Public region catalog accessors and region metadata lifecycle operations.
Scheduler runtime	`scheduler_runtime.go`	Scheduler snapshot generation, store stats, operation application, and apply-entry dispatch.
Peer set	`peer_set.go`	Tracks active peers and exposes thread-safe lookups/iteration snapshots.
Command pipeline	`command_pipeline.go`	Assigns request IDs, records proposals, matches apply results, returns responses/errors to callers.
Region manager	`region_manager.go`	Validates state transitions, writes manifest edits, updates peer metadata, triggers region hooks.
Operation scheduler	`operation_scheduler.go`	Buffers planner output, enforces cooldown & burst limits, dispatches leader transfers or other operations.
Heartbeat loop	`heartbeat_loop.go`	Periodically publishes region/store heartbeats and, when the sink implements planner capability, drains scheduling actions.

10. Current Boundaries and Guarantees

Reads served through ReadCommand are leader-strong and pass a Raft linearizability barrier (LinearizableRead + WaitApplied).
Mutating NoKV RPC commands are serialized through Raft log replication and apply.
Command payload format on apply path is strict RaftCmdRequest encoding.
Region metadata (range/epoch/peers) is validated before both read and write command execution.
store.RegionMetrics + StatsSnapshot provide runtime visibility for region count, backlog, and scheduling health.

PD-lite

PD-lite is NoKV’s control-plane service for distributed mode.
It exposes a gRPC API (pb.PD) and is started by:

go run ./cmd/nokv pd --addr 127.0.0.1:2379

1. Responsibilities

PD-lite currently owns:

Routing: GetRegionByKey
Heartbeats: StoreHeartbeat, RegionHeartbeat
Region removal: RemoveRegion
ID service: AllocID
TSO: Tso

Runtime clients (for example cmd/nokv-redis raft backend) use PD as the routing source of truth.

2. Runtime Architecture

flowchart LR
    Store["nokv serve"] -->|"StoreHeartbeat / RegionHeartbeat"| PD["PD-lite (gRPC)"]
    Gateway["nokv-redis (raft mode)"] -->|"GetRegionByKey / Tso"| PD
    PD --> Cluster["pd/core.Cluster"]
    Cluster --> Scheduler["leader-transfer hint planner"]

Core implementation units:

pd/core: in-memory cluster metadata model + allocators.
pd/storage: persistence abstraction (Store) with local manifest+state implementation.
pd/server: gRPC service + RPC validation/error mapping.
pd/client: client wrapper used by store/gateway.
pd/adapter: scheduler sink that forwards heartbeats into PD.

3. Persistence (`--workdir`)

When --workdir is provided, PD-lite persists control-plane state:

Region catalog via manifest edits.
Allocator checkpoints via PD_STATE.json:
- id_current
- ts_current

Startup flow:

Open pd/storage with --workdir.
Load snapshot (regions + allocator counters).
Compute starts as max(cli_start, checkpoint+1).
Replay region snapshot into pd/core.Cluster.

This avoids allocator rollback after restart and keeps route metadata stable.

4. Config Integration

raft_config.json supports PD endpoint + workdir defaults:

"pd": {
  "addr": "127.0.0.1:2379",
  "docker_addr": "nokv-pd:2379",
  "work_dir": "./artifacts/cluster/pd",
  "docker_work_dir": "/var/lib/nokv-pd"
}

Resolution rules:

CLI override wins.
Otherwise read from config by scope (host / docker).

Helpers:

config.ResolvePDAddr(scope)
config.ResolvePDWorkDir(scope)
nokv-config pd --field addr|workdir --scope host|docker

5. Routing Source Convergence

NoKV now uses PD-first routing:

raftstore/client resolves regions with GetRegionByKey.
raft_config regions are bootstrap/deployment metadata.
Runtime route truth comes from PD heartbeats + PD region catalog.

This avoids dual sources drifting over time (config vs PD).

6. Serve Mode Semantics

nokv serve is now PD-only:

--pd-addr is required.
Runtime routing/scheduling control-plane state is sourced from PD.

Related CLI behavior:

Inspect control-plane state through PD APIs/metrics.

7. Comparison: TinyKV / TiKV

TinyKV (teaching stack)

Uses a scheduler server (tinyscheduler) as separate process.
Control plane integrates embedded etcd for metadata persistence.
Educational architecture, minimal production hardening.

TiKV (production stack)

PD is an independent, highly available cluster.
PD internally uses etcd Raft for durable metadata + leader election.
Rich scheduling and balancing policies, rolling updates, robust ops tooling.

NoKV PD-lite (current)

Single PD-lite process with optional local persistence (--workdir).
Sufficient for local clusters, testing, and architecture iteration.
API shape intentionally aligned with a PD-style control plane so migration to stronger HA semantics is incremental.

8. Current Limitations / Next Steps

No multi-PD quorum and no automatic PD failover.
Scheduler policy is intentionally small (leader transfer focused).
No advanced placement constraints yet.

These are deliberate scope limits for a fast-moving experimental platform.

Crash Recovery Playbook

This document describes how NoKV restores state after abnormal exit, and which tests validate each recovery contract.

1. Recovery Phases

flowchart TD
    Start[DB.Open]
    Verify[runRecoveryChecks]
    WalOpen[wal.Open]
    LSM[lsm.NewLSM]
    Manifest[manifest replay + table load]
    WALReplay[WAL replay to memtables]
    VLog[valueLog recover]
    Flush[submit immutable flush backlog]
    Stats[stats/start background loops]

    Start --> Verify --> WalOpen --> LSM --> Manifest --> WALReplay --> VLog --> Flush --> Stats

Pre-flight verification: DB.runRecoveryChecks runs manifest.Verify, wal.VerifyDir, and per-bucket vlog.VerifyDir.
WAL manager reopen: wal.Open reopens latest segment and rebuilds counters.
Manifest replay + SST load: levelManager.build replays manifest version and opens SST files.
Stale SST cleanup: if a manifest SST is missing or unreadable/corrupt, it is marked stale and removed from manifest (EditDeleteFile) so startup can continue.
WAL replay: lsm.recovery replays post-checkpoint WAL records into memtables.
Flush backlog restore: recovered immutable memtables are resubmitted to flush.Manager.
ValueLog recovery: value-log managers reconcile on-disk files with manifest metadata, trim torn tails, and drop stale/orphan segments.
Runtime restart: metrics and periodic workers start again.

2. Failure Scenarios & Tests

Failure Point	Expected Recovery Behaviour	Tests
WAL tail truncated	Replay stops safely at truncated tail, preserving valid prefix records	`wal/manager_test.go::TestManagerReplayHandlesTruncate`
Crash before memtable flush install	WAL replay restores user data not yet flushed to SST	`db_test.go::TestRecoveryWALReplayRestoresData`
Manifest references missing SST	Startup removes stale manifest entry and continues	`db_test.go::TestRecoveryCleansMissingSSTFromManifest`
Manifest references corrupt/unreadable SST	Startup removes stale entry and continues	`db_test.go::TestRecoveryCleansCorruptSSTFromManifest`
ValueLog stale segment (manifest marked invalid)	Recovery deletes stale file from disk	`db_test.go::TestRecoveryRemovesStaleValueLogSegment`
ValueLog orphan segment (disk only)	Recovery deletes orphan file not tracked by manifest	`db_test.go::TestRecoveryRemovesOrphanValueLogSegment`
Manifest rewrite interrupted	Recovery keeps using CURRENT-selected manifest and data remains readable	`db_test.go::TestRecoveryManifestRewriteCrash`
ValueLog contains records absent from LSM/WAL	Recovery does not replay vlog as source-of-truth	`db_test.go::TestRecoverySkipsValueLogReplay`

3. Recovery Tooling

3.1 Targeted tests

go test ./... -run 'Recovery|ReplayHandlesTruncate'

Set RECOVERY_TRACE_METRICS=1 to emit RECOVERY_METRIC ... lines in tests.

3.2 Script harness

RECOVERY_TRACE_METRICS=1 ./scripts/recovery_scenarios.sh

Outputs are saved under artifacts/recovery/.

3.3 CLI checks

nokv manifest --workdir <dir>: verify level files, WAL pointer, vlog metadata.
nokv stats --workdir <dir>: confirm flush backlog converges.
nokv vlog --workdir <dir>: inspect vlog segment state.

4. Operational Signals

Watch these fields during restart:

flush.queue_length
wal.segment_count
value_log.heads
value_log.segments
value_log.pending_deletes

If flush.queue_length remains high after replay, inspect flush worker throughput and manifest sync settings.

5. Notes on Consistency Model

WAL + manifest remain the authoritative recovery chain for LSM state.
ValueLog is reconciled/validated but is not replayed as a mutation source.
In strict flush mode (ManifestSync=true), SST install ordering is SST Sync -> RenameNoReplace -> SyncDir -> manifest edit.

For deeper internals, see flush.md, manifest.md, and wal.md.

Stats & Observability Pipeline

NoKV exposes runtime health through:

StatsSnapshot (structured in-process snapshot)
expvar (/debug/vars)
nokv stats CLI (plain text or JSON)

The implementation lives in stats.go, and collection runs continuously once DB is open.

1. Architecture

flowchart TD
    subgraph COLLECTORS["Collectors"]
        LSM["lsm.* metrics"]
        WAL["wal metrics"]
        VLOG["value log metrics"]
        HOT["hotring"]
        REGION["region metrics"]
        TRANSPORT["grpc transport metrics"]
        REDIS["redis gateway metrics"]
    end
    LSM --> SNAP["Stats.Snapshot()"]
    WAL --> SNAP
    VLOG --> SNAP
    HOT --> SNAP
    REGION --> SNAP
    TRANSPORT --> SNAP
    REDIS --> SNAP
    SNAP --> EXP["Stats.collect -> expvar"]
    SNAP --> CLI["nokv stats"]

Two-layer design:

metrics layer: only collects counters/gauges/snapshots.
stats layer: aggregates cross-module data and exports.

2. Snapshot Schema

StatsSnapshot is now domain-grouped (not flat):

entries
flush.*
compaction.*
value_log.* (includes value_log.gc.*)
wal.*
raft.*
write.*
region.*
hot.*
cache.*
lsm.*
transport.*
redis.*

Representative fields:

flush.pending, flush.queue_length, flush.last_wait_ms
compaction.backlog, compaction.max_score, compaction.value_weight
value_log.segments, value_log.pending_deletes, value_log.gc.gc_runs
wal.active_segment, wal.segment_count, wal.typed_record_ratio
raft.group_count, raft.lagging_groups, raft.max_lag_segments
write.queue_depth, write.avg_request_wait_ms, write.hot_key_limited
region.total, region.running, region.removing, region.tombstone
hot.read_keys, hot.write_keys, hot.read_ring, hot.write_ring
cache.block_l0_hit_rate, cache.bloom_hit_rate, cache.iterator_reused
lsm.levels, lsm.value_bytes_total

3. expvar Export

Stats.collect exports a single structured object:

NoKV.Stats

All domains (flush, compaction, value_log, wal, raft, write, region, hot, cache, lsm, transport, redis) are nested under this object.

Legacy scalar compatibility keys are removed. Consumers should read fields from NoKV.Stats directly.

4. CLI & JSON

nokv stats --workdir <dir>: offline snapshot from local DB
nokv stats --expvar <host:port>: snapshot from running process /debug/vars
nokv stats --json: machine-readable nested JSON

Example:

{
  "entries": 1048576,
  "flush": {
    "pending": 2,
    "queue_length": 2
  },
  "value_log": {
    "segments": 6,
    "pending_deletes": 1,
    "gc": {
      "gc_runs": 12
    }
  },
  "hot": {
    "read_keys": [
      {"key": "user:123", "count": 42}
    ]
  }
}

5. Operational Guidance

flush.queue_length + compaction.backlog both rising: flush/compaction under-provisioned.
value_log.discard_queue high for long periods: check value_log.gc.* and compaction pressure.
write.throttle_active=true frequently: L0 pressure likely high; inspect cache.block_l0_hit_rate and compaction.
write.hot_key_limited increasing: hot key write throttling is active.
raft.lag_warning=true: at least one group exceeds lag threshold.

6. Comparison

Engine	Built-in observability
RocksDB	Rich metrics/perf context, often needs additional tooling/parsing
Badger	Optional metrics integrations
NoKV	Native expvar + structured snapshot + CLI with offline/online modes

Testing & Validation Matrix

This document inventories NoKV’s automated coverage and provides guidance for extending tests. It aligns module-level unit tests, integration suites, and benchmarking harnesses with the architectural features described elsewhere.

1. Quick Commands

# All unit + integration tests (uses local module caches)
GOCACHE=$PWD/.gocache GOMODCACHE=$PWD/.gomodcache go test ./...

# Focused distributed transaction suite
go test ./percolator/... ./raftstore/client/... -run 'Test.*(Commit|Prewrite|TwoPhaseCommit)'

# Crash recovery scenarios
RECOVERY_TRACE_METRICS=1 ./scripts/recovery_scenarios.sh

# Protobuf schema hygiene
make proto-check

# gRPC transport chaos tests + watchdog metrics
CHAOS_TRACE_METRICS=1 ./scripts/transport_chaos.sh

# Sample PD-lite service for shared TSO / routing in distributed tests
go run ./cmd/nokv pd --addr 127.0.0.1:2379 --id-start 1 --ts-start 100 --workdir ./artifacts/pd

# Local three-node cluster (includes manifest bootstrap + PD-lite)
./scripts/run_local_cluster.sh --config ./raft_config.example.json
# Tear down with Ctrl+C

# Docker-compose sandbox (3 nodes + PD-lite)
docker compose up --build
docker compose down -v

# Build RocksDB locally (installs into ./third_party/rocksdb/dist by default)
./scripts/build_rocksdb.sh
# YCSB baseline (records=1e6, ops=1e6, warmup=1e5, conc=16)
./scripts/run_benchmarks.sh
# YCSB with RocksDB (requires CGO, `benchmark_rocksdb`, and the RocksDB build above)
LD_LIBRARY_PATH="$(pwd)/third_party/rocksdb/dist/lib:${LD_LIBRARY_PATH}" \
CGO_CFLAGS="-I$(pwd)/third_party/rocksdb/dist/include" \
CGO_LDFLAGS="-L$(pwd)/third_party/rocksdb/dist/lib -lrocksdb -lz -lbz2 -lsnappy -lzstd -llz4" \
YCSB_ENGINES="nokv,badger,rocksdb" ./scripts/run_benchmarks.sh
# One-click script (auto-detect RocksDB, supports `YCSB_*` env vars to override defaults)
./scripts/run_benchmarks.sh
# Quick smoke run (smaller dataset)
NOKV_RUN_BENCHMARKS=1 YCSB_RECORDS=10000 YCSB_OPS=50000 YCSB_WARM_OPS=0 \
./scripts/run_benchmarks.sh -ycsb_workloads=A -ycsb_engines=nokv

Tip: Pin GOCACHE/GOMODCACHE in CI to keep build artefacts local and avoid permission issues.

2. Module Coverage Overview

Module	Tests	Coverage Highlights	Gaps / Next Steps
WAL	`wal/manager_test.go`	Segment rotation, sync semantics, replay tolerance for truncation, directory bootstrap.	Add IO fault injection, concurrent append stress.
LSM / Flush / Compaction	`lsm/lsm_test.go`, `lsm/compaction_test.go`, `lsm/compact/*_test.go`, `lsm/flush/manager_test.go`	Memtable correctness, iterator merging, flush pipeline metrics, compaction scheduling.	Extend backpressure assertions, test cache hot/cold split.
Manifest	`manifest/manager_test.go`, `lsm/manifest_test.go`	CURRENT swap safety, rewrite crash handling, vlog metadata persistence.	Simulate partial edit corruption, column family extensions.
ValueLog	`vlog/manager_test.go`, `vlog/io_test.go`, `vlog_test.go`	ValuePtr encoding/decoding, GC rewrite/rewind, concurrent iterator safety.	Long-running GC, discard-ratio edge cases.
Percolator / Distributed Txn	`percolator/*_test.go`, `raftstore/client/client_test.go`, `stats_test.go`	Prewrite/Commit/ResolveLock flows, 2PC retries, timestamp-driven MVCC behaviour, metrics accounting.	Mixed multi-region fuzzing with lock TTL and leader churn.
DB Integration	`db_test.go`, `db_write_bench_test.go`	End-to-end writes, recovery, and throttle behaviour.	Combine ValueLog GC + compaction stress, multi-DB interference.
CLI & Stats	`cmd/nokv/main_test.go`, `stats_test.go`	Golden JSON output, stats snapshot correctness, hot key ranking.	CLI error handling, expvar HTTP integration tests.
Redis Gateway	`cmd/nokv-redis/backend_embedded_test.go`, `cmd/nokv-redis/server_test.go`, `cmd/nokv-redis/backend_raft_test.go`	Embedded backend semantics (NX/XX, TTL, counters), RESP parser, raft backend config wiring, and PD-backed routing/TSO discovery.	End-to-end multi-region CRUD with raft backend, TTL lock cleanup under failures.
Scripts & Tooling	`cmd/nokv-config/main_test.go`, `cmd/nokv/serve_test.go`	`nokv-config` JSON/simple formats, manifest logging CLI, serve bootstrap behavior.	Add direct shell-script golden tests (currently not present) and failure-path diagnostics for `run_local_cluster.sh`.
Benchmark	`benchmark/ycsb_test.go`, `benchmark/ycsb_runner.go`	YCSB throughput/latency comparisons across engines (A-G) with detailed percentile + operation mix reporting.	Automate multi-node deployments and add longer-running, multi-GB stability baselines.

3. System Scenarios

Scenario	Coverage	Focus
Crash recovery	`db_test.go`, `scripts/recovery_scenarios.sh`	WAL replay, missing SST cleanup, vlog GC restart, manifest rewrite safety.
WAL pointer desync	`raftstore/engine/wal_storage_test.go::TestWALStorageDetectsTruncatedSegment`	Detects manifest pointer offsets beyond truncated WAL tails to avoid silent corruption.
Distributed transaction contention	`raftstore/client/client_test.go::TestClientTwoPhaseCommitAndGet`, `percolator/*_test.go`	Lock conflicts, retries, and 2PC sequencing under region routing.
Value separation + GC	`vlog/manager_test.go`, `db_test.go::TestRecoveryRemovesStaleValueLogSegment`	GC correctness, manifest integration, iterator stability.
Iterator consistency	`lsm/iterator_test.go`	Snapshot visibility, merging iterators across levels and memtables.
Throttling / backpressure	`lsm/compaction_test.go`, `db_test.go::TestWriteThrottle`	L0 backlog triggers, flush queue growth, metrics observation.
Distributed NoKV client	`raftstore/client/client_test.go::TestClientTwoPhaseCommitAndGet`, `raftstore/transport/grpc_transport_test.go::TestGRPCTransportManualTicksDriveElection`	Region-aware routing, NotLeader retries, manual tick-driven elections, cross-region 2PC sequencing.
Performance regression	`benchmark` package	Compare NoKV vs Badger/Pebble by default (RocksDB optional), produce human-readable reports under `benchmark/benchmark_results`.

4. Observability in Tests

RECOVERY_METRIC logs – produced when RECOVERY_TRACE_METRICS=1; consumed by recovery script and helpful when triaging CI failures.
TRANSPORT_METRIC logs – emitted by scripts/transport_chaos.sh when CHAOS_TRACE_METRICS=1, capturing gRPC watchdog counters during network partitions and retries.
Stats snapshots – stats_test.go verifies JSON structure so CLI output remains backwards compatible.
Benchmark artefacts – stored under benchmark/benchmark_results/*.txt for historical comparison. Aligns with README instructions.

5. Extending Coverage

Property-based testing – integrate testing/quick or third-party generators to randomise distributed 2PC sequences (prewrite/commit/rollback ordering).
Stress harness – add a Go-based stress driver to run mixed read/write workloads for hours, capturing metrics akin to RocksDB’s db_stress tool.
Distributed readiness – strengthen raftstore fault-injection and long-run tests (leader transfer, transport chaos, snapshot catch-up) with reproducible CI artifacts.
CLI smoke tests – simulate corrupted directories to ensure CLI emits actionable errors.

Keep this matrix updated when adding new modules or scenarios so documentation and automation remain aligned.

Scripts Overview

NoKV ships a small collection of helper scripts to streamline local experimentation, demos, diagnostics, and automation. This page summarises what each script does, how to use it, and which shared configuration it consumes.

Cluster helpers

`scripts/run_local_cluster.sh`

Purpose – builds nokv and nokv-config, reads raft_config.json, seeds manifests, starts PD-lite, and starts the NoKV nodes. If a store directory already contains a manifest (CURRENT), the seeding step is skipped so previously bootstrapped data is reused.

Usage

./scripts/run_local_cluster.sh --config ./raft_config.example.json --workdir ./artifacts/cluster

--config defaults to the repository’s raft_config.example.json; --workdir chooses the data root (./artifacts/cluster by default). For every entry under stores the script creates store-<id> and calls nokv-config manifest, then launches nokv pd and the store processes. PD state is persisted under pd.work_dir (or <workdir>/pd when config omits it), so region routing metadata survives restarts. By default, store processes resolve PD from config; --pd-listen override is only forwarded when explicitly set. The script runs in the foreground—press Ctrl+C to stop all spawned processes. When --pd-listen is omitted, the script reads pd.addr from config and falls back to 127.0.0.1:2379.

❗️ Shutdown / restart note — To avoid WAL/manifest mismatches, stop the script with Ctrl+C and wait for child processes to exit. If you crash the process or the host, clean the workdir (rm -rf ./artifacts/cluster) before starting again; otherwise the replay step may panic when it encounters truncated WAL segments.

`scripts/bootstrap_from_config.sh`

Purpose – manifest-only bootstrap, typically used in Docker Compose before the nodes start. Stores that already hold a manifest are detected and skipped.
Usage
```
./scripts/bootstrap_from_config.sh --config /etc/nokv/raft_config.json --path-template /data/store-{id}
```
The script iterates over every store in the config and writes Region metadata via nokv-config manifest into the provided path template.

`scripts/serve_from_config.sh`

Purpose – translate raft_config.json into a nokv serve command, avoiding manual --peer lists. It resolves peer IDs from the region metadata and maps every peer (other than the local store) to its advertised address so that gRPC transport works out of the box.
Usage
```
./scripts/serve_from_config.sh \
    --config ./raft_config.json \
    --store-id 1 \
    --workdir ./artifacts/cluster/store-1 \
    --scope local   # use --scope docker inside containers
```
--scope decides whether to use the local addresses or the container-friendly ones. The script also resolves PD from config.pd unless --pd-addr explicitly overrides it. It assembles all peer mappings (excluding the local store) and execs nokv serve.

Diagnostics & benchmarking

Script	Purpose
`scripts/recovery_scenarios.sh`	Runs crash-recovery scenarios across WAL/manifest/vlog. Set `RECOVERY_TRACE_METRICS=1` to collect metrics under `artifacts/recovery/`.
`scripts/transport_chaos.sh`	Injects disconnects/blocks/delay into the `raftstore` transport to observe behaviour under faulty networks.
`scripts/run_benchmarks.sh`	Executes YCSB benchmarks (default engines: NoKV/Badger/Pebble, workloads A-G; optional RocksDB via build tags).
`scripts/debug.sh`	Convenience wrapper around `dlv test` for targeted debugging.
`scripts/gen.sh`	Generates protobuf Go bindings through Buf with pinned remote plugin versions.

Other helpers

`cmd/nokv pd`

PD-lite service used by local scripts and compose for:

routing (GetRegionByKey)
ID allocation (AllocID)
timestamp allocation (Tso)

Example:

go run ./cmd/nokv pd --addr 127.0.0.1:2379 --id-start 1 --ts-start 100 --workdir ./artifacts/pd

Relationship with `nokv-config`

nokv-config stores / regions / pd provide structured views over raft_config.json, making it easy for scripts and CI to query the topology.
nokv-config manifest writes Region metadata into manifests and replaces the historical manifestctl binary.
cmd/nokv-redis reads the same config and uses config.pd by default in raft mode (--pd-addr remains an override).
Go tools or custom scripts can import github.com/feichai0017/NoKV/config and call config.LoadFile / Validate to consume the same raft_config.json, avoiding divergent schemas.

Maintaining a single raft_config.json keeps local scripts, Docker Compose, Redis gateway, and automated tests aligned.

Redis Gateway

cmd/nokv-redis exposes NoKV through a RESP-compatible endpoint and can operate in two modes:

Mode	Description	Key flags
Embedded (`embedded`)	Opens a local `*NoKV.DB` work directory. Commands (`SET`, `SET NX/XX`, `EX/PX/EXAT/PXAT`, `MSET`, `INCR/DECR`, `DEL`, `MGET`, `EXISTS`, …) run through regular DB APIs (`Get/Set/SetWithTTL/Del`) with backend-side synchronization for read-modify-write operations.	`--workdir <dir>`
Raft (`raft`)	Routes requests through `raftstore/client` and a NoKV cluster. Writes execute via TwoPhaseCommit; TTL is persisted directly in entry `expires_at` metadata (same write path as value updates). Routing and TSO allocation are provided by PD-lite over gRPC (PD is runtime route source; config regions are bootstrap metadata).	`--raft-config <file>` `--pd-addr host:port` (optional override; defaults to `config.pd`)

When both CLI and config provide the same setting, CLI wins.

Usage examples

Embedded backend

go run ./cmd/nokv-redis \
  --addr 127.0.0.1:6380 \
  --workdir ./work_redis \
  --metrics-addr 127.0.0.1:9100  # optional expvar endpoint

Validate with redis-cli -p 6380 ping. Metrics are exposed at http://127.0.0.1:9100/debug/vars under NoKV.Stats.redis.

Raft backend

Start NoKV and PD-lite using the helper script or Docker Compose. Both consume raft_config.example.json, initialise manifests for each store, and launch nokv pd automatically:
```
./scripts/run_local_cluster.sh
# or: docker compose up --build
```

Run the gateway:

go run ./cmd/nokv-redis \
  --addr 127.0.0.1:6380 \
  --raft-config raft_config.example.json

Supported commands

String operations: GET, SET, SET NX/XX, EX/PX/EXAT/PXAT, DEL, MGET, MSET, EXISTS
Integer operations: INCR, DECR, INCRBY, DECRBY
Utility: PING, ECHO, QUIT

TTL option semantics:

EX / PX are relative TTLs.
EXAT / PXAT are absolute expire timestamps.
The current engine expiry resolution is seconds, so sub-second TTL intent is rounded/coarsened to second granularity.

In both modes write commands are atomic. The Raft backend batches multi-key updates (MSET, DEL, …) into a single TwoPhaseCommit, matching the embedded semantics. Reads use direct DB.Get locally and leader reads with TTL checks remotely.

Configuration file

raft_config.example.json is shared by scripts/run_local_cluster.sh, Docker Compose, and the Redis gateway. Important fields:

stores – store ID, gRPC address, and optional container listen/advertise addresses
regions – region ID, start/end keys (use hex:<bytes> for binary data), epoch, peer list, leader store ID
max_retries – maximum retries for region errors in the distributed client
pd – PD-lite endpoint(s) and optional persistence dirs:
- addr / docker_addr for endpoint resolution by scope
- work_dir / docker_work_dir for PD state persistence defaults

Use nokv-config to inspect or validate the configuration:

nokv-config stores --config raft_config.json
nokv-config regions --config raft_config.json --format json | jq '.[] | {id:.id, peers:.peers}'

For Go tooling, import github.com/feichai0017/NoKV/config and call config.LoadFile / Validate to reuse the same schema and defaults across CLIs, scripts, and applications.

Metrics

With --metrics-addr enabled the gateway publishes Redis metrics as part of NoKV.Stats on /debug/vars, for example:

{
  "NoKV.Stats": {
    "redis": {
      "commands_total": 128,
      "errors_total": 0,
      "connections_active": 1,
      "connections_accepted": 4,
      "commands_per_operation": {
        "PING": 4,
        "SET": 32,
        "GET": 64,
        "MGET": 8,
        "DEL": 10,
        "INCR": 10
      }
    }
  }
}

These counters are part of the process-wide expvar output and can be scraped alongside the rest of NoKV’s metrics.

Notes

Use this folder to capture per-debug or per-investigation notes. Keep entries short, factual, and easy to skim.

Add a new note

Create a new file in docs/notes/ named YYYY-MM-DD-short-title.md.
Add it to docs/SUMMARY.md under Notes.
Use the template below to keep entries consistent.

Template

Context

Symptom

Repro

Investigation

Root cause

Fix

Follow-ups

2026-01-16 mmap choice

本文档详细对比了主流文件 I/O 模型的差异，并解析 NoKV 在不同组件（SSTable, WAL, VLog）中做出不同 I/O 选择的深层原因与权衡。

1. I/O 模型的四国杀

在 Linux/Unix 环境下，我们在设计存储引擎时通常面临四种选择。理解它们的优劣是做出正确架构决策的前提。

特性	标准 I/O (`read`/`write`)	内存映射 (`mmap`)	直接 I/O (`O_DIRECT`)	异步 I/O (`io_uring`)
机制	系统调用，数据在 Kernel Buffer 和 User Buffer 间拷贝	建立虚拟内存映射，缺页中断加载，零拷贝	绕过 Page Cache，直接 DMA 到用户内存	提交请求队列，内核异步完成，零系统调用开销
优势	简单，通用，Page Cache 自动预读	读延迟极低 (像访问内存一样)，代码简单	完全可控 (内存/刷盘)，无 GC 干扰	吞吐量极高，CPU 占用低
痛点	拷贝开销 (CPU copy)，高频调用 Context Switch	不可控 (Page Fault 阻塞，TLB shootdown)，大文件污染 Cache	复杂 (需自建 Buffer Pool，对齐限制)	极复杂 (编程模型完全不同)
适用	日志追加 (WAL)	只读索引，随机小读 (SSTable)	数据库自管理缓存 (MySQL, ScyllaDB)	超高并发网络/磁盘 IO

2. NoKV 的选择：因地制宜

NoKV 没有“一种 IO 走天下”，而是根据不同组件的访问模式（Access Pattern）选择了最适合的方案。

2.1 SSTable：坚定选择 `mmap`

SSTable 是 LSM Tree 的数据文件，具有 不可变 (Immutable) 和 随机读 (Random Read) 的特性。

痛点：如果用标准 pread，每次 Get(key) 都要发起一次系统调用。在 100k QPS 下，上下文切换（Context Switch）的开销是巨大的。
mmap 的解法：
- 零拷贝：数据直接映射到用户空间，slice = data[offset:len]，没有 memcpy。
- 零系统调用：热点数据如果在物理内存中，读取就是纯内存访问，纳秒级延迟。
- OS 帮我管缓存：利用操作系统的 Page Cache 管理热点，不用自己写复杂的 LRU Cache。

2.2 WAL：回归标准 `os.File` + `bufio`

WAL (Write Ahead Log) 是 顺序追加 (Append Only) 且 持久化敏感 的。

mmap 的痛点：
- 文件扩容麻烦：mmap 需要预先 ftruncate 占位，写满了要 remap，这在追写场景下很笨重。
- 落盘不可控：虽然有 msync，但 OS 何时把 Dirty Page 刷盘是不确定的。对于要求 fsync 严格落盘的 WAL，标准 IO 更可控。
NoKV 的选择：使用标准 I/O 配合 bufio.Writer。
- bufio 提供了用户态缓冲，减少了 write 系统调用次数。
- fsync 语义清晰，确保数据不丢。

2.3 ValueLog：目前的妥协 (mmap + madvise)

ValueLog 也是 顺序写，但面临 随机读（KV 分离查询时）。

现状：NoKV 目前对 VLog 也使用了 mmap。
写入控制：虽然使用 mmap 写入，但代码中显式调用了 madvise(MADV_DONTNEED)。
- 在 DoneWriting（文件写满轮转）和 SetReadOnly 时，系统会通知内核“我不再需要这些页面了”。
- 目的：主动释放 VLog 刚刚写入的大量脏页占用的 Page Cache，防止它们把 SSTable 的热点数据（索引、Filter）挤出内存。
持久化：只有当 SyncWrites: true 时，才会调用 msync。平时依赖 OS 的后台刷盘。

3. 读写交互逻辑图

下面这张图展示了不同 IO 模型在 NoKV 读写流中的位置：

flowchart TD
    subgraph "Write Path"
        Mem[MemTable]
        WAL["WAL (Standard IO)"]
        Flush["Flush/Compact"]
    end
    
    subgraph "Persistence"
        SST["SSTable (mmap)"]
        VLog["ValueLog (mmap)"]
    end
    
    Write["Set(k, v)"] --> Mem
    Write --> WAL
    
    Mem -->|Full| Flush
    Flush -->|"Small Values"| SST
    Flush -->|"Large Values"| VLog
    
    subgraph "Read Path"
        Get["Get(k)"]
        LSM["LSM Search"]
        
        Get --> LSM
        LSM -->|"1. Index Lookups"| SST
        SST -->|"2. Zero Copy Read"| Kernel["Page Cache"]
        
        LSM -->|"3. ValuePtr Found"| VLog
        VLog -->|"4. Random Read"| Kernel
    end
    
    style WAL fill:#f9f,stroke:#333,stroke-width:2px
    style SST fill:#bfb,stroke:#333,stroke-width:2px
    style VLog fill:#bfb,stroke:#333,stroke-width:2px

4. 总结

NoKV 的 I/O 选型策略是 “读写分治，稳定为王”：

读密集 (SST)：选 mmap，榨干内存带宽，减少 CPU 开销。
写敏感 (WAL)：选 Standard IO，确保数据安全和追加性能。
大容量 (VLog)：选 mmap + madvise，利用切片读取的便利性，同时主动管理缓存污染。

理解这些权衡，是掌握存储引擎底层性能优化的关键。

2026-01-16 hotring design

本文档详细记录了 NoKV 中 hotring 模块的设计灵感、架构定位、核心实现以及未来展望。这是一个从学术论文汲取灵感，并转化为工业级“热点探测器”的典型案例。

当前实现已并入 NoKV 仓库，位于 hotring/ 包。

1. 设计灵感：取其神而弃其形

来源：HotRing: A Hotspot-Aware In-Memory Key-Value Store (FAST ’20)

1.1 论文解决的痛点

在传统的 Hash 索引（链地址法）中，如果链表很长且热点数据位于链表尾部，每次访问热点都需要遍历大量冷数据，造成严重的 CPU Cache Miss 和长尾延迟。HotRing 提出将链表改为环形结构，并让 Head 指针智能指向热点节点，从而实现 $O(1)$ 的热点访问。

1.2 NoKV 的工程转化

NoKV 并没有照搬论文作为主索引（因为主索引是 LSM Tree），而是提取了 “热点感知” 这一核心思想，设计了一个轻量级、旁路式的热点统计模块。

差异点：
- 定位：论文是存数据的索引；NoKV 是记账的统计器。
- 结构：论文是环形链表 + 智能指针；NoKV 是分片 Hash + 有序链表 + 滑动窗口。
核心价值：在百万级 QPS 下，以极低的开销（Lock-Free List）精准识别系统中的“热点”，为缓存优化和限流提供数据支撑。

2. 核心架构：反馈驱动设计 (Feedback-Driven)

NoKV 的 HotRing 不仅仅是一个统计工具，它是整个系统“自适应优化”的大脑。

2.1 架构全景图

graph TD
    Client[Client Request] --> DB[DB Layer]
    
    subgraph "HotRing Subsystem (The Brain)"
        Tracker[Hot Key Tracker]
        Window[Sliding Window]
        Decay[Decay Loop]
    end
    
    subgraph "Execution Layer"
        LSM[LSM Tree]
        Cache[Block Cache]
        Compaction[Compaction Picker]
        Limiter[Write Limiter]
    end
    
    DB -->|"1. Touch(key)"| Tracker
    Tracker -->|"2. Update Counters"| Window
    Decay -.->|"3. Age Out"| Window
    
    Tracker -.->|"4. TopN Report"| Compaction
    Compaction -->|"5. Hot Score"| LSM

2.2 关键交互流程

探测 (Probe)：
- 读路径：每次 Get 命中时调用 Touch。
- 写路径：只有当启用了限流（WriteHotKeyLimit）或突发检测时，才会调用 TouchAndClamp。
计算 (Compute)：HotRing 内部利用滑动窗口算法计算实时 QPS。
反馈 (Feedback)：
- Compaction 评分：lsm/compact 在选择压缩层级时，会参考 HotRing.TopN。如果某一层包含大量热点 Key，会优先压缩该层（Hot Overlap Score），减少热点数据的读放大。
- 缓存预取 (Prefetch)：DB 层会根据 TopN 结果触发预取逻辑。虽然 HotRing 不直接控制 Cache，但它提供的热点名单是预取策略的重要输入。
- 写入限流：对于写频率过高的 Key，TouchAndClamp 会触发限流保护。

3. 实现细节深度解析

3.1 并发控制：Lock-Free 与 Spin-Lock

为了支撑高并发，HotRing 采用了混合并发策略：

主链表 (Buckets & List)：采用 Lock-Free 的 CAS 操作进行节点插入。
- Ordered List：链表节点按 (Tag, Key) 排序，查找失败可提前终止。
滑动窗口 (Window Counters)：由于涉及复杂的窗口滚动和数组更新，使用了轻量级的 Spin-Lock (自旋锁) 保护。
- node.lockWindow(): CAS(&lock, 0, 1)。
衰减 (Decay)：后台协程定期衰减时，会有互斥锁保护 decayMu，但实际的计数器衰减是原子操作。

3.2 统计算法：滑动窗口与衰减

如何区分“历史热点”和“突发热点”？

滑动窗口 (Sliding Window)：
- 将时间切分为多个 Slot（如 8 个 Slot，每个 250ms）。
- Touch 时根据 Timestamp % Slots 写入对应 Slot。
- 效果：能够精准反映“最近 2 秒”的热度，过期数据自动失效。
衰减 (Decay)：
- 后台协程定期将所有 Counter 右移一位（count >> 1）。
- 效果：模拟热度的“半衰期”，让不再访问的旧热点逐渐冷却。

3.3 与论文/算法的关键差异（工程化改动）

对比点	论文 / 经典算法	NoKV HotRing
目标	作为索引或严格频率估计	作为系统级热点反馈信号
数据结构	环形链表/Sketch	哈希分桶 + 有序链表
误差控制	明确误差界	工程可接受范围
并发	复杂锁或全局结构	Lock-Free + 轻量自旋锁
时间维度	常态累计	滑动窗口 + 衰减

结论：NoKV HotRing 是“工程可用”优先的实现，而不是“数学最优”优先。

4. 实际应用场景

4.1 可观测性 (Observability)

运维人员可以通过 CLI 实时查看系统热点，瞬间定位“谁在打挂数据库”。

# 使用 stats 命令查看
$ go run cmd/nokv/main.go stats --workdir ./work_test
...
Hot Keys:
  key: user:1001, count: 52000
  key: config:global, count: 12000

4.2 缓存与性能 (Performance)

VIP 缓存区 (Hot Tier)：LSM Cache 内部维护了一个小型的 Clock-Pro 缓存（Hot Tier）。虽然它不是绝对的“免死金牌”（仍可能被更热的数据挤出），但它为热点 Block 提供了比普通 LRU 更强的保护。
热点压缩优先：通过 HotRing 的反馈，系统能主动将热点数据所在的重叠 SSTable 进行合并，将热点数据的查询路径压缩到最短。

5. 未来展望

基于目前的 HotRing 基础，NoKV 未来可以实现更高级的特性：

写吸收 (Write Absorption)：
- 对于超高频写入的热点（如计数器），可以在内存中聚合 100 次更新为 1 次 VLog 写入，大幅降低 LSM 写放大。
动态数据迁移：
- 在分布式场景下，发现某个 Region 出现热点，自动触发 Region Split 或将该热点 Key 迁移到专用节点。

6. 总结

NoKV 的 hotring 是一个 “学术灵感 + 工程务实” 的典范。它没有追求理论上完美的环形索引结构，而是抓住了“热点感知”这一核心价值，用混合并发结构（Lock-Free + SpinLock）解决了工程中最头疼的监控盲区问题，并成功反哺了 Compaction 调度。

2026-02-01 compaction and ingest

本文档深入解析 NoKV 的 Compaction（压缩） 机制与 Ingest Buffer（导入缓冲） 的协同设计。这是 NoKV 解决 LSM Tree 经典的“写停顿（Write Stall）”问题的核心武器，也是体现其工业级稳定性的关键设计。

1. 设计理念：拒绝“写停顿”

在 LSM Tree 架构中，数据从 MemTable 刷入 L0 层。由于 L0 层的 SSTable 之间 Key 是重叠的，当 L0 文件数量达到上限（如 15 个）时，必须触发 L0 -> L1 的 Compaction。

传统痛点：L0 -> L1 的 Compaction 需要将 L0 文件与 L1 中所有重叠的文件读出，进行归并排序（Merge Sort），然后重写。这个过程涉及大量 IO 和 CPU，耗时较长。
后果：如果写入速度超过了 L0 -> L1 的压缩速度，L0 就会被填满，系统被迫触发 Write Stall（限制甚至停止写入），导致严重的性能抖动。

NoKV 的哲学：

“先收下，再整理。” 当 L0 拥堵时，不要阻塞写入去等待漫长的排序，而是先把 L0 的文件“甩”给下一层，让下一层暂时“保管”，等有空了再慢慢整理。

1.1 参考论文与工程对标

以下论文/系统是 NoKV compaction 与 ingest buffer 设计的主要参考坐标（按主题分类）：

LSM 设计与调参理论：
- Monkey (SIGMOD 2017) —— 全局调参、Bloom 过滤器与合并策略的权衡模型。
- Dostoevsky (SIGMOD 2018) —— Lazy Leveling / 低合并成本的层级策略。
写停顿与稳定性：
- bLSM (SIGMOD 2012) —— 强调写入吞吐稳定与 tail latency。
- Performance Stability in LSM-based Storage Systems —— compaction 抖动与写停顿成因分析。
工程系统实践：
- RocksDB Compaction (官方文档) —— leveled/tiered/universal 与 L0 处理策略。
- PebblesDB —— 碎片化/分片思路降低写放大。
- Co-KV —— 把 compaction 视作核心瓶颈的系统研究。

2. 核心组件：Ingest Buffer

为了实现上述哲学，NoKV 为每一层（Level 1+）引入了一个特殊的结构：Ingest Buffer。

2.1 结构定义 (`lsm/ingest.go`)

它不是一个简单的队列，而是一个分片化的容器：

type ingestBuffer struct {
    shards []ingestShard // 默认 4 个分片
}

type ingestShard struct {
    tables    []*table   // 暂存在这里的 SSTable 列表
    ranges    []tableRange // 对应的 Key 范围索引
}

分片 (Sharding)：根据 Key 的前缀将暂存的表分配到不同的 Shard。
并行性：这允许后台的多个 Compactor 线程并行地处理不同 Key 范围的积压数据。

3. 交互逻辑：救火与还债

NoKV 的 Compaction 流程被设计为“快慢双轨”制。

3.1 快路径：L0 溢出卸载 (Offloading)

这是应对 Write Stall 的“救火”机制。

触发：L0 文件数过多。
动作 (moveToIngest)：
1. 不进行数据合并。
2. 直接将 L0 的 SSTable 文件从 L0 列表中移除。
3. 将这些文件加入到 L1 的 Ingest Buffer 中。
代价：纯元数据操作，微秒级完成。
结果：L0 瞬间清空，写停顿解除。L1 暂时持有这些未排序的文件。

graph TD
    subgraph Before_L0_Congested["Before: L0 Congested"]
        L0["L0: 15 SSTables (Full)"]
        L1["L1: Sorted SSTables"]
    end

    subgraph Action_Offload_Fast["Action: Offload (Fast)"]
        Move["Move to Ingest"]
    end

    subgraph After_L0_Empty["After: L0 Empty"]
        L0_New["L0: Empty"]
        L1_New["L1: Sorted SSTables"]
        L1_Ingest["L1 Ingest Buffer: 15 Unsorted Tables"]
    end

    L0 --> Move --> L1_Ingest

3.2 慢路径：后台异步归并 (Merge)

这是“还债”机制，确保存储结构的最终有序性。

触发：Compactor 发现某层的 Ingest Buffer 积压严重（Score > 1）。
模式选择 (IngestMode)：
- IngestDrain：将 Ingest Shard 合并进 Main Tables，彻底清空缓冲。
- IngestKeep：合并 Shard，但如果下游压力也大，可能会将输出结果继续保留在 Ingest Buffer 中（暂存结果），以避免写入放大的级联效应。
动作 (fillTablesIngestShard)：
1. 挑选一个积压最严重的 Shard。
2. 锁定该 Shard 和 L1 中与其 Key 范围重叠的 Main Tables。
3. 执行标准的归并排序。
4. 生成新的 Main Tables，清空该 Shard。

4. 读路径的权衡

这种设计本质上是 “空间换时间” 和 “读写权衡”。我们牺牲了一点点读性能，换取了极致的写稳定性。

查询流程 (Get)：

查 MemTable。
查 L0。
查 L1：
- 先查 L1 Ingest Buffer：因为这里面是从 L0 刚“甩”下来的新数据，版本更新。
  - 需要在 Shard 内进行二分查找（因为 buffer 内的表之间可能有重叠）。
- 后查 L1 Main Tables：这是标准的有序数据，查找很快。
查 L2…

5. 协同设计：Value-Aware Compaction

除了处理写抖动，Compaction 还承担了回收 VLog 空间的任务。

痛点：在 KV 分离架构中，LSM 里的删除只是写了一个 Tombstone，VLog 里的旧 Value 依然占着磁盘。
方案：
- Value Density (价值密度)：Compaction Picker 会计算每一层的 TotalValueBytes / TotalSizeBytes。
- Discard Stats (失效统计)：虽然 VLog GC 依赖专门的 discard stats，但 Compaction 必须负责通过重写 SSTable 来丢弃那些指向无效 Value 的指针。
- 策略：Compaction 会优先选择 Value 密度异常（或者包含大量 Stale 数据）的层级进行压缩，主动触发指针清理。

6. 总结

NoKV 的 Compaction 和 Ingest Buffer 设计解决了一组复杂的工程矛盾：

问题	传统方案	NoKV 方案	收益
L0 拥堵	阻塞写入，强制合并	L0 -> Ingest Buffer (快速卸载)	零写停顿 (Zero Write Stall)
合并卡顿	单线程大合并	Sharding + Subcompaction	并行处理，利用多核/SSD 优势
VLog 膨胀	被动等待	Value-Aware Scoring	主动出击，加速空间回收

这是一个非常成熟的工业级设计，它不仅关注“存得下”，更关注“写得稳”和“删得掉”。

7. 与论文原始设计的关键对比（我们做了哪些改动）

7.1 与 bLSM / Performance Stability 的对比

论文观点	原文侧重点	NoKV 改动	实际影响
写停顿主因是 L0 拥堵 + Compaction 过慢	强调稳定吞吐	Ingest Buffer + 快速卸载	写停顿几乎消失
需要把后台任务节奏“拉平”	关注 tail latency	分片 + 并行 compaction + 动态调度	把抖动压在后台

7.2 与 Monkey / Dostoevsky 的对比

论文观点	原文侧重点	NoKV 改动	实际影响
LSM 参数需全局权衡（读/写/空间）	理论模型	引入 ingest buffer 作为工程缓冲层	实际调参更稳定
Lazy leveling 降低合并成本	减少写放大	IngestKeep/Drain 模式	热点时延降低

7.3 与 RocksDB / PebblesDB 的对比

系统	原始设计	NoKV 改动	说明
RocksDB	L0 → leveled，universal 作为可选	引入每层 ingest 缓冲区	更适合 burst 场景
PebblesDB	碎片化 LSM	按前缀分片 shard	保持范围局部性

7.4 与论文原型不同的工程化点

分片并行：按 key 前缀 shard，使 ingest 与 compaction 可并行而不互相覆盖。
IngestKeep / IngestDrain：把“快速止血”和“慢速还债”拆成两条路径。
Value-aware compaction：与 VLog discard stats 联动，把无效指针尽快清掉。
调度基于 backlog/score：优先处理最急的 shard，而非随机挑选。

简单总结：论文解决的是“理论最优解”，NoKV 做的是“工程稳定性 + 可运维”。

2026-02-05 vlog 设计与 GC（WiscKey + HashKV 工程化）

这份笔记把 NoKV 的 ValueLog（vlog）设计、GC 机制、以及最近的并行化与热冷分流优化整理成一份完整版本。内容融合 WiscKey（KV 分离）与 HashKV（哈希分区/热冷分离）两条主线，并结合当前实现细节与参数策略。

一页摘要（TL;DR）

核心思路：LSM 只保存 Key+ValuePtr，大 Value 顺序写入 vlog；再用 多桶 + 热冷分流 把热点更新局部化，并通过 并行 GC + 压力控制 把 GC 开销稳定在可控范围。

设计点	借鉴	NoKV 实现	直接收益
KV 分离	WiscKey	vlog + ValuePtr	LSM 更小、写入更顺序
哈希分区	HashKV	`ValueLogBucketCount`	垃圾局部化
热冷分流	HashKV	HotRing 路由	热点不污染冷数据
GC 并行	工程化	`ValueLogGCParallelism`	提升清理吞吐
压力控制	工程化	reduce/skip 阈值	不与 compaction 抢资源

1. 论文借鉴要点

1.1 WiscKey

KV 分离：LSM 只存 Key + ValuePtr，大 Value 写入 vlog。
顺序写：写入走日志追加，延迟稳定。
GC 必要性：旧值只能通过搬运+删除回收。

1.2 HashKV

哈希分区：ValueLog 分桶，key 的历史版本集中。
热冷分离：热点更新影响局部桶，冷数据保持稳定。
轻 GC：热点桶高频回收，冷桶低频维护。

1.3 参考论文（标题）

2. 设计目标（工程化视角）

写路径极简：顺序追加为主，不引入复杂索引结构。
GC 不扰动主路径：并行但受控，避免和 compaction 争 IO。
热点更新局部化：尽量把垃圾限制在热桶。
可观测 + 可调参：让调参是“看得见的系统工程”。

2.1 设计约束与假设

Crash Recovery 必须可靠：vlog 的 head/删除状态必须可恢复。
写放大优先于读放大：更倾向把写成本压低，读路径可容忍一次额外跳转。
GC 可退让：GC 是“后台维护”，不能把 compaction 压死。

3. 架构总览（分层模型）

flowchart TD
  subgraph DB["DB Policy 层"]
    VlogGo["vlog.go / vlog_gc.go<br/>写入路由 + GC 调度"]
  end
  subgraph Mgr["ValueLog Manager"]
    MgrGo["vlog/manager.go<br/>分段/轮转/读写"]
  end
  subgraph IO["IO Layer"]
    File["file/ (mmap)<br/>LogFile"]
  end

  DB --> Mgr --> IO

4. 目录布局与分桶结构

<workdir>/
  vlog/
    bucket-000/
      00000.vlog
      00001.vlog
    bucket-001/
      00000.vlog
      00001.vlog
    ...

ValueLogBucketCount > 1 启用分桶。
ValuePtr 现在包含 Bucket/Fid/Offset/Len，LSM 可以精确定位。

4.1 记录格式与 ValuePtr 布局

vlog 记录格式（与 WAL 一致）：

+--------+----------+------+-------------+-----------+-------+
| KeyLen | ValueLen | Meta | ExpiresAt   | Key bytes | Value |
+--------+----------+------+-------------+-----------+-------+
                                             + CRC32 (4B)

ValuePtr 布局：

+------+--------+-----+--------+
| Len  | Offset | Fid | Bucket |
+------+--------+-----+--------+
| 4B   | 4B     | 4B  | 4B     |

这保证了：LSM 索引只需持有 ValuePtr 即可定位到具体桶 + 文件 + 偏移。

4.2 Manifest 与恢复关系（NoKV 特有工程点）

与论文原型不同，NoKV 把 vlog 的 head 与删除事件写入 manifest：

flowchart LR
  A["vlog append"] --> B["update head"]
  B --> C["manifest edit"]
  C --> D["crash recovery"]
  D --> E["rebuild vlog state"]

这样恢复时不依赖完整目录扫描，避免误删/误开段。

5. 写入路径（顺序追加）

sequenceDiagram
  participant C as commitWorker
  participant V as vlog.Manager
  participant W as WAL
  participant M as MemTable
  C->>V: AppendEntries(entries)
  V-->>C: ValuePtr list
  C->>W: Append(entries+ptrs)
  C->>M: Apply to memtable

关键保证：vlog 写入在 WAL 之前，崩溃恢复时不会出现“指针悬空”。

6. 读路径（指针解引用）

flowchart LR
  K["Get(key)"] --> LSM["LSM 查索引"]
  LSM -->|inline value| V["直接返回"]
  LSM -->|ValuePtr| P["定位 bucket/fid/offset"]
  P --> R["vlog 读取 (mmap)"]
  R --> V

读路径的代价在于一次额外的 vlog 定位，但换来更小的 LSM 与更顺序的写入。

6. 热冷分流（HotRing 驱动）

热度统计只看写路径（写热点），避免读热点污染：

flowchart TD
  E["Entry 写入"] --> H["HotRing Touch"]
  H -->|hot| B1["热桶 0..H-1"]
  H -->|cold| B2["冷桶 H..N-1"]
  B1 --> V["vlog append"]
  B2 --> V

默认配置（可调）：

ValueLogBucketCount = 16
ValueLogHotBucketCount = 4
ValueLogHotKeyThreshold = 8

7. GC 机制（采样 + 重写）

sequenceDiagram
  participant GC as GC Thread
  participant Stats as Discard Stats
  participant Old as Old Segment
  participant LSM as LSM
  participant New as Active Segment

  GC->>Stats: 选择候选文件
  GC->>Old: Sample 10%
  GC->>LSM: 校验指针是否仍指向旧值
  alt discard 过阈值
    loop 遍历旧文件
      GC->>Old: Read Entry
      GC->>LSM: Double Check
      alt still live
        GC->>New: Rewrite
      end
    end
    GC->>Old: 删除旧文件
  else discard不足
    GC-->>Stats: 跳过
  end

8. 并行 GC + 压力控制（核心工程化）

8.1 并行调度

ValueLogGCParallelism 控制并发数（默认自动）。
同桶互斥：同一桶不会并发 GC（无锁 CAS）。
全局 semaphore 限制同时 GC 数量。

8.2 压力控制

当 compaction 压力过高时，GC 自动降级或跳过：

flowchart LR
  A["Compaction Stats"] --> B{"压力评估"}
  B -->|低| C["并行 GC"]
  B -->|中| D["并行度减半"]
  B -->|高| E["跳过本轮 GC"]

阈值参数：

ValueLogGCReduceScore / ValueLogGCReduceBacklog
ValueLogGCSkipScore / ValueLogGCSkipBacklog

8.3 与论文实现的关键差异（重点对比）

WiscKey vs NoKV

维度	WiscKey	NoKV
vlog 元数据	论文原型不强调 manifest	manifest 记录 head/删除
GC 触发	依赖扫描与 stale ratio	来自 LSM discard stats
GC 并行	未强调	多桶并行 + 压力控制
热点处理	无显式热冷	HotRing 驱动热/冷桶

HashKV vs NoKV

维度	HashKV	NoKV
分区策略	哈希分区	哈希分桶 + 热/冷分流
目标	降低更新放大	降低 GC 波动 + write amp
GC 调度	以分区为单位	分桶并行 + compaction 压力控制

结论：NoKV 保留论文的“核心思想”，但在恢复一致性、调度策略、观测性上做了工程化强化。

9. 可观测性与调参抓手

关键指标（expvar）：

NoKV.ValueLog.GcParallelism
NoKV.ValueLog.GcActive
NoKV.ValueLog.GcScheduled
NoKV.ValueLog.GcThrottled
NoKV.ValueLog.GcSkipped
NoKV.ValueLog.GcRejected

简单调参建议：

低负载：调高 ValueLogGCParallelism
高负载：降低 ReduceScore 或 ReduceBacklog，更快降级

10. 代价与边界

桶数过多 → 文件碎片化、head 追踪成本上升
热桶过小 → 轮转频繁、写放大升高
并行 GC 过高 → 可能与 compaction 争抢 IO

11. 小结

NoKV 的 vlog 设计是典型的 “WiscKey + HashKV + 工程化调度”：

写路径保持顺序，延迟稳定
多桶 + 热冷分流 把垃圾局部化
并行 GC + 压力控制 把系统稳定性和吞吐平衡起来

这使得 vlog 从“可用”走向“可运维 + 可扩展”。

NoKV 内存内核：Arena 线性分配与自适应索引（ART vs SkipList）的极致工程实现

在高性能存储引擎中，内存管理直接决定了系统的吞吐上限和延迟稳定性。NoKV 的 MemTable 层通过 Arena 线性分配和高度优化的索引结构，实现了零 GC 压力下的极速读写。本文将深度拆解这一层级的核心架构设计与工程权衡。

1. Arena：构建“指针无关”的堆外内存池

NoKV 的 Arena（位于 utils/arena.go）是所有内存索引的物理基石。它通过“单向追加”和“偏移量寻址”机制，将海量的小对象分配从 Go Runtime 的堆内存中剥离出来。

1.1 物理布局与分配策略

Arena 并不是零散的内存块，而是一个连续的 []byte。

线性分配 (Bump Allocation)：分配开销仅为一个原子加法。
内存对齐 (Alignment)：为了保证原子操作（如 64 位版本号更新）的安全性，Arena 强制执行对齐：

// AllocAligned 确保分配的起始地址在 align 的倍数上
func (a *Arena) AllocAligned(size, align int) uint32 {
    // 计算对齐补齐量 (Padding)
    padding := (align - (int(a.n) % align)) % align
    // 原子地移动分配指针
    offset := a.Allocate(uint32(size + padding))
    // 返回真实的起始偏移量
    return offset + uint32(padding)
}

为什么对齐是刚需？：在现代 64 位 CPU 上，如果一个 8 字节的 uint64 跨越了缓存行 (Cache Line)，硬件无法保证其原子写操作的原子性。NoKV 通过 Arena 对齐，保证了所有元数据更新（如 SkipList 的 Tower 链接或 ART 的版本号）在物理上是并发安全的。

1.2 寻址艺术：Uint32 Offset vs Pointer

在 Go 堆中存储数百万个节点指针会导致 GC 扫描极慢（STW 延迟剧增）。NoKV 所有的索引节点内部都使用 uint32 的 Offset 来互相引用。

GC 友好性：Offset 对 GC 是透明的。在扫描阶段，GC 只需要扫描 Arena 那个巨大的 []byte 切片头，而不需要递归扫描数以万计的小对象。
内存节省：在 64 位系统上，uint32 (4B) 只有原生指针 (8B) 的一半大小，这让内存索引的有效负载比提高了近一倍。

2. SkipList：经典的并发写基准

SkipList（位于 utils/skiplist.go）以其实现简单、并发稳健著称，是 NoKV 的基准索引。

2.1 架构设计

多级塔式索引：通过随机化层高（MaxHeight=20，概率 P=0.25），实现平均 $O(\log N)$ 的查找复杂度。
无锁并发 (Lock-free)：利用 atomic.CompareAndSwapUint32 在每一层进行节点插入。

2.2 插入协议 (Add 逻辑)

SkipList 的插入不是简单的 Lock -> Insert，而是一个多阶段的原子安装过程：

节点创建：在 Arena 中预分配节点空间，设置 Key/Value 的 Offset。
寻找切面 (Find Splice)：从最高层开始向下寻找每一层的前驱 (prev) 和后继 (next) 节点。
逐层原子链接：从 Level 0 开始向上 CAS 安装。如果 CAS 失败（说明并发环境下有其他节点插入），则局部重试寻找切面，直到安装完成。

func (s *Skiplist) Add(entry *kv.Entry) {
    // 1. 预分配节点并随机化层高
    nodeOffset := s.newNode(entry.Key, entry.Value, height)
    // 2. 局部 CAS 链接
    for i := 0; i < height; i++ {
        for {
            prev, next := s.findSpliceForLevel(entry.Key, i)
            // 将新节点的 next 指向找到的后继
            s.setNextOffset(nodeOffset, i, next)
            // 原子地将前驱的 next 指向新节点
            if s.casNextOffset(prev, i, next, nodeOffset) {
                break 
            }
        }
    }
}

3. ART：追求极致的自适应基数树

ART（Adaptive Radix Tree，位于 utils/art.go）是 NoKV 的默认索引，专为现代 CPU 缓存架构和内存效率平衡设计。

3.1 自适应节点架构

ART 会根据子节点数量动态调整节点物理大小，以平衡空间利用率和查询效率：

Node4 / Node16：使用线性扫描寻找子节点，适合分支较少的路径。
Node48：使用间接索引表（256 字节哈希表），空间效率极高。
Node256：直接数组寻址，提供极致的 $O(k)$ 寻址性能。

3.2 排序难题：Comparable Route Key 编码

Radix Tree 原生基于字节比较，不支持 LSM 要求的复合排序（UserKey 升序 + Version 降序）。NoKV 通过一套精妙的编码解决：

编码公式：RouteKey = EncodeComparable(UserKey) + BigEndian(Timestamp)。
设计价值：这保证了在 ART 树的层级深度遍历结果完全等价于 LSM 的 Key + Version 排序逻辑，从而完美支持 Seek 和范围扫描。

3.3 并发模型：COW + OLC-lite

完全无锁读 (Lock-free Read)：通过 COW (Copy-On-Write) 保证读取路径观察到的是一致的、不可变的节点快照。
OLC-lite 原地更新：针对单点修改（如更新已存在的 Key 的 Value），ART 保留了一个快速路径，在不复制节点的前提下通过原子指令替换 ValuePtr，极大降低了分配开销。

4. MemTable 原子预留协议 (Workflow)

当用户调用 Set 时，LSM 内核的协调流程如下：

sequenceDiagram
    participant U as User
    participant LSM as LSM Kernel
    participant MT as Active MemTable
    participant FM as Flush Manager

    U->>LSM: Set(entry)
    LSM->>LSM: Acquire RLock (Concurrent writers entry)
    LSM->>MT: tryReserve(estimate_size)
    alt 预留成功 (Fast Path)
        MT->>MT: 原子移动 reservedSize
        MT->>MT: 并发写入 Index 和 WAL
        LSM->>LSM: Release RLock
    else 空间不足 (Slow Path - Rotation)
        LSM->>LSM: Upgrade to WLock (Stop all writers)
        LSM->>LSM: rotateLocked (Seal current MT -> Move to Immutables)
        LSM->>FM: submitFlush(oldMemTable)
        LSM->>LSM: Create New Active MemTable & WAL Segment
        LSM->>LSM: Release WLock
        LSM->>U: Retry Set (Go to Fast Path)
    end

5. 性能参数与工程取舍建议

在 Options 中配置索引引擎时，应考虑负载特性：

维度	SkipList	ART (Default)
点查延迟 (Get)	$O(\log N)$，较高 Cache Miss	$O(k)$，缓存局部性极佳
范围定位 (Seek)	中等	极快（前缀压缩优势）
内存占用	极低（仅存储 Offset）	较高（当前约 2x，包含内部节点）
代码可维护性	极佳	较高（节点升级降级复杂）

总结：NoKV 的内存设计遵循 “空间即状态，并发即原子” 的原则。Arena 锁定了物理稳定性，自适应索引提供了灵活的性能上限。这套方案让 NoKV 在单机热点场景下表现出了极强的韧性。

NoKV 写入流水线：从 MPSC 节拍器到自适应聚合的深度演进

高性能存储引擎的写入路径必须像“节拍器”一样稳定。NoKV 的写入流水线不仅是一个并发队列，它是一套具备 自适应反馈能力 和 分段一致性保证 的异步聚合系统。本文深度拆解 NoKV 如何在高并发压力下保持极致的写入吞吐与低尾延迟。

1. 架构模型：MPSC 聚合流水线

NoKV 并没有让每个用户协程都直接去竞争底层的磁盘锁或 WAL 互斥量，而是采用了 MPSC (Multi-Producer, Single-Consumer) 异步聚合模型。

1.1 设计背景：为什么是 MPSC？

在 LSM 引擎中，WAL (预写日志) 的写入必须是严格顺序的。如果让 1000 个用户协程并发地调用 write() 系统调用，内核态的上下文切换和文件锁竞争会瞬间压垮系统。 NoKV 通过 MPSC 模型，将数千个前台并发压力汇聚到一个后台 commitWorker 中，将随机小写入转化为大块的顺序磁盘 IO。

1.2 核心组件：commitQueue

commitQueue（位于 db_write.go）并不是一个简单的 Channel，它是由三部分构成的协同系统：

RingBuffer：基于原子序列号的无锁循环队列，负责极速的数据传递。
Spaces Channel：作为“票据”系统，控制队列的硬上限，形成天然的 Backpressure (背压)。
Items Channel：作为“唤醒”信号，通知后台 Worker 有新数据到达，避免 Worker 空转浪费 CPU。

2. 自适应批处理算法 (Adaptive Coalescing)

nextCommitBatch 是整个流水线的灵魂。它不是死板地按照固定大小打包，而是能够根据系统负载动态调整其“吞吐模式”。

2.1 积压驱动的动态上限

系统实时监控队列积压程度 (queueLen)。当积压严重时，Worker 会自动“变强”：

// 动态调整 Batch 限制
backlog := int(cq.queueLen.Load())
if backlog > limitCount {
    // 如果队列积压，按比例放大 Batch 大小，最高 4 倍
    factor := min(max(backlog/limitCount, 1), 4)
    limitCount = min(limitCount * factor, hardLimit)
    limitSize *= int64(factor)
}

设计价值：这利用了批处理的规模效应。压力越大，聚合度越高，单次磁盘 I/O 摊薄的成本就越低，从而在高压下实现吞吐量的“逆增长”。

2.2 热点感知倍率 (Hot-Aware)

如果 Batch 中包含由 hotTracker 识别出的热点写 Key，系统会应用 HotWriteBatchMultiplier。

原理：对于热点 Key，多次写入往往可以合并。通过扩大 Batch，我们让热点请求在进入 WAL 之前有更多机会在内存中被“折叠”，极大减轻了物理磁盘的带宽压力。

2.3 Coalesce 等待机制

在队列瞬时为空时，Worker 不会立即触发提交，而是会短暂等待一个 WriteBatchWait（默认 200us）。这个微小的停顿是低延迟与高吞吐之间的微妙平衡点，它能让微量的突发流量共享一次 fsync 成本。

3. 核心调用逻辑：一个请求的“入库”之旅

graph TD
    A[User db.Set/Apply] --> B{L0 Throttle Check?}
    B -- Blocked --> A
    B -- Pass --> C[Encode Internal Key]
    C --> D[Push to RingBuffer]
    D --> E[Wait for Request.WaitGroup]
    
    subgraph "commitWorker (Background)"
        F[nextCommitBatch: Collect Requests] --> G[vlog.write: Value Separation]
        G --> H[wal.Append: Durability]
        H --> I[lsm.SetBatch: Index Apply]
        I --> J{SyncWrites Enabled?}
        J -- Yes --> K[wal.Sync: Flush to Disk]
        J -- No --> L[Skip Sync]
        K --> M[Signal All WaitGroups]
        L --> M
    end
    
    M --> N[User returns Success/Err]

4. 健壮性设计：分段式错误归因与回滚

在聚合系统中，最怕的是“一人犯错，全家连坐”。如果一个包含 100 个请求的 Batch 在执行到第 50 个时磁盘满了，剩下的 50 个怎么办？

NoKV 实现了 精确的失败路径追踪：

逐请求执行：applyRequests 会在遇到第一个错误时停止，并返回 failedAt 索引。

错误隔离：

// finishCommitRequests 负责分发结果
for i, cr := range batch.reqs {
    if i < failedAt {
        cr.req.Err = nil // 前面的请求已经成功落盘
    } else {
        cr.req.Err = actualErr // 从失败点开始的所有请求标记为错误
    }
    cr.req.wg.Done()
}

VLog 回滚：如果写入一半失败，VLog 会执行 Rewind 操作，将文件头指针回滚到上一个已知的安全点，防止留下不完整的脏数据。

5. 性能参数推导与调优

参数	默认值	调优逻辑
`WriteBatchMaxCount`	64	聚合深度。提高可增加吞吐，但会拉长 P99 延迟。
`WriteBatchWait`	200us	聚合等待时间。Sync 写场景下建议保留，非 Sync 场景可设为 0。
`SyncWrites`	false	是否每次 Batch 都调用 `fsync`。设为 true 会让吞吐下降一个数量级，但保证强持久性。
`HotWriteMultiplier`	2	热点聚合倍率。对于倾斜严重的负载，可设为 4。

总结：NoKV 的写路径通过“牺牲”极微小的入队延迟，换取了极其稳健的磁盘顺序写入带宽。这种 “漏斗式” 的设计配合 自适应调节算法，是 NoKV 能够从容应对分布式环境下突发写潮汐的关键所在。

NoKV VFS 抽象：跨越 OS 边界的存储契约与确定性故障模拟

NoKV 的 VFS（Virtual File System）层不是为了增加复杂性，它是整个引擎实现 “确定性可靠性” 的核心堡垒。通过将存储语义与操作系统细节彻底解耦，NoKV 实现了跨平台的原子语义保障和极高强度的故障模拟测试。

1. 为什么工业级存储引擎需要 VFS？

在存储引擎开发中，直接依赖原生 os 包会带来三个致命问题：

原子语义缺失：LSM 引擎依赖 Rename 的原子性来更新 Manifest。但在不同操作系统上，Rename 是否允许覆盖现有文件、是否保证原子性，其表现差异巨大。
测试黑盒：如何验证磁盘在 Sync 时突然断电的行为？如何在不拆硬盘的情况下模拟磁盘坏道？
扩展受限：如果未来需要接入 分布式文件系统 (HDFS/S3) 或者实现 纯内存模式 (In-Memory)，没有 VFS 将意味着需要重写整个存储内核。

NoKV 通过 vfs.FS 和 vfs.File 接口（位于 vfs/vfs.go），将所有的 IO 行为抽象为一套统一的契约。

2. FaultFS：精准的“故障手术刀”

vfs/faultfs.go 是 NoKV 最引以为傲的可靠性测试工具。它通过装饰器模式包装了标准文件系统，允许测试用例以编程方式注入各种极端故障。

2.1 故障策略 (FaultPolicy)

开发者可以定义极其复杂的故障场景，并观察引擎的自愈能力：

FailOnce：在操作某特定文件时触发一次错误（模拟瞬时 IO 抖动）。
FailOnNth：在第 N 次操作（如第 100 次 Write）时触发故障。这在验证 崩溃恢复 (Recovery) 的幂等性时至关重要。
FailOnOp：只在执行 Sync 或 Truncate 这种改变文件系统元数据的重型操作时触发故障。

2.2 实现机制解析

// faultFile 的 WriteAt 实现
func (f *faultFile) WriteAt(p []byte, off int64) (int, error) {
    // 1. 在真正 IO 前，先通过 Policy 进行前置检查
    if err := f.fs.before(OpWriteAt, f.name); err != nil {
        return 0, err // 模拟故障返回
    }
    // 2. 执行真正的 OS 调用
    return f.File.WriteAt(p, off)
}

通过这套机制，NoKV 的测试集成功模拟了：

Manifest 写入一半时磁盘满。
SSTable 生成后 Sync 失败，但文件已存在的情况。
WAL 在回滚过程中发生权限错误的极端场景。

3. 跨平台语义抹平：原子重命名协议

LSM 引擎的命脉在于 Manifest 的原子替换。NoKV 在 VFS 层针对不同系统做了极致的封装。

3.1 Linux 平台的 RenameNoReplace

在 Linux 上，NoKV 利用了 unix.RENAME_NOREPLACE 系统调用。

设计价值：它保证了如果目标 Manifest 文件已存在，Rename 会直接原子地报错，而不是覆盖它。这从根本上杜绝了由于进程异常重启导致的旧元数据被误覆盖的问题。

3.2 Darwin 平台的模拟支持

由于 macOS 并不原生支持 RENAME_NOREPLACE，NoKV 在 VFS 层通过专有的 getattrlist 和原子判断逻辑模拟了这一行为，确保了开发者在 Mac 上也能跑出与 Linux 生产环境完全一致的逻辑闭环。

4. 性能提升：PRead/PWrite 并发契约

VFS 不仅是为了可靠性，它还解锁了高性能的 Lock-free 并发读 模式。

PRead 语义 (ReadAt)：vfs.File 强制要求实现 ReadAt。
无竞争读取：在 ValueLog 的读取路径中，多个查询协程可以并发地使用同一个文件描述符执行 ReadAt。由于 ReadAt 是自带 Offset 的原子操作，它不需要像 Seek + Read 模式那样需要获取文件句柄级别的互斥锁。
结果：在多核机器上执行大 Value 读取时，NoKV 的吞吐量随着 CPU 核心数呈完美的线性增长。

5. 存储引擎对比分析

特性	NoKV	Pebble (CockroachDB)	RocksDB
VFS 核心架构	精简接口 + 装饰器注入	深度集成 (errorfs)	复杂的 Env/FileSystem 抽象
故障注入强度	强（支持路径/操作级别计数）	极强（支持各种计数策略）	中（依赖 Env 注入点）
并发读契约	强制 PRead/PWrite	深度优化 PRead	依赖操作系统支持
跨平台原子性	抹平 Linux/Darwin 差异	通过 Go 运行时保证	依赖特定的插件实现

总结：VFS 并不是一种代码开销，它是 “对每一行磁盘操作负责” 的态度。通过 VFS，NoKV 将复杂的底层系统调用和不可预测的硬件故障，收敛为了一个可预测、可测试、可证明的确定性模型。

Keyboard shortcuts

NoKV Docs

NoKV

🏎️ Extreme Performance

🧠 Smart Caching

🌐 Distributed Consistency

🔌 Redis Compatible