Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NoKV — Documentation

NoKV

An open-source namespace metadata substrate for distributed filesystems, object storage, and AI dataset metadata.

Native fsmeta primitives · Own LSM · Own Raft · Own MVCC · Own control plane

CI Coverage Go Report Card Go Reference Mentioned in Awesome CNCF Landscape

Go Version License DeepWiki

NoKV is the open-source counterpart of the “stateless schema layer + transactional KV” pattern that powers Meta Tectonic (over ZippyDB), Google Colossus (over Bigtable), and DeepSeek 3FS (over FoundationDB). The headline service is fsmeta, a namespace metadata API for distributed filesystems / object storage / AI dataset metadata.

The interesting part isn’t the feature list. The interesting part is that layer separation is enforced in code: the fsmeta executor consumes a narrow TxnRunner; the default OpenWithRaftstore adapter owns raftstore wiring; meta/root keeps only lifecycle / authority truth; the storage engine never learns that a namespace exists.

This site is the technical docs hub. For the project landing page, headline benchmarks, and the Why NoKV vs X? matrix, see the root README.


🧭 Three Audiences, One Substrate

DFS frontendObject storage namespaceAI dataset metadata
Consumer shapeFUSE / NFS / SMB driverS3-compatible HTTP gatewaytraining pipeline / scheduler
fsmeta primitives usedReadDirPlus, WatchSubtree, SnapshotSubtree, RenameSubtreeReadDirPlus for LIST, WatchSubtree for bucket events, SnapshotSubtree for versions, RenameSubtree for prefix movesSnapshotSubtree for dataset versions, WatchSubtree for checkpoint notification, ReadDirPlus for batch metadata fetch
Comparable industrial patternTectonic / Colossus / 3FS / HopsFSTectonic / Colossus over object layerMooncake / Quiver / 3FS dataset layer

All three consume the same rooted truth in meta/root and the same native primitives in fsmeta — schema is not specialized to any single consumer.

Deep dive: fsmeta positioning · namespace authority events umbrella


📑 If You Read Only Three Pages

Start here:

  1. fsmeta.md — namespace metadata service (the headline). Primitives, lifecycle authority, deployment.
  2. architecture.md — three-layer architecture. Where each module lives, what each layer is allowed to know.
  3. control_and_execution_protocols.md — the contract between control plane (coordinator/), execution plane (raftstore/), and rooted truth (meta/root/).

For the authority schema behind those primitives, read notes/2026-04-25-namespace-authority-events-umbrella.md.


🗺️ Read By Interest

🗂️ Namespace metadata service (fsmeta) — the primary product

TopicDoc
Complete reference (primitives + lifecycle + deployment)fsmeta.md
Positioning v5 (DFS / OSS / AI three-audience)notes/2026-04-24-fsmeta-positioning.md
Namespace authority events umbrella (Mount / SubtreeAuthority / SnapshotEpoch / QuotaFence schema)notes/2026-04-25-namespace-authority-events-umbrella.md
Snapshot subtree MVCC epochnotes/2026-04-25-snapshot-subtree-mvcc-epoch.md
Benchmark resultsfsmeta.md · benchmark/fsmeta/results/

🔬 Correctness models

TopicLocation
TLA+ / TLC models for control-plane and metadata transition safetyspec/ · spec/README.md
Checked artifactsspec/artifacts/

🏛️ Distributed runtime — the layer below fsmeta

TopicDoc
Rooted truth kernel (meta/root)rooted_truth.md
Coordinator (route / TSO / heartbeats / WatchRootEvents stream)coordinator.md
Coordinator ↔ meta/root deployment separationnotes/2026-04-12-coordinator-meta-separation.md
Coordinator-driven store registry and rooted membershipcoordinator.md · rooted_truth.md
Raftstore overview (store / peer / admin)raftstore.md
Control-plane ↔ execution-plane contractcontrol_and_execution_protocols.md
Standalone → distributed migrationmigration.md
Recovery modelrecovery.md
Percolator MVCC 2PC + AssertionNotExistpercolator.md
Runtime call chains (sequence diagrams)runtime.md

🔧 Storage engine internals — the foundation

The single-node substrate that everything sits on. Independently usable as an embedded Go LSM + Raft library.

TopicDoc
High-level architecturearchitecture.md
WAL discipline and replaywal.md
MemTable + ART/SkipList (ART pinned for fsmeta)memtable.md
Flush pipelineflush.md
Leveled compaction + landing buffercompaction.md · landing_buffer.md
Value log (KV separation + GC)vlog.md
Manifest semanticsmanifest.md
Range filterrange_filter.md
Block / row cachecache.md
VFS abstraction + FaultFSvfs.md · file.md
Hot-key observer (Thermos)thermos.md
Entry / error modelentry.md · errors.md

🛠️ Operations and tooling

TopicDoc
CLI reference (nokv — stats / manifest / regions / mount / quota / migrate)cli.md
nokv-fsmeta standalone gRPC gatewayfsmeta.md
Configuration (one JSON file shared by all binaries)config.md
Cluster demodemo.md
Scripts layoutscripts.md
Stats / expvar / metrics (4 namespaces: executor, watch, quota, mount)stats.md
Testing strategy (failpoints, chaos, restart, migration)testing.md

📒 Notable design decision records

All notes under notes/ are dated decision records — they explain the why, not just the what.


🏗️ Architecture at a Glance

NoKV Architecture

Layer 1  fsmeta            ← namespace primitives (Create / ReadDirPlus / WatchSubtree / RenameSubtree / SnapshotSubtree / Link / Unlink with link-count GC)
   │
Layer 2  meta/root         ← rooted authority truth (Mount / SubtreeAuthority / SnapshotEpoch / QuotaFence)
         coordinator       ← routing, TSO, store discovery, root-event publish + WatchRootEvents stream
         raftstore         ← per-region Raft + apply observer
         percolator        ← 2PC + MVCC + AssertionNotExist + commit-ts retry
   │
Layer 3  engine            ← LSM + ART memtable + WAL + value log (with per-CF/prefix value separation policy: fsm\x00 → AlwaysInline)

Four boundaries enforced in code:

  1. fsmeta-first API. Metadata operations expose filesystem/object-namespace shapes directly, instead of forcing users to assemble them from raw KV calls.
  2. Layer separation enforced. The fsmeta executor consumes a narrow TxnRunner; the default runtime adapter owns raftstore wiring; lower layers do not import fsmeta.
  3. Multi-gateway-safe. Quota fences are rooted truth; usage counters are data-plane keys updated in the same Percolator transaction as metadata mutations. Subtree handoff uses rooted events plus runtime repair.
  4. Root-event driven lifecycle. coordinator.WatchRootEvents pushes mount retire / quota fence / pending handoff changes after bootstrap; the monitor interval is reconnect backoff.

⚡ Quick Start

Bring up a full cluster + register a mount + use fsmeta

# 1. Build binaries
make build

# 2. Launch full cluster: meta-root + coordinator + 3 stores + fsmeta gateway
./scripts/dev/cluster.sh --config ./raft_config.example.json
# (Or: docker compose up -d  — includes mount-init bootstrap)

# 3. Register a mount (rooted authority)
nokv mount register --coordinator-addr 127.0.0.1:2379 \
  --mount default --root-inode 1 --schema-version 1

# 4. (Optional) Set a quota fence
nokv quota set --coordinator-addr 127.0.0.1:2379 \
  --mount default --limit-bytes 10737418240 --limit-inodes 10000000

# 5. Use fsmeta from any gRPC client (Go typed client at fsmeta/client/)
#    or embedded Go: see fsmetaexec.OpenWithRaftstore in the root README

# 6. Inspect runtime state
curl http://127.0.0.1:9101/debug/vars | jq '.nokv_fsmeta_executor, .nokv_fsmeta_watch, .nokv_fsmeta_quota, .nokv_fsmeta_mount'
nokv stats --workdir ./artifacts/cluster/store-1

Full walkthrough: getting_started.md · CLI reference: cli.md


🔗 Jump Points

fsmeta serviceThe headline product — namespace metadata API
Formal specsTLA+ / TLC models for transition safety
CLI surfacenokv — stats, manifest, regions, mount, quota, migrate
Topology configOne JSON file shared by scripts, Docker, all CLI
CoordinatorRoute / TSO / heartbeat / root-event subscribe
Rooted truthmeta/root typed event log
Percolator / MVCC2PC primitives in distributed mode
Runtime call chainsFunction-level sequence diagrams
TestingFailpoints, chaos, restart, migration
SUMMARY.mdFull mdbook table of contents

Open-source namespace metadata substrate for DFS, OSS, and AI dataset metadata.
Built from scratch — no external storage engine, no external Raft library, no external coordinator.

Getting Started

This guide gets you from zero to a running NoKV cluster (or an embedded DB) in a few minutes.

Prerequisites

  • Go 1.26+
  • Git
  • (Optional) Docker + Docker Compose for containerized runs

This launches the 333 separated layout: 3 replicated meta-root peers (Truth plane), 1 coordinator (Service plane), and the stores declared in the config (Execution plane).

./scripts/dev/cluster.sh --config ./raft_config.example.json

The launcher stays attached, streams meta-root/Coordinator/store logs to the terminal, and also writes them under ./artifacts/cluster/.

If you stop one store and want to restart it later, restart it against the same workdir:

./scripts/ops/serve-store.sh \
  --config ./raft_config.example.json \
  --store-id 1 \
  --workdir ./artifacts/cluster/store-1

On restart, NoKV recovers hosted peers from local metadata in the store workdir. The config file is only used to start stores and Coordinator. Runtime clients discover store addresses from Coordinator heartbeats. Do not rerun scripts/ops/bootstrap.sh or treat scripts/dev/cluster.sh as the restart path for an already-running store.

Inspect stats

go run ./cmd/nokv stats --workdir ./artifacts/cluster/store-1

Option B: Docker Compose

This runs the full cluster (3 meta-root + 3 coordinator + 3 store + fsmeta gateway) in containers with the published GHCR image.

docker compose up -d
docker compose logs -f

To force-refresh :latest before startup, use:

make docker-up

make docker-up pulls the published image first. If the GHCR package is not published or public yet, it falls back to a local Docker build.

For local Docker development builds from this checkout:

docker compose up -d --build

Local builds are tagged as the configured NoKV image. If you build locally and then want to return to the published :latest, run the pull command above. For reproducible runs, pin a published SHA tag:

NOKV_IMAGE_TAG=<commit-sha> docker compose up -d

Tear down:

docker compose down -v

Embedded Usage (single-process)

Use NoKV as a library when you do not need raftstore.

package main

import (
	"fmt"
	"log"

	NoKV "github.com/feichai0017/NoKV"
)

func main() {
	opt := NoKV.NewDefaultOptions()
	opt.WorkDir = "./workdir-demo"

	db, err := NoKV.Open(opt)
	if err != nil {
		log.Fatalf("open failed: %v", err)
	}
	defer db.Close()

	key := []byte("hello")
	if err := db.Set(key, []byte("world")); err != nil {
		log.Fatalf("set failed: %v", err)
	}

	entry, err := db.Get(key)
	if err != nil {
		log.Fatalf("get failed: %v", err)
	}
	fmt.Printf("value=%s\n", entry.Value)
}

Note:

  • DB.Get returns detached entries (do not call DecrRef).
  • DB.GetInternalEntry returns borrowed entries and callers must call DecrRef exactly once.
  • DB.SetWithTTL accepts time.Duration (relative TTL). DB.Set/DB.SetBatch/DB.SetWithTTL reject nil values; use DB.Del or DB.DeleteRange(start,end) for deletes.
  • DB.NewIterator exposes user-facing entries, while DB.NewInternalIterator scans raw internal keys (cf+user_key+ts).

Benchmarks

Micro benchmarks:

go test -bench=. -run=^$ ./...

YCSB (default: NoKV + Badger + Pebble, workloads A-F):

make bench

Override defaults with env vars:

YCSB_RECORDS=1000000 YCSB_OPS=1000000 YCSB_CONC=8 make bench

Detailed benchmark methodology and latest result snapshots are maintained in: benchmark/README.md.

Cleanup

If a local run crashes or you want a clean slate:

make clean

Troubleshooting

  • WAL replay errors after crash: wipe the workdir and restart the cluster.
  • Port conflicts: adjust addresses in raft_config.example.json.
  • Slow startup: reduce YCSB_RECORDS or YCSB_OPS when benchmarking locally.

NoKV Architecture Overview

NoKV delivers a hybrid storage engine that can operate as a standalone embedded KV store or as a distributed NoKV service. The distributed RPC surface follows a TinyKV/TiKV-style region + MVCC design, but the service identity and deployment model are NoKV’s own. This document captures the key building blocks, how they interact, and the execution flow from client to disk.

Read this page if you want the shortest route from “what is NoKV” to “which package owns which part of the system”.

This architecture is also meant to support NoKV as a maintainable and extensible distributed storage research platform. The point is not only to describe how the current system runs, but to make the package boundaries, lifecycle ownership, and experiment surfaces explicit enough that new storage-engine, metadata, control-plane, and distributed-runtime ideas can be added without rebuilding the repository around each new topic.

At a high level, the codebase is organized around four long-lived layers:

  • Root facade and runtime surface – the top-level DB APIs and thin system entrypoints.
  • Single-node engine substrateengine/* owns WAL, LSM, manifest, value log, file, and VFS mechanics.
  • Distributed execution and control planeraftstore/*, meta/*, and coordinator/* host replicated execution, rooted metadata, and cluster control logic.
  • Experiment and evidence layerbenchmark/*, scripts, and docs keep evaluation and design claims attached to the implementation.

Reader Map

  • If you care about the embedded engine, focus on sections 2 and 5.
  • If you care about distributed runtime ownership, focus on sections 3, 4, and 5.
  • If you care about migration and recovery, read this page together with migration.md and recovery.md.

1. High-Level Layout

┌─────────────────────────┐   NoKV gRPC     ┌─────────────────────────┐
│ raftstore Service       │◀──────────────▶ │ raftstore/client        │
└───────────┬─────────────┘                 │  (Get / Scan / Mutate)  │
            │                               └─────────────────────────┘
            │ ReadCommand / ProposeCommand
            ▼
┌─────────────────────────┐
│ store.Store / peer.Peer │  ← multi-Raft region lifecycle
│  ├ Local peer catalog   │
│  ├ Router / region catalog │
│  └ transport (gRPC)     │
└───────────┬─────────────┘
            │ Apply via kv.Apply
            ▼
┌─────────────────────────┐
│ kv.Apply + percolator   │
│  ├ Get / Scan           │
│  ├ Prewrite / Commit    │
│  └ Latch manager        │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ Embedded NoKV core      │
│  ├ WAL Manager          │
│  ├ MemTable / Flush     │
│  ├ ValueLog + GC        │
│  └ Manifest / Stats     │
└─────────────────────────┘
  • Embedded mode uses NoKV.Open directly: WAL→MemTable→SST durability, ValueLog separation, non-transactional APIs with internal version ordering, and rich stats.
  • Distributed mode layers raftstore on top: multi-Raft regions reuse the same WAL, keep store-local recovery metadata separate from storage manifest state, expose metrics, and serve NoKV RPCs.
  • Control plane split: raft_config provides bootstrap topology; Coordinator provides runtime routing/TSO/control-plane state in cluster mode.
  • Clients obtain leader-aware routing, automatic NotLeader/EpochNotMatch retries, and two-phase commit helpers.

Same system, two shapes

flowchart LR
    App["App / CLI / fsmeta client"]
    App --> Embedded["Embedded NoKV DB"]
    App --> RPC["NoKV RPC / raftstore/client"]

    subgraph "Standalone shape"
        Embedded --> Core["WAL + LSM + VLog + MVCC"]
    end

    subgraph "Distributed shape"
        RPC --> Server["server.Node"]
        Server --> Store["store.Store"]
        Store --> Peer["peer.Peer"]
        Peer --> Core
        Store --> Coordinator["Coordinator"]
    end

    Embedded -. migrate init / seed .-> Store

Detailed Runtime Paths

For function-level call chains with sequence diagrams (embedded write/read, iterator scan, distributed read/write via Raft apply), see docs/runtime.md.


2. Embedded Engine

Code entry points

If you want to inspect the embedded side first, start here:

opt := NoKV.NewDefaultOptions()
opt.WorkDir = "./workdir"

db, err := NoKV.Open(opt)
if err != nil {
    panic(err)
}
defer db.Close()

_ = db.Set([]byte("hello"), []byte("world"))
entry, _ := db.Get([]byte("hello"))
fmt.Println(string(entry.Value))

Then read:

  • db.go
  • engine/lsm/
  • engine/wal/
  • vlog.go

2.1 WAL & MemTable

  • wal.Manager appends [len|type|payload|crc] records (typed WAL), rotates segments, and replays logs on crash.
  • MemTable accumulates writes until full, then enters the flush queue; the concrete flush runtime runs Enqueue → Build → Install → Release, logs edits, and releases WAL segments.
  • Writes are handled by a single commit worker that performs value-log append first, then WAL/memtable apply, keeping durability ordering simple and consistent.

2.2 ValueLog

  • Large values are written to the ValueLog before the WAL append; the resulting ValuePtr is stored in WAL/LSM so replay can recover.
  • vlog.Manager tracks the active head and uses flush discard stats to trigger GC; manifest records new heads and removed segments.

2.3 Manifest

  • manifest.Manager stores only storage-engine metadata: SST metadata, WAL checkpoints, and ValueLog metadata. Store-local raft replay pointers live in raftstore/localmeta.
  • CURRENT provides crash-safe pointer updates for storage-engine metadata. Region descriptors are no longer stored in the storage manifest.

2.4 LSM Compaction & Landing Buffer

  • lsm.compaction drives compaction cycles; lsm.levelManager supplies table metadata and executes the plan.
  • Planning is split inside lsm: PlanFor* selects table IDs + key ranges, then LSM resolves IDs back to tables and runs the merge.
  • lsm.State guards overlapping key ranges and tracks in-flight table IDs.
  • Landing shard selection is policy-driven in lsm (PickShardOrder / PickShardByBacklog) while the landing buffer remains in lsm.
flowchart TD
  Manager["lsm.compaction"] --> LSM["lsm.levelManager"]
  LSM -->|TableMeta snapshot| Planner["PlanFor*"]
  Planner --> Plan["lsm.Plan (fid+range)"]
  Plan -->|resolvePlanLocked| Exec["LSM executor"]
  Exec --> State["lsm.State guard"]
  Exec --> Build["subcompact/build SST"]
  Build --> Manifest["manifest edits"]
  L0["L0 tables"] -->|moveToLanding| Landing["landing buffer shards"]
  Landing -->|LandingDrain: landing-only| Main["Main tables"]
  Landing -->|LandingKeep: landing-merge| Landing

2.5 Distributed Transaction Path

  • percolator implements Prewrite/Commit/ResolveLock/CheckTxnStatus; kv.Apply dispatches raft commands to these helpers.
  • MVCC timestamps come from the distributed client/Coordinator TSO flow, not from an embedded standalone transaction API.
  • Watermarks (utils.WaterMark) are used in durability/visibility coordination; they have no background goroutine and advance via mutex + atomics.

2.6 Write Pipeline & Backpressure

  • Writes enqueue into a commit queue inside db.go where requests are coalesced into batches before a commit worker drains them.
  • The commit worker always writes the value log first (when needed), then applies WAL/LSM updates; SyncWrites adds a WAL fsync step.
  • Batch sizing adapts to backlog through WriteBatchMaxCount, WriteBatchMaxSize, and WriteBatchWait.
  • Backpressure is enforced in two places: LSM throttling toggles db.blockWrites when L0 backlog grows, and Thermos can reject hot keys via WriteHotKeyLimit.

2.7 Ref-Count Lifecycle Contracts

NoKV uses fail-fast reference counting for internal pooled/owned objects. DecrRef underflow is treated as a lifecycle bug and panics.

ObjectOwned byBorrowed byRelease rule
kv.Entry (pooled)internal write/read pipelinescodec iterator, memtable/lsm internal reads, request batchesMust call DecrRef exactly once per borrow.
kv.Entry (detached public result)callernoneReturned by DB.Get; must not call DecrRef.
kv.Entry (borrowed internal result)calleryes (DecrRef)Returned by DB.GetInternalEntry; caller must release exactly once.
requestcommit queue/workerwaiter path (Wait)IncrRef on enqueue; Wait does one DecrRef; zero returns request to pool and releases entries.
tablelevel/main+landing lists, block cachetable iterators, prefetch workersRemoved tables are decremented once after manifest+in-memory swap; zero deletes SST.
Skiplist / ART indexmemtableiteratorsIterator creation increments index ref; iterator Close decrements; double-close is idempotent.

3. Replication Layer (raftstore)

Code entry points

If you want to inspect the distributed side first, start here:

srv, err := server.NewNode(server.Config{
    Storage: server.Storage{MVCC: db, Raft: db.RaftLog()},
    Store: store.Config{StoreID: 1},
    Raft: myraft.Config{ElectionTick: 10, HeartbeatTick: 2, PreVote: true},
    TransportAddr: "127.0.0.1:20160",
})
if err != nil {
    panic(err)
}
defer srv.Close()

Then read:

  • raftstore/server/node.go
  • raftstore/store/store.go
  • raftstore/peer/peer.go
  • raftstore/raftlog/wal_storage.go
  • raftstore/localmeta/store.go
PackageResponsibility
storeRegion catalog/runtime root, router, RegionMetrics, scheduler + command runtimes, helpers such as StartPeer / SplitRegion.
peerWraps etcd/raft RawNode, handles Ready pipeline, snapshot resend queue, backlog instrumentation.
raftlogWALStorage/DiskStorage/MemoryStorage, reusing the DB’s WAL while keeping store-local raft replay metadata in sync.
transportgRPC transport for Raft Step messages, connection management, retries/blocks/TLS. Also acts as the host for NoKV RPC.
kvNoKV RPC handler plus kv.Apply bridging Raft commands to MVCC logic.
serverConfig + NewNode combine DB, Store, transport, and NoKV service into a reusable node instance.

3.1 Bootstrap Sequence

  1. server.NewNode wires DB, store configuration (StoreID, hooks, scheduler), Raft config, and transport address. It registers NoKV RPC on the shared gRPC server and sets transport.SetHandler(store.Step).
  2. CLI (nokv serve) or application enumerates the local peer catalog and calls Store.StartPeer for every Region containing the local store:
    • peer.Config includes Raft params, transport, kv.NewEntryApplier, peer storage, and Region metadata.
    • Router registration, regionManager bookkeeping, optional Peer.Bootstrap with initial peer list, leader campaign.
  3. Peers from other stores can be configured through transport.SetPeer(peerID, addr) (raft peer ID). In cluster mode, runtime routing/control-plane decisions come from Coordinator.

3.2 Command Paths

  • ReadCommand (KvGet/KvScan): validate Region & leader, execute Raft ReadIndex (LinearizableRead) and WaitApplied, then run commandApplier (i.e. kv.Apply in read mode) to fetch data from the DB. This yields leader-strong reads with an explicit Raft linearizability barrier.
  • ProposeCommand (write): encode the request, push through Router to the leader peer, replicate via Raft, and apply in kv.Apply which maps to MVCC operations.

3.3 Transport

  • gRPC server handles Step RPCs and NoKV RPCs on the same endpoint; peers are registered via SetPeer.
  • Retry policies (WithRetry) and TLS credentials are configurable. Tests cover partitions, blocked peers, and slow followers.

4. NoKV Service

raftstore/kv/service.go exposes pb.NoKV RPCs:

RPCExecutionResult
KvGetstore.ReadCommandkv.Apply GETpb.GetResponse / RegionError
KvScanstore.ReadCommandkv.Apply SCANpb.ScanResponse / RegionError
KvPrewritestore.ProposeCommandpercolator.Prewritepb.PrewriteResponse
KvCommitstore.ProposeCommandpercolator.Commitpb.CommitResponse
KvResolveLockpercolator.ResolveLockpb.ResolveLockResponse
KvCheckTxnStatuspercolator.CheckTxnStatuspb.CheckTxnStatusResponse

nokv serve is the CLI entry point—open the DB, construct server.Node, register peers, start local Raft peers, and display a local peer catalog summary (Regions, key ranges, peers). scripts/dev/cluster.sh builds the CLI, seeds local peer catalogs, and launches the 333 separated layout (3 meta-root peers + 1 coordinator + all configured stores) on localhost, handling cleanup on Ctrl+C.

The RPC request/response shape is intentionally close to TinyKV/TiKV so the MVCC and region semantics remain familiar, but the service name exposed on the wire is pb.NoKV.


5. Client Workflow

raftstore/client offers a leader-aware client with retry logic and convenient helpers:

  • Initialization: provide a Coordinator-backed RegionResolver (GetRegionByKey) and StoreResolver (GetStore) so runtime routing and store discovery are Coordinator-driven.
  • Reads: Get and Scan pick the leader store for a key range, issue NoKV RPCs, and retry on NotLeader/EpochNotMatch.
  • Writes: Mutate bundles operations per region and drives Prewrite/Commit (primary first, secondaries after); Put and Delete are convenience wrappers using the same 2PC path.
  • Timestamps: clients must supply startVersion/commitVersion. For distributed demos, use Coordinator (nokv coordinator) to obtain globally increasing values before calling TwoPhaseCommit.
  • Bootstrap helpers: scripts/dev/cluster.sh --config raft_config.example.json builds the binaries, seeds local peer catalogs via nokv-config catalog, launches the 3 meta-root peers + coordinator, and starts the stores declared in the config.

Example (two regions)

  1. Regions [a,m) and [m,+∞), each led by a different store.
  2. Mutate(ctx, primary="alfa", mutations, startTs, commitTs, ttl) prewrites and commits across the relevant regions.
  3. Get/Scan retries automatically if the leader changes.
  4. See raftstore/server/node_test.go for a full end-to-end example using real server.Node instances.

6. Failure Handling

  • Manifest edits capture only storage metadata, WAL checkpoints, and ValueLog pointers. Store-local region recovery state and raft replay pointers are loaded from raftstore/localmeta.
  • WAL replay reconstructs memtables and Raft groups; ValueLog recovery trims partial records.
  • Stats.StartStats resumes metrics sampling immediately after restart, making it easy to verify recovery correctness via nokv stats.

7. Observability & Tooling

  • StatsSnapshot publishes flush/compaction/WAL/VLog/raft/region/hot/cache metrics. nokv stats and the expvar endpoint expose the same data.
  • nokv regions inspects the local peer catalog.
  • nokv serve advertises Region samples on startup (ID, key range, peers) for quick verification.
  • Inspect scheduler/control-plane state via Coordinator APIs/metrics.
  • Scripts:
    • scripts/dev/cluster.sh – launch a multi-node NoKV cluster locally.
    • RECOVERY_TRACE_METRICS=1 go test ./... -run 'TestRecovery(RemovesStaleValueLogSegment|FailsOnMissingSST|FailsOnCorruptSST|ManifestRewriteCrash|SlowFollowerSnapshotBacklog|SnapshotExportRoundTrip|WALReplayRestoresData)' -count=1 -v – crash-recovery validation.
    • CHAOS_TRACE_METRICS=1 go test -run 'TestGRPCTransport(HandlesPartition|MetricsWatchdog|MetricsBlockedPeers)' -count=1 -v ./raftstore/transport – inject network faults and observe transport metrics.

8. When to Use NoKV

  • Embedded: call NoKV.Open, use the local non-transactional DB APIs.
  • Distributed: deploy nokv serve nodes, use raftstore/client (or any NoKV gRPC client) to perform reads, scans, and 2PC writes.
  • Observability-first: inspection via CLI or expvar is built-in; Region, WAL, Flush, and Raft metrics are accessible without extra instrumentation.

See also docs/raftstore.md for deeper internals, docs/coordinator.md for control-plane details, and docs/testing.md for coverage details.

Runtime Call Chains (Current)

This document focuses on the current execution paths in NoKV and maps API calls to concrete functions in the codebase.

It intentionally describes only what is running today.


1. API Surface Snapshot

ModeRead APIsWrite APIsTxn APIs
Embedded (NoKV.DB)Get, NewIterator, NewInternalIteratorSet, SetBatch, SetWithTTL, Del, DeleteRange, ApplyInternalEntriesN/A (no standalone local txn API)
Distributed (raftstore/kv)KvGet, KvBatchGet, KvScanN/A direct writeKvPrewrite, KvCommit, KvBatchRollback, KvResolveLock, KvCheckTxnStatus

Core entry points:


2. Embedded Write Path (Set / SetBatch / SetWithTTL / Del / DeleteRange)

2.1 Function-Level Chain

  1. DB.Set / DB.SetBatch / DB.SetWithTTL / DB.Del / DB.DeleteRange allocates monotonic non-transactional versions and creates internal-key entries via kv.NewInternalEntry.
  2. DB.ApplyInternalEntries validates each internal key via kv.SplitInternalKey, then calls batchSet.
  3. batchSet enqueues request (sendToWriteCh -> enqueueCommitRequest -> bounded MPSC commit queue).
  4. commitWorker acquires a long-lived utils.MPSCConsumer[*commitRequest] and drains a batch:
    • vlog.write(requests) writes large values first and produces ValuePtr.
    • applyRequests -> writeToLSM -> lsm.SetBatch.
    • if SyncWrites uses the dedicated sync pipeline, committed-but-unsynced batches are handed off to syncWorker; otherwise commitWorker performs wal.Sync() inline when required.
  5. lsm.SetBatch writes one atomic batch by delegating to memTable.setBatch; inside that step:
    • wal.AppendEntryBatch
    • mem index insert.

2.2 Sequence Diagram

sequenceDiagram
    participant U as User API
    participant DB as DB.Set/SetBatch/SetWithTTL/Del/DeleteRange
    participant Q as "commitQueue (MPSCQueue)"
    participant W as commitWorker
    participant V as vlog.write
    participant L as lsm.SetBatch
    participant M as memTable.setBatch
    participant WAL as wal.AppendEntryBatch
    U->>DB: Set/SetBatch/SetWithTTL/Del/DeleteRange
    DB->>DB: NewInternalEntry + ApplyInternalEntries
    DB->>Q: sendToWriteCh / enqueueCommitRequest
    Q->>W: nextCommitBatch
    W->>V: write(requests)
    V-->>W: ValuePtr for large values
    W->>L: writeToLSM(entries)
    L->>M: setBatch(entries)
    M->>WAL: AppendEntryBatch(entries)
    M-->>L: index.Add(...)
    L-->>W: success

3. Embedded Read Path (Get / GetInternalEntry)

3.1 Function-Level Chain

  1. DB.Get builds InternalKey(CFDefault, userKey, nonTxnMaxVersion).
  2. loadBorrowedEntry calls lsm.Get for the newest visible internal record.
  3. If value is pointer (BitValuePointer), read real bytes via vlog.read, clear pointer bit.
  4. PopulateInternalMeta ensures CF/Version cache matches internal key.
  5. DB.Get returns detached public entry via cloneEntry (user key + copied value).
  6. DB.GetInternalEntry returns borrowed internal entry (caller must DecrRef).

3.2 Sequence Diagram

sequenceDiagram
    participant U as User API
    participant DB as DB.Get/GetInternalEntry
    participant LSM as lsm.Get
    participant VLOG as vlog.read
    U->>DB: Get(userKey)
    DB->>DB: InternalKey(CFDefault,userKey,nonTxnMaxVersion)
    DB->>LSM: Get(internalKey)
    LSM-->>DB: pooled Entry (internal key)
    alt BitValuePointer set
        DB->>VLOG: read(ValuePtr)
        VLOG-->>DB: raw value bytes
    end
    DB->>DB: PopulateInternalMeta
    alt Get (public)
        DB->>DB: cloneEntry -> user key/value copy
        DB-->>U: detached Entry
    else GetInternalEntry
        DB-->>U: borrowed internal Entry (DecrRef required)
    end

4. Iterator Paths

4.1 Public Iterator (DB.NewIterator)

  1. Build merged internal iterator: lsm.NewIterators + lsm.NewMergeIterator.
  2. Seek converts user key to internal seek key (CFDefault + nonTxnMaxVersion).
  3. populate/materialize:
    • parse internal key (kv.SplitInternalKey)
    • apply bounds on user key
    • optionally resolve vlog pointer
    • expose user-key item.

4.2 Internal Iterator (DB.NewInternalIterator)

  • Directly returns merged iterator over internal keys.
  • No user-key rewrite; caller handles kv.SplitInternalKey.
flowchart TD
  A["DB.NewIterator"] --> B["lsm.NewMergeIterator(internal keys)"]
  B --> C["Seek(userKey -> InternalKey)"]
  C --> D["populate/materialize"]
  D --> E["SplitInternalKey + bounds check"]
  E --> F{"BitValuePointer?"}
  F -- no --> G["Expose inline value"]
  F -- yes --> H["vlog.read(ValuePtr)"]
  H --> G
  G --> I["Item.Entry uses user key"]

5. Distributed Read Path (KvGet / KvBatchGet / KvScan)

5.1 Function-Level Chain

  1. raftstore/kv.Service builds RaftCmdRequest from NoKV RPC.
  2. Store.ReadCommand:
    • validateCommand (region/epoch/leader/key-range)
    • peer.LinearizableRead
    • peer.WaitApplied
    • commandApplier(req) (injected as kv.Apply).
  3. kv.Apply executes:
    • handleGet -> percolator.Reader.GetLock + GetValue
    • handleScan -> iterate CFWrite, resolve visible versions.

5.2 Sequence Diagram

sequenceDiagram
    participant C as NoKV Client
    participant SVC as kv.Service
    participant ST as Store.ReadCommand
    participant P as peer.LinearizableRead
    participant AP as kv.Apply
    participant R as percolator.Reader
    participant DB as DB
    C->>SVC: KvGet/KvBatchGet/KvScan
    SVC->>ST: ReadCommand(RaftCmdRequest)
    ST->>ST: validateCommand
    ST->>P: LinearizableRead + WaitApplied
    ST->>AP: commandApplier(req)
    AP->>R: GetLock/GetValue or scan CFWrite
    R->>DB: GetInternalEntry/NewInternalIterator
    AP-->>SVC: RaftCmdResponse
    SVC-->>C: NoKV response

6. Distributed Write Path (2PC via Raft Apply)

6.1 Function-Level Chain

  1. Client (raftstore/client) runs Mutate / TwoPhaseCommit by region.
  2. RPC layer (kv.Service) sends write commands through Store.ProposeCommand.
  3. Raft replication commits log entries; apply path invokes kv.Apply.
  4. kv.Apply dispatches to percolator.Prewrite/Commit/BatchRollback/ResolveLock/CheckTxnStatus.
  5. Percolator mutators call applyVersionedOps:
    • build entries via kv.NewInternalEntry
    • call db.ApplyInternalEntries
    • release refs (DecrRef).
  6. Storage then follows the same embedded write pipeline (MPSC commit queue -> vlog -> LSM/WAL, with optional sync handoff).

6.2 Sequence Diagram

sequenceDiagram
    participant CL as raftstore/client
    participant SVC as kv.Service
    participant ST as Store.ProposeCommand
    participant RF as Raft replicate/apply
    participant AP as kv.Apply
    participant TXN as percolator.txn
    participant DB as DB.ApplyInternalEntries
    CL->>SVC: KvPrewrite/KvCommit/...
    SVC->>ST: ProposeCommand
    ST->>RF: route to leader + replicate
    RF->>AP: apply committed command
    AP->>TXN: Prewrite/Commit/Resolve...
    TXN->>DB: applyVersionedOps -> ApplyInternalEntries
    DB-->>TXN: write result
    AP-->>SVC: RaftCmdResponse
    SVC-->>CL: NoKV response

7. Entry Ownership and Refcount Rules

SourceReturned entry typeKey formCaller action
DB.GetInternalEntryBorrowed pooledInternal keyMust call DecrRef() once
DB.GetDetached copyUser keyMust not call DecrRef()
percolator.applyVersionedOps temporary entriesBorrowed pooledInternal keyAlways DecrRef() after ApplyInternalEntries
LSM.Get / memtable readsBorrowed pooledInternal keyUpstream owner must release

8. Key/Value Shape by Stage

StageEntry.KeyEntry.ValueNotes
User write before queueInternal key (CF + user key + ts)Raw user bytesBuilt by NewInternalEntry
After vlog stepInternal keyInline value or ValuePtr.Encode()Pointer marked by BitValuePointer
LSM/WAL stored formInternal keyEncoded value payloadUsed by replay/flush/compaction
GetInternalEntry outputInternal keyRaw value bytes (pointer resolved)Internal caller view
Get / public iterator outputUser keyRaw value bytesExternal caller view

Control-Plane and Execution-Plane Protocols

This note defines NoKV’s protocol line for:

  • the control plane
  • the execution plane

and the next-stage evolution around them.

The current implementation status is intentionally minimal but no longer just a design sketch:

  • control-plane protocol v1 is implemented and exposed through Coordinator RPCs plus meta/root storage semantics
  • execution-plane protocol v1 is implemented as a store-local contract with a small admin diagnostics API surface

The point of this document is to keep those two lines coordinated instead of letting them drift into separate, implicit rule sets.

The control plane focuses on the contract between:

  • meta/root
  • coordinator

The execution plane focuses on the contract between:

  • coordinator
  • raftstore
  • local durable state (raftstore/localmeta, raft log, restart replay)

The purpose of this document is not to replace Raft or redesign the data plane. The purpose is to make NoKV’s existing cross-plane behavior explicit, testable, and evolvable.

The control plane is protocolized around four ideas:

  • Freshness
  • CatchUp
  • Transition
  • DegradedMode

These four ideas already existed in partial form inside the implementation. The current work turns them into a stable vocabulary, explicit invariants, and a clear rollout line.

The execution plane is protocolized around four matching ideas:

  • Admission
  • ExecutionTarget
  • PublishBoundary
  • RestartState

0. Current Status

The control plane now has a minimal implemented v1.

Implemented and exposed through pb/coordinator/coordinator.proto, coordinator/server, coordinator/rootview, and tests:

  • route-read Freshness
  • RootToken
  • root_lag
  • DegradedMode
  • CatchUpState
  • TransitionID
  • PublishRootEventResponse.assessment as a pre-persist lifecycle assessment

This means the protocol is no longer only a design direction. It is already the formal serving contract for key Coordinator APIs.

Not implemented in v1:

  • richer transition phases such as Published / Stalled
  • a fuller catch-up action surface exposed through API
  • automatic recovery policy derived from protocol state
  • broad client-side policy that consumes every protocol field

So the right description today is:

control-plane protocol v1 is implemented and in use, while richer scheduler/runtime policy is not implemented in v1.

The execution plane is in a different state.

Today, raftstore has a minimal implemented v1 inside raftstore/store.

Already implemented and exercised through store-local types, raftstore/admin, runtime state, and tests:

  • explicit Admission classes and reasons on read / write / topology entry points
  • explicit topology ExecutionOutcome
  • explicit topology PublishState
  • explicit RestartState derived from raftstore/localmeta + raft replay pointers
  • terminal publish failures retained as visible retry state instead of silent drop
  • admin diagnostics exposure through pb/admin/admin.proto ExecutionStatus

Not implemented as first-class execution protocol fields yet:

  • request validation and routing
  • context propagation
  • detailed local leader admission diagnostics
  • detailed per-attempt scheduler retry/backoff policy
  • metrics for planned truth -> execute -> terminal truth latency
  • richer degraded local scheduler states

The current landing is still mostly store-local and spread across:

  • raftstore/store
  • raftstore/peer
  • raftstore/raftlog
  • raftstore/localmeta

So the right description there is:

execution-plane protocol v1 now exists as a minimal named runtime contract, with store-local state and admin-visible diagnostics, while broader metrics, policy, and richer executor states are not implemented in v1.


1. Intent

NoKV already has the right building blocks:

  • rooted truth events
  • checkpoint + committed tail
  • watch-first tail subscription
  • rebuildable coordinator/catalog
  • explicit planned and terminal topology events

Before v1, these pieces mostly existed as implementation mechanics. The control plane now has a formal minimum contract, while several policy extensions remain intentionally outside v1:

  • when a follower read is fresh enough
  • when a follower must reload
  • when retained tail catch-up is no longer enough
  • what phase a topology change is in
  • what “degraded” actually means to callers

The design goal is to keep turning these implicit behaviors into a formal protocol.

That protocol should be:

  • small
  • explicit
  • observable
  • testable
  • compatible with the current architecture

2. Scope

This document covers both planes, but not at the same implementation depth.

For the control plane, it defines the behavior of:

  • rooted truth consumption
  • control-plane view freshness
  • rooted catch-up progression
  • topology transition lifecycle
  • degraded operating modes

For the execution plane, it defines the protocol direction for:

  • request admission
  • transition execution
  • terminal truth publication
  • restart and local recovery alignment
  • degraded local behavior around scheduler / queue / publish boundaries

It does not redefine:

  • Raft replication
  • Percolator / 2PC transaction semantics
  • store-local recovery metadata
  • storage-engine internals unrelated to distributed lifecycle

This document should be read as two linked contracts:

control plane = durable truth + materialized view + serving contract

execution plane = admitted work + local execution + publish/restart contract


3. Protocol Objects

The naming set should remain compact, stable, and precise.

3.1 RootToken

RootToken is the rooted truth position already incorporated by some materialized view.

It is the control-plane equivalent of:

  • “what truth have I already consumed?”

It should be treated as:

  • monotonic
  • comparable
  • portable across control-plane nodes

RootToken is not just an internal storage cursor. It is the anchor for:

  • freshness
  • catch-up state
  • read eligibility
  • transition causality

3.2 Freshness

Freshness is the serving contract attached to a read.

It answers:

  • how fresh did the caller ask for?
  • how fresh was the returned answer?

3.3 CatchUpState

CatchUpState describes how far one Coordinator node has converged on rooted truth.

It answers:

  • can this node serve route reads?
  • can it satisfy bounded-freshness reads?
  • must it reload?
  • must it install bootstrap?

3.4 Transition

Transition is one rooted topology change that moves through a formal lifecycle.

Examples:

  • peer addition
  • peer removal
  • region split
  • region merge
  • region tombstone

Transition is not just a single event. It is a causally tracked change with:

  • identity
  • source truth position
  • phase
  • progress

3.5 DegradedMode

DegradedMode is the externally visible restriction level of the control plane.

It answers:

  • what kind of reads may still be served?
  • are rooted writes currently allowed?
  • should clients retry elsewhere?
  • is the node usable only as a stale view?

4. Naming Set

The protocol should use one stable vocabulary across:

  • API
  • code
  • logs
  • metrics
  • tests
  • docs

4.1 Read classes

  • Strong
    • requires leader-grade freshness
  • Bounded
    • allows follower service within explicit lag limits
  • BestEffort
    • allows stale cache service

These names are short and carry clear serving intent.

4.2 Catch-up actions

  • Reload
    • rebuild catalog from rooted storage
  • Advance
    • acknowledge rooted tail progress without a full rebuild
  • Bootstrap
    • install a fresh checkpoint because retained tail is insufficient
  • Reject
    • deny freshness-sensitive reads until convergence improves

4.3 Catch-up states

  • Fresh
  • Lagging
  • BootstrapRequired
  • Recovering
  • Unavailable

4.4 Degraded modes

  • Healthy
  • CoordinatorDegraded
  • RootLagging
  • RootUnavailable
  • ViewOnly

ViewOnly is deliberately chosen over more vague names like ExecutionOnly. This section only defines control-plane behavior, so the right question is:

can this node still expose a stale view?


5. Freshness Contract

The control plane should stop treating all successful reads as equivalent.

Every control-plane read should:

  1. declare the requested freshness class
  2. optionally declare a rooted lower bound
  3. receive an explicit served freshness result

5.1 Why this matters

Today, follower reads are effectively:

“good enough if the follower recently reloaded and is not too far behind”

That is practical, but not a protocol.

Without a formal freshness contract:

  • clients cannot reason about route read quality
  • tests cannot assert serving guarantees precisely
  • degraded modes remain guesswork
  • control-plane correctness is partly hidden in implementation details

5.2 Request fields

Control-plane read RPCs should be able to express:

  • freshness
    • Strong, Bounded, or BestEffort
  • required_root_token
    • optional lower bound on rooted truth already incorporated
  • max_root_lag
    • optional bound on acceptable rooted lag

Not every caller will need all three fields. But the protocol should have room for them.

5.3 Response fields

Control-plane read RPCs should return:

  • served_root_token
  • served_freshness
  • served_by_leader
  • degraded_mode

Optional future fields:

  • root_lag
  • freshness_reason

5.4 Serving rules

Strong

Should be served only when:

  • the node is rooted leader
  • and the serving catalog has incorporated at least the requested RootToken

If this is not true, the server should reject rather than silently downgrade.

Bounded

May be served by a follower when:

  • the node is not in BootstrapRequired, Recovering, or Unavailable
  • and lag is within declared bounds
  • and the served token satisfies required_root_token if one was requested

If bounds cannot be satisfied, the server should reject rather than silently serve stale data.

BestEffort

May be served from the current materialized catalog so long as:

  • the catalog exists
  • the node is not fully unavailable

This class exists to make stale service explicit instead of accidental.

5.5 First rollout target

The first RPC that should adopt this contract is:

  • GetRegionByKey

That gives the system a clear, high-value place to prove the model before wider rollout.


6. Rooted Catch-Up Protocol

NoKV already has a good catch-up foundation:

  • checkpoint
  • committed tail
  • watch-first subscription
  • bootstrap install when retained tail is insufficient

The next step is to give that behavior a formal state machine.

6.1 Catch-up state definitions

Fresh

The node’s materialized catalog is sufficiently close to rooted truth to serve:

  • Bounded
  • BestEffort

and, if leader, possibly Strong.

Lagging

The node is behind, but still within retained-tail recovery range.

This means:

  • further rooted tail observation may repair the gap
  • bootstrap install is not yet mandatory
  • some bounded reads may need to be rejected

BootstrapRequired

The node is too far behind for retained tail replay.

This means:

  • a plain reload from retained tail is not sufficient
  • a new checkpoint/bootstrap install is required
  • freshness-sensitive reads should be rejected

Recovering

The node is actively rebuilding its materialized control-plane view.

This means:

  • catalog may be in transition
  • only explicitly allowed stale reads may be served

Unavailable

The node cannot presently produce a valid control-plane view.

This means:

  • no rooted freshness contract can be satisfied
  • the server should fail reads except possibly future explicit diagnostics

6.2 Catch-up actions

Reload

Used when rooted truth advanced in a way that requires rebuilding the materialized catalog.

Advance

Used when rooted tail progressed, but the catalog does not need a full rebuild.

Bootstrap

Used when the node must install a checkpoint because retained tail can no longer bridge the gap.

Reject

Used when the node should refuse freshness-sensitive serving until it converges further.

6.3 Protocol outputs

The rooted subscription path should eventually expose a structured result like:

  • root_token_before
  • root_token_after
  • catch_up_state
  • catch_up_action
  • reload_required
  • bootstrap_required

6.4 Why protocolizing this matters

Without explicit catch-up semantics:

  • tests can only assert indirect effects
  • follower-read serving policy stays implicit
  • degraded-mode logic gets duplicated
  • future clients cannot reason about retries properly

This is one of the strongest places for NoKV to become distinctive.


7. Transition Lifecycle Protocol

NoKV already records rooted topology intent and rooted completion. That is the start of a lifecycle, not yet the full protocol.

The next stage is to make transition tracking first-class.

7.1 Transition identity

Every topology transition should have a stable TransitionID.

TransitionID should be:

  • deterministic
  • durable
  • safe to log, surface, and test against

It should not require callers to infer identity from:

  • region ID
  • event kind
  • timing

alone.

7.2 Transition source

Every transition should record:

  • source rooted epoch or token
  • target topology intent
  • the event that created it

This makes causality explicit:

  • what truth position created this transition?
  • what later truth position superseded it?

7.3 Phase definitions

Planned

The rooted lifecycle assessment says the transition exists as an intended topology change, but the scheduler/control-plane runtime has not yet admitted it for forward progress.

This is the phase used by:

  • AssessRootEvent
  • PublishRootEventResponse.assessment

Admitted

The rooted transition is currently pending or open, and the scheduler/control-plane runtime has admitted it for execution progress.

This is the phase used by:

  • ListTransitions

It is intentionally runtime-facing. It does not appear in PublishRootEventResponse.assessment, because that response reports a pre-persist lifecycle assessment rather than post-admission runtime state.

Completed

The rooted lifecycle says the requested transition target is already satisfied. For a plan event, this usually means the requested topology is already present.

Cancelled

The rooted lifecycle says the requested transition target was cancelled.

Conflicted

The rooted lifecycle says a different pending transition already owns progress for the same target.

Superseded

The rooted lifecycle says a newer rooted topology already superseded this transition target.

Aborted

The rooted lifecycle says an apply or terminal event does not match the current pending rooted target.

7.4 Why lifecycle matters

A formal lifecycle enables:

  • clear scheduling decisions
  • proper retry/backoff
  • stuck transition recovery
  • scheduler/control-plane runtime clarity
  • precise testing around publish boundaries

Without it, the system keeps relying on partial signals scattered across:

  • rooted events
  • in-memory views
  • runtime heuristics

8. Degraded Semantics

NoKV already has some degraded behavior:

  • followers serve stale route views
  • route cache may survive Coordinator outages
  • scheduler paths may degrade

These behaviors should become explicit protocol states.

8.1 Mode definitions

Healthy

Normal serving mode.

Rooted truth, catalog freshness, and serving guarantees are all within policy.

CoordinatorDegraded

The Coordinator process is alive, but not all control-plane functions can be performed normally.

Examples:

  • partial RPC surface availability
  • write restrictions while leadership is unsettled

RootLagging

Rooted truth exists, but this node’s materialized catalog is behind allowed freshness bounds.

This is not full unavailability. It is a serving restriction mode.

RootUnavailable

The rooted backend cannot currently provide enough truth to support valid control-plane service.

In this mode:

  • truth-sensitive reads fail
  • rooted writes fail
  • diagnostics may still be exposed

ViewOnly

The node may still expose a stale materialized catalog, but cannot satisfy freshness-sensitive contracts.

This mode is useful because it makes “stale but useful” explicit.

8.2 Why this should be formal

Without explicit degraded modes, callers only see:

  • transport failure
  • not leader
  • route unavailable

Those errors do not express the actual system state.

A real degraded protocol lets callers answer:

  • retry elsewhere?
  • retry later?
  • accept stale?
  • fail fast?

8.3 Relationship to freshness

DegradedMode and Freshness are related but not identical.

  • Freshness is the contract requested and served for one read
  • DegradedMode is the broader operating condition of the serving node

A node may be:

  • Healthy and still reject a Strong read because it is not leader
  • RootLagging and still serve BestEffort
  • ViewOnly and still serve diagnostics

That distinction should remain sharp.

8.4 Current coordinator contract

The current implementation already enforces a concrete degraded-mode contract at the Coordinator RPC boundary.

Metadata reads (GetRegionByKey)

  • Freshness=BEST_EFFORT
    • serves from the local materialized catalog even when meta/root is currently unavailable
    • returns degraded_mode=ROOT_UNAVAILABLE when the rooted snapshot cannot be reloaded
    • returns degraded_mode=ROOT_LAGGING when the local catalog trails rooted truth
  • Freshness=BOUNDED
    • rejects when meta/root is unavailable
    • rejects when root_lag > max_root_lag
    • rejects when catch-up is still BOOTSTRAP_REQUIRED
  • Freshness=STRONG
    • rejects on followers
    • rejects whenever root_lag > 0
    • rejects when meta/root is unavailable

In all cases, successful replies carry the current answerability witness:

  • served_root_token
  • current_root_token
  • root_lag
  • catch_up_state
  • degraded_mode
  • serving_class
  • sync_health

Duty-gated writes (AllocID, TSO, scheduler decisions)

These do not have a degraded fallback.

  • the local coordinator must first campaign / renew the rooted lease
  • the rooted lease must still be active for the local holder
  • the rooted era must not already be sealed
  • the rooted duty mask must admit the requested action

If any of those fail, the request is rejected instead of falling back to stale local state. This is the current boundary between:

  • read-path degradation
  • write-path fail-stop admission

Lifecycle mutations (Seal, Confirm, Close, Reattach)

Lifecycle mutations are stricter than hot-path duty admission:

  • they always re-read rooted state from storage before mutating
  • they reject any stale-holder / expired-lease / sealed-era view
  • they treat finality as a rooted safety condition, not a best-effort hint

That is why seal / confirm / close / reattach do not use the cached mirror admission path.

8.5 Operational diagnostics

DiagnosticsSnapshot() now exports both:

  • the current degraded serving state (root, lease, audit, handover_witness)
  • cumulative Eunomia counters under eunomia_metrics

eunomia_metrics is grouped into:

  • tenure_era_transitions_total
  • handover_stage_transitions_total
  • gate_rejections_total
  • guarantee_violations_total

The guarantee_violations_total buckets map directly to the four Eunomia guarantees:

  • primacy
  • inheritance
  • silence
  • finality

9. API Direction

The most valuable first implementation step is at the Coordinator RPC boundary.

9.1 Read-side API direction

Read APIs should conceptually grow:

  • freshness
  • required_root_token
  • max_root_lag

Read responses should conceptually expose:

  • served_root_token
  • served_freshness
  • degraded_mode
  • served_by_leader

9.2 Write-side API direction

Leader-only writes should remain leader-only.

Write requests should continue to require:

  • rooted leadership
  • expected cluster epoch where applicable

Write responses should eventually expose:

  • accepted_root_token
  • transition_id where topology change is involved

This makes a write result more precise than:

  • accepted = true

9.3 Diagnostics API direction

The control plane will likely also benefit from an explicit diagnostics surface.

Conceptually, that should expose:

  • current rooted token
  • catalog rooted token
  • catch-up state
  • degraded mode
  • leader identity knowledge
  • lag estimate

This may become:

  • a dedicated diagnostics RPC
  • metrics
  • CLI output

or all three.


10. Storage and Catalog Direction

To support the protocol above, the Coordinator catalog should become rooted-token aware.

At minimum, the materialized control-plane view should track:

  • catalog_root_token
  • catalog_updated_at
  • catch_up_state
  • degraded_mode

Optional future metadata:

  • root_lag
  • last_reload_reason
  • leader_observed

10.1 Ownership rule

This design does not change truth ownership.

The ownership line remains:

  • meta/root owns durable truth
  • coordinator/catalog owns materialized serving state

The catalog should become more informative, not more authoritative.

10.2 Materialization rule

The catalog must remain:

  • rebuildable
  • discardable
  • follower-local

It should never become a second durable truth source.

That is a core invariant.


11. Invariants

This protocol should preserve the following invariants.

11.1 Truth ownership invariant

Only meta/root owns durable control-plane truth.

11.2 Materialization invariant

coordinator/catalog is always derived state, never authority.

11.3 Monotonic token invariant

The materialized rooted token of one node must never move backward.

11.4 No silent downgrade invariant

If a caller requests Strong or bounded freshness and the node cannot satisfy it, the server should reject rather than silently serve BestEffort.

11.5 Explicit stale service invariant

If stale service is allowed, the response should say so explicitly.

11.6 Transition identity invariant

Every control-plane transition must be referencable as a stable object, not just inferred from event timing.


12. Rollout State

The rollout stays incremental, but the first protocol line is already in use.

Phase 1: Freshness

Status: implemented

Delivered outcomes:

  • GetRegionByKey can express requested freshness
  • route responses disclose served freshness and rooted token
  • follower-read behavior is no longer implicit

Phase 2: Catch-Up

Status: minimal v1 implemented

Delivered outcomes:

  • CatchUpState
  • formal bootstrap-required boundary
  • rooted lag awareness in serving decisions

Still open:

  • a wider public CatchUpAction surface
  • more explicit recovery diagnostics

Phase 3: Transition

Status: minimal v1 implemented

Delivered outcomes:

  • durable TransitionID
  • explicit phase semantics across:
    • ListTransitions
    • AssessRootEvent
    • PublishRootEvent
  • publish-time pre-persist lifecycle assessment

Still open:

  • richer runtime phases
  • stuck / timeout diagnosis

Phase 4: DegradedMode

Status: minimal v1 implemented

Delivered outcomes:

  • explicit degraded semantics in route responses
  • route-serving rejection under rooted lag / rooted unavailability

Still open:

  • broader surfacing through metrics and diagnostics
  • tighter client retry policy based on degraded state

13. What Not To Do

The following are intentionally out of scope for this line of work:

  • inventing a new general-purpose consensus algorithm
  • replacing Raft in the mainline system
  • redesigning 2PC before control-plane semantics are explicit
  • collapsing rooted truth and catalog into one mixed layer
  • treating stale follower service as an undocumented optimization

NoKV’s control-plane innovation should come from stronger semantics and clearer ownership, not from unnecessary reinvention of already mature primitives.


14. Current Practical Naming Guidance

If this protocol starts landing in code, the implementation should prefer:

  • RootToken
  • Freshness
  • CatchUpState
  • CatchUpAction
  • TransitionID
  • DegradedMode

For execution-plane work, prefer:

  • Admission
  • ExecutionTarget
  • ExecutionOutcome
  • PublishState
  • RestartState

Avoid reintroducing weaker names like:

  • state kind
  • stale mode
  • sync status
  • reload reason as the primary protocol object

Those may still exist as helper fields, but the public model should stay anchored to the smaller protocol vocabulary above.


15. Execution-Plane Protocol

The execution plane is the contract between:

  • raftstore
  • local leader peer runtime
  • local durable recovery state
  • the control-plane publish boundary

Its job is different from the control plane.

The control plane answers:

  • what topology truth exists?
  • how fresh is the served view?
  • what transition lifecycle is visible globally?

The execution plane answers:

  • may this request enter local execution now?
  • what target is being executed?
  • how far has local execution progressed?
  • has terminal truth been published yet?
  • what state is safe to recover after restart?

15.1 Why this matters

Without an explicit execution-plane protocol, the system keeps important distributed safety semantics hidden in code paths such as:

  • request validation and cancellation
  • queue admission and local degradation
  • planned truth publication before local execution
  • terminal truth publication after local apply
  • restart reconciliation between localmeta, raft durable state, and Coordinator

Those are not low-level implementation details. They are correctness boundaries.

15.2 Protocol objects

The execution plane should be formalized around the following objects.

Admission

Admission is the local decision about whether one request may enter execution.

It should answer:

  • is the local peer leader?
  • is the region epoch valid?
  • is the peer hosted and runnable?
  • is the request cancelled or timed out already?
  • is the queue or scheduler allowed to accept more work?

The important design rule is that admission must be explicit, not an accidental mix of local checks and fallback retries.

ExecutionTarget

ExecutionTarget is the concrete unit of work the execution plane is trying to carry out.

Examples:

  • one read command
  • one raft write proposal
  • one peer change target
  • one split target
  • one merge target

For topology changes, ExecutionTarget must remain causally tied to the rooted transition object created by the control plane.

ExecutionOutcome

ExecutionOutcome is the local state reached by an admitted target.

Minimal useful states are:

  • Rejected
  • Queued
  • Proposed
  • Committed
  • Applied
  • Failed

This is the minimum needed to stop conflating “accepted by API”, “replicated by raft”, and “applied to local state”.

PublishState

PublishState tracks the boundary between local apply and control-plane truth publication.

This is a first-class boundary in NoKV’s architecture:

  • planned truth is published before execution
  • terminal truth is published after local apply

The protocol must therefore distinguish:

  • NotRequired
  • Pending
  • Published
  • PublishFailed

This is the exact boundary where split/merge/peer-change correctness otherwise turns into invisible best-effort behavior.

RestartState

RestartState describes whether one store can safely resume from local durable state.

It should answer:

  • is local peer metadata self-consistent?
  • is the local raft replay pointer usable?
  • does the store need Coordinator catch-up only, or local rebuild first?
  • is startup safe, degraded, or fatal?

This object exists to stop restart behavior from being an implicit composition of:

  • raftstore/localmeta
  • raft log replay
  • ad hoc bootstrap logic

15.3 Request classes and admission

Execution-plane v1 should start by distinguishing three request classes:

  • Read
    • local leader read admission
    • read-index / wait-applied preconditions
    • cancellation and deadline propagation
  • Write
    • raft proposal admission
    • proposal tracking through commit/apply
    • retryable local rejection vs fatal local rejection
  • Topology
    • peer change
    • split
    • merge
    • explicit coupling to planned and terminal rooted truth

These classes do not need separate RPC protocols, but they do need stable admission outcomes. At minimum, those outcomes should distinguish:

  • NotLeader
  • EpochMismatch
  • NotHosted
  • Canceled
  • TimedOut
  • QueueSaturated
  • SchedulerDegraded
  • Accepted

Without this line, request behavior remains split across store-local branches instead of becoming one coherent executor contract.

15.4 Publish lifecycle

Execution-plane v1 should also make the publish boundary explicit for topology work.

The minimal lifecycle is:

  1. PlannedPublished
  2. LocallyExecuting
  3. Applied
  4. TerminalPublishPending
  5. TerminalPublished
  6. TerminalPublishFailed

The important rule is that Applied and TerminalPublished are different states. Local execution success does not mean global lifecycle completion until terminal truth is durably published.

This is the boundary that should align:

  • raftstore/store/transition_builder.go
  • raftstore/store/transition_executor.go
  • raftstore/store/transition_outcome.go
  • raftstore/store/scheduler_runtime.go

15.5 First landing points

Execution-plane protocol v1 landed first in the places that already carried the boundary implicitly:

  • raftstore/store/command_ops.go
    • request admission and context semantics
  • raftstore/store/command_pipeline.go
    • request lifecycle states visible to callers
  • raftstore/store/scheduler_runtime.go
    • queue overflow / degraded local behavior
  • raftstore/store/transition_builder.go
    • execution target construction from rooted truth
  • raftstore/store/transition_executor.go
    • local execution and apply boundary
  • raftstore/store/transition_outcome.go
    • terminal truth publication result
  • raftstore/localmeta
    • restart state and local recovery truth

These files still do not expose a new public API. But they now share one explicit local protocol vocabulary instead of inventing those semantics independently.

15.6 Execution invariants

The execution-plane protocol should preserve the following invariants.

Admission invariant

Every externally visible rejection should map to a stable admission reason, not only a transport error or generic retry exhaustion.

No skipped publish boundary invariant

If local apply completed but terminal truth publication did not, the system must surface that state explicitly. It must not be silently treated as fully complete.

Restart truth boundary invariant

Restart must derive hosted peer truth from local durable state, not from bootstrap config. Static config may resolve addresses, but must not overwrite runtime truth.

No hidden drop invariant

Queue overflow, scheduler degradation, and publish retry loss must be explicit protocol states or metrics-backed outcomes, not silent local behavior.

15.7 Minimal rollout target

Execution-plane protocol v1 started small.

The minimum useful delivered line is now:

  1. request admission
  2. topology execution outcome
  3. publish boundary state
  4. restart state

That is enough to formalize the most dangerous boundaries without trying to protocolize every internal raft detail.


16. Priority and Rollout Order

The next protocol work should avoid widening either protocol until the current v1 contracts stay small, observable, and well tested.

16.1 What is implemented now

The control plane has a minimal, externally visible contract:

  • freshness classes
  • rooted token / lag
  • degraded serving state
  • transition identity

The execution plane now has a minimal internal contract:

  • admission class / reason
  • topology outcome
  • publish state
  • restart state
  • admin-visible ExecutionStatus

That is enough for v1. It gives tests and operators names for the important boundaries without turning raftstore into a policy engine.

16.2 What should not happen next

The wrong next step would be to keep enriching lifecycle phases and diagnostic fields before the existing v1 state proves stable under recovery and integration tests.

That would create a vocabulary mismatch:

  • control plane claims richer transition semantics than the executor can act on
  • execution plane reports more states than the coordinator can use safely
  1. Keep control-plane v1 and execution-plane v1 narrow.
  2. Add tests around the existing publish/restart/admission states before adding new states.
  3. Only then tighten control-plane v1 toward richer scheduler/runtime phases.

In short:

stabilize both v1 contracts first, then deepen scheduler/runtime semantics.

WAL Subsystem

NoKV’s write-ahead log mirrors RocksDB’s durability model and is implemented as a compact Go module similar to Badger’s journal. WAL appends happen alongside memtable writes (via lsm.SetBatch in the DB write pipeline), while values that are routed to the value log are written before the WAL so that replay always sees durable value pointers.


1. File Layout & Naming

  • Location: ${Options.WorkDir}/wal/.
  • Naming pattern: %05d.wal (e.g. 00001.wal).
  • Rotation threshold: configurable via wal.Config.SegmentSize (defaults to 64 MiB, minimum 64 KiB).
  • Verification: wal.VerifyDir ensures the directory exists prior to DB.Open.

On open, wal.Manager scans the directory, sorts segment IDs, and resumes the highest ID—exactly how RocksDB re-opens its MANIFEST and WAL sequence files.


2. Record Format

uint32 length (big-endian, includes type + payload)
uint8  type
[]byte payload
uint32 checksum (CRC32 Castagnoli over type + payload)
  • Checksums use kv.CastagnoliCrcTable, the same polynomial used by RocksDB (Castagnoli). Record encoding/decoding lives in engine/wal/record.go.
  • The type byte allows mixing LSM mutations with raft log/state/snapshot records in the same WAL segment.
  • Appends are buffered by bufio.Writer so batches become single system calls.
  • Replay stops cleanly at truncated tails; tests simulate torn writes by truncating the final bytes and verifying replay remains idempotent (engine/wal/manager_test.go::TestManagerReplayHandlesTruncate).

3. Public API (Go)

mgr, _ := wal.Open(wal.Config{Dir: path})
info, _ := mgr.AppendEntry(wal.DurabilityFlushed, entry)
batchInfo, _ := mgr.AppendEntryBatch(wal.DurabilityFlushed, entries)
typedInfos, _ := mgr.AppendRecords(wal.DurabilityFsyncBatched, wal.Record{
    Type:    wal.RecordTypeRaftEntry,
    Payload: raftPayload,
})
_ = mgr.Sync()
_ = mgr.Rotate()
_ = mgr.Replay(func(info wal.EntryInfo, payload []byte) error {
    // reapply to memtable
    return nil
})

Key behaviours:

  • AppendEntry/AppendEntryBatch/AppendRecords automatically call ensureCapacity to decide when to rotate; they return EntryInfo{SegmentID, Offset, Length} so upper layers can keep storage WAL checkpoints and store-local raft replay pointers.
  • Every append must declare a DurabilityPolicy: DurabilityBuffered (writer buffer only), DurabilityFlushed (OS page cache), DurabilityFsync (fsync now), or DurabilityFsyncBatched (fsync contract reserved for the group-commit pipeline).
  • Sync flushes the active file (used for Options.SyncWrites).
  • Rotate forces a new segment (used after flush/compaction checkpoints similar to RocksDB’s LogFileManager::SwitchLog).
  • Replay iterates segments in numeric order, forwarding each payload to the callback. Errors abort replay so recovery can surface corruption early.
  • Metrics (wal.Manager.Metrics) reveal the active segment ID, total segments, and number of removed segments—these feed directly into StatsSnapshot and nokv stats output.
  • RegisterRetention lets LSM and raft participants publish the oldest WAL segment they still need. RemoveSegment rejects retained segments with ErrSegmentRetained.

Compared with Badger: Badger keeps a single vlog for both data and durability. NoKV splits WAL (durability) from vlog (value separation), matching RocksDB’s separation of WAL and blob files.


4. Integration Points

Call SitePurpose
lsm.memTable.setBatchEncodes each entry (kv.EncodeEntry) and appends to WAL before inserting into the active memtable index (ART by default, skiplist when explicitly selected).
DB.commitWorkerCommit worker applies batched writes via writeToLSM, which calls lsm.SetBatch and appends one WAL entry-batch record per request batch.
DB.Set / DB.SetBatch / DB.SetWithTTL / DB.Del / DB.DeleteRange / DB.ApplyInternalEntriesUser/internal writes all flow through the same commit queue and eventually reach lsm.SetBatch + WAL append.
engine/lsm/level_manager.go::flushPersists WAL checkpoint via manifest.LogEdits(EditAddFile, EditLogPointer) during flush install.
engine/lsm/level_manager.go::flush + engine/lsm/levelManager.canRemoveWalSegmentRemoves obsolete WAL segments after storage checkpoint and raftstore/localmeta replay constraints are satisfied.
db.runRecoveryChecksEnsures WAL directory invariants before manifest replay, similar to Badger’s directory bootstrap.

5. Metrics & Observability

Stats.collect reads the manager metrics and exposes them as:

  • NoKV.Stats.wal.active_segment
  • NoKV.Stats.wal.segment_count
  • NoKV.Stats.wal.segments_removed

The CLI command nokv stats --workdir <dir> prints these alongside backlog, making WAL health visible without manual inspection. In high-throughput scenarios the active segment ID mirrors RocksDB’s LOG number growth.


6. WAL Watchdog (Auto GC)

The WAL watchdog runs inside the DB process to keep WAL backlog in check and surface warnings when raft-typed records dominate the log. It:

  • Samples WAL metrics + per-segment metrics and combines them with raftstore/localmeta raft pointer snapshots to compute removable segments.
  • Filters those candidates through all registered WAL retention participants before deleting any segment.
  • Removes up to WALAutoGCMaxBatch segments when at least WALAutoGCMinRemovable are eligible.
  • Exposes counters (wal.auto_gc_runs/removed/last_unix) and warning state (wal.typed_record_ratio/warning/reason) through StatsSnapshot.WAL.

Relevant options (see options.go for defaults):

  • EnableWALWatchdog
  • WALAutoGCInterval
  • WALAutoGCMinRemovable
  • WALAutoGCMaxBatch
  • WALTypedRecordWarnRatio
  • WALTypedRecordWarnSegments

For hard admission control, wal.Config.MaxSegments rejects segment growth with ErrWALBackpressure once the WAL reaches the configured segment cap.


7. Recovery Walkthrough

  1. wal.Open reopens the highest segment, leaving the file pointer at the end (switchSegmentLocked).
  2. manifest.Manager supplies the WAL checkpoint (segment + offset) while building the version. Replay skips entries up to this checkpoint, ensuring we only reapply writes not yet materialised in SSTables.
  3. wal.Manager.Replay (invoked by the LSM recovery path) rebuilds memtables from entries newer than the manifest checkpoint. Value-log recovery only validates/truncates segments and does not reapply data.
  4. If the final record is partially written, the CRC mismatch stops replay and the segment is truncated during recovery tests, mimicking RocksDB’s tolerant behaviour.

8. Operational Tips

  • Configure Options.SyncWrites=true for DB-level synchronous durability. Raft log/state/snapshot appends use the WAL DurabilityFsyncBatched contract directly.
  • After large flushes, forcing Rotate keeps WAL files short, reducing replay time.
  • Archived WAL segments can be copied alongside manifest files for hot-backup strategies—since the manifest contains the WAL log number, snapshots behave like RocksDB’s Checkpoints.

9. Truncation Metadata

  • raftstore/raftlog/wal_storage keeps a per-group index of [firstIndex,lastIndex] spans for each WAL record so it can map raft log indices back to the segment that stored them.
  • When a log is truncated (either via snapshot or future compaction hooks), the store-local metadata in raftstore/localmeta is updated with the index/term, segment ID (RaftLogPointer.SegmentIndex), and byte offset (RaftLogPointer.TruncatedOffset) that delimit the remaining WAL data.
  • engine/lsm/levelManager.canRemoveWalSegment blocks garbage collection whenever any raft group still references a segment through that store-local truncation metadata, preventing slow followers from losing required WAL history while letting aggressively compacted groups release older segments earlier.

For broader context, read the architecture overview and flush pipeline documents.

Memtable Design & Lifecycle

NoKV’s write path mirrors RocksDB: every write lands in the WAL and an in-memory memtable backed by a selectable in-memory index (skiplist or ART). The implementation lives in engine/lsm/memtable.go and ties directly into the concrete flush queue in engine/lsm/flush_runtime.go.


1. Structure

type memTable struct {
    lsm        *LSM
    segmentID  uint32       // WAL segment backing this memtable
    index      memIndex
    maxVersion uint64
    walSize    int64
}

The memtable index is an interface that can be backed by either a skiplist or ART:

type memIndex interface {
    Add(*kv.Entry)
    Search([]byte) ([]byte, kv.ValueStruct)
    NewIterator(*utils.Options) utils.Iterator
    MemSize() int64
    IncrRef()
    DecrRef()
}
  • Memtable engineOptions.MemTableEngine selects art (default) or skiplist via newMemIndex. ART is not a generic trie: it is an internal-key-only memtable index. It uses a reversible mem-comparable route key so trie ordering matches the LSM internal-key comparator; skiplist remains available as the simpler baseline alternative.
  • Arena sizing – both utils.NewSkiplist and utils.NewART use arenaSizeFor to derive arena capacity from Options.MemTableSize.
  • WAL coupling – every Set uses kv.EncodeEntry to materialise the payload to the active WAL segment before inserting into the chosen index. walSize tracks how much of the segment is consumed so flush can release it later.
  • Segment IDLSM.NewMemtable atomically increments levels.maxFID, switches the WAL to a new segment (wal.Manager.SwitchSegment), and tags the memtable with that FID. This matches RocksDB’s logfile_number field.
  • ART specifics – ART stores prefix-compressed inner nodes (Node4/16/48/256). Each leaf keeps both the private route key used by the trie and the original canonical internal key returned to callers. The concurrency model is pure copy-on-write payload/node cloning with CAS installs: published payloads are immutable, reads stay lock-free, and writers only publish fully-built cloned payloads.

2. Lifecycle

sequenceDiagram
    participant WAL
    participant MT as MemTable
    participant Flush
    participant Manifest
    WAL->>MT: Append+Set(entry)
    MT->>Flush: freeze (walSize + incomingEstimate > limit)
    Flush->>Manifest: LogPointer + AddFile
    Manifest-->>Flush: ack
    Flush->>WAL: Release segments ≤ segmentID
  1. Active → Immutable – when mt.walSize + estimate exceeds Options.MemTableSize, the memtable is rotated and pushed onto the flush queue. The new active memtable triggers another WAL segment switch.
  2. Flush – the flush queue drains immutable memtables, builds SSTables, logs manifest edits, and releases the WAL segment ID recorded in memTable.segmentID once the SST is durably installed.
  3. RecoveryLSM.recovery scans WAL files, reopens memtables per segment (most recent becomes active), and deletes segments ≤ the manifest’s log pointer. Entries are replayed via wal.Manager.ReplaySegment into fresh indexes and the active in-memory state is rebuilt.

Badger follows the same pattern, while RocksDB often uses skiplist-backed arenas with reference counting—NoKV reuses Badger’s arena allocator for simplicity.


3. Read Semantics

  • memTable.Get looks up the chosen index and returns a borrowed, ref-counted *kv.Entry from the internal pool. The index search returns the matched internal key plus value struct, so memtable hit entries carry the concrete version key instead of the query sentinel key. Internal callers must release borrowed entries with DecrRef when done.
  • MemTable.IncrRef/DecrRef delegate to the index, allowing iterators to hold references while the flush manager processes immutable tables—mirroring RocksDB’s MemTable::Ref/Unref lifecycle.
  • WAL-backed values that exceed the value threshold are stored as pointers; the memtable stores the encoded pointer, and the transaction/iterator logic reads from the vlog on demand.
  • DB.Get returns detached entries; callers must not call DecrRef on them.
  • DB.GetInternalEntry returns borrowed entries; callers must call DecrRef exactly once.

4. Integration with Other Subsystems

SubsystemInteraction
Distributed 2PCkv.Apply + percolator write committed MVCC versions through the same WAL/memtable pipeline in raft mode.
ManifestFlush completion logs EditLogPointer(segmentID) so restart can discard WAL files already persisted into SSTs.
StatsStats.Snapshot pulls FlushPending/Active/Queue counters via lsm.FlushMetrics, exposing how many immutables are waiting.
Value Loglsm.flush emits discard stats keyed by segmentID, letting the value log GC know when entries become obsolete.

5. Comparison

AspectRocksDBBadgerDBNoKV
Data structureSkiplist + arenaSkiplist + arenaSkiplist or ART + arena (art default)
WAL linkagelogfile_number per memtableSegment ID stored in vlog entriessegmentID on memTable, logged via manifest
RecoveryMemtable replays from WAL, referencing MANIFESTReplays WAL segmentsReplays WAL segments, prunes ≤ manifest log pointer
Flush triggerSize/entries/timeSize-basedWAL-size budget (walSize) with explicit queue metrics

6. Operational Notes

  • Tuning Options.MemTableSize affects WAL segment count and flush latency. Larger memtables reduce flush churn but increase crash recovery time.
  • ART currently uses noticeably more memindex arena memory than skiplist because it stores both route keys and original internal keys in leaves; in local measurements the ART memindex is roughly 2x the skiplist memindex footprint for the same key/value set.
  • Monitor NoKV.Stats.flush.* fields to catch stalled immutables—an ever-growing queue often indicates slow SST builds or manifest contention.
  • Because memtables carry WAL segment IDs, deleting WAL files manually can lead to recovery failures; always rely on the engine’s manifest-driven cleanup.

See docs/flush.md for the end-to-end flush scheduler and [docs/architecture.md](architecture.md#3-end-to-end-write-flow) for where memtables sit in the write pipeline.

MemTable Flush Pipeline

NoKV’s flush path converts immutable memtables into L0 SST files, then advances the manifest WAL checkpoint and reclaims obsolete WAL segments. The queue and timing bookkeeping live directly in engine/lsm/flush_runtime.go; SST persistence and manifest install are in engine/lsm/table_builder.go and engine/lsm/level_manager.go.


1. Responsibilities

  1. Persistence: materialize immutable memtables into SST files.
  2. Ordering: publish SST metadata to manifest only after the SST is durably installed (strict mode).
  3. Cleanup: remove WAL segments once checkpoint and raft constraints allow removal.
  4. Observability: export queue/build/release timing through flush metrics.

2. Concrete Flush Queue

flowchart LR
    Active[Active MemTable]
    Immutable[Immutable MemTable]
    FlushQ[flush queue]
    Build[Build SST]
    Install[Install SST]
    Release[Release MemTable]

    Active -->|threshold reached| Immutable --> FlushQ
    FlushQ --> Build --> Install --> Release --> Active
  • Enqueue: lsm.submitFlush pushes the immutable memtable into the concrete flush queue and records wait-start time.
  • Build: worker pulls the next task, builds the SST (levelManager.flush -> openTable -> tableBuilder.flush).
  • Install: after SST + manifest edits succeed, the worker records install timing.
  • Release: worker removes the immutable from memory, closes the memtable, records release timing, and completes the task.

3. SST Persistence Modes

Flush uses two write modes controlled by Options.ManifestSync:

  1. Fast path (ManifestSync=false)

    • Writes SST directly to final filename with O_CREATE|O_EXCL.
    • No temp file/rename step.
    • Highest throughput, weaker crash-consistency guarantees.
  2. Strict path (ManifestSync=true)

    • Writes to "<table>.tmp.<pid>.<ns>".
    • tmp.Sync() to persist SST bytes.
    • RenameNoReplace(tmp, final) installs file atomically. If unsupported by platform/filesystem, returns vfs.ErrRenameNoReplaceUnsupported.
    • SyncDir(workdir) is called before manifest edit so directory entry is durable.

This is the durability ordering used by current code.


4. Execution Path in Code

  1. lsm.Set/lsm.SetBatch detects walSize + estimate > MemTableSize and rotates memtable.
  2. Rotated memtable is submitted to the flush queue (lsm.submitFlush).
  3. Worker executes levelManager.flush(mt):
    • iterates memtable entries,
    • builds SST via tableBuilder,
    • prepares manifest edits: EditAddFile + EditLogPointer.
  4. In strict mode, SyncDir runs before manifest.LogEdits(...).
  5. On successful manifest commit, table is added to L0 and wal.RemoveSegment runs when allowed.

5. Recovery Notes

  • Startup rebuild (levelManager.build) validates manifest SST entries against disk.
  • Missing or unreadable SSTs fail startup; normal restart does not repair manifest state by deleting referenced files.
  • Temp SST names are only used in strict mode and are created in WorkDir with suffix .tmp.<pid>.<ns> (not a dedicated tmp/ directory).

6. Metrics & CLI

flushRuntime.stats() feeds StatsSnapshot.Flush:

  • pending, queue, active
  • wait/build/release totals, counts, last, max
  • completed

Use:

nokv stats --workdir <dir>

to inspect flush backlog and latency.


  • engine/lsm/flush_runtime_test.go: queue lifecycle and timing counters.
  • db_test.go::TestRecoveryWALReplayRestoresData: replay still restores data after crash before flush completion.
  • db_test.go::TestRecoveryFailsOnMissingSST and db_test.go::TestRecoveryFailsOnCorruptSST: startup fails when manifest SSTs are missing or corrupt.

See also recovery.md, memtable.md, and wal.md.

Compaction & Cache Strategy

NoKV’s compaction pipeline borrows the leveled‑LSM layout from RocksDB, but layers it with a landing buffer, lightweight cache telemetry, and simple concurrency guards so the implementation stays approachable while still handling bursty workloads.


1. Overview

Compactions are orchestrated by lsm.compaction with lsm.levelManager supplying scheduling input and executing the plan. Each level owns two lists of tables:

  • tables – the canonical sorted run for the level.
  • landing – a landing buffer that temporarily holds SSTables moved from the level above when there is not yet enough work (or bandwidth) to do a full merge.

The compaction runtime periodically calls into the picker to build a list of Priority entries. The priorities consider three signals:

  1. L0 table count – loosely capped by Options.NumLevelZeroTables.
  2. Level size vs target – computed by levelTargets(), which dynamically adjusts the “base” level depending on total data volume.
  3. Landing buffer backlog – if a level’s landing shards have data, they receive elevated scores so landed tables are merged promptly.

The highest adjusted score is processed first. L0 compactions can either move tables into the landing buffer of the base level (cheap re‑parenting) or compact directly into a lower level when the overlap warrants it.

Planning now happens via Plan: LSM snapshots table metadata into TableMeta, PlanFor* selects table IDs + key ranges, and LSM resolves the plan back to *table before executing.


2. Landing Buffer

moveToLanding (see engine/lsm/compaction_executor.go) performs a metadata-only migration:

  1. Records a manifest.EditDeleteFile for the source level.
  2. Logs a new manifest.EditAddFile targeting the destination level.
  3. Removes the table from thisLevel.tables and appends it to nextLevel.landing.

This keeps write amplification low when many small L0 tables arrive at once. Reads still see the newest data because levelHandler.searchLNSST checks the landing buffer before consulting the canonical level tables.

Compaction tests (engine/lsm/compaction_test.go) assert that after calling moveToLanding the table disappears from the source level and shows up in the landing buffer.


3. Concurrency Guards

To prevent overlapping compactions:

  • State.CompareAndAdd tracks the key range of each in-flight compaction per level.
  • Attempts to register a compaction whose ranges intersect an existing one are rejected.
  • When a compaction finishes, State.Delete removes the ranges and table IDs from the guard.

This mechanism is intentionally simple—just a mutex‐protected slice—yet effective in tests (TestCompactStatusGuards) that simulate back‑to‑back registrations on the same key range.


4. Cache Telemetry

NoKV’s cache strategy has two explicit user-space caches plus direct bloom probing from decoded table indexes (engine/lsm/cache.go):

ComponentPurposeMetrics hook
Block cacheRistretto cache for L0/L1 blocks.cacheMetrics.recordBlock(level, hit)
OS page cache pathDeeper levels bypass user-space cache and rely on mmap + kernel page cache.Same as above
Bloom filtersEmbedded in pb.TableIndex and probed directly from the decoded index.no separate cache layer

Cache hit/miss signals are exported through StatsSnapshot.Cache (and surfaced by nokv stats / expvar), which is especially helpful when tuning landing behaviour. If L0/L1 cache misses spike, the landing buffer likely needs to be drained faster. TestCacheHotColdMetrics verifies cache hit accounting.


5. Interaction with Value Log

Compaction informs value‑log GC via discard statistics:

  1. During subcompact, every entry merged out is inspected. If it stores a ValuePtr, the amount is added to the discard map.
  2. At the end of subcompaction, the accumulated discard map is pushed through setDiscardStatsCh.
  3. valueLog receives the stats and can safely rewrite or delete vlog segments with predominantly obsolete data.

This tight coupling keeps the value log from growing indefinitely after heavy overwrite workloads.


6. Testing Checklist

Relevant tests to keep compaction healthy:

  • engine/lsm/compaction_test.go
    • TestCompactionMoveToLanding – ensures metadata migration works and the landing buffer grows.
    • TestCompactStatusGuards – checks overlap detection.
  • engine/lsm/cache_test.go
    • TestCacheHotColdMetrics – validates cache hit accounting.
  • engine/lsm/lsm_test.go
    • TestCompact / TestHitStorage – end‑to‑end verification that data remains queryable across memtable flushes and compactions.

When adding new compaction heuristics or cache behaviour, extend these tests (or introduce new ones) so the behaviour stays observable.


7. Practical Tips

  • Tune Options.LandingCompactBatchSize when landing queues build up; increasing it lets a single move cover more tables.
  • Observe NoKV.Stats.cache.* and NoKV.Stats.compaction.* via the CLI (nokv stats) to decide whether you need more compaction workers or bigger caches.
  • For workloads dominated by range scans, consider increasing Options.BlockCacheBytes if you want to keep more L0/L1 blocks in the user-space cache; cold data relies on the OS page cache.
  • Keep an eye on NoKV.Stats.value_log.gc (for example gc_runs and head_updates); if compactions are generating discard stats but the value log head doesn’t move, GC thresholds may be too conservative.

With these mechanisms, NoKV stays resilient under bursty writes while keeping the code path small and discoverable—ideal for learning or embedding. Dive into the source files referenced above for deeper implementation details.

Landing Buffer Architecture

The landing buffer is a per-level landing area for SSTables—typically promoted from L0—designed to absorb bursts, reduce overlap, and unlock parallel compaction without touching the main level tables immediately. It combines fixed sharding, adaptive scheduling, and optional LandingKeep (landing-merge) passes to keep write amplification and contention low.

flowchart LR
  L0["L0 SSTables"] -->|moveToLanding| Landing["Landing Buffer (sharded)"]
  subgraph levelN["Level N"]
    Landing -->|LandingDrain: landing-only| MainTables["Main Tables"]
    Landing -->|LandingKeep: landing-merge| Landing
  end
  Landing -.read path merge.-> ClientReads["Reads/Iterators"]

Design Highlights

  • Sharded by key prefix: landing tables are routed into fixed shards (top bits of the first byte). Sharding cuts cross-range overlap and enables safe parallel drain.
  • Snapshot-friendly reads: landing tables are read under the level RLock, and iterators hold table refs so mmap-backed data stays valid without additional snapshots.
  • Two landing paths:
    • Landing-only compaction: drain landing → main level (or next level) with optional multi-shard parallelism guarded by LandingMode.
    • Landing-merge: compact landing tables back into landing (stay in-place) to drop superseded versions before promoting, reducing downstream write amplification.
  • LandingMode enum: plans carry a LandingMode with LandingNone, LandingDrain, and LandingKeep. LandingDrain corresponds to landing-only (drain into main tables), while LandingKeep corresponds to landing-merge (compact within landing).
  • Adaptive scheduling:
    • Shard selection is driven by PickShardOrder / PickShardByBacklog using per-shard size, age, and density.
    • Shard parallelism scales with backlog score (based on shard size/target file size) bounded by LandingShardParallelism.
    • Batch size scales with shard backlog to drain faster under pressure.
    • Landing-merge triggers when backlog score exceeds LandingBacklogMergeScore (default 2.0), with dynamic lowering under extreme backlog/age.
  • Observability: expvar/stats expose LandingDrain vs LandingKeep counts, duration, and tables processed, plus landing size/value density per level/shard.

Configuration

  • LandingShardParallelism: max shards to compact in parallel (default max(NumCompactors/2, 2), auto-scaled by backlog).
  • LandingCompactBatchSize: base batch size per landing compaction (auto-boosted by shard backlog).
  • LandingBacklogMergeScore: backlog score threshold to trigger LandingKeep/landing-merge (default 2.0).

Benefits

  • Lower write amplification: bursty L0 SSTables land in the landing buffer first; LandingKeep/landing-merge prunes duplicates before full compaction.
  • Reduced contention: sharding + State allow parallel landing drain with minimal overlap.
  • Predictable reads: landing is part of the read snapshot, so moving tables in/out does not change read semantics.
  • Tunable and observable: knobs for parallelism and merge aggressiveness, with per-path metrics to guide tuning.

Future Work

  • Deeper adaptive policies (IO/latency-aware), richer shard-level metrics, and more exhaustive parallel/restart testing under fault injection.

Value Log (vlog) Design

NoKV keeps the LSM tree lean by separating large values into sequential value log (vlog) files. The module is split between

  • engine/vlog/manager.go – owns the open file set, rotation, and segment lifecycle helpers.
  • engine/vlog/io.go – append/read/iterate/verify/sample IO paths.
  • vlog.go – integrates the manager with the DB write path, discard statistics, and garbage collection (GC).

The design echoes BadgerDB’s value log while remaining manifest-driven like RocksDB’s blob_db: vlog metadata (head pointer, pending deletions) is persisted inside the manifest so recovery can reconstruct the exact state without scanning the filesystem.


1. Layering (Engine View)

The value log is split into three layers so IO can stay reusable while DB policy lives in the core package:

  • DB policy layer (vlog.go, vlog_gc.go) – integrates the manager with the DB write path, persists vlog metadata in the manifest, and drives GC scheduling based on discard stats.
  • Value-log manager (engine/vlog/) – owns segment lifecycle (open/rotate/remove), encodes/decodes entries, and exposes append/read/sample APIs without touching MVCC or LSM policy.
  • File IO (engine/file/) – mmap-backed LogFile primitives (open/close/truncate, read/write, read-only remap) shared by WAL/vlog/SST. Vlog currently uses LogFile directly instead of an intermediate store abstraction.

2. Directory Layout & Naming

<workdir>/
  vlog/
    bucket-000/
      00000.vlog
      00001.vlog
    bucket-001/
      00000.vlog
      00001.vlog
    ...
  • Files are named %05d.vlog and live under workdir/vlog/bucket-XXX/ when Options.ValueLogBucketCount > 1. Manager.populate discovers existing segments at open.
  • Manager tracks the active file ID (activeID) and byte offset; Manager.Head exposes these so the manifest can checkpoint them (manifest.EditValueLogHead).
  • Files created after a crash but never linked in the manifest are removed during valueLog.reconcileManifest.
  • Production defaults keep ordinary hash bucketization enabled (ValueLogBucketCount=16). Bucket selection is plain hash routing; it no longer depends on Thermos-based hot/cold classification.

3. Record Format

The vlog uses the shared encoding helper (kv.EncodeEntryTo), so entries written to the value log and the WAL are byte-identical.

+--------+----------+------+-------------+-----------+-------+
| KeyLen | ValueLen | Meta | ExpiresAt   | Key bytes | Value |
+--------+----------+------+-------------+-----------+-------+
                                             + CRC32 (4 B)
  • Header fields are varint-encoded (kv.EntryHeader).
  • The first 20 bytes of every segment are reserved (kv.ValueLogHeaderSize) for future metadata; iteration always skips this fixed header.
  • kv.EncodeEntry and the entry iterator (kv.EntryIterator) perform the layout work, and each append finishes with a CRC32 to detect torn writes.
  • vlog.VerifyDir scans all segments with sanitizeValueLog to trim corrupted tails after crashes, mirroring RocksDB’s blob_file::Sanitize. Badger performs a similar truncation pass at startup.

4. Manager API Surface

mgr, _ := vlog.Open(vlog.Config{Dir: "...", MaxSize: 1<<29})
ptr, _ := mgr.AppendEntry(entry)
ptrs, _ := mgr.AppendEntries(entries, writeMask)
val, unlock, _ := mgr.Read(ptr)
unlock()             // release per-file lock
_ = mgr.Rewind(*ptr) // rollback partially written batch
_ = mgr.Remove(fid)  // close + delete file

Key behaviours:

  1. Append + RotateManager.AppendEntry encodes and appends into the active file. The reservation path handles rotation when the active segment would exceed MaxSize; manual rotation is rare.
  2. Crash recoveryManager.Rewind truncates the active file and removes newer files when a write batch fails mid-flight. valueLog.write uses this to guarantee idempotent WAL/value log ordering.
  3. Safe readsManager.Read returns an mmap-backed slice plus an unlock callback. Active segments take a per-file RWMutex, while sealed segments use a pin/unpin path to avoid long-held locks; callers that need ownership should copy the bytes before releasing the lock.
  4. VerificationVerifyDir validates entire directories (used by CLI and recovery) by parsing headers and CRCs.

Compared with RocksDB’s blob manager the surface is intentionally small—NoKV treats the manager as an append-only log with rewind semantics, while RocksDB maintains index structures inside the blob file metadata.


5. Integration with DB Writes

sequenceDiagram
    participant Commit as commitWorker
    participant Mgr as vlog.Manager
    participant WAL as wal.Manager
    participant Mem as MemTable
    Commit->>Mgr: AppendEntries(entries, writeMask)
    Mgr-->>Commit: ValuePtr list
    Commit->>WAL: Append(entries+ptrs)
    Commit->>Mem: apply to active memtable index
  1. valueLog.write builds a write mask for each batch, then delegates to Manager.AppendEntries. Entries staying in LSM (shouldWriteValueToLSM) receive zero-value pointers.
  2. Rotation is handled inside the manager when the reserved bytes would exceed MaxSize. The WAL append happens after the value log append so crash replay observes consistent pointers.
  3. Any error triggers Manager.Rewind back to the saved head pointer, removing new files and truncating partial bytes. vlog_test.go exercises both append- and rotate-failure paths.
  4. All batched writes share the same pipeline: the commit worker always writes the value log first, then applies to WAL/memtable, keeping ordering and durability consistent.

Badger follows the same ordering (value log first, then write batch). RocksDB’s blob DB instead embeds blob references into the WAL entry before the blob file write, relying on two-phase commit between WAL and blob; NoKV avoids the extra coordination by reusing a single batching loop.


6. Discard Statistics & GC

flowchart LR
  Compaction -- "obsolete ptrs" --> DiscardStats
  DiscardStats -->|"batch json"| writeCh
  valuePtr["valueLog.newValuePtr(lfDiscardStatsKey)"]
  writeCh --> valuePtr
  valueLog -- "GC trigger" --> Manager

  • lfDiscardStats aggregates per-file discard counts from LSM compaction workers (engine/lsm/compaction_executor.go, subcompact -> updateDiscardStats). Once the in-memory counter crosses discardStatsFlushThreshold, it marshals the map into JSON and writes it back through the DB pipeline under the special key !NoKV!discard.
  • valueLog.flushDiscardStats consumes those stats, ensuring they are persisted even across crashes. During recovery valueLog.populateDiscardStats replays the JSON payload to repopulate the in-memory map.
  • GC uses discardRatio = discardedBytes/totalBytes derived from Manager.Sample, which applies windowed iteration based on configurable ratios. If a file exceeds the configured threshold, valueLog.doRunGC rewrites live entries into the current head via db.batchSet (the normal commit pipeline) and then valueLog.rewrite triggers manifest delete edits through removeValueLogFile.
    • Sampling behaviour is controlled by Options.ValueLogGCSampleSizeRatio (default 0.10 of the file) and Options.ValueLogGCSampleCountRatio (default 1% of the configured entry limit). Setting either to <=0 keeps the default heuristics. Options.ValueLogGCSampleFromHead starts sampling from the beginning instead of a random window.
  • Completed deletions are logged via lsm.LogValueLogDelete so the manifest can skip them during replay. When GC rotates to a new head, valueLog.updateHead records the pointer and increments NoKV.Stats.value_log.gc.head_updates.

RocksDB’s blob GC leans on compaction iterators to discover obsolete blobs. NoKV, like Badger, leverages flush/compaction discard stats so GC does not need to rescan SSTs.


7. GC Scheduling & Parallelism

NoKV runs value-log GC with bucket-aware parallelism while protecting the LSM from overload:

  • ValueLogGCParallelism controls the maximum number of concurrent GC tasks. When set to <= 0, it auto-tunes to max(NumCompactors/2, 1) and is capped by the bucket count.
  • Each bucket has a lock-free guard, so no two GC jobs run in the same bucket at once.
  • A lightweight semaphore limits total concurrency without blocking the GC scheduler thread.

Pressure-aware throttling

Compaction backlog and score feed into the GC scheduler:

  • Reduce: when compaction backlog or max score crosses ValueLogGCReduceBacklog / ValueLogGCReduceScore, GC parallelism is halved.
  • Skip: when compaction backlog or max score crosses ValueLogGCSkipBacklog / ValueLogGCSkipScore, GC is skipped for that tick.

This keeps GC throughput high under light load but avoids compaction starvation under pressure.


8. Recovery Semantics

  1. DB.Open restores the manifest and fetches the last persisted head pointer.
  2. valueLog.open launches flushDiscardStats and iterates every vlog file via valueLog.replayLog. Files marked invalid in the manifest are removed; valid ones are registered in the manager’s file map.
  3. valueLog.replayLog streams entries to validate checksums and trims torn tails; it does not reapply data into the LSM. WAL replay remains the sole source of committed state.
  4. Manager.VerifyDir trims torn records so replay never sees corrupt payloads.
  5. After validation, valueLog.populateDiscardStats rehydrates discard counters from the persisted JSON entry if present.

The flow mirrors Badger’s vlog scanning but keeps state reconstruction anchored on WAL + manifest checkpoints, similar to RocksDB’s reliance on MANIFEST to mark blob files live or obsolete.


9. Observability & CLI

  • Metrics in stats.go report segment counts, pending deletions, discard queue depth, and GC head pointer via expvar.
  • GC scheduling exposes NoKV.Stats.value_log.gc (including gc_parallelism, gc_active, gc_scheduled, gc_throttled, gc_skipped, gc_rejected) for diagnostics.
  • nokv vlog --workdir <dir> loads a manager in read-only mode and prints current head plus file status (valid, gc candidate). It invokes vlog.VerifyDir before describing segments.
  • Recovery traces controlled by RECOVERY_TRACE_METRICS log every head movement and file removal, aiding pressure testing of GC edge cases. For ad-hoc diagnostics, enable Options.ValueLogVerbose to emit replay/GC messages through the process slog logger.

10. Quick Comparison

CapabilityRocksDB BlobDBBadgerDBNoKV
Head trackingIn MANIFEST (blob log number + offset)Internal to vlog directoryManifest entry via EditValueLogHead
GC triggerCompaction sampling, blob garbage scoreDiscard stats from LSM tablesDiscard stats flushed through lfDiscardStats
Failure recoveryBlob DB and WAL coordinate two-phase commitsReplays value log then LSMRewind-on-error + manifest-backed deletes
Read pathSeparate blob cacheDirect read + checksumManager.Read with copy + per-file lock

By anchoring the vlog state in the manifest and exposing rewind/verify primitives, NoKV maintains the determinism of RocksDB while keeping Badger’s simple sequential layout.


11. Further Reading

  • docs/recovery.md – failure matrix covering append crashes, GC interruptions, and manifest rewrites.
  • docs/cache.md – how vlog-backed entries interact with the block cache.
  • docs/stats.md – metric names surfaced for monitoring.

Manifest & Version Management

The manifest is NoKV’s metadata log for:

  • SST files (EditAddFile / EditDeleteFile)
  • WAL checkpoint (EditLogPointer)
  • value-log metadata (EditValueLogHead, EditDeleteValueLog, EditUpdateValueLog)

Implementation: engine/manifest/manager.go, engine/manifest/codec.go, engine/manifest/types.go.


1. Files on Disk

WorkDir/
  CURRENT
  MANIFEST-000001
  MANIFEST-000002
  • CURRENT stores the active manifest filename.
  • CURRENT is updated via CURRENT.tmp -> CURRENT rename.
  • MANIFEST-* stores append-only encoded edits.

2. In-Memory Version Model

type Version struct {
    Levels       map[int][]FileMeta
    LogSegment   uint32
    LogOffset    uint64
    ValueLogs    map[ValueLogID]ValueLogMeta
    ValueLogHead map[uint32]ValueLogMeta
}
  • Levels: per-level SST metadata.
  • LogSegment/LogOffset: WAL replay checkpoint.
  • ValueLogs + ValueLogHead: all known vlog segments and per-bucket active heads.

3. Edit Append Semantics

Manager.LogEdits(edits...) does:

  1. Encode edits to a buffer.
  2. Write encoded bytes to current manifest file.
  3. Conditionally call manifest.Sync() when:
    • Manager.syncWrites == true, and
    • at least one edit type requires sync (Add/DeleteFile, LogPointer, value-log edits).
  4. Apply edits to in-memory Version.
  5. Trigger manifest rewrite if size crosses threshold.

SetSync(bool) and SetRewriteThreshold(int64) are configured by LSM options.


4. Rewrite Flow

When rewrite threshold is exceeded (or Rewrite() is called):

  1. Create next MANIFEST-xxxxxx.
  2. Write a full snapshot of current Version.
  3. Flush writer, and Sync() the new manifest when syncWrites is enabled.
  4. Update CURRENT to point to new file.
  5. Reopen the new manifest for appends and remove old manifest file.

If rewrite fails before CURRENT update, restart continues using previous manifest.


5. Interaction with Other Modules

ModuleManifest usage
engine/lsm/level_manager.go::flushLogs EditAddFile + EditLogPointer after SST install; compaction logs add/delete edits.
engine/lsm/level_manager.go::buildDuring startup, missing/corrupt SST entries are marked stale and cleaned via EditDeleteFile.
walReplays from manifest checkpoint (LogSegment, LogOffset).
vlogPersists head/update/delete metadata and uses manifest state for stale/orphan cleanup on startup.
raftstoreDoes not own manifest state. Store-local region catalogs and raft WAL replay checkpoints live in raftstore/localmeta; runtime routing state lives in Coordinator storage.

6. Recovery-Relevant Guarantees

  1. Manifest append is ordered by single manager mutex.
  2. WAL replay starts from manifest checkpoint.
  3. Restart replays only storage-engine metadata.
  4. CURRENT indirection protects against partial manifest rewrite publication.

7. Operational Commands

nokv manifest --workdir <dir>

Useful fields:

  • log_pointer.segment, log_pointer.offset
  • levels[*].files
  • value_log_heads
  • value_logs[*].valid

See recovery.md and flush.md for startup and flush ordering details.

VFS

The vfs package provides a small filesystem abstraction used by WAL, manifest, SST, value-log, and raftstore paths.


1. Core Interfaces

vfs.FS includes:

  • file open/create: OpenHandle, OpenFileHandle
  • path ops: MkdirAll, Remove, RemoveAll, Rename, RenameNoReplace, Stat, ReadDir, Glob
  • helpers: ReadFile, WriteFile, Truncate, Hostname

vfs.File includes:

  • read/write/seek APIs
  • Sync, Truncate, Close, Stat, Name

vfs.Ensure(fs) maps nil to OSFS.


2. Rename Semantics

Rename:

  • normal rename/move semantics.
  • target replacement behavior is platform-dependent (os.Rename behavior).

RenameNoReplace:

  • contract: fail with os.ErrExist when destination already exists.
  • on Linux, uses renameat2(..., RENAME_NOREPLACE).
  • on macOS, uses renamex_np(..., RENAME_EXCL).
  • when atomic no-replace rename is unsupported by platform/filesystem, returns vfs.ErrRenameNoReplaceUnsupported (no non-atomic fallback).

3. Directory Sync Helper

vfs.SyncDir(fs, dir) fsyncs directory metadata to persist entry updates (create/rename/remove).

This is used in strict durability paths (for example SST install before manifest publication) to guarantee directory entry persistence.


4. Fault Injection (FaultFS)

FaultFS wraps an underlying FS and can inject failures by operation/path.

  • Rule helpers: FailOnceRule, FailOnNthRule, FailAfterNthRule, FailOnceRenameRule
  • File-handle faults: write/sync/close/truncate
  • Rename fault matching supports src/dst targeting

Used to test manifest/WAL/recovery failure paths deterministically.


5. Current Implementation Set

  • OSFS: production implementation (Go os package).
  • FaultFS: failure-injection wrapper over any FS.

No in-memory FS is included yet.


6. Design Notes

  • Keep storage code decoupled from direct os.* calls.
  • Make crash/failure tests reproducible.
  • Keep API minimal and only add operations required by real storage call sites.

References:

File Abstractions

The engine/file package encapsulates direct file-system interaction for WAL, SST, and value-log files. It provides portable mmap helpers, allocation primitives, and log file wrappers.


1. Core Types

TypePurposeKey Methods
OptionsParameter bag for opening files (FID, path, size).Used by WAL/vlog managers.
MmapFileCross-platform mmap wrapper.OpenMmapFile, AppendBuffer, Truncate, Sync.
LogFileValue-log specific helper built on MmapFile.Open, Write, Read, DoneWriting, Truncate, Bootstrap.

Darwin-specific builds live alongside (mmap_darwin.go, sstable_darwin.go) ensuring the package compiles on macOS without manual tuning.


2. Mmap Management

  • OpenMmapFile opens or creates a file, optionally extending it to maxSz, then mmaps it. The returned MmapFile exposes Data []byte and the underlying *os.File handle.
  • Writes grow the map on demand: AppendBuffer checks if the write would exceed the current mapping and calls Truncate to expand (doubling up to 1 GiB increments).
  • Sync flushes dirty pages (mmap.Msync), while Delete unmaps, truncates, closes, and removes the file—used when dropping SSTs or value-log segments.

RocksDB relies on custom Env implementations for portability; NoKV keeps the logic in Go, relying on build tags for OS differences.


3. LogFile Semantics

LogFile wraps MmapFile to simplify value-log operations:

lf := &file.LogFile{}
_ = lf.Open(&file.Options{FID: 1, FileName: "00001.vlog", MaxSz: 1<<29})
var buf bytes.Buffer
payload, _ := kv.EncodeEntry(&buf, entry)
_ = lf.Write(offset, payload)
_ = lf.DoneWriting(offset + uint32(len(payload)))
  • Open mmaps the file and records current size (guarded to < 4 GiB).
  • Read validates offsets against both the mmap length and tracked size, preventing partial reads when GC or drop operations shrink the file.
  • Entry encoding uses shared helpers in kv (kv.EncodeEntry / kv.EncodeEntryTo); LogFile focuses on write/read/truncate + durability semantics.
  • DoneWriting guarantees durability for both data bytes [0, offset) and the file metadata (size).
    • Sequence: It flushes dirty pages (msync), truncates the file to offset, and performs a file-descriptor level sync (fsync) to ensure the new file size is persisted on disk before returning.
    • Contract: Success implies that after a crash, the file size will not exceed offset, and all data prior to offset is safe.
    • After syncing, it reinitializes the mmap and keeps the file open in read-write mode for potential subsequent appends (if logic allows) or prepares it for read-only consumption.
  • Rewind (via vlog.Manager.Rewind) leverages LogFile.Truncate and Init to roll back partial batches after errors.

4. SST Helpers

While SSTable builders/readers live under engine/lsm/table.go, they rely on engine/file helpers to map index/data blocks efficiently. The build tags (sstable_linux.go, sstable_darwin.go) provide OS-specific tuning for direct I/O hints or mmap flags.


5. Comparison

EngineApproach
RocksDBC++ Env & random-access file wrappers.
Badgery.File abstraction with mmap.
NoKVGo-native mmap wrappers with explicit log helpers.

By keeping all filesystem primitives in one package, NoKV ensures WAL, vlog, and SST layers share consistent behaviour (sync semantics, truncation rules) and simplifies testing (engine/file/mmap_linux_test.go).


6. Operational Notes

  • DoneWriting provides strong crash-consistency guarantees. Even on filesystems where ftruncate metadata persistence is asynchronous, the explicit post-truncate fsync ensures the file size is durable upon success.
  • Value-log and WAL segments rely on DoneWriting/Truncate to seal files; avoid manipulating files externally or mmap metadata may desynchronise.
  • LogFile updates cached size internally on Write/Truncate, so read bounds stay consistent during rewrite/rewind flows.
  • vfs.SyncDir is used by strict durability flows to persist directory entry changes (create/rename/remove). For example, strict SST flush calls SyncDir(workdir) before manifest publication.

For more on how these primitives plug into higher layers, see docs/wal.md and docs/vlog.md.

Cache & Bloom Filters

NoKV’s LSM tier layers a multi-level block cache with decoded index caching to accelerate lookups. Bloom filters remain embedded in SST indexes and are probed directly from pb.TableIndex. The implementation is in engine/lsm/cache.go.


1. Components

ComponentPurposeSource
cache.indexesByte-budgeted W-TinyLFU cache for decoded table indexes (fid*pb.TableIndex).utils/cache
blockCacheRistretto-based block cache (L0/L1 only) with per-table direct slots.engine/lsm/cache.go
cacheMetricsAtomic hit/miss counters for L0/L1 blocks and indexes.engine/lsm/cache.go#L30-L110

Badger exposes separate block/index cache budgets while Pebble uses a unified cache budget. NoKV keeps block and index caches explicit; bloom filters piggyback on the decoded table index already held by each live SST.


1.1 Index Cache & Handles

  • SSTable metadata stays with the table struct, while decoded protobuf indexes are stored in cache.indexes. Lookups first hit the cache before falling back to disk.
  • SST handles are reopened on demand for lower levels. L0/L1 tables keep their file descriptors pinned, while deeper levels close them once no iterator is using the table.

2. Block Cache Strategy

User-space block cache (L0/L1, parsed blocks, Ristretto LFU-ish)
Deeper levels rely on OS page cache + mmap readahead
  • Options.BlockCacheBytes sets the block-cache budget in bytes. Entries are admitted with an estimated block footprint, so larger blocks naturally consume more of the budget.
  • Per-table direct slots (table.cacheSlots[idx]) give a lock-free fast path. Misses fall back to the shared Ristretto cache (approx LFU with admission).
  • Evictions clear the table slot via OnEvict; user-space cache only tracks L0/L1 blocks. Deeper levels depend on the OS page cache.
  • Access patterns: getBlock also updates hit/miss metrics for L0/L1; deeper levels bypass the cache and do not affect metrics.
flowchart LR
  Read --> CheckCache
  CheckCache -->|hit| Return
  CheckCache -->|miss| LoadFromTable["LoadFromTable (mmap + OS page cache)"]
  LoadFromTable --> InsertCache
  InsertCache --> Return

By default only L0 and L1 blocks are cached (level > 1 short-circuits), reflecting the higher re-use for top levels.


3. Bloom Filters

  • Bloom filters are stored inside pb.TableIndex and probed directly from the decoded index already held by table.idx.
  • There is no separate bloom-filter cache layer; this avoids a redundant hot-path mutex/LRU hop on every point lookup.
  • indexCache keeps the existing W-TinyLFU admission path and budgets decoded pb.TableIndex payloads using the protobuf-encoded size (proto.Size).

4. Metrics & Observability

cache.metricsSnapshot() produces:

type CacheMetrics struct {
    L0Hits, L0Misses uint64
    L1Hits, L1Misses uint64
    IndexHits, IndexMisses uint64
}

Stats.Snapshot converts these into hit rates. Monitor them alongside the block cache sizes to decide when to scale memory.


5. Thermos Integration

Thermos is no longer part of cache warmup or read-path prefetch. Cache behavior is now independent of Thermos and driven only by:

  • iterator/table prefetch settings
  • block/index cache budgets
  • normal read traffic

The only remaining Thermos integration is optional write throttling.


6. Interaction with Value Log

  • Keys stored as value pointers (large values) still populate block cache entries for the key/index block. The value payload is read directly from the vlog (valueLog.read), so block cache hit rates remain meaningful.
  • Discard stats from flushes can demote cached blocks via cache.dropBlock, ensuring obsolete SST data leaves the cache quickly.

7. Comparison

FeatureRocksDBBadgerDBNoKV
Block cache policyConfigurable multiple cachesSingle cacheRistretto for L0/L1 + OS page cache for deeper levels
Bloom filter storagePer tablePer tableEmbedded in decoded table indexes
MetricsBlock cache stats via GetAggregatedIntPropertyLimitedNoKV.Stats.cache.* hit rates

8. Operational Tips

  • If point-read false positives become expensive, tune bloom bits-per-key at SST build time rather than adding another filter cache layer.
  • Track nokv stats --json cache metrics over time; drops often indicate iterator misuse or working-set shifts.
  • Benchmark tooling accepts cache sizes in MB and converts them into these byte-budget fields before opening the engine.

More on SST layout lives in docs/manifest.md and docs/architecture.md.

Range Filter

NoKV’s range filter is a read-path pruning layer for the LSM tree. It is inspired by the paper GRF: A Global Range Filter for LSM-Trees with Shape Encoding, but it is deliberately more conservative and much simpler.

The current implementation is best described as:

  • in-memory only
  • correctness-first
  • advisory pruning, not a source of truth
  • table-level pruning plus table-internal block-range pruning

It is designed to reduce unnecessary SST and block probes on point lookups and narrow bounded scans without changing recovery or manifest semantics.


1. Goals

The range filter exists to cut down the part of the read path that answers:

  • which SSTs are even worth probing
  • which block range inside a chosen SST is even worth scanning

This is most valuable for:

  • point misses
  • point hits on non-overlapping levels with many SSTs
  • narrow bounded iterators

It is not intended to change write behavior, compaction policy, or persistence format.


2. Inspiration

The design direction comes from the SIGMOD 2024 GRF paper:

The paper’s main idea is stronger than what NoKV currently implements:

  • a true global range filter
  • shape encoding tied to LSM shape and run IDs
  • version/snapshot-aware maintenance
  • maintenance rules that assume more deterministic compaction behavior

NoKV does not implement full GRF or Shape Encoding. The current design only borrows the high-level idea:

  • do pruning before expensive per-run / per-table probes
  • keep the pruning layer global enough to matter
  • preserve correctness by falling back when uncertain

3. Current Design

3.1 Per-level table spans

Each levelHandler maintains a rangeFilter built from its current table set:

Each span records:

  • minimum base key (CF + user key)
  • maximum base key (CF + user key)
  • table pointer

For non-overlapping levels, the filter supports binary-search candidate lookup. For overlapping or small levels, it falls back to a conservative linear filter.

The implementation deliberately works on base-key semantics rather than pure user-key semantics so different column families are never collapsed into the same pruning range.

3.2 Point-read pruning

Point lookups first prune candidate tables:

If a non-overlapping level collapses to a single exact candidate, NoKV uses a thinner point-read path:

This avoids the heavier generic iterator-style path for the common “one table could contain this key” case.

3.3 Bounded iterator pruning

Bounded iterators use the same per-level filter to prune whole tables before iterator assembly:

Inside a table, NoKV then uses SST block base keys from the decoded table index to shrink the block range that the iterator needs to touch:

This is the current “table-internal” stage of the design.

3.4 Rebuild behavior

The filter is rebuilt when a level’s table set changes, for example after:

  • sort/install
  • compaction replacement
  • table deletion

The filter is not updated on every write. It is rebuilt on table-set changes under the existing level ownership rules.


4. Correctness Model

The range filter is intentionally not authoritative.

Rules:

  • if the filter is unsure, fall back
  • if the level overlaps, fall back
  • if the level is too small to amortize pruning overhead, fall back
  • never allow false-negative pruning

This is why the design is safe to deploy without changing startup, manifest, or recovery behavior.

L0 is handled conservatively:

  • it remains overlap-first
  • it does not use the non-overlapping exact-candidate fast path

That choice trades peak possible speed for simpler correctness.


5. What This Is Not

This document describes a practical NoKV optimization, not a claim of full GRF compatibility.

NoKV currently does not implement:

  • Shape Encoding
  • a persisted global range filter
  • run-ID encoding
  • snapshot/version-aware filter state
  • compaction-policy constraints required by full GRF

This is deliberate. Those features would couple the filter much more tightly to compaction scheduling, version management, and recovery semantics.


6. Observability

Range-filter behavior is exported through LSM diagnostics and top-level stats:

Current counters include:

  • PointCandidates
  • PointPruned
  • BoundedCandidates
  • BoundedPruned
  • Fallbacks

These counters are useful for deciding whether a workload is actually benefiting from pruning or mostly falling back.


7. Performance Notes

Microbenchmarks show strong gains when candidate-pruning is the dominant cost:

  • point misses
  • point hits on many-table non-overlapping levels
  • narrow bounded scans

System-level YCSB behavior is more nuanced:

  • read-heavy workloads benefit more clearly
  • mixed read/write workloads depend more on L0 shape, block loads, flush, and compaction pressure

This is expected. The range filter optimizes query planning and table/block selection. It does not remove the rest of the read path.


8. Why NoKV Stops Short of Full GRF

Full paper alignment is not the current goal.

Reasons:

  1. NoKV’s current compaction and level behavior does not naturally satisfy the paper’s stronger deterministic requirements.
  2. A persisted global filter would add metadata, rebuild, and recovery complexity.
  3. The current implementation already captures the most practical early win:
    • prune whole tables first
    • then prune block ranges inside the chosen table
  4. Current bottlenecks for read-heavy workloads are now more visible in:
    • L0 overlap handling
    • block loading
    • remaining table probe cost

For NoKV, this simpler design is a better engineering tradeoff today.


9. Source Map

Primary implementation files:

Related benchmark coverage:

Thermos

thermos is NoKV’s optional internal hotspot detector. The package lives in thermos/ and is no longer part of the default read path, LSM compaction planning, or value-log bucket routing.

Current Role

The production role is deliberately narrow:

  • detect write-hot keys when Options.ThermosEnabled is enabled
  • enforce Options.WriteHotKeyLimit via Thermos.TouchAndClamp
  • publish write-hot snapshots through StatsSnapshot.Hot.WriteKeys and StatsSnapshot.Hot.WriteRing

If ThermosEnabled=false, the DB runs without Thermos tracking on the hot path.

What Thermos Does Not Control Anymore

These integrations were intentionally removed from the default engine path:

  • read-path hot tracking and asynchronous read prefetch
  • LSM compaction scoring based on hot-key overlap
  • value-log hot/cold bucket routing
  • hot-write batch enlargement heuristics

The default value-log configuration keeps ordinary hash bucketization enabled through ValueLogBucketCount, but bucket selection no longer depends on Thermos.

Data Structure

Thermos remains a concurrent in-memory frequency tracker with:

  • sharded hash buckets
  • lock-free bucket lists for lookup/insert
  • atomic counters per node
  • optional sliding-window and rotation support

These capabilities are still useful for optional write throttling and operational diagnostics, even though they are no longer wired into unrelated subsystems.

Relevant Options

OptionMeaning
ThermosEnabledMaster switch for write-hot tracking.
WriteHotKeyLimitReject writes once a single key exceeds the configured threshold.
ThermosBitsBucket count (2^bits) for the tracker.
ThermosTopKNumber of hot keys exported in stats.
ThermosRotationIntervalOptional dual-ring rotation.
ThermosWindowSlots / ThermosWindowSlotDurationOptional sliding-window tracking.
ThermosNodeCap / ThermosNodeSampleBitsBound in-memory growth.

Write Throttling

When both ThermosEnabled and WriteHotKeyLimit > 0 are set, NoKV records write frequency by CF + UserKey and returns utils.ErrHotKeyWriteThrottle once the limit is reached.

This path exists to protect the engine from pathological skew. It is intentionally independent from cache warming, compaction, and value-log routing.

Stats

Thermos contributes only write-side observability:

  • StatsSnapshot.Hot.WriteKeys
  • StatsSnapshot.Hot.WriteRing
  • StatsSnapshot.Write.HotKeyLimited

The CLI surfaces the same data under nokv stats.

Design Position

Thermos should be understood as:

  • an optional internal detector
  • an optional write throttling tool
  • not a required performance feature
  • not a default read-path or value-log optimization

That narrower scope keeps the core engine path simpler and makes Thermos easier to reason about.

Entry Lifecycle, Encoding, and Field Semantics

kv.Entry is the core record container that flows through user APIs, commit batching, WAL/value-log codecs, memtable indexes, SST blocks, and iterators.

This document explains:

  1. How key/value bytes are encoded.
  2. What Entry.Key / Entry.Value mean at each stage.
  3. Which fields are authoritative vs derived.
  4. How PopulateInternalMeta keeps internal fields consistent.
  5. Borrowed-vs-detached ownership rules.

1. Structure Overview

Source: engine/kv/entry.go, engine/kv/key.go, engine/kv/value.go

type Entry struct {
    Key       []byte
    Value     []byte
    ExpiresAt uint64
    CF        ColumnFamily
    Meta      byte
    Version   uint64
    Offset    uint32
    Hlen      int
    ValThreshold int64
    ref int32
}

Important interpretation:

  • Key is the canonical source of truth for internal records.
  • CF and Version are cached/derived fields for convenience.
  • Value can represent either:
    • inline value bytes, or
    • encoded ValuePtr bytes (Meta has BitValuePointer).

2. Encoding Layers

2.1 Internal Key Encoding

Source: engine/kv/key.go

InternalKey(cf, userKey, ts) layout:

  • 4-byte CF header: 0xFF 'C' 'F' <cf-byte>
  • raw user key bytes
  • 8-byte big-endian descending timestamp (MaxUint64 - ts)

Helpers:

  • SplitInternalKey(internal) -> (cf, userKey, ts, ok)
  • SplitBaseKey(base) -> (cf, userKey, ok)
  • Timestamp(key) / InternalToBaseKey(internal) / BaseKey(cf, userKey) / SameBaseKey(a, b)

2.2 ValueStruct Encoding

Source: engine/kv/value.go

ValueStruct layout:

  • Meta (1B)
  • ExpiresAt (uvarint)
  • Value (raw bytes)

ValueStruct does not store Version; version is always taken from internal key.

2.3 Entry Record Encoding (WAL / Vlog record payload)

Source: engine/kv/entry_codec.go

Entry codec layout:

  • header (uvarints: keyLen/valueLen/meta/expiresAt)
  • key bytes
  • value bytes
  • crc32

DecodeEntryFrom now calls PopulateInternalMeta() after key decode so internal records get consistent CF/Version immediately.


3. Stage-by-Stage Meaning of Key and Value

3.1 User write (DB.Set, DB.SetBatch, DB.SetWithTTL, DB.Del, DB.DeleteRange, DB.ApplyInternalEntries)

Source: db.go

  • Set/SetBatch/SetWithTTL/Del/DeleteRange use NewInternalEntry(...):
    • Key: encoded internal key.
    • Value: user value bytes.
  • ApplyInternalEntries validates internal key, then writes back parsed CF/Version from key before entering write pipeline.

3.2 Commit worker: vlog then LSM apply

Source: db.go, vlog.go

  • Before LSM.SetBatch, large values are replaced by ValuePtr.Encode() bytes and BitValuePointer is set.
  • Small values stay inline.
  • Key remains internal key throughout.

3.3 WAL replay / vlog iteration decode

Source: engine/kv/entry_codec.go, engine/lsm/memtable.go

  • Decoded records carry internal key bytes in Key.
  • Value is record payload bytes (inline value or pointer bytes).
  • CF/Version are derived from key via PopulateInternalMeta.

3.4 Memtable index lookup

Source: engine/lsm/memtable.go, engine/index/skiplist.go, engine/index/art.go

  • memIndex.Search(...) returns (matchedInternalKey, ValueStruct).
  • memTable.Get assembles pooled Entry from this and calls PopulateInternalMeta.
  • For miss, Key=nil and zero value struct (sentinel, not a valid record).

3.5 SST / iterator decode

Source: engine/lsm/table_builder.go, engine/lsm/table.go

  • Block iterator reconstructs internal key + value struct.
  • It calls PopulateInternalMeta before exposing item entry.
  • table.Search clones key/value and re-populates internal metadata.

3.6 Internal read API (GetInternalEntry)

Source: db.go

  • loadBorrowedEntry fetches from LSM.
  • If BitValuePointer is set:
    • decode pointer from Value
    • read real bytes from vlog
    • replace Value with actual value bytes
    • clear BitValuePointer
  • Return entry with internal key still in Key, and CF/Version re-populated.

3.7 Public read API (Get, public iterator)

Source: db.go, iterator.go

  • Public APIs convert internal key to user key for external consumers.
  • Returned entry is detached copy (DB.Get) or iterator materialized object.

3.8 Runtime State Flow Diagram

flowchart TD
  A["DB.Set/SetWithTTL/Del"] --> B["kv.NewInternalEntry"]
  B --> C["DB.ApplyInternalEntries"]
  C --> D["commitWorker: vlog.write"]
  D --> E["LSM/WAL persist"]
  E --> F["DB.GetInternalEntry"]
  F --> G{"BitValuePointer?"}
  G -- yes --> H["vlog.read + clear pointer bit"]
  G -- no --> I["inline value"]
  H --> J["PopulateInternalMeta"]
  I --> J
  J --> K["DB.Get cloneEntry -> user key/value"]
  J --> L["DB.NewIterator materialize -> user key/value"]

4. Field Validity Matrix

ContextKeyValueCF/VersionOwnership
Internal write pathInternal keyInline value or ptr-bytes (large values)Valid (set at build/validate time)Borrowed/pooled
Decoded WAL/vlog recordInternal keyEncoded record payload value bytesValid after PopulateInternalMetaBorrowed/pooled
GetInternalEntry returnInternal keyReal value bytes (pointer resolved)Valid (re-populated before return)Borrowed/pooled
DB.Get returnUser keyReal value bytesValid for external semanticsDetached copy
Memtable miss sentinelnilnilCFDefault/0Borrowed/pooled sentinel

Rule of thumb:

  • Internal code should treat Key as authoritative.
  • CF/Version should be expected to match SplitInternalKey(Key) after PopulateInternalMeta.

5. PopulateInternalMeta Semantics

Source: engine/kv/entry.go

func (e *Entry) PopulateInternalMeta() bool

Behavior:

  1. Parse e.Key as internal key.
  2. If parse succeeds:
    • e.CF = parsedCF
    • e.Version = parsedTS
    • return true
  3. If parse fails:
    • reset cache fields to safe defaults (CFDefault, Version=0)
    • return false

Why it exists:

  • pooled entries can carry stale cached fields if not reset carefully;
  • parse-once helper gives a single normalization point at codec/index/read boundaries;
  • keeps key-derived metadata (CF/Version) authoritative and consistent.

6. Ownership and Refcount States

Source: engine/kv/entry.go, docs/architecture.md

Borrowed entry (internal)

Returned by:

  • DecodeEntryFrom
  • memTable.Get
  • LSM.Get internals
  • DB.GetInternalEntry

Contract:

  • caller must call DecrRef() exactly once.
  • double release panics (underflow guard).

Detached entry (public)

Returned by:

  • DB.Get (via clone)

Contract:

  • do not call DecrRef().

7. Contributor Rules

  1. Any new internal decode boundary should either:
    • call PopulateInternalMeta, or
    • immediately parse key with SplitInternalKey and set fields consistently.
  2. Do not reintroduce value-side version caches.
  3. Keep internal APIs internal-key-first; only external APIs should expose user keys.
  4. Preserve borrowed/detached ownership contracts in comments and tests.

Error Handling Guide

This document defines how NoKV should own, define, and propagate errors.


1. Ownership Rules

  1. Domain errors stay in domain packages.
  2. Cross-cutting runtime errors may live in utils only when shared by multiple subsystems.
  3. Command-local/business-flow errors should be unexported (errXxx) and stay in command/service packages.

Examples:

  • kv: entry codec/read-decode errors.
  • vfs: filesystem contract errors.
  • coordinator/catalog: control-plane validation/conflict errors.

2. Propagation Rules

  1. Wrap with %w when crossing package boundaries.
  2. Match via errors.Is, not string compare.
  3. Keep stable sentinel values for retryable / control-flow decisions.
  4. Add context in upper layers; do not lose original cause.

3. Naming Rules

  1. Exported sentinels use ErrXxx.
  2. Error text should be lowercase and package-scoped when useful (for example coordinator/catalog: ..., coordinator/idalloc: ..., vfs: ...).
  3. Avoid duplicate sentinels with identical semantics in different packages.

4. Current Error Map

Shared runtime sentinels

  • utils/error.go: common cross-package sentinels such as invalid request, key/value validation errors, throttling, and lifecycle guards.

Domain-specific sentinels

  • engine/kv/entry_codec.go: ErrBadChecksum, ErrPartialEntry
  • engine/vfs/vfs.go: ErrRenameNoReplaceUnsupported
  • engine/lsm/compaction.go: compaction planner/runtime domain errors
  • raftstore/peer/errors.go: peer lifecycle/state errors
  • pb/errorpb.proto: region/store routing protobuf errors (RegionError, StoreNotMatch, RegionNotFound, KeyNotInRegion, …)
  • engine/wal/errors.go: WAL encode/decode and segment errors
  • coordinator/catalog/errors.go: Coordinator metadata and range validation errors

5. Propagation in Hot Paths

  1. Embedded write path (DB.Set* -> commit worker -> LSM/WAL):
    • validation returns direct sentinel (ErrEmptyKey, ErrNilValue, ErrInvalidRequest);
    • storage boundary errors are wrapped with context and preserved via %w.
  2. Distributed command path (kv.Service -> Store.*Command -> kv.Apply):
    • region/leader/store/range failures are mapped to errorpb messages in protobuf responses;
    • execution failures return Go errors to RPC layer and are translated to gRPC status.
  3. Recovery/replay path (WAL/Vlog/Manifest):
    • partial/corrupt records return domain sentinels and are handled by truncation or restart logic in upper layers.

RaftStore Deep Dive

raftstore powers NoKV’s distributed mode by layering multi-Raft replication on top of the embedded storage engine. Its RPC surface is exposed as the NoKV gRPC service, while the command model still tracks the TinyKV/TiKV region + MVCC design. This note explains the major packages, the boot and command paths, how transport and storage interact, and the supporting tooling for observability and testing.

Read this page if you want to answer one question precisely: which package owns which responsibility once NoKV leaves standalone mode and becomes a cluster.

For the protocol line that now minimally formalizes raftstore request admission, transition execution, publish boundaries, and restart state, read docs/control_and_execution_protocols.md together with this file.

Reader Map

  • Start with section 1 if you want package ownership.
  • Jump to section 3 if you want request execution.
  • Jump to section 8 if you care about split/merge and control-plane behavior.
  • Read this together with coordinator.md and migration.md if your focus is lifecycle rather than just request flow.

1. Package Structure

PackageResponsibility
storeOrchestrates peer set, command pipeline, region manager, scheduler/heartbeat loops; exposes helpers such as StartPeer, ProposeCommand, SplitRegion.
peerWraps etcd/raft RawNode, drives Ready processing (persist to WAL, send messages, apply entries), tracks snapshot resend/backlog.
raftlogWALStorage/DiskStorage/MemoryStorage across all Raft groups, leveraging the NoKV WAL while tracking store-local raft replay metadata.
metaStore-local durable metadata: peer catalog for restart and raft WAL replay checkpoints for replay/GC.
transportgRPC transport with retry/TLS/backpressure; exposes the raft Step RPC and can host additional services (NoKV).
kvNoKV RPC implementation, bridging Raft commands to MVCC operations via kv.Apply.
serverConfig + NewNode that bind storage, Store, transport, and the shared KV/admin RPC surfaces into a reusable node primitive.

Runtime ownership sketch

flowchart TD
    Client["Client / fsmeta gateway / CLI"]
    Client --> KV["kv.Service"]
    Client --> Coordinator["Coordinator"]

    subgraph "Node runtime"
        Server["server.Node"] --> Store["store.Store"]
        Store --> Router["router"]
        Store --> Peer["peer.Peer"]
        Store --> Scheduler["scheduler runtime"]
        Store --> Admin["admin service"]
        Store --> Meta["raftstore/localmeta"]
        Peer --> Engine["raftstore/raftlog"]
        Peer --> Apply["kv.Apply"]
    end

    Apply --> DB["NoKV DB"]

2. Boot Sequence

  1. Construct Server

    srv, _ := server.NewNode(server.Config{
        Storage: server.Storage{MVCC: db, Raft: db.RaftLog()},
        Store: store.Config{StoreID: 1},
        Raft: myraft.Config{ElectionTick: 10, HeartbeatTick: 2, PreVote: true},
        TransportAddr: "127.0.0.1:20160",
    })
    
    • A gRPC transport is created, the NoKV service is registered, and transport.SetHandler(store.Step) wires raft Step handling.
    • store.Store loads the local peer catalog from raftstore/localmeta to rebuild the Region catalog (router + metrics).
  2. Start local peers

    • CLI (nokv serve) loads the local peer catalog and calls Store.StartPeer for every region that includes the local store.
    • Each peer.Config carries raft parameters, the transport reference, kv.NewEntryApplier, peer storage, and Region metadata.
    • StartPeer registers the peer through the peer-set/routing layer and may bootstrap or campaign for leadership.
  3. Peer connectivity

    • transport.SetPeer(peerID, addr) defines outbound raft connections; the CLI resolves those peer endpoints from local metadata plus stable storeID -> addr mapping.
    • Additional services can reuse the same gRPC server through transport.WithServerRegistrar.

Minimal mental model

  • Server is the node wiring root.
  • Store is the runtime owner for what this node hosts.
  • Peer is one region replica’s state machine and raft runtime.
  • Meta is a local recovery mirror, not cluster truth.
  • Coordinator is control plane, not a writer of local truth.

3. Command Execution

Read (strong leader read)

  1. kv.Service.KvGet builds pb.RaftCmdRequest and invokes Store.ReadCommand.
  2. validateCommand ensures the region exists, epoch matches, and the local peer is leader; a RegionError is returned otherwise.
  3. peer.LinearizableRead obtains a safe read index, then peer.WaitApplied waits until local apply index reaches it.
  4. commandApplier (i.e. kv.Apply) runs GET/SCAN against the DB using MVCC readers to honor locks and version visibility.

Write (via Propose)

  1. Write RPCs (Prewrite/Commit/…) call Store.ProposeCommand, encoding the command and routing to the leader peer.
  2. The leader appends the encoded request to raft, replicates, and once committed the command pipeline hands data to kv.Apply, which maps Prewrite/Commit/ResolveLock to the percolator package.
  3. engine.WALStorage persists raft entries/state snapshots and updates raftstore/localmeta raft pointers. This keeps WAL GC and raft truncation aligned without polluting the storage manifest.
  4. Raft apply only accepts command-encoded payloads (RaftCmdRequest). Legacy raw KV payloads are rejected as unsupported.

Command flow diagram

sequenceDiagram
    participant C as NoKV Client
    participant SVC as kv.Service
    participant ST as store.Store
    participant PR as peer.Peer
    participant RF as raft log/replication
    participant AP as kv.Apply
    participant DB as NoKV DB

    rect rgb(237, 247, 255)
      C->>SVC: KvGet/KvScan
      SVC->>ST: ReadCommand
      ST->>PR: LinearizableRead + WaitApplied
      ST->>AP: commandApplier(req)
      AP->>DB: internal read path
      AP-->>C: read response
    end

    rect rgb(241, 253, 244)
      C->>SVC: KvPrewrite/KvCommit/...
      SVC->>ST: ProposeCommand
      ST->>RF: route + replicate
      RF->>AP: apply committed entry
      AP->>DB: percolator mutate -> ApplyInternalEntries
      AP-->>C: write response
    end

4. Transport

  • gRPC transport listens on TransportAddr, serving both raft Step RPC and NoKV RPC.
  • SetPeer updates the mapping of remote store IDs to addresses; BlockPeer can be used by tests or chaos tooling.
  • Configurable retry/backoff/timeout options mirror production requirements. Tests cover message loss, blocked peers, and partitions.

5. Storage Backend (engine)

  • WALStorage piggybacks on the embedded WAL: each Raft group writes typed entries, HardState, and snapshots into the shared log.
  • raftstore/localmeta persists the store-local raft replay pointer used by WAL GC and replay.
  • Alternative storage backends (DiskStorage, MemoryStorage) are available for tests and special scenarios.

6. NoKV RPC Integration

RPCExecution PathNotes
KvGet / KvScanReadCommandLinearizableRead(ReadIndex) + WaitAppliedkv.Apply (read mode)Leader-only strong read with Raft linearizability barrier.
KvPrewrite / KvCommit / KvBatchRollback / KvResolveLock / KvCheckTxnStatusProposeCommand → command pipeline → raft log → kv.ApplyPipeline matches proposals with apply results; MVCC latch manager prevents write conflicts.

The cmd/nokv serve command uses server.Node internally and prints a local peer catalog summary (key ranges, peers) so operators can verify the node’s recovery view at startup.


7. Client Interaction (raftstore/client)

  • Region-aware routing with NotLeader/EpochNotMatch retry.
  • Mutate splits mutations by region and performs two-phase commit (primary first). Put / Delete are convenience wrappers.
  • Scan transparently walks region boundaries.
  • End-to-end coverage lives in raftstore/server/node_test.go, which launches real nodes, uses the client to write and delete keys, and verifies the results.

8. Control Plane & Region Operations

8.1 Topology & Routing

  • Topology is sourced from raft_config.example.json (via config.LoadFile) and reused by scripts and Docker Compose as bootstrap metadata.
  • Runtime routing is Coordinator-first: raftstore/client resolves Regions by key through GetRegionByKey and caches route entries for retries.
  • raft_config regions are treated as bootstrap/deployment metadata and are not the runtime source of truth once Coordinator is available.
  • Coordinator is the only control-plane source of truth for runtime scheduling/routing.

Why the layering matters

The design rule is:

  • RaftAdmin executes membership operations against the current leader store.
  • Coordinator decides and observes at the cluster level.
  • Store/Peer own local truth and apply.

That split is what keeps migration and scheduling from becoming a second, parallel truth path.

8.2 Split / Merge

  • Split: leaders call Store.ProposeSplit, which writes a split AdminCommand into the parent region’s raft log. On apply, Store.SplitRegion updates the parent range/epoch and starts the child peer.
  • Merge: leaders call Store.ProposeMerge, writing a merge AdminCommand. On apply, the target region range/epoch is expanded and the source peer is stopped/removed from the manifest.
  • These operations are explicit/manual and are not auto-triggered by size/traffic heuristics.

9. Observability

  • store.RegionMetrics() feeds into StatsSnapshot, making region counts and backlog visible via expvar and nokv stats.
  • nokv regions shows the local peer catalog used for store recovery: ID, range, peers, state. It is a store-local recovery view, not cluster routing authority.
  • CHAOS_TRACE_METRICS=1 go test -run 'TestGRPCTransport(HandlesPartition|MetricsWatchdog|MetricsBlockedPeers)' -count=1 -v ./raftstore/transport exercises transport metrics under faults; scripts/dev/cluster.sh spins up multi-node clusters for manual inspection.

Store internals at a glance

ComponentFileResponsibility
Store facadestore.goStore construction/wiring and shared component ownership (router, region manager, command pipeline, scheduler runtime).
Peer lifecyclepeer_lifecycle.goStart/stop peers, router registration, lifecycle hooks, and store shutdown sequencing.
Command opscommand_ops.goRegion/epoch/key-range validation and read/propose request handling.
Admin opsadmin_ops.goSplit/merge proposal handling and applied admin command side effects.
Membership opsmembership_ops.goConf-change proposal helpers and local region metadata updates after membership changes.
Region catalogregion_catalog.goPublic region catalog accessors and region metadata lifecycle operations.
Routerrouter.goTracks active peers and dispatches requests/messages to the owning peer.
Command pipelinecommand_pipeline.goAssigns request IDs, records proposals, matches apply results, returns responses/errors to callers.
Region managerregion_manager.goValidates state transitions, persists local peer catalog updates, updates peer metadata, publishes scheduler-visible region state.
Scheduler runtimescheduler_runtime.goPeriodically publishes region/store heartbeats, enforces cooldown & burst limits, and applies scheduling actions.

10. Region Truth and Persistence Roles

NoKV intentionally keeps three different region views, and they do not have equal authority:

  1. Store-local region truth

    • Advanced only through raft apply paths and bootstrap/restart loading.
    • Owned by region_manager.go via:
      • applyRegionMeta
      • applyRegionState
      • applyRegionRemoval
      • loadBootstrapSnapshot
    • This is the runtime source of truth for what a store currently hosts.
  2. Store-local persistent mirror

    • Owned by raftstore/localmeta.
    • Persists:
      • local region catalog entries for restart
      • local raft WAL replay checkpoints
    • This mirror exists for local recovery only. It is not cluster routing authority and must not be treated as consensus truth outside the store.
  3. Coordinator control-plane view

    • Owned by Coordinator and persisted through coordinator/rootview.
    • Built from region/store heartbeats and allocator durability checkpoints.
    • Used for:
      • route lookup
      • scheduler decisions
      • allocator durability
    • Coordinator is not allowed to overwrite local raftstore truth directly.

The resulting rule is simple:

  • raft apply/bootstrap advances local truth
  • raftstore/localmeta mirrors that truth for restart
  • Coordinator observes and schedules from heartbeats

This separation is what prevents parallel truth sources from creeping back into the design.


11. Current Boundaries and Guarantees

  • Reads served through ReadCommand are leader-strong and pass a Raft linearizability barrier (LinearizableRead + WaitApplied).
  • Mutating NoKV RPC commands are serialized through Raft log replication and apply.
  • Command payload format on apply path is strict RaftCmdRequest encoding.
  • Region metadata (range/epoch/peers) is validated before both read and write command execution.
  • The store now records explicit execution-plane admission outcomes for read, write, and topology entry points.
  • Topology transitions now record explicit local execution outcome plus publish boundary state (planned published, terminal pending, terminal published, terminal failed).
  • Terminal publish failure is retained as visible retry state rather than being silently dropped.
  • Restart posture is now exposed explicitly from local durable recovery state (raftstore/localmeta + raft replay pointers).
  • raftstore/admin now exposes execution-plane diagnostics through ExecutionStatus for last admission, topology lifecycle, and restart posture.
  • nokv execution --addr <store-admin-addr> now queries that diagnostics surface directly from the CLI, with optional --region and --transition filters on topology output.
  • store.RegionMetrics + StatsSnapshot provide runtime visibility for region count, backlog, and scheduling health.

Coordinator

Coordinator is NoKV’s control-plane service for distributed mode. It exposes a gRPC API (pb.Coordinator) and is started by:

go run ./cmd/nokv coordinator --addr 127.0.0.1:2379

1. Responsibilities

Coordinator currently owns:

  • Routing: GetRegionByKey
  • Heartbeats: StoreHeartbeat, RegionHeartbeat
  • Region removal: RemoveRegion
  • ID service: AllocID
  • TSO: Tso

Runtime clients (for example cmd/nokv-fsmeta) use Coordinator as the routing source of truth, but Coordinator is not the durable owner of cluster topology truth. Durable truth lives in meta/root.


2. Runtime Architecture

flowchart LR
    Store["nokv serve"] -->|"StoreHeartbeat / RegionHeartbeat"| Coordinator["Coordinator (gRPC)"]
    Gateway["nokv-fsmeta"] -->|"GetRegionByKey / Tso"| Coordinator
    Coordinator --> Cluster["coordinator/catalog.Cluster"]
    Cluster --> Scheduler["leader-transfer hint planner"]

Core implementation units:

  • coordinator/catalog: in-memory cluster metadata model.
  • coordinator/idalloc: monotonic ID allocator used by Coordinator.
  • coordinator/rootview: persistence abstraction (Store) backed by the metadata root.
  • coordinator/server: gRPC service + RPC validation/error mapping.
  • coordinator/client: client wrapper used by store/gateway.
  • coordinator/adapter: scheduler sink that forwards heartbeats into Coordinator.

For the next-stage protocol direction on both the control plane and the paired execution plane, see docs/control_and_execution_protocols.md.

Control-Plane Protocol Status

Coordinator now uses a minimal formal control-plane protocol v1 for its key route and transition surfaces.

Already in active use:

  • route-read Freshness
  • rooted token serving metadata
  • rooted lag exposure
  • DegradedMode
  • CatchUpState
  • TransitionID
  • publish-time lifecycle assessment on PublishRootEvent

This means Coordinator no longer exposes only best-effort implementation behavior. It now returns explicit protocol state that callers, tests, and docs can rely on.

The current protocol is intentionally minimal. It does not yet expose the full future runtime/operator model such as stalled transitions or richer catch-up actions.

Minimal Eunomia vocabulary

The rooted handoff protocol is intentionally small. Docs and operator-facing surfaces should use the same vocabulary as the implementation and the Eunomia research note:

  • Tenure — the currently active authority record
  • Legacy — the retired predecessor era plus the frontier it already consumed
  • Handover — the rooted handoff record for the current successor
  • Era — the monotonic authority era
  • Witness — the operator-visible proof bundle that explains whether the current handoff state is safe

The four guarantees discussed by the docs and runtime metrics are:

  • Primacy — at most one authority era is active
  • Inheritance — the successor must cover the predecessor’s published work
  • Silence — a sealed predecessor must not keep serving
  • Finality — a handoff must not remain permanently half-finished

The mapping to concrete implementation types is direct:

Doc termImplementation term
TenureTenure
LegacyLegacy
HandoverHandover
EraEra / era
WitnessHandoverWitness / continuation witness fields
FrontiersMandateFrontiers / frontiers / consumed_frontiers

Do not reintroduce Lease / Seal as public aliases. They are useful informally, but keeping them in formal docs creates two names for the same rooted objects and makes Eunomia harder to explain.


3. Deployment Model

NoKV ships exactly one distributed topology plus the standalone engine shape.

standalone

  • no coordinator
  • no meta/root
  • no control-plane process
  • all truth remains inside the single storage process

This is the default local engine shape. Standalone is not a degraded control plane deployment; it simply has no control plane.

separated meta-root + coordinator

  • three independent nokv meta-root processes own durable rooted truth (replicated raft quorum, the only backend NoKV ships)
  • one or more nokv coordinator processes connect through the remote metadata-root gRPC API
  • Tenure gates singleton Coordinator duties: AllocID, Tso, and scheduler operation planning
  • route reads still come from Coordinator’s rebuildable in-memory view and expose Freshness, RootToken, CatchUpState, and DegradedMode

Keep the same logical split inside every deployment:

  • meta/root/*: durable rooted truth (replicated + gRPC service)
  • coordinator/view + coordinator/catalog: rebuildable routing/scheduling state
  • coordinator/rootview: remote view of meta-root consumed by coordinator/server
  • coordinator/server: gRPC API surface

Product assumptions:

  • exactly three meta-root replicas
  • meta-root is the only place durable rooted truth lives
  • coordinators are stateless relative to rooted truth; only the Tenure differentiates active vs standby
  • no dynamic metadata-root membership
  • no production-grade dynamic coordinator membership manager

4. Persistence (--workdir)

--workdir is required for every formal Coordinator deployment that hosts rooted truth.

separated meta-root + coordinator

Each meta-root process has its own rooted workdir and raft transport address. The coordinator process does not host rooted truth; it only connects to the remote root endpoints through --root-peer nodeID=grpc_addr.

Each meta-root workdir persists two layers of state:

  1. rooted truth state
    • root.events.wal
    • root.checkpoint.binpb
  2. replicated protocol state
    • root.raft.bin
    • contains raft hard state, raft snapshot, and retained raft entries

Each meta-root node must have an isolated workdir. Workdirs are not shared.

Persistence ownership:

  1. meta-root workdirs own durable rooted truth and replicated metadata-root raft state.
  2. coordinator runtime view is rebuildable from remote meta/root.
  3. allocator fences and Tenure are rooted events, not local coordinator files.

--coordinator-id must be a stable configured identity. It is used for lease ownership and operator debugging; it should not be generated randomly on each restart.

Rooted bootstrap flow

The Coordinator storage layer rebuilds its region snapshot and allocator checkpoints by replaying rooted truth:

  • region descriptor publish/tombstone events rebuild the route catalog
  • allocator fences rebuild:
    • id_current
    • ts_current

id_current and ts_current are durable allocator fences, not necessarily the last values served. With allocator window preallocation they may be ahead of the last returned ID or timestamp; restart recovery intentionally resumes after the fence and skips unused values.

Startup flow:

  1. Open rooted coordinator/rootview against the 3 meta-root --root-peer endpoints.
  2. Reconstruct a rooted Coordinator snapshot (regions + allocator fences).
  3. Compute starts as max(cli_start, fence+1).
  4. Materialize the rooted region snapshot into coordinator/catalog.Cluster.

Coordinator periodically refreshes rooted state via the meta-root tail stream and rebuilds the service-side view. This avoids allocator rollback and keeps all durable truth inside meta-root.

Region Truth Hierarchy

NoKV intentionally keeps three region views with different authority:

  • Coordinator region catalog: cluster routing truth. Clients and stores must treat Coordinator as the authoritative key-to-region source at the service boundary, but Coordinator rebuilds this view from rooted metadata truth plus heartbeats.
  • raftstore/localmeta local catalog: store-local recovery truth. It exists so one store can restart hosted peers and replay raft WAL checkpoints even if Coordinator is temporarily unavailable.
  • Store.regions runtime catalog: in-memory cache/view rebuilt from local metadata at startup and then advanced by peer lifecycle plus raft apply.

These layers are not interchangeable. Local metadata is recovery state, not cluster routing authority.


5. Config Integration

raft_config.json supports Coordinator endpoint + workdir defaults:

"coordinator": {
  "addr": "127.0.0.1:2379",
  "docker_addr": "nokv-coordinator:2379",
  "work_dir": "./artifacts/cluster/coordinator",
  "docker_work_dir": "/var/lib/nokv-coordinator"
}

Resolution rules:

  • CLI override wins.
  • Otherwise read from config by scope (host / docker).

Helpers:

  • config.ResolveCoordinatorAddr(scope)
  • config.ResolveCoordinatorWorkDir(scope)
  • nokv-config coordinator --field addr|workdir --scope host|docker

Replicated-root transport settings are currently CLI-driven, not config-file driven.


6. Routing Source Convergence

NoKV now uses Coordinator-first routing:

  • raftstore/client resolves regions with GetRegionByKey.
  • raft_config regions are bootstrap/deployment metadata.
  • Runtime route truth comes from Coordinator heartbeats + Coordinator region catalog.

This avoids dual sources drifting over time (config vs Coordinator).


7. Serve Mode Semantics

nokv serve is now Coordinator-only:

  • --coordinator-addr is required.
  • Runtime routing/scheduling control-plane state is sourced from Coordinator.

For restart and recovery, nokv serve intentionally separates runtime truth from deployment metadata:

  • hosted region/peer truth comes from raftstore/localmeta
  • raft durable progress comes from the store workdir (WAL, raft log, local metadata)
  • raft_config.json is used only to resolve static addresses (Coordinator, store listen, store transport)

This means:

  • bootstrap-time config.regions are not replayed during restart
  • runtime split/merge/peer-change results continue to come back from local recovery state
  • --store-addr is an exceptional static address override, not the normal restart path
  • --store-id must match the durable workdir identity when the workdir was already used

The recommended restart shape is therefore:

nokv serve \
  --config ./raft_config.example.json \
  --scope host \
  --store-id 1 \
  --workdir ./artifacts/cluster/store-1

serve will:

  1. load the local peer catalog from the store workdir
  2. derive the current remote peer set from local metadata
  3. use config stores only to map storeID -> addr

If static transport overrides are needed, prefer stable store identities:

nokv serve \
  --config ./raft_config.example.json \
  --scope host \
  --store-id 1 \
  --workdir ./artifacts/cluster/store-1 \
  --store-addr 2=10.0.0.12:20160

Related CLI behavior:

  • Inspect control-plane state through Coordinator APIs/metrics.
  • nokv coordinator --metrics-addr <host:port> exposes native expvar on /debug/vars.
  • nokv serve --metrics-addr <host:port> exposes store/runtime expvar on /debug/vars.

8. Service Semantics

Coordinator intentionally separates rooted truth leadership from the outer gRPC service surface.

In 3 coordinator + replicated meta:

  • all three coordinator processes may listen and serve RPC
  • only the rooted leader may commit truth writes
  • followers refresh rooted state and serve read/view traffic

In separated mode:

  • meta-root leadership determines which root endpoint accepts truth writes
  • Tenure determines which Coordinator may serve singleton duties
  • non-holder Coordinators may still serve route reads if their rooted view satisfies the caller’s freshness contract

Leader-only writes

These RPCs require rooted leadership:

  • RegionHeartbeat
  • PublishRootEvent
  • RemoveRegion
  • AllocID
  • Tso

Followers return FailedPrecondition with coordinator not leader semantics, and clients are expected to retry against another Coordinator endpoint.

In separated mode, AllocID, Tso, and scheduler operation planning also require the local Coordinator to hold Tenure.

Any-node reads

These RPCs may be served by any Coordinator node:

  • GetRegionByKey
  • StoreHeartbeat handling and store-view inspection

Follower reads are driven by a rooted watch-first tail subscription, with explicit refresh/reload as fallback into coordinator/catalog.Cluster. They are expected to be shortly stale rather than linearly consistent.

For GetRegionByKey, that follower-service behavior is now explicit in the protocol surface:

  • callers can request Freshness
  • responses include rooted token metadata
  • responses disclose DegradedMode and CatchUpState
  • bounded-freshness reads may be rejected if rooted lag exceeds the requested limit

Client behavior

coordinator/client accepts multiple Coordinator addresses. Write RPCs retry across Coordinator nodes and converge on the rooted leader. Read RPCs may use any available Coordinator endpoint.


9. Deployment Example

NoKV ships exactly one topology: a 3-peer replicated meta-root cluster plus one or more coordinator processes that talk to it over gRPC.

Start three metadata-root peers (peer map is identical on all three; only -addr, -workdir, -node-id, -transport-addr differ):

go run ./cmd/nokv meta-root \
  -addr 127.0.0.1:2380 \
  -workdir ./artifacts/cluster/meta-root-1 \
  -node-id 1 \
  -transport-addr 127.0.0.1:3380 \
  -peer 1=127.0.0.1:3380 \
  -peer 2=127.0.0.1:3381 \
  -peer 3=127.0.0.1:3382

Start one coordinator (add more by giving each a distinct -coordinator-id and -addr; they share the same -root-peer set):

go run ./cmd/nokv coordinator \
  -addr 127.0.0.1:2379 \
  -coordinator-id c1 \
  -root-peer 1=127.0.0.1:2380 \
  -root-peer 2=127.0.0.1:2381 \
  -root-peer 3=127.0.0.1:2382

Current product assumptions:

  • exactly three meta-root replicas
  • meta-root is the only place durable rooted truth lives
  • coordinators are stateless relative to rooted truth; only the Tenure differentiates active vs standby
  • no dynamic metadata-root membership

For local bootstrap, use:

./scripts/dev/cluster.sh --config ./raft_config.example.json

10. Comparison: TinyKV / TiKV

TinyKV (teaching stack)

  • Uses a scheduler server (tinyscheduler) as separate process.
  • Control plane integrates embedded etcd for metadata persistence.
  • Educational architecture, minimal production hardening.

TiKV (production stack)

  • Coordinator is an independent, highly available cluster.
  • Coordinator internally uses etcd Raft for durable metadata + leader election.
  • Rich scheduling and balancing policies, rolling updates, robust ops tooling.

NoKV Coordinator (current)

  • Standalone mode has no Coordinator and no metadata-root service.
  • Distributed mode has three control-plane deployments:
  • single coordinator + local meta
  • 3 coordinator + replicated meta
  • separated meta-root + remote coordinator
  • In co-located deployments, each coordinator process hosts a same-process rooted backend and rebuilds its service-side view from rooted truth.
  • In separated deployment, meta-root is the durable truth service and coordinator is a remote rooted view/service layer.
  • Coordinator persistence is intentionally limited to rooted control-plane truth:
    • region descriptor publish/tombstone events
    • allocator durability (AllocID, TSO)
    • Tenure ownership for separated singleton duties
  • Coordinator is not the durable owner of a store’s local raft/region truth. Store restart truth remains in raftstore/localmeta, while Coordinator keeps routing and scheduling state rebuilt from meta/root.

11. Current Limitations / Next Steps

  • single coordinator + local meta remains the simpler and more mature deployment.
  • 3 coordinator + replicated meta is now a formal product mode, but still has a deliberately small HA surface:
    • fixed three replicas
    • no dynamic metadata membership
    • follower convergence uses watch-first tailing with refresh/reload fallback
  • separated meta-root + remote coordinator is implemented but experimental:
    • use it for control-plane research and failure-domain experiments
    • do not treat it as the default production path yet
    • failure/recovery E2E tests and eunomia benchmarks still need to be expanded before stronger claims are made
  • Scheduler policy is intentionally small (leader transfer focused).
  • No advanced placement constraints yet.

These are deliberate scope limits for a fast-moving experimental platform that keeps the rooted truth surface small.

Rooted Truth — meta/root

The meta/root/ tree implements NoKV’s rooted truth kernel: a typed, append-only event log whose committed tail is the single source of truth for cluster-level metadata (coordinator leases, allocator fences, region lifecycle, pending peer/range changes).

If the distributed system has a “brain”, it does not live in the coordinator. It lives here. The coordinator is a service+view on top of this log.


1. Why a separate truth layer

In a typical multi-raft system, the metadata used by the control plane (routes, TSO, leases, scheduling decisions) is either:

  1. Stored inside one of the raft groups (mixed with user data)
  2. Owned by a single coordinator node (coordinator becomes the bottleneck)
  3. Split across ad-hoc persistence files

NoKV makes it explicit: there is a small, typed metadata log with its own durability and replication shape, and everything control-plane-related goes through it. This matches the “virtual consensus” pattern (Delos-lite): the log is the truth; services above are views that can be rebuilt.

The benefits are concrete:

  • Coordinator is stateless at restart — the only persistent thing about a coordinator is its configured holder ID; everything else is rebuilt from meta/root on boot
  • The log can be swapped between local (single-node) and replicated (embedded-raft) backends without changing coordinator code
  • Authority handoff is auditable — every tenure issue / legacy seal / handover event is a committed log record with a cursor

2. Package layout

meta/
├── root/
│   ├── protocol/       # Pure protocol types (Cursor, Frontiers, Handoff, Witness, ...)
│   ├── event/          # Typed events (KindStoreJoined, KindTenure, ...)
│   ├── state/          # Compact applied state (State, Snapshot, ApplyEventToSnapshot)
│   ├── materialize/    # Helpers that build Snapshot from raw events
│   ├── storage/        # Virtual log file layout + checkpoint format
│   │   └── file/       # Actual on-disk file operations
│   ├── backend/
│   │   ├── local/      # Single-node file-backed log
│   │   └── replicated/ # Embedded raft-backed log (quorum durability)
│   └── remote/         # gRPC service + client for remote rooted access
└── wire/               # proto <-> Go conversions (Event, Snapshot, Cursor)

3. What lives in rooted state

meta/root/state/state.go defines State, the applied snapshot. Everything the control plane cares about is here:

type State struct {
    ClusterEpoch       uint64         // bumped on topology event
    MembershipEpoch    uint64         // bumped on store join/leave
    LastCommitted      Cursor         // highest committed (term, index)
    IDFence            uint64         // globally fenced ID allocator floor
    TSOFence           uint64         // globally fenced TSO allocator floor
    Tenure   Tenure
    Legacy    Legacy
    Handover Handover
}

Snapshot wraps State together with descriptors and pending peer/range changes:

type Snapshot struct {
    State               State
    Descriptors         map[uint64]descriptor.Descriptor
    PendingPeerChanges  map[uint64]PendingPeerChange
    PendingRangeChanges map[uint64]PendingRangeChange
}

Every event kind has a deterministic effect on Snapshot. See meta/root/state/snapshot_apply.go and meta/root/state/state.go:ApplyEventToState.


4. The Append protocol

The core interface any backend must satisfy:

type Backend interface {
    Snapshot() (rootstate.Snapshot, error)
    Append(ctx context.Context, events ...rootevent.Event) (rootstate.CommitInfo, error)
    FenceAllocator(ctx context.Context, kind AllocatorKind, min uint64) (uint64, error)
}

Append does five things atomically:

  1. Validate events against the current Snapshot (reject duplicate region IDs, invalid transitions, stale epochs)
  2. Assign each event a committed cursor (Term, Index)
  3. Persist the batch to the backing log
  4. Persist an updated compact Checkpoint
  5. Advance in-memory State + Descriptors + pending maps

After Append returns successfully, callers can observe the new state via Snapshot(). The CommitInfo.Cursor they get back is the globally ordered cursor for the last event in the batch.

FenceAllocator is separate because it’s an authoritative minimum — backends may promote the fence further (e.g., to account for outstanding windows) but must never return a value below min.


5. Single backend: replicated

NoKV ships one meta-root backend: the 3-peer raft-replicated cluster. Historical single-process “local” backend has been removed.

meta/root/replicated/store.go — embedded raft library, quorum-durable commits.

  • exactly 3 replicas, one leader
  • Append proposes a raft log entry; returns after it’s committed to quorum
  • Non-leader nodes reject Append with codes.FailedPrecondition
  • Leader changes trigger IsLeader() / LeaderID() state updates that coordinator consumes
  • On-disk state per peer: root.events.wal, root.checkpoint.binpb, root.raft.bin (raft hard state + snapshot + retained entries)

6. Coordinator commands — how tenure/legacy/handover flow in

In addition to “raw” events, backends expose command APIs for control-plane-specific operations:

ApplyTenure(ctx, cmd TenureCommand)
    (EunomiaState, error)

ApplyHandover(ctx, cmd HandoverCommand)
    (EunomiaState, error)

These are validated, typed writes that internally:

  1. Validate the command against current state (e.g., Seal requires an active Tenure, Confirm requires prior Legacy)
  2. Emit the appropriate KindTenure / KindLegacy / KindHandover event
  3. Append through the normal log path
  4. Return the new EunomiaState = { Tenure, Legacy, Handover }

Command-level validation lives in meta/root/state/eunomia.go.


7. Tail subscription — how coordinator consumes

Coordinators don’t poll Snapshot() — they subscribe:

sub := rootstorage.NewTailSubscription(afterToken, waitFn)
advance, err := sub.Next(ctx, fallback)
if advance.Action == rootstorage.TailCatchUpAction_Reload {
    // backend advanced far; reload snapshot
} else if advance.Action == rootstorage.TailCatchUpAction_Bootstrap {
    // backend advanced past our retention window; install from compact state
}
sub.Acknowledge(advance)

meta/root/storage/virtual_log.go defines:

  • TailToken — opaque position in the log
  • TailAdvance — either new events, a reload signal, or a bootstrap install
  • TailSubscription — stateful iterator that survives across reloads

This is what lets coordinator/ run as a thin service without duplicating rooted storage.


8. Recovery model

On coordinator boot:

  1. Open the replicated backend (each peer via rootreplicated.Open, or connect through coordinator/rootview for the client side)
  2. Call Snapshot() — backend replays/bootstraps internally
  3. Build a TailSubscription from the snapshot’s LastCommitted
  4. Start the tenure campaign loop, which will eventually ApplyTenure(Issue) when it’s leader

If the backend file is corrupted, the coordinator fails fast — it does not try to reconstruct rooted state from raftstore local metadata. The two are deliberately partitioned.


9. Remote access

meta/root/remote/ provides a gRPC service + client. This exists so a raftstore store can read rooted state without being colocated with the replicated backend:

  • RemoteRootService serves Snapshot, Append, WaitForTail, ObserveTail, etc.
  • RemoteRootClient implements the same rootBackend interface by calling over gRPC
  • Leader redirect is automatic: if the target returns NotLeader, client re-dials the returned leader

This is what keeps coordinator/ deployable separately from the rooted log, if you ever want to.


10. Source map

FileResponsibility
meta/root/protocol/types.goPure protocol types (no persistence logic)
meta/root/event/types.goTyped event constructors
meta/root/state/state.goState, Snapshot, ApplyEventToSnapshot
meta/root/state/eunomia.goTenure/Legacy/Handover validation + digest
meta/root/state/transition.goCross-event transition rules
meta/root/storage/virtual_log.goTail subscription + checkpoint primitives
meta/root/replicated/store.goThe only backend: 3-peer raft-replicated meta-root
meta/root/server/service.go, meta/root/client/client.gogRPC service + client for meta-root
meta/wire/root.goproto ↔ Go conversions

Related docs:

Percolator Distributed Transaction Design

This document explains NoKV’s distributed transaction path implemented by percolator/ and executed through raftstore.

The scope here is the current code path:

  • Prewrite
  • Commit
  • BatchRollback
  • ResolveLock
  • CheckTxnStatus
  • MVCC read visibility (KvGet/KvScan through percolator.Reader)

1. Where It Runs

Percolator logic is executed on the Raft apply path:

  1. Client sends NoKV RPC (KvPrewrite, KvCommit, …).
  2. raftstore/kv/service.go wraps it into a RaftCmdRequest.
  3. Store proposes command through Raft.
  4. On apply, raftstore/kv/apply.go dispatches to percolator.*.
sequenceDiagram
    participant C as raftstore/client
    participant S as kv.Service
    participant R as Raft (leader->followers)
    participant A as kv.Apply
    participant P as percolator
    participant DB as NoKV DB
    C->>S: KvPrewrite/KvCommit...
    S->>R: ProposeCommand(RaftCmdRequest)
    R->>A: Apply committed log
    A->>P: percolator.Prewrite/Commit...
    P->>DB: CFDefault/CFLock/CFWrite reads+writes
    A-->>S: RaftCmdResponse
    S-->>C: NoKV RPC response

Key files:

1.1 RPC to Percolator Function Mapping

NoKV RPCkv.Apply branchPercolator function
KvPrewriteCMD_PREWRITEPrewrite
KvCommitCMD_COMMITCommit
KvBatchRollbackCMD_BATCH_ROLLBACKBatchRollback
KvResolveLockCMD_RESOLVE_LOCKResolveLock
KvCheckTxnStatusCMD_CHECK_TXN_STATUSCheckTxnStatus
KvGetCMD_GETReader.GetLock + Reader.GetValue
KvScanCMD_SCANReader.GetLock + CFWrite iteration + GetInternalEntry

2. MVCC Data Model

NoKV uses three MVCC column families:

  • CFDefault: stores user values at start_ts
  • CFLock: stores lock metadata at fixed lockColumnTs = MaxUint64
  • CFWrite: stores commit records at commit_ts

2.1 Lock Record

percolator.Lock (encoded by EncodeLock):

  • Primary
  • Ts (start timestamp)
  • TTL
  • Kind (Put/Delete/Lock)
  • MinCommitTs

2.2 Write Record

percolator.Write (encoded by EncodeWrite):

  • Kind
  • StartTs
  • ShortValue (codec supports it; current commit path does not populate it)

3. Concurrency Control: Latches

Before mutating keys, percolator acquires striped latches:

  • latch.Manager hashes keys to stripe mutexes.
  • Stripes are deduplicated and acquired in sorted order to avoid deadlocks.
  • Guard releases in reverse order.

In raftstore/kv, latches are passed explicitly:

  • NewEntryApplier creates one latch.NewManager(512) and reuses it.
  • Apply / NewApplier accept an injected manager; nil falls back to latch.NewManager(512).

This serializes conflicting apply operations on overlapping keys in one node.


4. Two-Phase Commit Flow

Client side (raftstore/client.Client.TwoPhaseCommit):

  1. Group mutations by region.
  2. Prewrite primary region.
  3. Prewrite secondary regions.
  4. Commit primary region.
  5. Commit secondary regions.
sequenceDiagram
    participant Cli as Client
    participant R1 as Region(primary)
    participant R2 as Region(secondary)
    Cli->>R1: Prewrite(primary + local muts)
    Cli->>R2: Prewrite(secondary muts)
    Cli->>R1: Commit(keys,startTs,commitTs)
    Cli->>R2: Commit(keys,startTs,commitTs)

5. Write-Side Operations

5.1 Prewrite

Prewrite runs mutation-by-mutation:

  1. Check existing lock on key:
    • if lock exists with different Ts -> KeyError.Locked
  2. Check latest committed write:
    • if commit_ts >= req.start_version -> WriteConflict
  3. Apply data intent:
    • Put: write value into CFDefault at start_ts
    • Delete/Lock: delete default value at start_ts (if exists)
  4. Write lock into CFLock at lockColumnTs

5.2 Commit

For each key:

  1. Read lock
  2. If no lock:
    • if write with same start_ts exists -> idempotent success
    • else -> abort (lock not found)
  3. If lock Ts != start_version -> KeyError.Locked
  4. commitKey:
    • if min_commit_ts > commit_version -> CommitTsExpired
    • if write with same start_ts already exists:
      • rollback write -> abort
      • write with different commit ts -> treat success, clean lock
      • same commit ts -> success
    • else write CFWrite[key@commit_ts] = {kind,start_ts}
    • remove lock from CFLock

5.3 BatchRollback

For each key:

  1. If already has write at start_ts:
    • rollback marker already exists -> success
    • non-rollback write exists -> success (already committed)
  2. Remove lock (if any)
  3. Remove default value at start_ts (if any)
  4. Write rollback marker to CFWrite at start_ts

5.4 ResolveLock

  • commit_version == 0 -> rollback matching locks
  • commit_version > 0 -> commit matching locks
  • Returns number of resolved keys

6. Transaction Status Check

CheckTxnStatus targets the primary key and decides whether txn is alive, committed, or should be rolled back.

Decision order:

  1. Read lock on primary
  2. If lock exists but lock.ts != req.lock_ts -> KeyError.Locked
  3. If lock exists and TTL expired (current_ts >= lock.ts + ttl):
    • rollback primary
    • action = TTLExpireRollback
  4. If lock exists and caller pushes timestamp:
    • min_commit_ts = max(min_commit_ts, caller_start_ts+1)
    • action = MinCommitTsPushed
  5. If no lock, check write by start_ts:
    • committed write -> return commit_version
    • rollback write -> action LockNotExistRollback
  6. If no lock and no write, and rollback_if_not_exist is true:
    • write rollback marker
    • action LockNotExistRollback

7. Read Path Semantics (MVCC Visibility)

KvGet and KvScan read through percolator.Reader:

  1. Check lock first:
    • if lock exists and read_ts >= lock.ts, return locked error
  2. Find visible write in CFWrite:
    • latest commit_ts <= read_ts
  3. Interpret write kind:
    • Delete/Rollback => not found
    • Put => read value from CFDefault at start_ts

Notes:

  • KvScan currently rejects reverse scan.
  • scanWrites uses internal iterator over CFWrite.

8. Error and Idempotency Behavior

OperationIdempotency/Conflict behavior
PrewriteRejects lock conflicts and write conflicts; returns per-key KeyError list.
CommitIdempotent for already committed keys with same start_ts; stale/missing lock may abort.
BatchRollbackSafe to repeat; rollback marker prevents duplicate side effects.
ResolveLockSafe to retry per key set; resolves only matching start_ts locks.
CheckTxnStatusMay push min_commit_ts, rollback expired primary lock, or return committed version.

9. Current Operational Boundaries

  • Percolator execution is tied to NoKV RPC + Raft apply path, with the command shape still following the TinyKV/TiKV MVCC model.
  • Latch scope is process-local when one store shares a single latch.Manager; region correctness still comes from Raft ordering.
  • Write.ShortValue and Write.ExpiresAt are codec fields; current commit path stores primary value bytes in CFDefault and reads from there when short value is not present.

10. Validation and Tests

Primary coverage:

These tests cover 2PC happy path, lock conflicts, status checks, resolve/rollback behavior, and client region-aware retries.

FSMetadata

TL;DR

  • Topic: NoKV’s namespace metadata substrate.
  • Core objects: Mount, Inode, Dentry, SubtreeAuthority, SnapshotEpoch, QuotaFence, UsageCounter.
  • Call chain: fsmeta/client -> fsmeta/server -> fsmeta/exec -> TxnRunner -> raftstore/percolator/coordinator.
  • Code contract: wire is in pb/fsmeta/fsmeta.proto, the executor is in fsmeta/exec, and the default NoKV runtime is fsmeta/exec.OpenWithRaftstore.

1. Conclusion

fsmeta is NoKV’s native metadata service. It isn’t a FUSE frontend, it doesn’t handle object body I/O, and it doesn’t promise full POSIX. What it provides is a metadata substrate that distributed filesystems, object-storage namespaces, and AI dataset metadata can all reuse.

The value of this layer isn’t picking a few keys to encode inode/dentry. The real boundary is: common namespace operations are exposed as server-side primitives, instead of asking each upper-layer application to stitch a protocol out of Get / Put / Scan.

2. Current API

The current v1 API is defined by pb/fsmeta/fsmeta.proto. fsmeta/server exposes gRPC; fsmeta/client provides a Go typed client.

RPCCurrent semantics
CreateAtomically creates a dentry and inode; the server uses AssertionNotExist to reject duplicate creation.
LookupRead a dentry by (mount, parent_inode, name).
ReadDirScan one directory page by dentry prefix.
ReadDirPlusScan dentries and batch-read inode attrs under the same snapshot version.
WatchSubtreeA prefix-scoped change feed; supports ready, ack, back-pressure, and cursor replay.
SnapshotSubtreePublishes a stable MVCC read version; subsequent ReadDir / ReadDirPlus can use it to read the snapshot.
RetireSnapshotSubtreeProactively retire a snapshot epoch.
GetQuotaUsageRead the persistent quota usage counter for a mount/scope.
RenameSubtreeAtomically move the root dentry of a subtree; descendants follow naturally via inode references.
LinkCreate a second dentry for an existing non-directory inode and increment link count in the same transaction.
UnlinkDelete a dentry; decrement link count and delete the inode record when the last link is removed.

3. Data model

fsmeta’s key schema is in fsmeta/keys.go; value schema is in fsmeta/value.go.

ObjectStorage locationNotes
Mount metadata keyEncodeMountKeyReserved mount-level data key; mount lifecycle truth does not live here.
InodeEncodeInodeKey(mount, inode)File/directory attributes, including size, mode, link_count.
DentryEncodeDentryKey(mount, parent, name)Mapping from parent/name to inode.
ChunkEncodeChunkKey(mount, inode, chunk)Schema is in place; the current fsmeta API doesn’t expose object body / chunk I/O.
SessionEncodeSessionKey(mount, session)Schema reserved for later session/lease use.
UsageEncodeUsageKey(mount, scope)Quota usage counter; scope=0 means mount-wide, non-zero means a direct accounting scope.

Both keys and values carry a magic + schema version. Values use a hand-written binary layout — not JSON.

4. Execution boundary

fsmeta/exec.Executor depends on a single narrow interface:

type TxnRunner interface {
    ReserveTimestamp(ctx context.Context, count uint64) (uint64, error)
    Get(ctx context.Context, key []byte, version uint64) ([]byte, bool, error)
    BatchGet(ctx context.Context, keys [][]byte, version uint64) (map[string][]byte, error)
    Scan(ctx context.Context, startKey []byte, limit uint32, version uint64) ([]KV, error)
    Mutate(ctx context.Context, primary []byte, mutations []*kvrpcpb.Mutation, startVersion, commitVersion, lockTTL uint64) error
}

The default runtime uses OpenWithRaftstore to wire up coordinator, raftstore client, TSO, watch source, mount/quota cache, snapshot publisher, and subtree handoff publisher. Embedded users can use this entry point directly; tests and custom deployments can keep passing in their own TxnRunner.

The layering constraints are:

  • Executor does not directly know about raft region / store routing.
  • OpenWithRaftstore is NoKV’s default adapter; it owns the raftstore wiring.
  • meta/root does not store high-frequency inode/dentry data — only lifecycle / authority truth.
  • raftstore and percolator don’t understand fsmeta semantics; they only provide transactions and apply observation.

5. Native primitives

ReadDirPlus

ReadDirPlus is the most direct shape advantage today: one dentry scan plus one BatchGet of inode attrs, all read under the same snapshot version. A generic-KV baseline has to do a point lookup per dentry after the scan, producing N+1.

Strict semantics: if any inode is missing or fails to decode, the whole page returns an error. fsmeta does not return half-true directory pages.

WatchSubtree

WatchSubtree subscribes to an fsmeta key prefix and externally exposes a (region_id, term, index) cursor and a commit_version. Event sources include:

  • a successful CMD_COMMIT;
  • CMD_RESOLVE_LOCK with commit_version != 0.

CMD_PREWRITE, rollback, and diagnostic commands do not produce visible events.

v1 already supports:

  • ready signal;
  • back-pressure window;
  • ack;
  • per-region recent ring;
  • resume cursor replay;
  • ErrWatchCursorExpired when a cursor has expired.

SnapshotSubtree

SnapshotSubtree only publishes a read epoch — it does not copy the directory tree. The token shape is (mount, root_inode, read_version). Subsequent ReadDir / ReadDirPlus use snapshot_version to read the same MVCC view.

SnapshotEpochPublished / SnapshotEpochRetired rooted events are already in place. The data-plane MVCC GC does not yet use these epochs as a retention lower bound — that’s the next piece of work for the GC layer.

RenameSubtree

The current dentry schema references the parent directory via parent_inode_id, so a subtree rename’s physical write volume is the same as a regular rename: delete the old root dentry, write the new root dentry. Descendants don’t have to be rewritten one by one.

The extra semantics live in the authority layer:

  • publish SubtreeHandoffStarted before mutation;
  • the dentry mutation goes through Percolator 2PC;
  • publish SubtreeHandoffCompleted after mutation;
  • the runtime monitor uses WatchRootEvents to discover pending handoffs and complete them.

This design prioritizes guaranteeing that rooted authority can never get stuck in an unknown state forever. In extreme cases it may advance an empty era — but it will not leave behind an orphaned pending handoff that no one will repair.

Link is allowed only for non-directory inodes. It creates a new dentry and increments InodeRecord.LinkCount in the same transaction.

Unlink deletes one dentry and updates the inode based on link count:

  • link count > 1: decrement and write back the inode;
  • link count ≤ 1: delete the inode record.

Hard links to directories remain illegal.

Quota Fence

The quota fence is rooted truth; the usage counter is a data-plane key. The write path packs the usage counter mutation and the dentry/inode mutation into the same Percolator transaction.

This solves two problems:

  • multiple nokv-fsmeta gateways won’t each maintain a local counter and breach the limit;
  • a gateway restart won’t lose usage.

Fence changes are pushed to the fsmeta runtime via the coordinator root-event stream; on cache miss, the runtime falls back to querying the coordinator.

6. Rooted truth vs runtime view

DomainRooted truthRuntime view
MountMountRegistered / MountRetiredfsmeta mount admission cache; a retired mount closes related watch subscriptions.
Subtree authoritySubtreeAuthorityDeclared / SubtreeHandoffStarted / SubtreeHandoffCompletedRenameSubtree frontier, pending handoff repair.
Snapshot epochSnapshotEpochPublished / SnapshotEpochRetiredsnapshot-version reads.
Quota fenceQuotaFenceUpdatedquota fence cache + persisted usage counter keys.
WatchSubtreeNot in meta/rootraftstore apply observer + fsmeta router.

On startup, nokv-fsmeta first pulls ListMounts / ListQuotaFences / ListSubtreeAuthorities for bootstrap, then follows subsequent changes via WatchRootEvents. MonitorInterval is the reconnect backoff after the root-event stream drops — not a steady-state polling interval.

7. Deployment

Docker Compose brings up meta-root, coordinator, raftstore, and fsmeta gateway, and registers a default mount via mount-init:

docker compose up -d

Start the fsmeta gateway directly:

go run ./cmd/nokv-fsmeta \
  --addr 127.0.0.1:8090 \
  --coordinator-addr 127.0.0.1:2390,127.0.0.1:2391,127.0.0.1:2392 \
  --metrics-addr 127.0.0.1:9400

Register a mount:

nokv mount register \
  --coordinator-addr 127.0.0.1:2390,127.0.0.1:2391,127.0.0.1:2392 \
  --mount default \
  --root-inode 1 \
  --schema-version 1

Set quota:

nokv quota set \
  --coordinator-addr 127.0.0.1:2390,127.0.0.1:2391,127.0.0.1:2392 \
  --mount default \
  --limit-bytes 10737418240 \
  --limit-inodes 10000000

8. Metrics

nokv-fsmeta --metrics-addr exposes four expvar groups:

NamespaceMeaning
nokv_fsmeta_executortransaction retry / retry exhausted.
nokv_fsmeta_watchsubscribers, events, delivered, dropped, overflow, remote source state.
nokv_fsmeta_mountmount cache hit/miss, admission rejects.
nokv_fsmeta_quotafence check/reject, cache hit/miss, fence updates, usage mutations.

9. Benchmarks

The fsmeta benchmark lives in benchmark/fsmeta. The core comparison is two paths against the same NoKV cluster:

DriverBehavior
native-fsmetaCalls the fsmeta typed API.
generic-kvUses the same raftstore/percolator substrate but stitches the metadata schema on the client.

Stage 1 headline: ReadDirPlus average latency 12.0 ms vs 510.3 ms — about 42.5×. Result CSVs are in benchmark/fsmeta/results/.

The WatchSubtree evidence workload lives in the same benchmark package; watch_notify reaches sub-second p95 on a Docker Compose 3-node cluster.

10. Non-goals

  • No FUSE / NFS / SMB frontend.
  • No S3 HTTP gateway or object body I/O.
  • No write of every inode/dentry mutation into meta/root.
  • No recursive materialized snapshot — SnapshotSubtree is an MVCC read epoch.
  • No claim that data-plane MVCC GC already retains by snapshot epoch.

From Standalone to Cluster

NoKV does not treat standalone and distributed mode as two separate products glued together later. The migration path promotes an existing standalone workdir into a distributed seed, then expands that seed into a replicated multi-Raft region without swapping out the storage core.

Read this page if the most interesting question in NoKV is not “how fast is it” but “how an existing standalone engine becomes a distributed system without switching data planes”.

Why This Feature Matters

Most systems make you choose early:

  • start with an embedded engine, then migrate to a different distributed database later
  • start distributed from day one, even when the workload is still small
  • accept dump/import-style migration as an operational tax

NoKV is aiming at a different story:

  1. Start with a serious standalone engine.
  2. Keep the same workdir and the same storage layer as data grows.
  3. Promote that workdir into a distributed seed explicitly.
  4. Expand the seed into a replicated cluster through normal raftstore mechanisms.

That is the core value of this feature. The goal is not “add one more migration command”. The goal is to make standalone and distributed shapes feel like one system with one data plane.

What Is Shipping Now

The current migration design is intentionally conservative:

  • offline only
  • one-shot bootstrap
  • no dual-write cutover
  • no background auto-repair
  • no automatic split or rebalance during promotion
  • one full-range seed region for the initial upgrade

That scope is deliberate. The first version is trying to make the protocol explicit, recoverable, and testable before chasing more automation.

The Core Contract

The migration path must preserve these invariants:

  1. Standalone writes stop before migration starts.
  2. The migrated workdir must not silently reopen as a normal standalone directory.
  3. Bootstrap is the only allowed non-apply path that creates the initial region truth for the promoted directory.
  4. Engine manifest stays storage-engine metadata only.
  5. Store-local region truth stays in raftstore/localmeta.
  6. Coordinator does not create local truth during bootstrap.

Those invariants are what make the feature defensible. Without them, “standalone to cluster” collapses into ad hoc tooling.

The Minimal Operator Flow

flowchart LR
    A["Standalone workdir"] --> B["nokv migrate plan"]
    B --> C["nokv migrate init"]
    C --> D["Seeded workdir"]
    D --> E["nokv serve --coordinator-addr ..."]
    E --> F["Single-store cluster seed"]
    F --> G["nokv migrate expand"]
    G --> H["Replicated cluster"]

Happy-path CLI

# 1. Inspect a standalone workdir
go run ./cmd/nokv migrate plan --workdir ./data/store-1

# 2. Promote it into a single-store seed
go run ./cmd/nokv migrate init \
  --workdir ./data/store-1 \
  --store 1 \
  --region 1 \
  --peer 101

# 3. Start the promoted workdir in distributed mode
go run ./cmd/nokv serve \
  --workdir ./data/store-1 \
  --store-id 1 \
  --coordinator-addr 127.0.0.1:2379

# 4. Expand the seed into more replicas
go run ./cmd/nokv migrate expand \
  --addr 127.0.0.1:20170 \
  --region 1 \
  --target 2:201@127.0.0.1:20171 \
  --target 3:301@127.0.0.1:20172

Local operator wrapper

For local demos and operator workflows, the happy path is wrapped by:

  • scripts/ops/migrate-cluster.sh

That wrapper intentionally delegates state transitions to the formal migration CLI instead of inventing a second control path. It now prints stage banners, shows the local migration status after promotion, and writes final migration reports under:

  • artifacts/migration/summary.txt
  • artifacts/migration/summary.json

For local inspection and automation, the CLI also exposes:

  • nokv migrate status --workdir ...
  • nokv migrate report --workdir ...
  • nokv migrate status --workdir ... --addr <leader-admin> --region <region>
  • nokv migrate report --workdir ... --addr <leader-admin> --region <region>

When --addr is set, the report includes a cluster-aware runtime view:

  • current leader peer
  • current leader store
  • whether the local store is hosted
  • current membership size
  • membership peer list
  • applied index / term on that store

Lifecycle States

stateDiagram-v2
    [*] --> standalone
    standalone --> preparing: migrate init
    preparing --> seeded: catalog + seed snapshot + raft seed
    seeded --> cluster: first successful serve
    cluster --> cluster: expand / transfer-leader / remove-peer

Workdir modes

The promoted workdir exposes an explicit mode contract:

  • standalone
  • preparing
  • seeded
  • cluster

Minimal persisted state looks like this:

{
  "mode": "seeded",
  "store_id": 1,
  "region_id": 1,
  "peer_id": 101
}

Semantics

  • standalone
    • regular standalone engine directory
  • preparing
    • migration is in progress; ordinary standalone open must refuse service
  • seeded
    • the standalone workdir has been promoted into a valid single-store cluster seed
  • cluster
    • the workdir is now operating through the distributed runtime

That mode gate is one of the most important engineering choices in the feature. It prevents a half-migrated directory from being silently treated as a normal local database.

Local promotion checkpoint

During migrate init, NoKV now persists a lightweight local checkpoint alongside the mode file. That checkpoint records which milestone of the standalone-to-seed promotion has already completed, for example:

  • mode-preparing-written
  • local-catalog-persisted
  • seed-snapshot-exported
  • raft-seed-initialized
  • seeded-finalized

nokv migrate status and nokv migrate report surface this checkpoint together with a resume_hint, so interrupted promotion is no longer just “stuck in preparing” without context.

When nokv migrate expand, nokv migrate transfer-leader, or nokv migrate remove-peer is invoked with --workdir, the same checkpoint file is also updated with operator progress:

  • which target store/peer is currently being rolled out
  • how many targets have already completed hosting
  • which leader transfer or peer removal step just completed
  • what migration command should be retried next after an interruption

Post-step validation

Migration commands now also validate the state they just created instead of trusting only the immediate RPC return:

  • migrate init verifies the local catalog, seed snapshot manifest, and local raft pointer all agree on the promoted seed region
  • migrate expand verifies leader membership and target hosted state agree on the new peer
  • migrate transfer-leader verifies the elected leader and target runtime converge on the requested peer
  • migrate remove-peer verifies leader metadata and target runtime both stop advertising the removed peer

What Actually Happens

The migration path is easier to reason about if you separate the steps by ownership.

1. plan: preflight only

nokv migrate plan is read-only. It checks:

  • the standalone workdir is readable
  • recovery state is coherent enough to promote
  • the directory is not already seeded or clustered
  • there is no conflicting local peer catalog
  • the directory is not already poisoned by an incompatible state

No mutation happens here.

2. init: promote the workdir into a seed

nokv migrate init performs the actual promotion.

At a high level it does this:

  1. write mode = preparing
  2. persist a full-range local RegionMeta in raftstore/localmeta
  3. export one full-range SST seed snapshot from the standalone DB
  4. synthesize the initial raft durable state for a single local voter
  5. persist the local raft replay pointer
  6. write mode = seeded

3. serve: reuse the normal distributed startup path

After init, the promoted directory is not started through a special bootstrap runtime. It is opened through normal distributed startup:

  • open the same DB workdir
  • load local recovery metadata from raftstore/localmeta
  • open group-local raft durable state
  • start one local peer
  • serve distributed traffic through the regular raftstore path

That is what keeps the feature architecturally honest.

4. expand: reuse normal replication mechanisms

Once the seed is healthy, normal distributed mechanisms take over:

  1. start empty target stores
  2. call nokv migrate expand against the current leader
  3. leader exports one SST snapshot stream
  4. target imports the streamed snapshot on an empty peer
  5. leader publishes the new membership
  6. wait until the target store reports the peer as hosted

The current implementation keeps rollout explicit and sequential:

  • repeated --target <store>:<peer>[@addr]
  • explicit transfer-leader
  • explicit remove-peer
  • no automatic split or rebalance in the promotion phase

Snapshot Semantics

This feature matters because migration is not implemented as a file dump or a second storage engine.

Current snapshot path

Today NoKV’s migration path uses one SST region snapshot primitive:

  • source side exports one region-scoped external SST snapshot
  • snapshot files are bundled into a transport-safe payload
  • target side imports that payload through the external SST install path
  • internal raft snapshot transport carries the same SST payload when a peer needs snapshot-based catch-up instead of regular log replication

This is a correctness-first choice.

Why that choice is reasonable right now

It keeps the promotion and replication contract simple enough to validate:

  • region-scoped
  • explicit snapshot metadata and checksum
  • install-before-publish boundaries
  • retry and restart semantics that are testable

What this is not trying to be yet

The current path is not pretending to be:

  • zero-copy table transfer
  • SST ingest-based install
  • online rebalance for already sharded data

Those are later upgrades, not prerequisites for the first credible promotion path.

Why Full-Range Seed First

The first promotion step intentionally creates one full-range seed region:

  • StartKey = nil
  • EndKey = nil

That means the migration feature does not need to solve split and rebalance as part of the first upgrade.

This is a good tradeoff.

The first hard problem is not “how to repartition existing data automatically”. The first hard problem is “how to make one existing standalone directory become a valid distributed region with explicit lifecycle and recovery semantics”.

Once that is solid, later split and rebalance work can build on top of a stable seed region instead of being entangled with promotion itself.

Failure Semantics

The migration flow must fail loudly and remain recoverable.

During plan

  • read-only errors are returned directly

During init

If any step after preparing fails:

  • keep mode as preparing
  • reject ordinary standalone startup
  • require explicit operator action:
    • rerun init, or
    • use a future rollback or repair path

This is intentionally strict. Half-migrated state must not quietly behave like a normal standalone database.

During expand

Expansion should not be treated as “best effort copy”. It should remain observable and explicit:

  • leader publication must become visible
  • target install must become hosted
  • interruption before publish must not leave a partially hosted peer
  • restart after install must converge without ghost peers

That is why the current test matrix includes:

  • snapshot interruption before publish
  • restarted follower recovery
  • removed-peer restart
  • repeated transport flap during membership changes
  • Coordinator outage after startup

Test and Validation Story

This feature only counts if it is validated as a lifecycle, not just as a command list.

The current validation story spans:

  • raftstore/migrate/*_test.go
    • migration orchestration and timeout behavior
  • raftstore/admin/service_test.go
    • admin execution surfaces for snapshot export/install
  • raftstore/store/peer_lifecycle_test.go
    • install and publish semantics on the target side
  • raftstore/integration/migration_flow_test.go
    • standalone → seed → cluster happy path
  • raftstore/integration/restart_recovery_test.go
    • restart, dehost, and follow-up membership behavior
  • raftstore/integration/snapshot_interruption_test.go
    • install interrupted before publish
  • raftstore/integration/coordinator_degraded_test.go
    • control-plane degradation after startup

That is the right level of seriousness for the feature. Migration is one of the easiest places for a system to look good in a demo and fail under recovery.

Current Scope and Non-goals

The first version is intentionally narrow.

In scope now

  • offline standalone → seed promotion
  • one full-range seed region
  • explicit plan/init/status/expand/remove-peer/transfer-leader
  • SST region snapshot export/install
  • recovery-aware mode gating

Not in scope yet

  • online cutover
  • dual-write
  • automatic split/rebalance during promotion
  • automatic cluster-wide orchestration
  • SST-based snapshot/install as the default path

Keeping the scope narrow is part of what makes the feature credible.

What Comes Next

The next round of work should not reinvent the protocol. It should productize and sharpen it.

Near-term productization

  1. richer cluster-aware migration status once the seed has booted
  2. structured machine-readable reports that can be consumed by demos and automation
  3. stronger stage-by-stage operator output in scripts/ops/migrate-cluster.sh
  4. clearer recovery guidance for preparing, interrupted install, and partial rollout

Next engineering upgrade

The most important technical upgrade after that is:

  • SST-based snapshot/install for promotion and peer rollout

But that should be treated as an upgrade to the existing migration contract, not a replacement for the contract itself.

Long-term direction

Once standalone → full-range seed → replicated region is solid and easy to operate, later work can move on to:

  • more automated orchestration
  • region split after promotion
  • richer rebalance flows
  • larger-scale rollout behavior

The Feature in One Sentence

NoKV starts with a standalone workdir, promotes that workdir into a distributed seed, and expands it into a replicated cluster without replacing the storage core underneath.

Crash Recovery Playbook

This document describes how NoKV restores state after abnormal exit, and which tests validate each recovery contract.


1. Recovery Phases

flowchart TD
    Start[DB.Open]
    Verify[runRecoveryChecks]
    WalOpen[wal.Open]
    LSM[lsm.NewLSM]
    Manifest[manifest replay + table load]
    WALReplay[WAL replay to memtables]
    VLog[valueLog recover]
    Flush[submit immutable flush backlog]
    Stats[stats/start background loops]

    Start --> Verify --> WalOpen --> LSM --> Manifest --> WALReplay --> VLog --> Flush --> Stats
  1. Pre-flight verification: DB.runRecoveryChecks runs manifest.Verify, wal.VerifyDir, and per-bucket vlog.VerifyDir.
  2. WAL manager reopen: wal.Open reopens latest segment and rebuilds counters.
  3. Manifest replay + SST load: levelManager.build replays manifest version and opens SST files.
  4. Strict SST validation: if a manifest SST is missing or unreadable/corrupt, startup fails and manifest state is left unchanged.
  5. WAL replay: lsm.recovery replays post-checkpoint WAL records into memtables.
  6. Flush backlog restore: recovered immutable memtables are resubmitted to the flush queue.
  7. ValueLog recovery: value-log managers reconcile on-disk files with manifest metadata, trim torn tails, and drop stale/orphan segments.
  8. Runtime restart: metrics and periodic workers start again.

2. Failure Scenarios & Tests

Failure PointExpected Recovery BehaviourTests
WAL tail truncatedReplay stops safely at truncated tail, preserving valid prefix recordsengine/wal/manager_test.go::TestManagerReplayHandlesTruncate
Crash before memtable flush installWAL replay restores user data not yet flushed to SSTdb_test.go::TestRecoveryWALReplayRestoresData
Manifest references missing SSTStartup fails fast and the manifest entry is preserved (operator must investigate before continuing)db_test.go::TestRecoveryFailsOnMissingSST
Manifest references corrupt/unreadable SSTStartup fails fast and the manifest entry is preserved (operator must investigate before continuing)db_test.go::TestRecoveryFailsOnCorruptSST
ValueLog stale segment (manifest marked invalid)Recovery deletes stale file from diskdb_test.go::TestRecoveryRemovesStaleValueLogSegment
ValueLog orphan segment (disk only)Recovery deletes orphan file not tracked by manifestdb_test.go::TestRecoveryRemovesOrphanValueLogSegment
Manifest rewrite interruptedRecovery keeps using CURRENT-selected manifest and data remains readabledb_test.go::TestRecoveryManifestRewriteCrash
ValueLog contains records absent from LSM/WALRecovery does not replay vlog as source-of-truthdb_test.go::TestRecoverySkipsValueLogReplay

3. Recovery Tooling

3.1 Targeted tests

go test ./... -run 'Recovery|ReplayHandlesTruncate'

Set RECOVERY_TRACE_METRICS=1 to emit RECOVERY_METRIC ... lines in tests.

3.2 Targeted harness command

RECOVERY_TRACE_METRICS=1 \
go test ./... -run 'TestRecovery(RemovesStaleValueLogSegment|FailsOnMissingSST|FailsOnCorruptSST|ManifestRewriteCrash|SlowFollowerSnapshotBacklog|SnapshotExportRoundTrip|WALReplayRestoresData)' -count=1 -v

Outputs are saved under artifacts/recovery/.

3.3 CLI checks

  • nokv manifest --workdir <dir>: verify level files, WAL pointer, vlog metadata.
  • nokv stats --workdir <dir>: confirm flush backlog converges.
  • nokv vlog --workdir <dir>: inspect vlog segment state.

4. Operational Signals

Watch these fields during restart:

  • flush.queue_length
  • wal.segment_count
  • value_log.heads
  • value_log.segments
  • value_log.pending_deletes

If flush.queue_length remains high after replay, inspect flush worker throughput and manifest sync settings.


5. Notes on Consistency Model

  • WAL + manifest remain the authoritative recovery chain for LSM state.
  • ValueLog is reconciled/validated but is not replayed as a mutation source.
  • In strict flush mode (ManifestSync=true), SST install ordering is SST Sync -> RenameNoReplace -> SyncDir -> manifest edit.

For deeper internals, see flush.md, manifest.md, and wal.md.

Configuration & Options

NoKV exposes two configuration surfaces:

  1. Runtime options for the embedded engine (Options in options.go).
  2. Cluster topology for distributed mode (raft_config.example.json via config.LoadFile/Validate).

1. Runtime Options (Embedded Engine)

NoKV.NewDefaultOptions() returns a tuned baseline. Override fields before calling NoKV.Open(opt), which now returns (*DB, error).

Key option groups (see options.go for the full list):

  • Paths & durability
    • WorkDir, SyncWrites, ManifestSync, ManifestRewriteThreshold
  • Write pipeline
    • WriteBatchMaxCount, WriteBatchMaxSize, WriteBatchWait
    • MaxBatchCount, MaxBatchSize
    • WriteThrottleMinRate, WriteThrottleMaxRate
  • Value log
    • ValueThreshold, ValueLogFileSize, ValueLogMaxEntries
    • ValueLogGCInterval, ValueLogGCDiscardRatio
    • ValueLogGCParallelism, ValueLogGCReduceScore, ValueLogGCSkipScore
    • ValueLogGCReduceBacklog, ValueLogGCSkipBacklog
    • ValueLogGCSampleSizeRatio, ValueLogGCSampleCountRatio, ValueLogGCSampleFromHead
    • ValueLogBucketCount
  • LSM & compaction
    • MemTableSize, MemTableEngine, SSTableMaxSz, NumCompactors
    • NumLevelZeroTables, LandingCompactBatchSize, LandingBacklogMergeScore
    • CompactionValueWeight, CompactionValueAlertThreshold
  • Caches
    • BlockCacheBytes, IndexCacheBytes
  • Hot key throttling
    • WriteHotKeyLimit
    • ThermosEnabled, ThermosTopK, decay/window settings
    • ThermosNodeCap, ThermosNodeSampleBits, ThermosRotationInterval
  • WAL watchdog
    • EnableWALWatchdog, WALAutoGCInterval
    • WALAutoGCMinRemovable, WALAutoGCMaxBatch
    • WALTypedRecordWarnRatio, WALTypedRecordWarnSegments
  • Raft lag warnings (stats only)
    • RaftLagWarnSegments

Example:

opt := NoKV.NewDefaultOptions()
opt.WorkDir = "./data"
opt.SyncWrites = true
opt.ValueThreshold = 2048
opt.WriteBatchMaxCount = 128
db, err := NoKV.Open(opt)
if err != nil {
	log.Fatalf("open failed: %v", err)
}
defer db.Close()

Notes:

  • NewDefaultOptions() populates concrete compaction/landing defaults up front. Open() resolves constructor-owned defaults once, then the DB and LSM layers consume the resolved values directly.
  • WriteBatchMaxCount, WriteBatchMaxSize, MaxBatchCount, MaxBatchSize, WriteThrottleMinRate, WriteThrottleMaxRate, and WALBufferSize now also expose concrete defaults through NewDefaultOptions(). If you construct Options manually, leaving these fields at zero lets Open() resolve the constructor defaults.
  • Batch knobs are split by owner:
    • WriteBatchMaxCount / WriteBatchMaxSize bound commit-worker request coalescing.
    • MaxBatchCount / MaxBatchSize bound internal apply/rewrite batches such as batchSet and value-log GC rewrites.
  • Write slowdown is bandwidth-driven: WriteThrottleMaxRate applies when slowdown first becomes active, and pressure lowers the target rate toward WriteThrottleMinRate as compaction debt approaches the stop threshold.

Load Options From TOML

For convenience, you can load engine options from a TOML file. Unspecified fields keep their defaults from NewDefaultOptions.

opt, err := NoKV.LoadOptionsFile("nokv.options.toml")
if err != nil {
    log.Fatal(err)
}
db, err := NoKV.Open(opt)
if err != nil {
	log.Fatalf("open failed: %v", err)
}
defer db.Close()

Example (TOML):

work_dir = "./data"
mem_table_engine = "art"
value_threshold = 2048
write_hot_key_limit = 128
value_log_gc_interval = "30s"

Notes:

  • Field names are case-insensitive; _ / - / . are ignored.
  • Durations accept Go-style strings (e.g. "30s", "200ms"). Numeric durations are interpreted as nanoseconds.
  • File extensions .toml and .tml are accepted.
  • JSON option files are rejected by design.
  • Unknown fields return an error so typos do not silently pass.

2. Raft Topology File

raft_config.example.json is consumed by every CLI in distributed mode.

Two-layer semantics

The file has two independent layers with different lifecycles. Confusing them is a common deployment mistake, so be explicit:

LayerKeysLifecycleSource of truth
Address directorymeta_root.peers, coordinator, stores, store_work_dir_template, max_retriesRead on every CLI invocation. Keep in sync with deployed containers/hosts.This file is the source of truth. Nothing else knows where to dial.
Bootstrap seedregionsRead only on first startup by scripts/ops/bootstrap.sh. Once a store has CURRENT, bootstrap skips it.After first bootstrap, meta-root owns the runtime region topology. Inspect with nokv-config regions.

Consequence: editing regions after bootstrap is a no-op for running clusters. Editing addresses is effective on the next CLI invocation (restart / docker compose up).

Precedence

When a value can come from both CLI and config file, CLI wins. Config is a source of defaults:

--root-peer=1=host:2380   → explicit, used
(absent)                  → fall back to meta_root.peers[0].addr

Minimal shape

{
  "max_retries": 8,
  "meta_root": {
    "peers": [
      { "node_id": 1,
        "addr": "127.0.0.1:2380",         // coordinator/host audit tools dial here
        "docker_addr": "nokv-meta-root-1:2380",
        "transport_addr": "127.0.0.1:3380", // sibling meta-root peers dial here for raft
        "docker_transport_addr": "nokv-meta-root-1:2480",
        "work_dir": "./artifacts/cluster/meta-root-1",
        "docker_work_dir": "/var/lib/nokv-meta-root" },
      { "node_id": 2, "...": "..." },
      { "node_id": 3, "...": "..." }
    ]
  },
  "coordinator": {
    "addr": "127.0.0.1:2379",
    "docker_addr": "nokv-coordinator-1:2379,nokv-coordinator-2:2379,nokv-coordinator-3:2379"
  },
  "store_work_dir_template": "./artifacts/cluster/store-{id}",
  "store_docker_work_dir_template": "/var/lib/nokv/store-{id}",
  "stores": [
    { "store_id": 1,
      "listen_addr": "127.0.0.1:20170",
      "addr": "127.0.0.1:20170",
      "docker_listen_addr": "0.0.0.0:20160",
      "docker_addr": "nokv-store-1:20160" }
  ],
  "regions": [
    { "id": 1, "start_key": "", "end_key": "m",
      "epoch": { "version": 1, "conf_version": 1 },
      "peers": [{ "store_id": 1, "peer_id": 101 }],
      "leader_store_id": 1 }
  ]
}

Field notes

  • meta_root.peers: exactly 3 entries. addr is the gRPC service port (coordinators / host audit tools dial it). transport_addr is the raft transport port (sibling meta-root peers dial it for raft messages). They MUST be different ports on the same host.
  • coordinator.addr / docker_addr: may be a single endpoint or comma-separated for multi-coord HA (coord1:2379,coord2:2379,coord3:2379). Gateways and stores use this list to failover on lease-not-held errors.
  • stores[i]: addr is what other processes dial; listen_addr is what the store binds locally. Usually the same on host scope; different on docker scope (0.0.0.0:20160 vs nokv-store-1:20160).
  • Store workdir resolution (ResolveStoreWorkDir):
    1. store-scoped override (stores[i].work_dir / docker_work_dir)
    2. global template (must contain {id})
    3. empty — caller falls back to its own default
  • start_key / end_key accept plain strings, hex:<bytes>, or base64. Empty or "-" means unbounded.
  • leader_store_id is bootstrap metadata only. Runtime routing comes from coordinator (GetRegionByKey), never from this field.

CLI integration

nokv meta-root --config <file> --node-id N resolves --peer, --transport-addr, --workdir from the meta_root section. Explicit flags still override.

nokv coordinator --config <file> resolves --addr from coordinator.addr and --root-peer from meta_root.peers. This is how docker-compose keeps meta-root addresses in a single file.

Programmatic loading:

cfg, _ := config.LoadFile("raft_config.example.json")
if err := cfg.Validate(); err != nil { /* handle */ }
peers := cfg.MetaRootServicePeers("docker") // id → gRPC addr
  • scripts/dev/cluster.sh --config raft_config.example.json
  • scripts/ops/serve-meta-root.sh --config ... --node-id 1
  • scripts/ops/serve-coordinator.sh --config ... --coordinator-id c1
  • nokv-config stores / nokv-config regions — query current rooted topology (not the JSON). Use these to diff against the deployment manifest after scheduler operations.

CLI (cmd/nokv) Reference

nokv provides operational visibility similar to RocksDB ldb / Badger CLI, with script-friendly JSON output.


Installation

go install ./cmd/nokv

Shared Flags

  • --workdir <path>: NoKV database directory (must contain CURRENT for manifest commands)
  • --json: JSON output (default is plain text)
  • --expvar <url>: for stats, fetch from /debug/vars
  • --no-region-metrics: for offline stats, skip attaching runtime region metrics

Subcommands

nokv stats

  • Reads StatsSnapshot either offline (--workdir) or online (--expvar)
  • JSON output is nested by domain (not flat)

Common fields:

  • entries
  • flush.pending, flush.queue_length, flush.last_wait_ms
  • compaction.backlog, compaction.max_score
  • value_log.segments, value_log.pending_deletes, value_log.gc.*
  • wal.active_segment, wal.segment_count, wal.typed_record_ratio
  • write.queue_depth, write.queue_entries, write.hot_key_limited
  • region.total, region.running, region.removing
  • hot.write_keys
  • lsm.levels, lsm.value_bytes_total
  • transport.*

Example:

nokv stats --workdir ./testdata/db --json | jq '.flush.queue_length'

nokv execution

  • Queries execution-plane diagnostics from a running raftstore admin endpoint
  • Exposes:
    • last admission decision observed by the store
    • restart-status summary (state, hosted region count, raft-group count, missing raft pointers)
    • current topology transition status as seen by the execution plane
  • Supports plain-text or --json output
  • Common flags:
    • --addr raftstore admin address (required)
    • --region optional region filter
    • --transition optional transition id filter
    • --timeout
    • --json

Example:

nokv execution --addr 127.0.0.1:20161 --json

nokv manifest

  • Reads manifest version state
  • Shows log pointer, per-level file info, and value-log metadata

nokv vlog

  • Lists value-log segments and current head per bucket
  • Useful after GC/recovery checks

nokv regions

  • Dumps the local peer catalog used for store recovery (state/range/epoch/peers)
  • Supports --json

nokv migrate plan

  • Runs a read-only preflight check for standalone -> cluster-seed migration
  • Verifies manifest, WAL, and value-log structure without repairing tails
  • Reports current mode, local catalog occupancy, blockers, and next step

nokv migrate init

  • Converts a standalone workdir into a single-store seeded cluster directory
  • Writes MODE.json, the full-range local region catalog entry, and the initial raft durable metadata
  • Exports a logical seed snapshot under RAFTSTORE_SNAPSHOTS/region-<id>
  • After init, ordinary standalone opens must reject the workdir unless the caller explicitly opts into distributed modes

nokv migrate status

  • Reads MODE.json when present and otherwise reports standalone
  • Shows current mode plus seed identifiers (store, region, peer)

nokv migrate expand

  • Sends one or more AddPeer requests to the leader store’s admin gRPC endpoint
  • Supports sequential rollout with repeated --target <store>:<peer>[@addr]
  • Optionally polls leader and target stores until each new peer is published, hosted, and has applied at least one raft index
  • Common flags:
    • --addr leader store admin address
    • --region
    • --target <store>:<peer>[@addr] (repeatable)
    • --wait overall wait timeout (0 disables waiting)
    • --poll-interval

nokv migrate remove-peer

  • Sends one RemovePeer request to the leader store’s admin gRPC endpoint
  • Optionally waits until the leader metadata drops the peer and the target store no longer reports it as hosted
  • Common flags:
    • --addr leader store admin address
    • --target-addr target store admin address for removal wait checks
    • --region, --peer
    • --wait, --poll-interval

nokv migrate transfer-leader

  • Sends one TransferLeader request to the leader store’s admin gRPC endpoint
  • Optionally waits until the target peer becomes the observed region leader
  • Common flags:
    • --addr current leader store admin address
    • --target-addr target store admin address for leader wait checks
    • --region, --peer
    • --wait, --poll-interval

nokv serve

  • Starts NoKV gRPC service backed by local raftstore
  • Requires --store-id
  • Also requires enough address metadata to reach the Coordinator and all remote raft peers:
    • either explicit flags (--workdir, --addr, --coordinator-addr, --store-addr)
    • or --config <raft_config.json> --scope host|docker
  • Common flags:
    • --addr (default 127.0.0.1:20160)
    • --config, --scope host|docker
    • --metrics-addr (optional expvar endpoint, exposes /debug/vars)
    • --store-addr storeID=address (repeatable override for remote store transport addresses)
    • --election-tick, --heartbeat-tick
    • --raft-max-msg-bytes, --raft-max-inflight
    • --raft-tick-interval, --raft-debug-log

Example:

nokv serve \
  --config ./raft_config.example.json \
  --scope host \
  --store-id 1 \
  --workdir ./artifacts/cluster/store-1

Restart semantics:

  • hosted peers come from raftstore/localmeta, not raft_config.json region lines
  • --config is used only to resolve:
    • store listen address
    • Coordinator address
    • storeID -> addr, which serve expands into remote peerID -> addr
  • --store-id must match the durable workdir identity once the workdir has been used
  • --store-addr is only an exceptional static override; it is keyed by stable storeID, not by mutable runtime peerID

nokv coordinator

  • Starts the Coordinator gRPC service. NoKV only supports the separated topology: coordinator always connects to an external 3-peer meta-root cluster via gRPC.
  • Required flags:
    • --coordinator-id (stable lease owner id)
    • --root-peer nodeID=addr (exactly 3 meta-root gRPC endpoints)
  • Common flags:
    • --addr (default 127.0.0.1:2379)
    • --lease-ttl, --lease-renew-before (default 10s / 3s)
    • --root-refresh (default 200ms)
    • --id-start, --ts-start (allocator seeds; only used when the meta-root cluster has no allocator state yet)
    • --config + --scope host|docker (resolves --addr from raft_config.json)
    • --metrics-addr (optional expvar endpoint, exposes /debug/vars)

Example:

nokv coordinator \
  --addr 127.0.0.1:2379 \
  --coordinator-id c1 \
  --root-peer 1=127.0.0.1:2380 \
  --root-peer 2=127.0.0.1:2381 \
  --root-peer 3=127.0.0.1:2382

nokv meta-root

  • Starts one peer of the 3-peer replicated metadata-root cluster. NoKV only supports the replicated topology; single-process local mode has been removed from the CLI.
  • Required flags:
    • --workdir, --node-id, --transport-addr
    • --peer nodeID=addr (repeatable, exactly 3)
  • Common flags:
    • --addr (default 127.0.0.1:2380, gRPC listen)
    • --tick-interval (default 100ms)
    • --metrics-addr (optional expvar endpoint)

Example:

nokv meta-root \
  --addr 127.0.0.1:2380 \
  --workdir ./artifacts/meta-root-1 \
  --node-id 1 \
  --transport-addr 127.0.0.1:3380 \
  --peer 1=127.0.0.1:3380 \
  --peer 2=127.0.0.1:3381 \
  --peer 3=127.0.0.1:3382

Script Helpers

  • scripts/ops/serve-meta-root.sh
    • Starts one replicated meta-root peer and forwards shutdown signals.
    • Requires --workdir, --node-id, --transport-addr, and 3 --peer values.
  • scripts/ops/serve-coordinator.sh
    • Starts one nokv coordinator against an external meta-root cluster.
    • Requires --coordinator-id and 3 --root-peer values (meta-root gRPC endpoints).
  • scripts/ops/serve-store.sh
    • Starts one nokv serve store against an existing durable workdir.
  • scripts/ops/bootstrap.sh
    • Seeds fresh store workdirs from config.regions; not a restart tool.
  • scripts/dev/cluster.sh
    • Dev bootstrap for the 333 separated layout: 3 meta-root + 1 coordinator(remote) + stores.
    • Uses fixed local root endpoints:
      • gRPC: 127.0.0.1:2380/2381/2382
      • raft transport: 127.0.0.1:3380/3381/3382

Expvar Keys

  • nokv_coordinator
    • Published by nokv coordinator --metrics-addr ...
    • Includes:
      • root_mode
      • rooted read-state summary
      • lease state
      • allocator window state

Integration Tips

  • Combine with RECOVERY_TRACE_METRICS=1 for recovery validation.
  • In CI, compare JSON snapshots to detect observability regressions.
  • Use nokv stats --expvar for online diagnostics and --workdir for offline forensics.

Cluster Demo

One-command demo of the full 333 HA topology (3 meta-root + 3 coordinator + 3 store + 1 fsmeta gateway).

One-shot startup

# Pull image + start every service + run bootstrap once
docker compose up -d
docker compose logs -f

docker compose down -v wipes the data volumes too.

Exposed ports (all bound to 127.0.0.1)

Symmetric port blocks so the three replicas of each role land on three consecutive numbers — easy to remember, easy to script against.

ServicePortPurpose
Meta-root-1 gRPC2380host-side tools dial rooted state directly
Meta-root-2 gRPC2381
Meta-root-3 gRPC2382
Meta-root-1 expvar9380/debug/vars JSON
Meta-root-2 expvar9381
Meta-root-3 expvar9382
Coordinator-1 gRPC2390
Coordinator-2 gRPC2391
Coordinator-3 gRPC2392
Coordinator-1 expvar9100
Coordinator-2 expvar9101
Coordinator-3 expvar9102
Store-1 expvar9200
Store-2 expvar9201
Store-3 expvar9202
FSMeta gRPC8090filesystem metadata service
FSMeta expvar9400/debug/vars JSON

Why are meta-root gRPC ports exposed?

Meta-root (2380/2381/2382) is exposed so host-side tools like nokv-config can query rooted state directly for debugging.

For production, don’t expose meta-root publicly. The gRPC API accepts ApplyTenure and ApplyHandover which are lease-gated but still structurally sensitive. To opt out, delete the ports: block under meta-root-1, meta-root-2, meta-root-3 in docker-compose.yml. The cluster keeps working since coordinator and fsmeta dial meta-root over the docker network, not through host ports.

Same applies to coordinator gRPC (2390/2391/2392): convenient for host-side client experiments, don’t expose publicly.

Failure drills

Run them straight from the terminal:

  • Stop the active coordinatordocker stop nokv-coordinator-1. The Eunomia lease moves to a standby in 1–3 s; watch the era bump on the surviving coordinators’ /debug/vars.
  • Stop the raft leader meta-rootdocker stop nokv-meta-root-1. Raft election lands on a surviving peer; coord lease may churn through one era before settling (~17 s total recovery).
  • Start the stopped containerdocker start nokv-coordinator-1. It rejoins quietly as a standby.

Scripts Overview

NoKV now groups shell entrypoints by role instead of keeping every helper flat under scripts/.

Layout

PathRole
scripts/devLocal development and bootstrap helpers for running a cluster from raft_config*.json.
scripts/opsOperator-style workflows that drive the formal migration CLI.
scripts/libShared shell helpers for config lookup, workdir hygiene, and build/bootstrap rules.
scripts/*.shTooling or benchmark entrypoints that are still intentionally top-level.

This split is deliberate:

  • dev scripts are allowed to help with local experiments and smoke tests.
  • ops scripts must treat the migration CLI as source of truth and stay stricter.
  • lib is where shared rules live, so shell semantics do not drift across scripts.

Bootstrap & Local Launch

scripts/dev/cluster.sh

  • Purpose: build nokv and nokv-config, start the canonical 333 separated dev cluster, seed fresh store workdirs, then start stores from raft_config.json.
  • Starts:
    • three nokv meta-root processes (Truth plane; replicated is the only mode)
    • one nokv coordinator process (Service plane; always remote-rooted)
    • all configured stores (Execution plane)
  • Uses shared rules from:
    • scripts/lib/common.sh
    • scripts/lib/config.sh
    • scripts/lib/workdir.sh
  • Example:
    ./scripts/dev/cluster.sh --config ./raft_config.example.json --workdir ./artifacts/cluster
    
  • Notes:
    • --config defaults to ./raft_config.example.json
    • --workdir defaults to ./artifacts/cluster
    • uses fixed local metadata-root gRPC endpoints 127.0.0.1:2380/2381/2382
    • uses fixed local metadata-root raft transport endpoints 127.0.0.1:3380/3381/3382
    • store workdirs are rejected if they contain unexpected files
    • logs stream with [meta-root-<id>] / [coordinator] / [store-<id>] prefixes and are still written to root.log / coordinator.log / server.log
    • this is a bootstrap/dev launcher, not a restart command
    • production-style restarts should run scripts/ops/serve-meta-root.sh, scripts/ops/serve-coordinator.sh, and scripts/ops/serve-store.sh directly against the same durable workdirs

scripts/ops/bootstrap.sh

  • Purpose: seed local peer catalog metadata into a set of store directories derived from a path template.
  • Intended for:
    • Docker Compose bootstrap
    • local static-topology experiments
  • Example:
    ./scripts/ops/bootstrap.sh --config /etc/nokv/raft_config.json --path-template /data/store-{id}
    
  • Notes:
    • skips stores that already contain CURRENT
    • refuses to seed into dirty directories
    • this is bootstrap-only; it does not recover runtime topology
    • use nokv serve / scripts/ops/serve-store.sh to restart an existing store from the same workdir

scripts/ops/serve-store.sh

  • Purpose: thin wrapper around nokv serve for one store.
  • Example:
    ./scripts/ops/serve-store.sh \
      --config ./raft_config.example.json \
      --store-id 1 \
      --workdir ./artifacts/cluster/store-1 \
      --scope local
    
  • Notes:
    • resolves store listen/workdir/Coordinator defaults through nokv serve --config
    • remote peer recovery comes from raftstore/localmeta; config stores only provide storeID -> addr
    • no longer treats config.regions as restart-time topology truth
    • if a static transport override is needed, use --store-addr <store-id>=<addr> rather than a peer-id keyed mapping
    • --scope docker selects container-friendly addresses

scripts/ops/serve-meta-root.sh

  • Purpose: thin wrapper around nokv meta-root for one replicated metadata-root peer.
  • Example:
    ./scripts/ops/serve-meta-root.sh \
      --addr 127.0.0.1:2380 \
      --workdir ./artifacts/cluster/meta-root-1 \
      --node-id 1 \
      --transport-addr 127.0.0.1:3380 \
      --peer 1=127.0.0.1:3380 \
      --peer 2=127.0.0.1:3381 \
      --peer 3=127.0.0.1:3382
    
  • Notes:
    • --peer values are metadata-root raft transport addresses, not gRPC service addresses
    • --workdir, --node-id, --transport-addr, and exactly 3 --peer values are required; there is no single-process local mode
    • forwards shutdown signals to nokv meta-root

scripts/ops/serve-coordinator.sh

  • Purpose: thin wrapper around nokv coordinator for one coordinator process wired to an external 3-peer meta-root cluster.
  • Example:
    ./scripts/ops/serve-coordinator.sh \
      --addr 127.0.0.1:2379 \
      --coordinator-id c1 \
      --root-peer 1=127.0.0.1:2380 \
      --root-peer 2=127.0.0.1:2381 \
      --root-peer 3=127.0.0.1:2382
    
  • Notes:
    • --root-peer values are metadata-root gRPC service addresses, not raft transport
    • exactly 3 --root-peer values are required (mirrors the Truth-plane quorum)
    • --coordinator-id is required (stable lease owner id)
    • forwards shutdown signals to nokv coordinator

Migration Workflow

scripts/ops/migrate-cluster.sh

  • Purpose: one-shot local operator wrapper for the standalone-to-cluster migration path.
  • Drives:
    • nokv migrate plan
    • nokv migrate init
    • nokv migrate expand
    • optional transfer-leader
    • optional remove-peer
  • Example:
    ./scripts/ops/migrate-cluster.sh \
      --config ./raft_config.example.json \
      --workdir ./artifacts/standalone \
      --seed-store 1 \
      --seed-region 1 \
      --seed-peer 101 \
      --target 2:201 \
      --target 3:301 \
      --transfer-leader 201 \
      --remove-peer 101
    
  • Notes:
    • seed workdir must already contain standalone data
    • target store workdirs must be fresh
    • uses the migration CLI as the only source of truth

Shared Shell Rules

scripts/lib/common.sh

  • shared repo-root detection
  • shared build helpers for nokv and nokv-config
  • shared TCP readiness helper

scripts/lib/config.sh

  • shared nokv-config lookups for:
    • store lines
    • region lines
    • Coordinator address
    • Coordinator workdir

scripts/lib/workdir.sh

  • shared workdir hygiene rules:
    • remove stale LOCK
    • reject unexpected directory contents
    • assert a directory is fresh before seeding or bootstrap

This is where shell-level correctness rules should keep living. New scripts should reuse these helpers instead of open-coding workdir or config parsing logic again.

Tooling & Benchmarks

ScriptPurpose
scripts/run_benchmarks.shExecute YCSB benchmarks (default engines: NoKV/Badger/Pebble, optional RocksDB).
scripts/build_rocksdb.shBuild local RocksDB artifacts used by benchmark comparisons.
scripts/debug.shWrap dlv test for focused debugging.
scripts/gen.shFormat protobufs and regenerate Go bindings through Buf.

For recovery and transport fault validation, use direct Go tests instead of shell wrappers:

RECOVERY_TRACE_METRICS=1 \
go test ./... -run 'TestRecovery(RemovesStaleValueLogSegment|FailsOnMissingSST|FailsOnCorruptSST|ManifestRewriteCrash|SlowFollowerSnapshotBacklog|SnapshotExportRoundTrip|WALReplayRestoresData)' -count=1 -v

CHAOS_TRACE_METRICS=1 \
go test -run 'TestGRPCTransport(HandlesPartition|MetricsWatchdog|MetricsBlockedPeers)' -count=1 -v ./raftstore/transport

Relationship with nokv-config

  • nokv-config stores / regions / coordinator remain the structured topology source for shell scripts.
  • config.regions remain bootstrap/deployment metadata, not restart-time peer truth.
  • nokv-config catalog writes Region metadata into the local peer catalog.
  • Go tools can import github.com/feichai0017/NoKV/config and call config.LoadFile / Validate directly.

Maintaining a single raft_config.json still keeps development scripts, Docker Compose, and automated tests aligned. The difference now is that shell behavior is shared and explicit instead of repeated across four separate entrypoints.

Stats & Observability Pipeline

NoKV exposes runtime health through:

  • StatsSnapshot (structured in-process snapshot)
  • expvar (/debug/vars)
  • nokv stats CLI (plain text or JSON)

The implementation lives in stats.go, and collection runs continuously once DB is open.


1. Architecture

flowchart TD
    subgraph COLLECTORS["Collectors"]
        LSM["lsm.* metrics"]
        WAL["wal metrics"]
        VLOG["value log metrics"]
        HOT["thermos"]
        REGION["region metrics"]
        TRANSPORT["grpc transport metrics"]
    end
    LSM --> SNAP["Stats.Snapshot()"]
    WAL --> SNAP
    VLOG --> SNAP
    HOT --> SNAP
    REGION --> SNAP
    TRANSPORT --> SNAP
    SNAP --> EXP["Stats.collect -> expvar"]
    SNAP --> CLI["nokv stats"]

Two-layer design:

  • metrics layer: only collects counters/gauges/snapshots.
  • stats layer: aggregates cross-module data and exports.

2. Snapshot Schema

StatsSnapshot is now domain-grouped (not flat):

  • entries
  • flush.*
  • compaction.*
  • value_log.* (includes value_log.gc.*)
  • wal.*
  • raft.*
  • write.*
  • region.*
  • hot.*
  • cache.*
  • lsm.*
  • transport.*

Representative fields:

  • flush.pending, flush.queue_length, flush.last_wait_ms
  • compaction.backlog, compaction.max_score, compaction.value_weight
  • value_log.segments, value_log.pending_deletes, value_log.gc.gc_runs
  • wal.active_segment, wal.segment_count, wal.typed_record_ratio
  • raft.group_count, raft.lagging_groups, raft.max_lag_segments
  • write.queue_depth, write.avg_request_wait_ms, write.hot_key_limited
  • region.total, region.running, region.removing, region.tombstone
  • hot.write_keys, hot.write_ring
  • cache.block_l0_hit_rate, cache.bloom_hit_rate, cache.iterator_reused
  • lsm.levels, lsm.value_bytes_total

3. expvar Export

Stats.collect exports a single structured object:

  • NoKV.Stats

All domains (flush, compaction, value_log, wal, raft, write, region, hot, cache, lsm, transport) are nested under this object.

Legacy scalar compatibility keys are removed. Consumers should read fields from NoKV.Stats directly.


4. CLI & JSON

  • nokv stats --workdir <dir>: offline snapshot from local DB
  • nokv stats --expvar <host:port>: snapshot from running process /debug/vars
  • nokv stats --json: machine-readable nested JSON

Example:

{
  "entries": 1048576,
  "flush": {
    "pending": 2,
    "queue_length": 2
  },
  "value_log": {
    "segments": 6,
    "pending_deletes": 1,
    "gc": {
      "gc_runs": 12
    }
  },
  "hot": {
    "write_keys": [
      {"key": "user:123", "count": 42}
    ]
  }
}

5. Operational Guidance

  • flush.queue_length + compaction.backlog both rising: flush/compaction under-provisioned.
  • value_log.discard_queue high for long periods: check value_log.gc.* and compaction pressure.
  • write.throttle_active=true frequently: L0 pressure likely high; inspect cache.block_l0_hit_rate and compaction.
  • write.hot_key_limited increasing: hot key write throttling is active.
  • raft.lag_warning=true: at least one group exceeds lag threshold.

6. Comparison

EngineBuilt-in observability
RocksDBRich metrics/perf context, often needs additional tooling/parsing
BadgerOptional metrics integrations
NoKVNative expvar + structured snapshot + CLI with offline/online modes

Testing & Validation Matrix

This document inventories NoKV’s automated coverage and provides guidance for extending tests. It aligns module-level unit tests, integration suites, and benchmarking harnesses with the architectural features described elsewhere.


1. Quick Commands

# All unit + integration tests (uses local module caches)
GOCACHE=$PWD/.gocache GOMODCACHE=$PWD/.gomodcache go test ./...

# Focused distributed transaction suite
go test ./percolator/... ./raftstore/client/... -run 'Test.*(Commit|Prewrite|TwoPhaseCommit)'

# Focused distributed migration / membership / restart suite
go test ./raftstore/integration -count=1

# Crash recovery scenarios
RECOVERY_TRACE_METRICS=1 \
go test ./... -run 'TestRecovery(RemovesStaleValueLogSegment|FailsOnMissingSST|FailsOnCorruptSST|ManifestRewriteCrash|SlowFollowerSnapshotBacklog|SnapshotExportRoundTrip|WALReplayRestoresData)' -count=1 -v

# Protobuf schema hygiene
make proto-check

# gRPC transport chaos tests + watchdog metrics
CHAOS_TRACE_METRICS=1 \
go test -run 'TestGRPCTransport(HandlesPartition|MetricsWatchdog|MetricsBlockedPeers)' -count=1 -v ./raftstore/transport

# Sample Coordinator service for shared TSO / routing in distributed tests
go run ./cmd/nokv coordinator --addr 127.0.0.1:2379 --id-start 1 --ts-start 100 --workdir ./artifacts/coordinator

# Local three-node cluster (includes catalog bootstrap + Coordinator)
./scripts/dev/cluster.sh --config ./raft_config.example.json
# Tear down with Ctrl+C

# Docker-compose sandbox (3 nodes + Coordinator)
docker compose up -d
docker compose down -v

# Build RocksDB locally (installs into ./third_party/rocksdb/dist by default)
./scripts/build_rocksdb.sh
# YCSB baseline (records=1e6, ops=1e6, warmup=1e5, conc=16)
./scripts/run_benchmarks.sh
# YCSB with RocksDB (requires CGO, `benchmark_rocksdb`, and the RocksDB build above)
LD_LIBRARY_PATH="$(pwd)/third_party/rocksdb/dist/lib:${LD_LIBRARY_PATH}" \
CGO_CFLAGS="-I$(pwd)/third_party/rocksdb/dist/include" \
CGO_LDFLAGS="-L$(pwd)/third_party/rocksdb/dist/lib -lrocksdb -lz -lbz2 -lsnappy -lzstd -llz4" \
YCSB_ENGINES="nokv,badger,rocksdb" ./scripts/run_benchmarks.sh
# One-click script (auto-detect RocksDB, supports `YCSB_*` env vars to override defaults)
./scripts/run_benchmarks.sh
# Quick smoke run (smaller dataset)
NOKV_RUN_BENCHMARKS=1 YCSB_RECORDS=10000 YCSB_OPS=50000 YCSB_WARM_OPS=0 \
./scripts/run_benchmarks.sh -ycsb_workloads=A -ycsb_engines=nokv

Tip: Pin GOCACHE/GOMODCACHE in CI to keep build artefacts local and avoid permission issues.


2. Module Coverage Overview

ModuleTestsCoverage HighlightsGaps / Next Steps
WALengine/wal/manager_test.goSegment rotation, sync semantics, replay tolerance for truncation, directory bootstrap.Add IO fault injection, concurrent append stress.
LSM / Flush / Compactionengine/lsm/lsm_test.go, engine/lsm/picker_test.go, engine/lsm/planner_test.go, engine/lsm/compaction_test.go, engine/lsm/flush_runtime_test.goMemtable correctness, iterator merging, flush pipeline metrics, compaction scheduling.Extend backpressure assertions and workload-shape coverage.
Manifestengine/manifest/manager_test.go, engine/lsm/manifest_test.goCURRENT swap safety, rewrite crash handling, vlog metadata persistence.Simulate partial edit corruption, column family extensions.
ValueLogengine/vlog/manager_test.go, engine/vlog/io_test.go, vlog_test.goValuePtr encoding/decoding, GC rewrite/rewind, concurrent iterator safety.Long-running GC, discard-ratio edge cases.
Percolator / Distributed Txnpercolator/*_test.go, raftstore/client/client_test.go, stats_test.goPrewrite/Commit/ResolveLock flows, 2PC retries, timestamp-driven MVCC behaviour, metrics accounting.Mixed multi-region fuzzing with lock TTL and leader churn.
DB Integrationdb_test.go, db_bench_test.goEnd-to-end writes, recovery, and throttle behaviour.Combine ValueLog GC + compaction stress, multi-DB interference.
CLI & Statscmd/nokv/main_test.go, stats_test.goGolden JSON output, stats snapshot correctness, hot key ranking.CLI error handling, expvar HTTP integration tests.
Scripts & Toolingcmd/nokv-config/main_test.go, cmd/nokv/serve_test.gonokv-config JSON/simple formats, catalog bootstrap CLI, serve bootstrap behavior.Add direct shell-script golden tests (currently not present) and failure-path diagnostics for cluster.sh.
Distributed Migration & Membershipraftstore/integration/*_test.go, raftstore/migrate/*_test.go, raftstore/admin/service_test.goStandalone -> seeded -> cluster flow, snapshot install, add/remove peer, leader transfer, restart/dehost recovery, Coordinator outage after startup, quorum-loss context propagation, multi-region 2PC deadline propagation, repeated link flap during membership changes, partitioned follower catch-up, and snapshot-install interruption before publish.Keep expanding publish-boundary coverage and larger fault matrices around runtime/transport interleavings.
Benchmarkbenchmark/ycsb/ycsb_test.go, benchmark/ycsb/ycsb_runner.goYCSB throughput/latency comparisons across engines (A-F) with detailed percentile + operation mix reporting.Automate multi-node deployments and add longer-running, multi-GB stability baselines.

3. System Scenarios

ScenarioCoverageFocus
Crash recoverydb_test.goWAL replay, fail-fast on missing/corrupt SST (manifest preserved for investigation), vlog GC restart, manifest rewrite safety.
WAL pointer desyncraftstore/raftlog/wal_storage_test.go::TestWALStorageDetectsTruncatedSegmentDetects store-local raft pointer offsets beyond truncated WAL tails to avoid silent corruption.
Distributed transaction contentionraftstore/client/client_test.go::TestClientTwoPhaseCommitAndGet, percolator/*_test.goLock conflicts, retries, and 2PC sequencing under region routing.
Value separation + GCengine/vlog/manager_test.go, db_test.go::TestRecoveryRemovesStaleValueLogSegmentGC correctness, manifest integration, iterator stability.
Iterator consistencyengine/lsm/iterator_test.goSnapshot visibility, merging iterators across levels and memtables.
Throttling / backpressureengine/lsm/compaction_test.go, db_test.go::TestWriteThrottleL0 backlog triggers, flush queue growth, metrics observation.
Distributed NoKV clientraftstore/client/client_test.go::TestClientTwoPhaseCommitAndGet, raftstore/transport/grpc_transport_test.go::TestGRPCTransportManualTicksDriveElectionRegion-aware routing, NotLeader retries, manual tick-driven elections, cross-region 2PC sequencing.
Migration & membership orchestrationraftstore/integration/migration_flow_test.go, raftstore/integration/restart_recovery_test.go, raftstore/integration/coordinator_degraded_test.go, raftstore/integration/snapshot_interruption_test.go, raftstore/integration/context_propagation_test.go, raftstore/integration/transport_chaos_test.goSeed bootstrap, multi-peer rollout, leader transfer, peer removal, restarted follower recovery, removed-peer dehost after restart, Coordinator outage after startup, quorum-loss read/write timeouts, split-region 2PC deadline propagation, repeated link flap during membership changes, partitioned follower catch-up, transfer-leader retry after partition recovery, and snapshot-install interruption before publish.
Performance regressionbenchmark packageCompare NoKV vs Badger/Pebble by default (RocksDB optional), produce human-readable reports under benchmark/benchmark_results.

4. Observability in Tests

  • RECOVERY_METRIC logs – produced when RECOVERY_TRACE_METRICS=1; helpful when triaging targeted recovery suites and CI failures.
  • TRANSPORT_METRIC logs – emitted by transport chaos tests when CHAOS_TRACE_METRICS=1, capturing gRPC watchdog counters during network partitions and retries.
  • Stats snapshotsstats_test.go verifies JSON structure so CLI output remains backwards compatible.
  • Benchmark artefacts – stored under benchmark/results/ for shared suites and under suite-local result directories where applicable, for example benchmark/results/ycsb/.

5. Extending Coverage

  1. Property-based testing – integrate testing/quick or third-party generators to randomise distributed 2PC sequences (prewrite/commit/rollback ordering).
  2. Stress harness – add a Go-based stress driver to run mixed read/write workloads for hours, capturing metrics akin to RocksDB’s db_stress tool.
  3. Distributed readiness – strengthen raftstore fault-injection and long-run tests (leader transfer, transport chaos, snapshot catch-up) with reproducible CI artifacts.
  4. CLI smoke tests – simulate corrupted directories to ensure CLI emits actionable errors.

6. Distributed Test Layers

  • Protocol unit tests: package-local tests under raftstore/peer, raftstore/store, raftstore/admin, raftstore/snapshot, and raftstore/migrate validate one protocol surface at a time.
  • Node-local integration tests: store/admin tests verify snapshot install, membership application, and region runtime publication without booting a full cluster.
  • Multi-node deterministic data-plane integration tests: raftstore/integration uses raftstore/testcluster to boot real stores, wire transports, and drive migration/member flows against live runtimes.
  • Multi-node deterministic control-plane integration tests: coordinator/integration/*_test.go uses coordinator/testcluster to boot 3 coordinator + replicated meta, exercise rooted watch/reload propagation, follower write rejection, allocator-fence/remove-region propagation, and control-plane read staleness without mixing those cases into store/data-plane tests.
  • Restart and recovery suites: raftstore/integration/restart_recovery_test.go covers restarted followers, removed-peer dehost persistence, and leader restart with subsequent membership changes.
  • Control-plane degradation and publish-boundary tests: raftstore/integration/coordinator_degraded_test.go and raftstore/integration/snapshot_interruption_test.go cover live Coordinator outage after startup and failpoint-driven snapshot interruption before peer publication.

When adding new distributed tests:

  • use raftstore/testcluster for store/data-plane behavior
  • use coordinator/testcluster for control-plane / replicated-root behavior
  • avoid embedding ad-hoc cluster bootstrap helpers into feature-specific test files

7. Distributed Fault Matrix

Fault ClassCurrent CoveragePrimary TestsNotes
Snapshot export/install failureCoveredraftstore/migrate/expand_test.go, raftstore/store/peer_lifecycle_test.go, raftstore/admin/service_test.goCovers leader export failure, target install failure, and corrupt payload rejection without partially hosted peers.
Membership wait timeoutsCoveredraftstore/migrate/expand_test.go, raftstore/migrate/remove_peer_test.go, raftstore/migrate/transfer_leader_test.goVerifies timeout surfaces when leader metadata does not publish, target never hosts, peer removal never converges, or leader transfer stalls.
Follower restart after snapshot installCoveredraftstore/integration/restart_recovery_test.go::TestExpandedPeerRestartPreservesRegionAndDataEnsures installed peer persists region metadata and data after restart.
Removed peer restartCoveredraftstore/integration/restart_recovery_test.go::TestRemovedPeerRestartDoesNotRehostEnsures dehosted peers do not come back after restart.
Leader restart with follow-up membership changeCoveredraftstore/integration/restart_recovery_test.go::TestLeaderRestartStillAllowsMembershipChangesExercises leadership churn before a later remove-peer operation.
Control-plane degraded / Coordinator unavailableCoveredcoordinator/adapter/scheduler_client_test.go, raftstore/store/command_ops_test.go::TestStoreProposeCommandSurvivesSchedulerUnavailable, raftstore/integration/coordinator_degraded_test.go::TestClusterSurvivesCoordinatorUnavailableAfterStartupCovers both local degraded scheduler semantics and live multi-node Coordinator outage after route cache warmup; new cold-route misses still fail with RouteUnavailable as expected.
Scheduler queue overflow / dropped operationsCoveredraftstore/store/scheduler_runtime_test.go::TestStoreSchedulerStatusTracksQueueDropValidates local degraded status and dropped operation accounting.
Snapshot install interrupted before publishCoveredraftstore/integration/snapshot_interruption_test.go::TestExpandSnapshotInstallInterruptedBeforePublish, raftstore/store/peer_lifecycle_test.go::TestStoreInstallRegionSnapshotRejectsCorruptPayloadUses failpoint injection to verify target install aborts without leaving a hosted peer or polluted region metadata, then retries cleanly after restart.
Request cancel / deadline propagationCoveredraftstore/client/client_test.go::TestClientGetHonorsCanceledContextDuringRouteLookup, raftstore/client/client_test.go::TestClientGetHonorsCanceledContextDuringRPC, raftstore/client/client_test.go::TestClientPutHonorsCanceledContextDuringRouteLookup, raftstore/client/client_test.go::TestClientPutHonorsCanceledContextDuringRPC, raftstore/client/client_test.go::TestClientTwoPhaseCommitHonorsCanceledContextDuringMultiRegionRouteLookup, raftstore/client/client_test.go::TestClientTwoPhaseCommitHonorsCanceledContextDuringMultiRegionRPC, raftstore/client/client_test.go::TestClientResolveLocksHonorsCanceledContextDuringMultiRegionRPC, raftstore/integration/context_propagation_test.go::TestClientReadWriteHonorContextUnderQuorumLoss, raftstore/integration/context_propagation_test.go::TestClientTwoPhaseCommitHonorsContextAcrossSplitRegionsUnderPartialQuorumLossVerifies read/write paths plus multi-region 2PC and resolve-lock flows preserve caller cancellation/deadlines through route lookup, RPC, and live split-region quorum loss instead of collapsing to generic retry exhaustion.
Transport partition / interleave recoveryCoveredraftstore/transport/grpc_transport_test.go::TestGRPCTransportHandlesPartition, raftstore/transport/grpc_transport_test.go::TestGRPCTransportFailpointBeforeSendRPCRecoversAfterClear, raftstore/peer/peer_test.go::TestPeerFailpointAfterReadyAdvanceBeforeSendRecoversOnLaterTicks, raftstore/integration/transport_chaos_test.go::TestPartitionedFollowerCatchesUpAfterRecovery, raftstore/integration/transport_chaos_test.go::TestTransferLeaderRecoversAfterPartitionedTargetReturns, raftstore/integration/transport_chaos_test.go::TestRepeatedLinkFlapConvergesDuringMembershipChangesCovers low-level gRPC link blocking, send-boundary failpoints, Ready advance/send publication gaps, repeated link flaps during membership operations, and live cluster recovery after follower isolation/restart plus transfer-leader timeout/retry under transport partitions.
Split/merge restart safetyCoveredraftstore/store/store_test.go::TestStoreRestartPreservesSplitMergeLocalMeta, raftstore/integration/split_merge_recovery_test.go::TestSplitMergeRestartSafetyAcrossStoresCovers store-local recovery plus live multi-store split -> restart -> merge -> restart flow after making split/merge admin replay idempotent across restart.

Next fault-matrix additions should focus on:

  • more publish-boundary failpoints around snapshot install and migration init
  • deeper transport/interleave chaos beyond partition + recovery, especially more concurrent membership combinations and repeated multi-link flaps

Keep this matrix updated when adding new modules or scenarios so documentation and automation remain aligned.