NoKV Architecture Overview
NoKV delivers a hybrid storage engine that can operate as a standalone embedded KV store or as a distributed NoKV service. The distributed RPC surface follows a TinyKV/TiKV-style region + MVCC design, but the service identity and deployment model are NoKV’s own. This document captures the key building blocks, how they interact, and the execution flow from client to disk.
Read this page if you want the shortest route from “what is NoKV” to “which package owns which part of the system”.
This architecture is also meant to support NoKV as a maintainable and extensible distributed storage research platform. The point is not only to describe how the current system runs, but to make the package boundaries, lifecycle ownership, and experiment surfaces explicit enough that new storage-engine, metadata, control-plane, and distributed-runtime ideas can be added without rebuilding the repository around each new topic.
At a high level, the codebase is organized around four long-lived layers:
- Root facade and runtime surface – the top-level
DBAPIs and thin system entrypoints. - Single-node engine substrate –
engine/*owns WAL, LSM, manifest, value log, file, and VFS mechanics. - Distributed execution and control plane –
raftstore/*,meta/*, andcoordinator/*host replicated execution, rooted metadata, and cluster control logic. - Experiment and evidence layer –
benchmark/*, scripts, and docs keep evaluation and design claims attached to the implementation.
Reader Map
- If you care about the embedded engine, focus on sections 2 and 5.
- If you care about distributed runtime ownership, focus on sections 3, 4, and 5.
- If you care about migration and recovery, read this page together with
migration.mdandrecovery.md.
1. High-Level Layout
┌─────────────────────────┐ NoKV gRPC ┌─────────────────────────┐
│ raftstore Service │◀──────────────▶ │ raftstore/client │
└───────────┬─────────────┘ │ (Get / Scan / Mutate) │
│ └─────────────────────────┘
│ ReadCommand / ProposeCommand
▼
┌─────────────────────────┐
│ store.Store / peer.Peer │ ← multi-Raft region lifecycle
│ ├ Local peer catalog │
│ ├ Router / region catalog │
│ └ transport (gRPC) │
└───────────┬─────────────┘
│ Apply via kv.Apply
▼
┌─────────────────────────┐
│ kv.Apply + percolator │
│ ├ Get / Scan │
│ ├ Prewrite / Commit │
│ └ Latch manager │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Embedded NoKV core │
│ ├ WAL Manager │
│ ├ MemTable / Flush │
│ ├ ValueLog + GC │
│ └ Manifest / Stats │
└─────────────────────────┘
- Embedded mode uses
NoKV.Opendirectly: WAL→MemTable→SST durability, ValueLog separation, non-transactional APIs with internal version ordering, and rich stats. - Distributed mode layers
raftstoreon top: multi-Raft regions reuse the same WAL, keep store-local recovery metadata separate from storage manifest state, expose metrics, and serve NoKV RPCs. - Control plane split:
raft_configprovides bootstrap topology; Coordinator provides runtime routing/TSO/control-plane state in cluster mode. - Clients obtain leader-aware routing, automatic NotLeader/EpochNotMatch retries, and two-phase commit helpers.
Same system, two shapes
flowchart LR
App["App / CLI / fsmeta client"]
App --> Embedded["Embedded NoKV DB"]
App --> RPC["NoKV RPC / raftstore/client"]
subgraph "Standalone shape"
Embedded --> Core["WAL + LSM + VLog + MVCC"]
end
subgraph "Distributed shape"
RPC --> Server["server.Node"]
Server --> Store["store.Store"]
Store --> Peer["peer.Peer"]
Peer --> Core
Store --> Coordinator["Coordinator"]
end
Embedded -. migrate init / seed .-> Store
Detailed Runtime Paths
For function-level call chains with sequence diagrams (embedded write/read,
iterator scan, distributed read/write via Raft apply), see
docs/runtime.md.
2. Embedded Engine
Code entry points
If you want to inspect the embedded side first, start here:
opt := NoKV.NewDefaultOptions()
opt.WorkDir = "./workdir"
db, err := NoKV.Open(opt)
if err != nil {
panic(err)
}
defer db.Close()
_ = db.Set([]byte("hello"), []byte("world"))
entry, _ := db.Get([]byte("hello"))
fmt.Println(string(entry.Value))
Then read:
db.goengine/lsm/engine/wal/vlog.go
2.1 WAL & MemTable
wal.Managerappends[len|type|payload|crc]records (typed WAL), rotates segments, and replays logs on crash.MemTableaccumulates writes until full, then enters the flush queue; the concrete flush runtime runsEnqueue → Build → Install → Release, logs edits, and releases WAL segments.- Writes are handled by a single commit worker that performs value-log append first, then WAL/memtable apply, keeping durability ordering simple and consistent.
2.2 ValueLog
- Large values are written to the ValueLog before the WAL append; the resulting
ValuePtris stored in WAL/LSM so replay can recover. vlog.Managertracks the active head and uses flush discard stats to trigger GC; manifest records new heads and removed segments.
2.3 Manifest
manifest.Managerstores only storage-engine metadata: SST metadata, WAL checkpoints, and ValueLog metadata. Store-local raft replay pointers live inraftstore/localmeta.CURRENTprovides crash-safe pointer updates for storage-engine metadata. Region descriptors are no longer stored in the storage manifest.
2.4 LSM Compaction & Landing Buffer
lsm.compactiondrives compaction cycles;lsm.levelManagersupplies table metadata and executes the plan.- Planning is split inside
lsm:PlanFor*selects table IDs + key ranges, then LSM resolves IDs back to tables and runs the merge. lsm.Stateguards overlapping key ranges and tracks in-flight table IDs.- Landing shard selection is policy-driven in
lsm(PickShardOrder/PickShardByBacklog) while the landing buffer remains inlsm.
flowchart TD Manager["lsm.compaction"] --> LSM["lsm.levelManager"] LSM -->|TableMeta snapshot| Planner["PlanFor*"] Planner --> Plan["lsm.Plan (fid+range)"] Plan -->|resolvePlanLocked| Exec["LSM executor"] Exec --> State["lsm.State guard"] Exec --> Build["subcompact/build SST"] Build --> Manifest["manifest edits"] L0["L0 tables"] -->|moveToLanding| Landing["landing buffer shards"] Landing -->|LandingDrain: landing-only| Main["Main tables"] Landing -->|LandingKeep: landing-merge| Landing
2.5 Distributed Transaction Path
percolatorimplements Prewrite/Commit/ResolveLock/CheckTxnStatus;kv.Applydispatches raft commands to these helpers.- MVCC timestamps come from the distributed client/Coordinator TSO flow, not from an embedded standalone transaction API.
- Watermarks (
utils.WaterMark) are used in durability/visibility coordination; they have no background goroutine and advance via mutex + atomics.
2.6 Write Pipeline & Backpressure
- Writes enqueue into a commit queue inside
db.gowhere requests are coalesced into batches before a commit worker drains them. - The commit worker always writes the value log first (when needed), then applies WAL/LSM updates;
SyncWritesadds a WAL fsync step. - Batch sizing adapts to backlog through
WriteBatchMaxCount,WriteBatchMaxSize, andWriteBatchWait. - Backpressure is enforced in two places: LSM throttling toggles
db.blockWriteswhen L0 backlog grows, and Thermos can reject hot keys viaWriteHotKeyLimit.
2.7 Ref-Count Lifecycle Contracts
NoKV uses fail-fast reference counting for internal pooled/owned objects. DecrRef underflow is treated as a lifecycle bug and panics.
| Object | Owned by | Borrowed by | Release rule |
|---|---|---|---|
kv.Entry (pooled) | internal write/read pipelines | codec iterator, memtable/lsm internal reads, request batches | Must call DecrRef exactly once per borrow. |
kv.Entry (detached public result) | caller | none | Returned by DB.Get; must not call DecrRef. |
kv.Entry (borrowed internal result) | caller | yes (DecrRef) | Returned by DB.GetInternalEntry; caller must release exactly once. |
request | commit queue/worker | waiter path (Wait) | IncrRef on enqueue; Wait does one DecrRef; zero returns request to pool and releases entries. |
table | level/main+landing lists, block cache | table iterators, prefetch workers | Removed tables are decremented once after manifest+in-memory swap; zero deletes SST. |
Skiplist / ART index | memtable | iterators | Iterator creation increments index ref; iterator Close decrements; double-close is idempotent. |
3. Replication Layer (raftstore)
Code entry points
If you want to inspect the distributed side first, start here:
srv, err := server.NewNode(server.Config{
Storage: server.Storage{MVCC: db, Raft: db.RaftLog()},
Store: store.Config{StoreID: 1},
Raft: myraft.Config{ElectionTick: 10, HeartbeatTick: 2, PreVote: true},
TransportAddr: "127.0.0.1:20160",
})
if err != nil {
panic(err)
}
defer srv.Close()
Then read:
raftstore/server/node.goraftstore/store/store.goraftstore/peer/peer.goraftstore/raftlog/wal_storage.goraftstore/localmeta/store.go
| Package | Responsibility |
|---|---|
store | Region catalog/runtime root, router, RegionMetrics, scheduler + command runtimes, helpers such as StartPeer / SplitRegion. |
peer | Wraps etcd/raft RawNode, handles Ready pipeline, snapshot resend queue, backlog instrumentation. |
raftlog | WALStorage/DiskStorage/MemoryStorage, reusing the DB’s WAL while keeping store-local raft replay metadata in sync. |
transport | gRPC transport for Raft Step messages, connection management, retries/blocks/TLS. Also acts as the host for NoKV RPC. |
kv | NoKV RPC handler plus kv.Apply bridging Raft commands to MVCC logic. |
server | Config + NewNode combine DB, Store, transport, and NoKV service into a reusable node instance. |
3.1 Bootstrap Sequence
server.NewNodewires DB, store configuration (StoreID, hooks, scheduler), Raft config, and transport address. It registers NoKV RPC on the shared gRPC server and setstransport.SetHandler(store.Step).- CLI (
nokv serve) or application enumerates the local peer catalog and callsStore.StartPeerfor every Region containing the local store:peer.Configincludes Raft params, transport,kv.NewEntryApplier, peer storage, and Region metadata.- Router registration, regionManager bookkeeping, optional
Peer.Bootstrapwith initial peer list, leader campaign.
- Peers from other stores can be configured through
transport.SetPeer(peerID, addr)(raft peer ID). In cluster mode, runtime routing/control-plane decisions come from Coordinator.
3.2 Command Paths
- ReadCommand (
KvGet/KvScan): validate Region & leader, execute Raft ReadIndex (LinearizableRead) andWaitApplied, then runcommandApplier(i.e.kv.Applyin read mode) to fetch data from the DB. This yields leader-strong reads with an explicit Raft linearizability barrier. - ProposeCommand (write): encode the request, push through Router to the leader peer, replicate via Raft, and apply in
kv.Applywhich maps to MVCC operations.
3.3 Transport
- gRPC server handles Step RPCs and NoKV RPCs on the same endpoint; peers are registered via
SetPeer. - Retry policies (
WithRetry) and TLS credentials are configurable. Tests cover partitions, blocked peers, and slow followers.
4. NoKV Service
raftstore/kv/service.go exposes pb.NoKV RPCs:
| RPC | Execution | Result |
|---|---|---|
KvGet | store.ReadCommand → kv.Apply GET | pb.GetResponse / RegionError |
KvScan | store.ReadCommand → kv.Apply SCAN | pb.ScanResponse / RegionError |
KvPrewrite | store.ProposeCommand → percolator.Prewrite | pb.PrewriteResponse |
KvCommit | store.ProposeCommand → percolator.Commit | pb.CommitResponse |
KvResolveLock | percolator.ResolveLock | pb.ResolveLockResponse |
KvCheckTxnStatus | percolator.CheckTxnStatus | pb.CheckTxnStatusResponse |
nokv serve is the CLI entry point—open the DB, construct server.Node, register peers, start local Raft peers, and display a local peer catalog summary (Regions, key ranges, peers). scripts/dev/cluster.sh builds the CLI, seeds local peer catalogs, and launches the 333 separated layout (3 meta-root peers + 1 coordinator + all configured stores) on localhost, handling cleanup on Ctrl+C.
The RPC request/response shape is intentionally close to TinyKV/TiKV so the MVCC and region semantics remain familiar, but the service name exposed on the wire is pb.NoKV.
5. Client Workflow
raftstore/client offers a leader-aware client with retry logic and convenient helpers:
- Initialization: provide a Coordinator-backed
RegionResolver(GetRegionByKey) andStoreResolver(GetStore) so runtime routing and store discovery are Coordinator-driven. - Reads:
GetandScanpick the leader store for a key range, issue NoKV RPCs, and retry on NotLeader/EpochNotMatch. - Writes:
Mutatebundles operations per region and drives Prewrite/Commit (primary first, secondaries after);PutandDeleteare convenience wrappers using the same 2PC path. - Timestamps: clients must supply
startVersion/commitVersion. For distributed demos, use Coordinator (nokv coordinator) to obtain globally increasing values before callingTwoPhaseCommit. - Bootstrap helpers:
scripts/dev/cluster.sh --config raft_config.example.jsonbuilds the binaries, seeds local peer catalogs vianokv-config catalog, launches the 3 meta-root peers + coordinator, and starts the stores declared in the config.
Example (two regions)
- Regions
[a,m)and[m,+∞), each led by a different store. Mutate(ctx, primary="alfa", mutations, startTs, commitTs, ttl)prewrites and commits across the relevant regions.Get/Scanretries automatically if the leader changes.- See
raftstore/server/node_test.gofor a full end-to-end example using realserver.Nodeinstances.
6. Failure Handling
- Manifest edits capture only storage metadata, WAL checkpoints, and ValueLog pointers. Store-local region recovery state and raft replay pointers are loaded from
raftstore/localmeta. - WAL replay reconstructs memtables and Raft groups; ValueLog recovery trims partial records.
Stats.StartStatsresumes metrics sampling immediately after restart, making it easy to verify recovery correctness vianokv stats.
7. Observability & Tooling
StatsSnapshotpublishes flush/compaction/WAL/VLog/raft/region/hot/cache metrics.nokv statsand the expvar endpoint expose the same data.nokv regionsinspects the local peer catalog.nokv serveadvertises Region samples on startup (ID, key range, peers) for quick verification.- Inspect scheduler/control-plane state via Coordinator APIs/metrics.
- Scripts:
scripts/dev/cluster.sh– launch a multi-node NoKV cluster locally.RECOVERY_TRACE_METRICS=1 go test ./... -run 'TestRecovery(RemovesStaleValueLogSegment|FailsOnMissingSST|FailsOnCorruptSST|ManifestRewriteCrash|SlowFollowerSnapshotBacklog|SnapshotExportRoundTrip|WALReplayRestoresData)' -count=1 -v– crash-recovery validation.CHAOS_TRACE_METRICS=1 go test -run 'TestGRPCTransport(HandlesPartition|MetricsWatchdog|MetricsBlockedPeers)' -count=1 -v ./raftstore/transport– inject network faults and observe transport metrics.
8. When to Use NoKV
- Embedded: call
NoKV.Open, use the local non-transactional DB APIs. - Distributed: deploy
nokv servenodes, useraftstore/client(or any NoKV gRPC client) to perform reads, scans, and 2PC writes. - Observability-first: inspection via CLI or expvar is built-in; Region, WAL, Flush, and Raft metrics are accessible without extra instrumentation.
See also docs/raftstore.md for deeper internals, docs/coordinator.md for control-plane details, and docs/testing.md for coverage details.