NoKV Architecture Overview
NoKV delivers a hybrid storage engine that can operate as a standalone embedded KV store or as a distributed NoKV service. The distributed RPC surface follows a TinyKV/TiKV-style region + MVCC design, but the service identity and deployment model are NoKV’s own. This document captures the key building blocks, how they interact, and the execution flow from client to disk.
1. High-Level Layout
┌─────────────────────────┐ NoKV gRPC ┌─────────────────────────┐
│ raftstore Service │◀──────────────▶ │ raftstore/client │
└───────────┬─────────────┘ │ (Get / Scan / Mutate) │
│ └─────────────────────────┘
│ ReadCommand / ProposeCommand
▼
┌─────────────────────────┐
│ store.Store / peer.Peer │ ← multi-Raft region lifecycle
│ ├ Manifest snapshot │
│ ├ Router / RegionHooks │
│ └ transport (gRPC) │
└───────────┬─────────────┘
│ Apply via kv.Apply
▼
┌─────────────────────────┐
│ kv.Apply + percolator │
│ ├ Get / Scan │
│ ├ Prewrite / Commit │
│ └ Latch manager │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Embedded NoKV core │
│ ├ WAL Manager │
│ ├ MemTable / Flush │
│ ├ ValueLog + GC │
│ └ Manifest / Stats │
└─────────────────────────┘
- Embedded mode uses
NoKV.Opendirectly: WAL→MemTable→SST durability, ValueLog separation, MVCC semantics, rich stats. - Distributed mode layers
raftstoreon top: multi-Raft regions reuse the same WAL/Manifest, expose metrics, and serve NoKV RPCs. - Control plane split:
raft_configprovides bootstrap topology; PD provides runtime routing/TSO/control-plane state in cluster mode. - Clients obtain leader-aware routing, automatic NotLeader/EpochNotMatch retries, and two-phase commit helpers.
Detailed Runtime Paths
For function-level call chains with sequence diagrams (embedded write/read,
iterator scan, distributed read/write via Raft apply), see
docs/runtime.md.
2. Embedded Engine
2.1 WAL & MemTable
wal.Managerappends[len|type|payload|crc]records (typed WAL), rotates segments, and replays logs on crash.MemTableaccumulates writes until full, then enters the flush queue;flush.ManagerrunsPrepare → Build → Install → Release, logs edits, and releases WAL segments.- Writes are handled by a single commit worker that performs value-log append first, then WAL/memtable apply, keeping durability ordering simple and consistent.
2.2 ValueLog
- Large values are written to the ValueLog before the WAL append; the resulting
ValuePtris stored in WAL/LSM so replay can recover. vlog.Managertracks the active head and uses flush discard stats to trigger GC; manifest records new heads and removed segments.
2.3 Manifest
manifest.Managerstores SST metadata, WAL checkpoints, ValueLog metadata, and (importantly) Region descriptors used by raftstore.CURRENTprovides crash-safe pointer updates; Region state is replicated through manifest edits.
2.4 LSM Compaction & Ingest Buffer
compact.Managerdrives compaction cycles;lsm.levelManagersupplies table metadata and executes the plan.- Planning is split:
compact.PlanFor*selects table IDs + key ranges, then LSM resolves IDs back to tables and runs the merge. compact.Stateguards overlapping key ranges and tracks in-flight table IDs.- Ingest shard selection is policy-driven in
compact(PickShardOrder/PickShardByBacklog) while the ingest buffer remains inlsm.
flowchart TD Manager["compact.Manager"] --> LSM["lsm.levelManager"] LSM -->|TableMeta snapshot| Planner["compact.PlanFor*"] Planner --> Plan["compact.Plan (fid+range)"] Plan -->|resolvePlanLocked| Exec["LSM executor"] Exec --> State["compact.State guard"] Exec --> Build["subcompact/build SST"] Build --> Manifest["manifest edits"] L0["L0 tables"] -->|moveToIngest| Ingest["ingest buffer shards"] Ingest -->|IngestDrain: ingest-only| Main["Main tables"] Ingest -->|IngestKeep: ingest-merge| Ingest
2.5 Distributed Transaction Path
percolatorimplements Prewrite/Commit/ResolveLock/CheckTxnStatus;kv.Applydispatches raft commands to these helpers.- MVCC timestamps come from the distributed client/PD TSO flow, not from an embedded standalone transaction API.
- Watermarks (
utils.WaterMark) are used in durability/visibility coordination; they have no background goroutine and advance via mutex + atomics.
2.6 Write Pipeline & Backpressure
- Writes enqueue into a commit queue (
db_write.go) where requests are coalesced into batches before a commit worker drains them. - The commit worker always writes the value log first (when needed), then applies WAL/LSM updates;
SyncWritesadds a WAL fsync step. - Batch sizing adapts to backlog (
WriteBatchMaxCount/Size,WriteBatchWait) and hot-key pressure can expand batch limits temporarily to drain spikes. - Backpressure is enforced in two places: LSM throttling toggles
db.blockWriteswhen L0 backlog grows, and HotRing can reject hot keys viaWriteHotKeyLimit.
2.7 Ref-Count Lifecycle Contracts
NoKV uses fail-fast reference counting for internal pooled/owned objects. DecrRef underflow is treated as a lifecycle bug and panics.
| Object | Owned by | Borrowed by | Release rule |
|---|---|---|---|
kv.Entry (pooled) | internal write/read pipelines | codec iterator, memtable/lsm internal reads, request batches | Must call DecrRef exactly once per borrow. |
kv.Entry (detached public result) | caller | none | Returned by DB.Get; must not call DecrRef. |
kv.Entry (borrowed internal result) | caller | yes (DecrRef) | Returned by DB.GetInternalEntry; caller must release exactly once. |
request | commit queue/worker | waiter path (Wait) | IncrRef on enqueue; Wait does one DecrRef; zero returns request to pool and releases entries. |
table | level/main+ingest lists, block cache | table iterators, prefetch workers | Removed tables are decremented once after manifest+in-memory swap; zero deletes SST. |
Skiplist / ART index | memtable | iterators | Iterator creation increments index ref; iterator Close decrements; double-close is idempotent. |
3. Replication Layer (raftstore)
| Package | Responsibility |
|---|---|
store | Region catalog, router, RegionMetrics, Region hooks, manifest integration, helpers such as StartPeer / SplitRegion. |
peer | Wraps etcd/raft RawNode, handles Ready pipeline, snapshot resend queue, backlog instrumentation. |
engine | WALStorage/DiskStorage/MemoryStorage, reusing the DB’s WAL while keeping manifest metadata in sync. |
transport | gRPC transport for Raft Step messages, connection management, retries/blocks/TLS. Also acts as the host for NoKV RPC. |
kv | NoKV RPC handler plus kv.Apply bridging Raft commands to MVCC logic. |
server | ServerConfig + New combine DB, Store, transport, and NoKV service into a reusable node instance. |
3.1 Bootstrap Sequence
raftstore.NewServerwires DB, store configuration (StoreID, hooks, scheduler), Raft config, and transport address. It registers NoKV RPC on the shared gRPC server and setstransport.SetHandler(store.Step).- CLI (
nokv serve) or application enumeratesManifest.RegionSnapshot()and callsStore.StartPeerfor every Region containing the local store:peer.Configincludes Raft params, transport,kv.NewEntryApplier, WAL/Manifest handles, Region metadata.- Router registration, regionManager bookkeeping, optional
Peer.Bootstrapwith initial peer list, leader campaign.
- Peers from other stores can be configured through
transport.SetPeer(storeID, addr). In cluster mode, runtime routing/control-plane decisions come from PD.
3.2 Command Paths
- ReadCommand (
KvGet/KvScan): validate Region & leader, execute Raft ReadIndex (LinearizableRead) andWaitApplied, then runcommandApplier(i.e.kv.Applyin read mode) to fetch data from the DB. This yields leader-strong reads with an explicit Raft linearizability barrier. - ProposeCommand (write): encode the request, push through Router to the leader peer, replicate via Raft, and apply in
kv.Applywhich maps to MVCC operations.
3.3 Transport
- gRPC server handles Step RPCs and NoKV RPCs on the same endpoint; peers are registered via
SetPeer. - Retry policies (
WithRetry) and TLS credentials are configurable. Tests cover partitions, blocked peers, and slow followers.
4. NoKV Service
raftstore/kv/service.go exposes pb.NoKV RPCs:
| RPC | Execution | Result |
|---|---|---|
KvGet | store.ReadCommand → kv.Apply GET | pb.GetResponse / RegionError |
KvScan | store.ReadCommand → kv.Apply SCAN | pb.ScanResponse / RegionError |
KvPrewrite | store.ProposeCommand → percolator.Prewrite | pb.PrewriteResponse |
KvCommit | store.ProposeCommand → percolator.Commit | pb.CommitResponse |
KvResolveLock | percolator.ResolveLock | pb.ResolveLockResponse |
KvCheckTxnStatus | percolator.CheckTxnStatus | pb.CheckTxnStatusResponse |
nokv serve is the CLI entry point—open the DB, construct raftstore.Server, register peers, start local Raft peers, and display a manifest summary (Regions, key ranges, peers). scripts/run_local_cluster.sh builds the CLI, writes a minimal region manifest, launches multiple nokv serve processes on localhost, and handles cleanup on Ctrl+C.
The RPC request/response shape is intentionally close to TinyKV/TiKV so the MVCC and region semantics remain familiar, but the service name exposed on the wire is pb.NoKV.
5. Client Workflow
raftstore/client offers a leader-aware client with retry logic and convenient helpers:
- Initialization: provide
[]StoreEndpoint+RegionResolver(GetRegionByKey) so runtime routing is PD-driven. - Reads:
GetandScanpick the leader store for a key range, issue NoKV RPCs, and retry on NotLeader/EpochNotMatch. - Writes:
Mutatebundles operations per region and drives Prewrite/Commit (primary first, secondaries after);PutandDeleteare convenience wrappers using the same 2PC path. - Timestamps: clients must supply
startVersion/commitVersion. For distributed demos, use PD-lite (nokv pd) to obtain globally increasing values before callingTwoPhaseCommit. - Bootstrap helpers:
scripts/run_local_cluster.sh --config raft_config.example.jsonbuilds the binaries, seeds manifests vianokv-config manifest, launches PD-lite, and starts the stores declared in the config.
Example (two regions)
- Regions
[a,m)and[m,+∞), each led by a different store. Mutate(ctx, primary="alfa", mutations, startTs, commitTs, ttl)prewrites and commits across the relevant regions.Get/Scanretries automatically if the leader changes.- See
raftstore/server/server_test.gofor a full end-to-end example using realraftstore.Serverinstances.
6. Failure Handling
- Manifest edits capture Region metadata, WAL checkpoints, and ValueLog pointers. Restart simply reads
CURRENTand replays edits. - WAL replay reconstructs memtables and Raft groups; ValueLog recovery trims partial records.
Stats.StartStatsresumes metrics sampling immediately after restart, making it easy to verify recovery correctness vianokv stats.
7. Observability & Tooling
StatsSnapshotpublishes flush/compaction/WAL/VLog/raft/region/hot/cache metrics.nokv statsand the expvar endpoint expose the same data.nokv regionsinspects Manifest-backed Region metadata.nokv serveadvertises Region samples on startup (ID, key range, peers) for quick verification.- Inspect scheduler/control-plane state via PD APIs/metrics.
- Scripts:
scripts/run_local_cluster.sh– launch a multi-node NoKV cluster locally.scripts/recovery_scenarios.sh– crash-recovery test harness.scripts/transport_chaos.sh– inject network faults and observe transport metrics.
8. When to Use NoKV
- Embedded: call
NoKV.Open, use the local non-transactional DB APIs. - Distributed: deploy
nokv servenodes, useraftstore/client(or any NoKV gRPC client) to perform reads, scans, and 2PC writes. - Observability-first: inspection via CLI or expvar is built-in; Region, WAL, Flush, and Raft metrics are accessible without extra instrumentation.
See also docs/raftstore.md for deeper internals, docs/pd.md for control-plane details, and docs/testing.md for coverage details.