RaftStore Deep Dive
raftstore powers NoKV’s distributed mode by layering multi-Raft replication on top of the embedded storage engine. Its RPC surface is exposed as the NoKV gRPC service, while the command model still tracks the TinyKV/TiKV region + MVCC design. This note explains the major packages, the boot and command paths, how transport and storage interact, and the supporting tooling for observability and testing.
1. Package Structure
| Package | Responsibility |
|---|---|
store | Orchestrates peer set, command pipeline, region manager, scheduler/heartbeat loops; exposes helpers such as StartPeer, ProposeCommand, SplitRegion. |
peer | Wraps etcd/raft RawNode, drives Ready processing (persist to WAL, send messages, apply entries), tracks snapshot resend/backlog. |
engine | WALStorage/DiskStorage/MemoryStorage across all Raft groups, leveraging the NoKV WAL while keeping manifest metadata in sync. |
transport | gRPC transport with retry/TLS/backpressure; exposes the raft Step RPC and can host additional services (NoKV). |
kv | NoKV RPC implementation, bridging Raft commands to MVCC operations via kv.Apply. |
server | ServerConfig + New that bind DB, Store, transport, and NoKV server into a reusable node primitive. |
2. Boot Sequence
-
Construct Server
srv, _ := raftstore.NewServer(raftstore.ServerConfig{ DB: db, Store: raftstore.StoreConfig{StoreID: 1}, Raft: myraft.Config{ElectionTick: 10, HeartbeatTick: 2, PreVote: true}, TransportAddr: "127.0.0.1:20160", })- A gRPC transport is created, the NoKV service is registered, and
transport.SetHandler(store.Step)wires raft Step handling. store.Storeloadsmanifest.RegionSnapshot()to rebuild the Region catalog (router + metrics).
- A gRPC transport is created, the NoKV service is registered, and
-
Start local peers
- CLI (
nokv serve) iterates the manifest snapshot and callsStore.StartPeerfor every region that includes the local store. - Each
peer.Configcarries raft parameters, the transport reference,kv.NewEntryApplier, WAL/manifest handles, and Region metadata. StartPeerregisters the peer through the peer-set/routing layer and may bootstrap or campaign for leadership.
- CLI (
-
Peer connectivity
transport.SetPeer(storeID, addr)defines outbound raft connections; the CLI exposes it via--peer storeID=addr.- Additional services can reuse the same gRPC server through
transport.WithServerRegistrar.
3. Command Execution
Read (strong leader read)
kv.Service.KvGetbuildspb.RaftCmdRequestand invokesStore.ReadCommand.validateCommandensures the region exists, epoch matches, and the local peer is leader; a RegionError is returned otherwise.peer.LinearizableReadobtains a safe read index, thenpeer.WaitAppliedwaits until local apply index reaches it.commandApplier(i.e.kv.Apply) runs GET/SCAN against the DB using MVCC readers to honor locks and version visibility.
Write (via Propose)
- Write RPCs (Prewrite/Commit/…) call
Store.ProposeCommand, encoding the command and routing to the leader peer. - The leader appends the encoded request to raft, replicates, and once committed the command pipeline hands data to
kv.Apply, which maps Prewrite/Commit/ResolveLock to thepercolatorpackage. engine.WALStoragepersists raft entries/state snapshots and updates manifest raft pointers. This keeps WAL GC and raft truncation aligned.- Raft apply only accepts command-encoded payloads (
RaftCmdRequest). Legacy raw KV payloads are rejected as unsupported.
Command flow diagram
sequenceDiagram
participant C as NoKV Client
participant SVC as kv.Service
participant ST as store.Store
participant PR as peer.Peer
participant RF as raft log/replication
participant AP as kv.Apply
participant DB as NoKV DB
rect rgb(237, 247, 255)
C->>SVC: KvGet/KvScan
SVC->>ST: ReadCommand
ST->>PR: LinearizableRead + WaitApplied
ST->>AP: commandApplier(req)
AP->>DB: internal read path
AP-->>C: read response
end
rect rgb(241, 253, 244)
C->>SVC: KvPrewrite/KvCommit/...
SVC->>ST: ProposeCommand
ST->>RF: route + replicate
RF->>AP: apply committed entry
AP->>DB: percolator mutate -> ApplyInternalEntries
AP-->>C: write response
end
4. Transport
- gRPC transport listens on
TransportAddr, serving both raft Step RPC and NoKV RPC. SetPeerupdates the mapping of remote store IDs to addresses;BlockPeercan be used by tests or chaos tooling.- Configurable retry/backoff/timeout options mirror production requirements. Tests cover message loss, blocked peers, and partitions.
5. Storage Backend (engine)
WALStoragepiggybacks on the embedded WAL: each Raft group writes typed entries, HardState, and snapshots into the shared log.LogRaftPointerandLogRaftTruncateedit manifest metadata so WAL GC knows how far it can compact per group.- Alternative storage backends (
DiskStorage,MemoryStorage) are available for tests and special scenarios.
6. NoKV RPC Integration
| RPC | Execution Path | Notes |
|---|---|---|
KvGet / KvScan | ReadCommand → LinearizableRead(ReadIndex) + WaitApplied → kv.Apply (read mode) | Leader-only strong read with Raft linearizability barrier. |
KvPrewrite / KvCommit / KvBatchRollback / KvResolveLock / KvCheckTxnStatus | ProposeCommand → command pipeline → raft log → kv.Apply | Pipeline matches proposals with apply results; MVCC latch manager prevents write conflicts. |
The cmd/nokv serve command uses raftstore.Server internally and prints a manifest summary (key ranges, peers) so operators can verify the node’s view at startup.
7. Client Interaction (raftstore/client)
- Region-aware routing with NotLeader/EpochNotMatch retry.
Mutatesplits mutations by region and performs two-phase commit (primary first).Put/Deleteare convenience wrappers.Scantransparently walks region boundaries.- End-to-end coverage lives in
raftstore/server/server_test.go, which launches real servers, uses the client to write and delete keys, and verifies the results.
8. Control Plane & Region Operations
8.1 Topology & Routing
- Topology is sourced from
raft_config.example.json(viaconfig.LoadFile) and reused by scripts, Docker Compose, and the Redis gateway. - Runtime routing is PD-first:
raftstore/clientresolves Regions by key throughGetRegionByKeyand caches route entries for retries. raft_configregions are treated as bootstrap/deployment metadata and are not the runtime source of truth once PD is available.- PD is the only control-plane source of truth for runtime scheduling/routing.
8.2 Split / Merge
- Split: leaders call
Store.ProposeSplit, which writes a splitAdminCommandinto the parent region’s raft log. On apply,Store.SplitRegionupdates the parent range/epoch and starts the child peer. - Merge: leaders call
Store.ProposeMerge, writing a mergeAdminCommand. On apply, the target region range/epoch is expanded and the source peer is stopped/removed from the manifest. - These operations are explicit/manual and are not auto-triggered by size/traffic heuristics.
9. Observability
store.RegionMetrics()feeds intoStatsSnapshot, making region counts and backlog visible via expvar andnokv stats.nokv regionsshows manifest-backed regions: ID, range, peers, state.scripts/transport_chaos.shexercises transport metrics under faults;scripts/run_local_cluster.shspins up multi-node clusters for manual inspection.
Store internals at a glance
| Component | File | Responsibility |
|---|---|---|
| Store facade | store.go | Store construction/wiring and shared component ownership (router, region manager, command pipeline, scheduler runtime). |
| Peer lifecycle | peer_lifecycle.go | Start/stop peers, router registration, lifecycle hooks, and store shutdown sequencing. |
| Command service | command_service.go | Region/epoch/key-range validation and read/propose request handling. |
| Admin service | admin_service.go | Split/merge proposal handling and applied admin command side effects. |
| Membership service | membership_service.go | Conf-change proposal helpers and manifest metadata updates after membership changes. |
| Region catalog | region_catalog.go | Public region catalog accessors and region metadata lifecycle operations. |
| Scheduler runtime | scheduler_runtime.go | Scheduler snapshot generation, store stats, operation application, and apply-entry dispatch. |
| Peer set | peer_set.go | Tracks active peers and exposes thread-safe lookups/iteration snapshots. |
| Command pipeline | command_pipeline.go | Assigns request IDs, records proposals, matches apply results, returns responses/errors to callers. |
| Region manager | region_manager.go | Validates state transitions, writes manifest edits, updates peer metadata, triggers region hooks. |
| Operation scheduler | operation_scheduler.go | Buffers planner output, enforces cooldown & burst limits, dispatches leader transfers or other operations. |
| Heartbeat loop | heartbeat_loop.go | Periodically publishes region/store heartbeats and, when the sink implements planner capability, drains scheduling actions. |
10. Current Boundaries and Guarantees
- Reads served through
ReadCommandare leader-strong and pass a Raft linearizability barrier (LinearizableRead+WaitApplied). - Mutating NoKV RPC commands are serialized through Raft log replication and apply.
- Command payload format on apply path is strict
RaftCmdRequestencoding. - Region metadata (range/epoch/peers) is validated before both read and write command execution.
store.RegionMetrics+StatsSnapshotprovide runtime visibility for region count, backlog, and scheduling health.