Coordinator
Coordinator is NoKV’s control-plane service for distributed mode.
It exposes a gRPC API (pb.Coordinator) and is started by:
go run ./cmd/nokv coordinator --addr 127.0.0.1:2379
1. Responsibilities
Coordinator currently owns:
- Routing:
GetRegionByKey - Heartbeats:
StoreHeartbeat,RegionHeartbeat - Region removal:
RemoveRegion - ID service:
AllocID - TSO:
Tso
Runtime clients (for example cmd/nokv-fsmeta) use Coordinator as the
routing source of truth, but Coordinator is not the durable owner of cluster topology
truth. Durable truth lives in meta/root.
2. Runtime Architecture
flowchart LR
Store["nokv serve"] -->|"StoreHeartbeat / RegionHeartbeat"| Coordinator["Coordinator (gRPC)"]
Gateway["nokv-fsmeta"] -->|"GetRegionByKey / Tso"| Coordinator
Coordinator --> Cluster["coordinator/catalog.Cluster"]
Cluster --> Scheduler["leader-transfer hint planner"]
Core implementation units:
coordinator/catalog: in-memory cluster metadata model.coordinator/idalloc: monotonic ID allocator used by Coordinator.coordinator/rootview: persistence abstraction (Store) backed by the metadata root.coordinator/server: gRPC service + RPC validation/error mapping.coordinator/client: client wrapper used by store/gateway.coordinator/adapter: scheduler sink that forwards heartbeats into Coordinator.
For the next-stage protocol direction on both the control plane and the paired
execution plane, see docs/control_and_execution_protocols.md.
Control-Plane Protocol Status
Coordinator now uses a minimal formal control-plane protocol v1 for its key route and transition surfaces.
Already in active use:
- route-read
Freshness - rooted token serving metadata
- rooted lag exposure
DegradedModeCatchUpStateTransitionID- publish-time lifecycle assessment on
PublishRootEvent
This means Coordinator no longer exposes only best-effort implementation behavior. It now returns explicit protocol state that callers, tests, and docs can rely on.
The current protocol is intentionally minimal. It does not yet expose the full future runtime/operator model such as stalled transitions or richer catch-up actions.
Minimal Eunomia vocabulary
The rooted handoff protocol is intentionally small. Docs and operator-facing surfaces should use the same vocabulary as the implementation and the Eunomia research note:
Tenure— the currently active authority recordLegacy— the retired predecessor era plus the frontier it already consumedHandover— the rooted handoff record for the current successorEra— the monotonic authority eraWitness— the operator-visible proof bundle that explains whether the current handoff state is safe
The four guarantees discussed by the docs and runtime metrics are:
Primacy— at most one authority era is activeInheritance— the successor must cover the predecessor’s published workSilence— a sealed predecessor must not keep servingFinality— a handoff must not remain permanently half-finished
The mapping to concrete implementation types is direct:
| Doc term | Implementation term |
|---|---|
Tenure | Tenure |
Legacy | Legacy |
Handover | Handover |
Era | Era / era |
Witness | HandoverWitness / continuation witness fields |
Frontiers | MandateFrontiers / frontiers / consumed_frontiers |
Do not reintroduce Lease / Seal as public aliases. They are useful
informally, but keeping them in formal docs creates two names for the same
rooted objects and makes Eunomia harder to explain.
3. Deployment Model
NoKV ships exactly one distributed topology plus the standalone engine shape.
standalone
- no
coordinator - no
meta/root - no control-plane process
- all truth remains inside the single storage process
This is the default local engine shape. Standalone is not a degraded control plane deployment; it simply has no control plane.
separated meta-root + coordinator
- three independent
nokv meta-rootprocesses own durable rooted truth (replicated raft quorum, the only backend NoKV ships) - one or more
nokv coordinatorprocesses connect through the remote metadata-root gRPC API Tenuregates singleton Coordinator duties:AllocID,Tso, and scheduler operation planning- route reads still come from Coordinator’s rebuildable in-memory view and
expose
Freshness,RootToken,CatchUpState, andDegradedMode
Keep the same logical split inside every deployment:
meta/root/*: durable rooted truth (replicated + gRPC service)coordinator/view+coordinator/catalog: rebuildable routing/scheduling statecoordinator/rootview: remote view of meta-root consumed by coordinator/servercoordinator/server: gRPC API surface
Product assumptions:
- exactly three meta-root replicas
- meta-root is the only place durable rooted truth lives
- coordinators are stateless relative to rooted truth; only the
Tenuredifferentiates active vs standby - no dynamic metadata-root membership
- no production-grade dynamic coordinator membership manager
4. Persistence (--workdir)
--workdir is required for every formal Coordinator deployment that hosts rooted truth.
separated meta-root + coordinator
Each meta-root process has its own rooted workdir and raft transport address.
The coordinator process does not host rooted truth; it only connects to the
remote root endpoints through --root-peer nodeID=grpc_addr.
Each meta-root workdir persists two layers of state:
- rooted truth state
root.events.walroot.checkpoint.binpb
- replicated protocol state
root.raft.bin- contains raft hard state, raft snapshot, and retained raft entries
Each meta-root node must have an isolated workdir. Workdirs are not shared.
Persistence ownership:
meta-rootworkdirs own durable rooted truth and replicated metadata-root raft state.coordinatorruntime view is rebuildable from remotemeta/root.- allocator fences and
Tenureare rooted events, not local coordinator files.
--coordinator-id must be a stable configured identity. It is used for lease
ownership and operator debugging; it should not be generated randomly on each
restart.
Rooted bootstrap flow
The Coordinator storage layer rebuilds its region snapshot and allocator checkpoints by replaying rooted truth:
- region descriptor publish/tombstone events rebuild the route catalog
- allocator fences rebuild:
id_currentts_current
id_current and ts_current are durable allocator fences, not necessarily the
last values served. With allocator window preallocation they may be ahead of the
last returned ID or timestamp; restart recovery intentionally resumes after the
fence and skips unused values.
Startup flow:
- Open rooted
coordinator/rootviewagainst the 3 meta-root--root-peerendpoints. - Reconstruct a rooted Coordinator snapshot (
regions+ allocator fences). - Compute starts as
max(cli_start, fence+1). - Materialize the rooted region snapshot into
coordinator/catalog.Cluster.
Coordinator periodically refreshes rooted state via the meta-root tail stream and rebuilds the service-side view. This avoids allocator rollback and keeps all durable truth inside meta-root.
Region Truth Hierarchy
NoKV intentionally keeps three region views with different authority:
- Coordinator region catalog: cluster routing truth. Clients and stores must treat Coordinator as the authoritative key-to-region source at the service boundary, but Coordinator rebuilds this view from rooted metadata truth plus heartbeats.
raftstore/localmetalocal catalog: store-local recovery truth. It exists so one store can restart hosted peers and replay raft WAL checkpoints even if Coordinator is temporarily unavailable.Store.regionsruntime catalog: in-memory cache/view rebuilt from local metadata at startup and then advanced by peer lifecycle plus raft apply.
These layers are not interchangeable. Local metadata is recovery state, not cluster routing authority.
5. Config Integration
raft_config.json supports Coordinator endpoint + workdir defaults:
"coordinator": {
"addr": "127.0.0.1:2379",
"docker_addr": "nokv-coordinator:2379",
"work_dir": "./artifacts/cluster/coordinator",
"docker_work_dir": "/var/lib/nokv-coordinator"
}
Resolution rules:
- CLI override wins.
- Otherwise read from config by scope (
host/docker).
Helpers:
config.ResolveCoordinatorAddr(scope)config.ResolveCoordinatorWorkDir(scope)nokv-config coordinator --field addr|workdir --scope host|docker
Replicated-root transport settings are currently CLI-driven, not config-file driven.
6. Routing Source Convergence
NoKV now uses Coordinator-first routing:
raftstore/clientresolves regions withGetRegionByKey.raft_configregions are bootstrap/deployment metadata.- Runtime route truth comes from Coordinator heartbeats + Coordinator region catalog.
This avoids dual sources drifting over time (config vs Coordinator).
7. Serve Mode Semantics
nokv serve is now Coordinator-only:
--coordinator-addris required.- Runtime routing/scheduling control-plane state is sourced from Coordinator.
For restart and recovery, nokv serve intentionally separates runtime truth from
deployment metadata:
- hosted region/peer truth comes from
raftstore/localmeta - raft durable progress comes from the store workdir (
WAL, raft log, local metadata) raft_config.jsonis used only to resolve static addresses (Coordinator,store listen,store transport)
This means:
- bootstrap-time
config.regionsare not replayed during restart - runtime split/merge/peer-change results continue to come back from local recovery state
--store-addris an exceptional static address override, not the normal restart path--store-idmust match the durable workdir identity when the workdir was already used
The recommended restart shape is therefore:
nokv serve \
--config ./raft_config.example.json \
--scope host \
--store-id 1 \
--workdir ./artifacts/cluster/store-1
serve will:
- load the local peer catalog from the store workdir
- derive the current remote peer set from local metadata
- use config
storesonly to mapstoreID -> addr
If static transport overrides are needed, prefer stable store identities:
nokv serve \
--config ./raft_config.example.json \
--scope host \
--store-id 1 \
--workdir ./artifacts/cluster/store-1 \
--store-addr 2=10.0.0.12:20160
Related CLI behavior:
- Inspect control-plane state through Coordinator APIs/metrics.
nokv coordinator --metrics-addr <host:port>exposes native expvar on/debug/vars.nokv serve --metrics-addr <host:port>exposes store/runtime expvar on/debug/vars.
8. Service Semantics
Coordinator intentionally separates rooted truth leadership from the outer gRPC
service surface.
In 3 coordinator + replicated meta:
- all three
coordinatorprocesses may listen and serve RPC - only the rooted leader may commit truth writes
- followers refresh rooted state and serve read/view traffic
In separated mode:
meta-rootleadership determines which root endpoint accepts truth writesTenuredetermines which Coordinator may serve singleton duties- non-holder Coordinators may still serve route reads if their rooted view satisfies the caller’s freshness contract
Leader-only writes
These RPCs require rooted leadership:
RegionHeartbeatPublishRootEventRemoveRegionAllocIDTso
Followers return FailedPrecondition with coordinator not leader semantics, and
clients are expected to retry against another Coordinator endpoint.
In separated mode, AllocID, Tso, and scheduler operation planning also
require the local Coordinator to hold Tenure.
Any-node reads
These RPCs may be served by any Coordinator node:
GetRegionByKeyStoreHeartbeathandling and store-view inspection
Follower reads are driven by a rooted watch-first tail subscription, with
explicit refresh/reload as fallback into coordinator/catalog.Cluster. They are expected
to be shortly stale rather than linearly consistent.
For GetRegionByKey, that follower-service behavior is now explicit in the
protocol surface:
- callers can request
Freshness - responses include rooted token metadata
- responses disclose
DegradedModeandCatchUpState - bounded-freshness reads may be rejected if rooted lag exceeds the requested limit
Client behavior
coordinator/client accepts multiple Coordinator addresses. Write RPCs retry across Coordinator nodes and
converge on the rooted leader. Read RPCs may use any available Coordinator endpoint.
9. Deployment Example
NoKV ships exactly one topology: a 3-peer replicated meta-root cluster plus one or more coordinator processes that talk to it over gRPC.
Start three metadata-root peers (peer map is identical on all three; only
-addr, -workdir, -node-id, -transport-addr differ):
go run ./cmd/nokv meta-root \
-addr 127.0.0.1:2380 \
-workdir ./artifacts/cluster/meta-root-1 \
-node-id 1 \
-transport-addr 127.0.0.1:3380 \
-peer 1=127.0.0.1:3380 \
-peer 2=127.0.0.1:3381 \
-peer 3=127.0.0.1:3382
Start one coordinator (add more by giving each a distinct -coordinator-id
and -addr; they share the same -root-peer set):
go run ./cmd/nokv coordinator \
-addr 127.0.0.1:2379 \
-coordinator-id c1 \
-root-peer 1=127.0.0.1:2380 \
-root-peer 2=127.0.0.1:2381 \
-root-peer 3=127.0.0.1:2382
Current product assumptions:
- exactly three meta-root replicas
- meta-root is the only place durable rooted truth lives
- coordinators are stateless relative to rooted truth; only the
Tenuredifferentiates active vs standby - no dynamic metadata-root membership
For local bootstrap, use:
./scripts/dev/cluster.sh --config ./raft_config.example.json
10. Comparison: TinyKV / TiKV
TinyKV (teaching stack)
- Uses a scheduler server (
tinyscheduler) as separate process. - Control plane integrates embedded etcd for metadata persistence.
- Educational architecture, minimal production hardening.
TiKV (production stack)
- Coordinator is an independent, highly available cluster.
- Coordinator internally uses etcd Raft for durable metadata + leader election.
- Rich scheduling and balancing policies, rolling updates, robust ops tooling.
NoKV Coordinator (current)
- Standalone mode has no Coordinator and no metadata-root service.
- Distributed mode has three control-plane deployments:
single coordinator + local meta3 coordinator + replicated metaseparated meta-root + remote coordinator- In co-located deployments, each
coordinatorprocess hosts a same-process rooted backend and rebuilds its service-side view from rooted truth. - In separated deployment,
meta-rootis the durable truth service andcoordinatoris a remote rooted view/service layer. - Coordinator persistence is intentionally limited to rooted control-plane truth:
- region descriptor publish/tombstone events
- allocator durability (
AllocID,TSO) Tenureownership for separated singleton duties
- Coordinator is not the durable owner of a store’s local raft/region truth. Store
restart truth remains in
raftstore/localmeta, while Coordinator keeps routing and scheduling state rebuilt frommeta/root.
11. Current Limitations / Next Steps
single coordinator + local metaremains the simpler and more mature deployment.3 coordinator + replicated metais now a formal product mode, but still has a deliberately small HA surface:- fixed three replicas
- no dynamic metadata membership
- follower convergence uses watch-first tailing with refresh/reload fallback
separated meta-root + remote coordinatoris implemented but experimental:- use it for control-plane research and failure-domain experiments
- do not treat it as the default production path yet
- failure/recovery E2E tests and eunomia benchmarks still need to be expanded before stronger claims are made
- Scheduler policy is intentionally small (leader transfer focused).
- No advanced placement constraints yet.
These are deliberate scope limits for a fast-moving experimental platform that keeps the rooted truth surface small.