Overview
🔥 Why NoKV?
NoKV is designed for modern hardware and distributed workloads. It combines the best of academic research (WiscKey, W-TinyLFU) with industrial-grade engineering (Raft, Percolator).
🏎️ Extreme Performance
Lock-light commit queue and Batch WAL writing deliver write throughput that saturates NVMe SSDs.
🧠 Smart Caching
Built-in W-TinyLFU Block Cache (via Ristretto) and HotRing implementation ensure 99% cache hit rates and adapt to skew access patterns.
🌐 Distributed Consistency
Multi-Raft replication for high availability. Percolator model for cross-row ACID transactions. Snapshot Isolation by default.
🔌 Redis Compatible
Drop-in replacement for Redis. Supports the RESP protocol so you can use your existing tools and client libraries.
📊 Performance Benchmark
Latest full baseline (generated on 2026-02-23 with default make bench profile: records=1M, ops=1M, conc=16, value_size=256, workloads A-G, engines NoKV/Badger/Pebble):
| Workload | NoKV (ops/s) | Badger (ops/s) | Pebble (ops/s) |
|---|---|---|---|
| YCSB-A | 847,660 | 396,314 | 1,282,218 |
| YCSB-B | 1,742,820 | 716,151 | 1,941,330 |
| YCSB-C | 2,070,856 | 826,766 | 847,764 |
| YCSB-D | 1,754,955 | 842,637 | 2,509,809 |
| YCSB-E | 205,489 | 41,508 | 554,557 |
| YCSB-F | 715,946 | 326,343 | 1,123,473 |
| YCSB-G | 413,521 | 399,405 | 583,584 |
Click to view full benchmark summary
NoKV YCSB-A 847660 YCSB-B 1742820 YCSB-C 2070856 YCSB-D 1754955 YCSB-E 205489 YCSB-F 715946 YCSB-G 413521
Badger YCSB-A 396314 YCSB-B 716151 YCSB-C 826766 YCSB-D 842637 YCSB-E 41508 YCSB-F 326343 YCSB-G 399405
Pebble YCSB-A 1282218 YCSB-B 1941330 YCSB-C 847764 YCSB-D 2509809 YCSB-E 554557 YCSB-F 1123473 YCSB-G 583584
Raw report: benchmark_results_20260223_195951.txt
🏗️ Architecture
graph TD
Client["Client / Redis"] -->|RESP Protocol| Gateway["Redis Gateway"]
Gateway -->|RaftCmd| RaftStore
subgraph "RaftStore (Distributed Layer)"
RaftStore -->|Propose| RaftLog["Raft Log (WAL)"]
RaftLog -->|Consensus| Apply["Apply Worker"]
end
subgraph "Storage Engine (LSM)"
Apply -->|Batch Set| MemTable
MemTable -->|Flush| SSTable["SSTables (L0-L6)"]
SSTable -->|Compact| SSTable
Apply -->|Large Value| VLog["Value Log"]
end
subgraph "Cache Layer"
BlockCache["Block Cache (Ristretto)"] -.-> SSTable
IndexCache["Index Cache (W-TinyLFU)"] -.-> SSTable
end
Getting Started
This guide gets you from zero to a running NoKV cluster (or an embedded DB) in a few minutes.
Prerequisites
- Go 1.26+
- Git
- (Optional) Docker + Docker Compose for containerized runs
Option A: Local Cluster (recommended for dev)
This launches a 3-node Raft cluster plus a PD-lite service.
./scripts/run_local_cluster.sh --config ./raft_config.example.json
Start the Redis-compatible gateway in another shell:
go run ./cmd/nokv-redis --addr 127.0.0.1:6380 --raft-config raft_config.example.json
Quick smoke test:
redis-cli -p 6380 ping
Inspect stats
go run ./cmd/nokv stats --workdir ./artifacts/cluster/store-1
Option B: Docker Compose
This runs the cluster and gateway in containers.
docker compose up --build
Tear down:
docker compose down -v
Embedded Usage (single-process)
Use NoKV as a library when you do not need raftstore.
package main
import (
"fmt"
"log"
NoKV "github.com/feichai0017/NoKV"
)
func main() {
opt := NoKV.NewDefaultOptions()
opt.WorkDir = "./workdir-demo"
db := NoKV.Open(opt)
defer db.Close()
key := []byte("hello")
if err := db.Set(key, []byte("world")); err != nil {
log.Fatalf("set failed: %v", err)
}
entry, err := db.Get(key)
if err != nil {
log.Fatalf("get failed: %v", err)
}
fmt.Printf("value=%s\n", entry.Value)
}
Note:
DB.Getreturns detached entries (do not callDecrRef).DB.GetInternalEntryreturns borrowed entries and callers must callDecrRefexactly once.DB.SetWithTTLacceptstime.Duration(relative TTL).DB.Set/DB.SetWithTTLrejectnilvalues; useDB.Delfor deletes.DB.NewIteratorexposes user-facing entries, whileDB.NewInternalIteratorscans raw internal keys (cf+user_key+ts).
Benchmarks
Micro benchmarks:
go test -bench=. -run=^$ ./...
YCSB (default: NoKV + Badger + Pebble, workloads A-G):
make bench
Override defaults with env vars:
YCSB_RECORDS=1000000 YCSB_OPS=1000000 YCSB_CONC=8 make bench
Latest full baseline (2026-02-23):
| Workload | NoKV (ops/s) | Badger (ops/s) | Pebble (ops/s) |
|---|---|---|---|
| YCSB-A | 847,660 | 396,314 | 1,282,218 |
| YCSB-B | 1,742,820 | 716,151 | 1,941,330 |
| YCSB-C | 2,070,856 | 826,766 | 847,764 |
| YCSB-D | 1,754,955 | 842,637 | 2,509,809 |
| YCSB-E | 205,489 | 41,508 | 554,557 |
| YCSB-F | 715,946 | 326,343 | 1,123,473 |
| YCSB-G | 413,521 | 399,405 | 583,584 |
Cleanup
If a local run crashes or you want a clean slate:
make clean
Troubleshooting
- WAL replay errors after crash: wipe the workdir and restart the cluster.
- Port conflicts: adjust addresses in
raft_config.example.json. - Slow startup: reduce
YCSB_RECORDSorYCSB_OPSwhen benchmarking locally.
NoKV Architecture Overview
NoKV delivers a hybrid storage engine that can operate as a standalone embedded KV store or as a distributed NoKV service. The distributed RPC surface follows a TinyKV/TiKV-style region + MVCC design, but the service identity and deployment model are NoKV’s own. This document captures the key building blocks, how they interact, and the execution flow from client to disk.
1. High-Level Layout
┌─────────────────────────┐ NoKV gRPC ┌─────────────────────────┐
│ raftstore Service │◀──────────────▶ │ raftstore/client │
└───────────┬─────────────┘ │ (Get / Scan / Mutate) │
│ └─────────────────────────┘
│ ReadCommand / ProposeCommand
▼
┌─────────────────────────┐
│ store.Store / peer.Peer │ ← multi-Raft region lifecycle
│ ├ Manifest snapshot │
│ ├ Router / RegionHooks │
│ └ transport (gRPC) │
└───────────┬─────────────┘
│ Apply via kv.Apply
▼
┌─────────────────────────┐
│ kv.Apply + percolator │
│ ├ Get / Scan │
│ ├ Prewrite / Commit │
│ └ Latch manager │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Embedded NoKV core │
│ ├ WAL Manager │
│ ├ MemTable / Flush │
│ ├ ValueLog + GC │
│ └ Manifest / Stats │
└─────────────────────────┘
- Embedded mode uses
NoKV.Opendirectly: WAL→MemTable→SST durability, ValueLog separation, MVCC semantics, rich stats. - Distributed mode layers
raftstoreon top: multi-Raft regions reuse the same WAL/Manifest, expose metrics, and serve NoKV RPCs. - Control plane split:
raft_configprovides bootstrap topology; PD provides runtime routing/TSO/control-plane state in cluster mode. - Clients obtain leader-aware routing, automatic NotLeader/EpochNotMatch retries, and two-phase commit helpers.
Detailed Runtime Paths
For function-level call chains with sequence diagrams (embedded write/read,
iterator scan, distributed read/write via Raft apply), see
docs/runtime.md.
2. Embedded Engine
2.1 WAL & MemTable
wal.Managerappends[len|type|payload|crc]records (typed WAL), rotates segments, and replays logs on crash.MemTableaccumulates writes until full, then enters the flush queue;flush.ManagerrunsPrepare → Build → Install → Release, logs edits, and releases WAL segments.- Writes are handled by a single commit worker that performs value-log append first, then WAL/memtable apply, keeping durability ordering simple and consistent.
2.2 ValueLog
- Large values are written to the ValueLog before the WAL append; the resulting
ValuePtris stored in WAL/LSM so replay can recover. vlog.Managertracks the active head and uses flush discard stats to trigger GC; manifest records new heads and removed segments.
2.3 Manifest
manifest.Managerstores SST metadata, WAL checkpoints, ValueLog metadata, and (importantly) Region descriptors used by raftstore.CURRENTprovides crash-safe pointer updates; Region state is replicated through manifest edits.
2.4 LSM Compaction & Ingest Buffer
compact.Managerdrives compaction cycles;lsm.levelManagersupplies table metadata and executes the plan.- Planning is split:
compact.PlanFor*selects table IDs + key ranges, then LSM resolves IDs back to tables and runs the merge. compact.Stateguards overlapping key ranges and tracks in-flight table IDs.- Ingest shard selection is policy-driven in
compact(PickShardOrder/PickShardByBacklog) while the ingest buffer remains inlsm.
flowchart TD Manager["compact.Manager"] --> LSM["lsm.levelManager"] LSM -->|TableMeta snapshot| Planner["compact.PlanFor*"] Planner --> Plan["compact.Plan (fid+range)"] Plan -->|resolvePlanLocked| Exec["LSM executor"] Exec --> State["compact.State guard"] Exec --> Build["subcompact/build SST"] Build --> Manifest["manifest edits"] L0["L0 tables"] -->|moveToIngest| Ingest["ingest buffer shards"] Ingest -->|IngestDrain: ingest-only| Main["Main tables"] Ingest -->|IngestKeep: ingest-merge| Ingest
2.5 Distributed Transaction Path
percolatorimplements Prewrite/Commit/ResolveLock/CheckTxnStatus;kv.Applydispatches raft commands to these helpers.- MVCC timestamps come from the distributed client/PD TSO flow, not from an embedded standalone transaction API.
- Watermarks (
utils.WaterMark) are used in durability/visibility coordination; they have no background goroutine and advance via mutex + atomics.
2.6 Write Pipeline & Backpressure
- Writes enqueue into a commit queue (
db_write.go) where requests are coalesced into batches before a commit worker drains them. - The commit worker always writes the value log first (when needed), then applies WAL/LSM updates;
SyncWritesadds a WAL fsync step. - Batch sizing adapts to backlog (
WriteBatchMaxCount/Size,WriteBatchWait) and hot-key pressure can expand batch limits temporarily to drain spikes. - Backpressure is enforced in two places: LSM throttling toggles
db.blockWriteswhen L0 backlog grows, and HotRing can reject hot keys viaWriteHotKeyLimit.
2.7 Ref-Count Lifecycle Contracts
NoKV uses fail-fast reference counting for internal pooled/owned objects. DecrRef underflow is treated as a lifecycle bug and panics.
| Object | Owned by | Borrowed by | Release rule |
|---|---|---|---|
kv.Entry (pooled) | internal write/read pipelines | codec iterator, memtable/lsm internal reads, request batches | Must call DecrRef exactly once per borrow. |
kv.Entry (detached public result) | caller | none | Returned by DB.Get; must not call DecrRef. |
kv.Entry (borrowed internal result) | caller | yes (DecrRef) | Returned by DB.GetInternalEntry; caller must release exactly once. |
request | commit queue/worker | waiter path (Wait) | IncrRef on enqueue; Wait does one DecrRef; zero returns request to pool and releases entries. |
table | level/main+ingest lists, block cache | table iterators, prefetch workers | Removed tables are decremented once after manifest+in-memory swap; zero deletes SST. |
Skiplist / ART index | memtable | iterators | Iterator creation increments index ref; iterator Close decrements; double-close is idempotent. |
3. Replication Layer (raftstore)
| Package | Responsibility |
|---|---|
store | Region catalog, router, RegionMetrics, Region hooks, manifest integration, helpers such as StartPeer / SplitRegion. |
peer | Wraps etcd/raft RawNode, handles Ready pipeline, snapshot resend queue, backlog instrumentation. |
engine | WALStorage/DiskStorage/MemoryStorage, reusing the DB’s WAL while keeping manifest metadata in sync. |
transport | gRPC transport for Raft Step messages, connection management, retries/blocks/TLS. Also acts as the host for NoKV RPC. |
kv | NoKV RPC handler plus kv.Apply bridging Raft commands to MVCC logic. |
server | ServerConfig + New combine DB, Store, transport, and NoKV service into a reusable node instance. |
3.1 Bootstrap Sequence
raftstore.NewServerwires DB, store configuration (StoreID, hooks, scheduler), Raft config, and transport address. It registers NoKV RPC on the shared gRPC server and setstransport.SetHandler(store.Step).- CLI (
nokv serve) or application enumeratesManifest.RegionSnapshot()and callsStore.StartPeerfor every Region containing the local store:peer.Configincludes Raft params, transport,kv.NewEntryApplier, WAL/Manifest handles, Region metadata.- Router registration, regionManager bookkeeping, optional
Peer.Bootstrapwith initial peer list, leader campaign.
- Peers from other stores can be configured through
transport.SetPeer(storeID, addr). In cluster mode, runtime routing/control-plane decisions come from PD.
3.2 Command Paths
- ReadCommand (
KvGet/KvScan): validate Region & leader, execute Raft ReadIndex (LinearizableRead) andWaitApplied, then runcommandApplier(i.e.kv.Applyin read mode) to fetch data from the DB. This yields leader-strong reads with an explicit Raft linearizability barrier. - ProposeCommand (write): encode the request, push through Router to the leader peer, replicate via Raft, and apply in
kv.Applywhich maps to MVCC operations.
3.3 Transport
- gRPC server handles Step RPCs and NoKV RPCs on the same endpoint; peers are registered via
SetPeer. - Retry policies (
WithRetry) and TLS credentials are configurable. Tests cover partitions, blocked peers, and slow followers.
4. NoKV Service
raftstore/kv/service.go exposes pb.NoKV RPCs:
| RPC | Execution | Result |
|---|---|---|
KvGet | store.ReadCommand → kv.Apply GET | pb.GetResponse / RegionError |
KvScan | store.ReadCommand → kv.Apply SCAN | pb.ScanResponse / RegionError |
KvPrewrite | store.ProposeCommand → percolator.Prewrite | pb.PrewriteResponse |
KvCommit | store.ProposeCommand → percolator.Commit | pb.CommitResponse |
KvResolveLock | percolator.ResolveLock | pb.ResolveLockResponse |
KvCheckTxnStatus | percolator.CheckTxnStatus | pb.CheckTxnStatusResponse |
nokv serve is the CLI entry point—open the DB, construct raftstore.Server, register peers, start local Raft peers, and display a manifest summary (Regions, key ranges, peers). scripts/run_local_cluster.sh builds the CLI, writes a minimal region manifest, launches multiple nokv serve processes on localhost, and handles cleanup on Ctrl+C.
The RPC request/response shape is intentionally close to TinyKV/TiKV so the MVCC and region semantics remain familiar, but the service name exposed on the wire is pb.NoKV.
5. Client Workflow
raftstore/client offers a leader-aware client with retry logic and convenient helpers:
- Initialization: provide
[]StoreEndpoint+RegionResolver(GetRegionByKey) so runtime routing is PD-driven. - Reads:
GetandScanpick the leader store for a key range, issue NoKV RPCs, and retry on NotLeader/EpochNotMatch. - Writes:
Mutatebundles operations per region and drives Prewrite/Commit (primary first, secondaries after);PutandDeleteare convenience wrappers using the same 2PC path. - Timestamps: clients must supply
startVersion/commitVersion. For distributed demos, use PD-lite (nokv pd) to obtain globally increasing values before callingTwoPhaseCommit. - Bootstrap helpers:
scripts/run_local_cluster.sh --config raft_config.example.jsonbuilds the binaries, seeds manifests vianokv-config manifest, launches PD-lite, and starts the stores declared in the config.
Example (two regions)
- Regions
[a,m)and[m,+∞), each led by a different store. Mutate(ctx, primary="alfa", mutations, startTs, commitTs, ttl)prewrites and commits across the relevant regions.Get/Scanretries automatically if the leader changes.- See
raftstore/server/server_test.gofor a full end-to-end example using realraftstore.Serverinstances.
6. Failure Handling
- Manifest edits capture Region metadata, WAL checkpoints, and ValueLog pointers. Restart simply reads
CURRENTand replays edits. - WAL replay reconstructs memtables and Raft groups; ValueLog recovery trims partial records.
Stats.StartStatsresumes metrics sampling immediately after restart, making it easy to verify recovery correctness vianokv stats.
7. Observability & Tooling
StatsSnapshotpublishes flush/compaction/WAL/VLog/raft/region/hot/cache metrics.nokv statsand the expvar endpoint expose the same data.nokv regionsinspects Manifest-backed Region metadata.nokv serveadvertises Region samples on startup (ID, key range, peers) for quick verification.- Inspect scheduler/control-plane state via PD APIs/metrics.
- Scripts:
scripts/run_local_cluster.sh– launch a multi-node NoKV cluster locally.scripts/recovery_scenarios.sh– crash-recovery test harness.scripts/transport_chaos.sh– inject network faults and observe transport metrics.
8. When to Use NoKV
- Embedded: call
NoKV.Open, use the local non-transactional DB APIs. - Distributed: deploy
nokv servenodes, useraftstore/client(or any NoKV gRPC client) to perform reads, scans, and 2PC writes. - Observability-first: inspection via CLI or expvar is built-in; Region, WAL, Flush, and Raft metrics are accessible without extra instrumentation.
See also docs/raftstore.md for deeper internals, docs/pd.md for control-plane details, and docs/testing.md for coverage details.
Runtime Call Chains (Current)
This document focuses on the current execution paths in NoKV and maps API calls to concrete functions in the codebase.
It intentionally describes only what is running today.
1. API Surface Snapshot
| Mode | Read APIs | Write APIs | Txn APIs |
|---|---|---|---|
Embedded (NoKV.DB) | Get, NewIterator, NewInternalIterator | Set, SetWithTTL, Del, ApplyInternalEntries | N/A (no standalone local txn API) |
Distributed (raftstore/kv) | KvGet, KvBatchGet, KvScan | N/A direct write | KvPrewrite, KvCommit, KvBatchRollback, KvResolveLock, KvCheckTxnStatus |
Core entry points:
- Embedded DB:
db.go,db_write.go,iterator.go - Distributed RPC:
raftstore/kv/service.go - Raft read/propose bridge:
raftstore/store/command_service.go - MVCC logic:
percolator/txn.go,percolator/reader.go
2. Embedded Write Path (Set / SetWithTTL / Del)
2.1 Function-Level Chain
DB.Set/DB.SetWithTTL/DB.Delcreates internal-key entry viakv.NewInternalEntry.DB.ApplyInternalEntriesvalidates each internal key viakv.SplitInternalKey, then callsbatchSet.batchSetenqueues request (sendToWriteCh-> commit queue).commitWorkerdrains a batch:vlog.write(requests)writes large values first and producesValuePtr.applyRequests->writeToLSM->lsm.SetBatch.
lsm.SetBatchwrites one atomic batch:memTable.setBatchwal.AppendEntryBatch- mem index insert.
2.2 Sequence Diagram
sequenceDiagram
participant U as User API
participant DB as DB.Set/SetWithTTL/Del
participant Q as commitQueue
participant W as commitWorker
participant V as vlog.write
participant L as lsm.SetBatch
participant M as memTable.setBatch
participant WAL as wal.AppendEntryBatch
U->>DB: Set/SetWithTTL/Del
DB->>DB: NewInternalEntry + ApplyInternalEntries
DB->>Q: sendToWriteCh / enqueueCommitRequest
Q->>W: nextCommitBatch
W->>V: write(requests)
V-->>W: ValuePtr for large values
W->>L: writeToLSM(entries)
L->>M: setBatch(entries)
M->>WAL: AppendEntryBatch(entries)
M-->>L: index.Add(...)
L-->>W: success
3. Embedded Read Path (Get / GetInternalEntry)
3.1 Function-Level Chain
DB.GetbuildsInternalKey(CFDefault, userKey, nonTxnMaxVersion).loadBorrowedEntrycallslsm.Getfor the newest visible internal record.- If value is pointer (
BitValuePointer), read real bytes viavlog.read, clear pointer bit. PopulateInternalMetaensuresCF/Versioncache matches internal key.DB.Getreturns detached public entry viacloneEntry(user key + copied value).DB.GetInternalEntryreturns borrowed internal entry (caller mustDecrRef).
3.2 Sequence Diagram
sequenceDiagram
participant U as User API
participant DB as DB.Get/GetInternalEntry
participant LSM as lsm.Get
participant VLOG as vlog.read
U->>DB: Get(userKey)
DB->>DB: InternalKey(CFDefault,userKey,nonTxnMaxVersion)
DB->>LSM: Get(internalKey)
LSM-->>DB: pooled Entry (internal key)
alt BitValuePointer set
DB->>VLOG: read(ValuePtr)
VLOG-->>DB: raw value bytes
end
DB->>DB: PopulateInternalMeta
alt Get (public)
DB->>DB: cloneEntry -> user key/value copy
DB-->>U: detached Entry
else GetInternalEntry
DB-->>U: borrowed internal Entry (DecrRef required)
end
4. Iterator Paths
4.1 Public Iterator (DB.NewIterator)
- Build merged internal iterator:
lsm.NewIterators+lsm.NewMergeIterator. Seekconverts user key to internal seek key (CFDefault + nonTxnMaxVersion).populate/materialize:- parse internal key (
kv.SplitInternalKey) - apply bounds on user key
- optionally resolve vlog pointer
- expose user-key item.
- parse internal key (
4.2 Internal Iterator (DB.NewInternalIterator)
- Directly returns merged iterator over internal keys.
- No user-key rewrite; caller handles
kv.SplitInternalKey.
flowchart TD
A["DB.NewIterator"] --> B["lsm.NewMergeIterator(internal keys)"]
B --> C["Seek(userKey -> InternalKey)"]
C --> D["populate/materialize"]
D --> E["SplitInternalKey + bounds check"]
E --> F{"BitValuePointer?"}
F -- no --> G["Expose inline value"]
F -- yes --> H["vlog.read(ValuePtr)"]
H --> G
G --> I["Item.Entry uses user key"]
5. Distributed Read Path (KvGet / KvBatchGet / KvScan)
5.1 Function-Level Chain
raftstore/kv.ServicebuildsRaftCmdRequestfrom NoKV RPC.Store.ReadCommand:validateCommand(region/epoch/leader/key-range)peer.LinearizableReadpeer.WaitAppliedcommandApplier(req)(injected askv.Apply).
kv.Applyexecutes:handleGet->percolator.Reader.GetLock+GetValuehandleScan-> iterateCFWrite, resolve visible versions.
5.2 Sequence Diagram
sequenceDiagram
participant C as NoKV Client
participant SVC as kv.Service
participant ST as Store.ReadCommand
participant P as peer.LinearizableRead
participant AP as kv.Apply
participant R as percolator.Reader
participant DB as DB
C->>SVC: KvGet/KvBatchGet/KvScan
SVC->>ST: ReadCommand(RaftCmdRequest)
ST->>ST: validateCommand
ST->>P: LinearizableRead + WaitApplied
ST->>AP: commandApplier(req)
AP->>R: GetLock/GetValue or scan CFWrite
R->>DB: GetInternalEntry/NewInternalIterator
AP-->>SVC: RaftCmdResponse
SVC-->>C: NoKV response
6. Distributed Write Path (2PC via Raft Apply)
6.1 Function-Level Chain
- Client (
raftstore/client) runsMutate/TwoPhaseCommitby region. - RPC layer (
kv.Service) sends write commands throughStore.ProposeCommand. - Raft replication commits log entries; apply path invokes
kv.Apply. kv.Applydispatches topercolator.Prewrite/Commit/BatchRollback/ResolveLock/CheckTxnStatus.- Percolator mutators call
applyVersionedOps:- build entries via
kv.NewInternalEntry - call
db.ApplyInternalEntries - release refs (
DecrRef).
- build entries via
- Storage then follows the same embedded write pipeline (vlog -> LSM/WAL).
6.2 Sequence Diagram
sequenceDiagram
participant CL as raftstore/client
participant SVC as kv.Service
participant ST as Store.ProposeCommand
participant RF as Raft replicate/apply
participant AP as kv.Apply
participant TXN as percolator.txn
participant DB as DB.ApplyInternalEntries
CL->>SVC: KvPrewrite/KvCommit/...
SVC->>ST: ProposeCommand
ST->>RF: route to leader + replicate
RF->>AP: apply committed command
AP->>TXN: Prewrite/Commit/Resolve...
TXN->>DB: applyVersionedOps -> ApplyInternalEntries
DB-->>TXN: write result
AP-->>SVC: RaftCmdResponse
SVC-->>CL: NoKV response
7. Entry Ownership and Refcount Rules
| Source | Returned entry type | Key form | Caller action |
|---|---|---|---|
DB.GetInternalEntry | Borrowed pooled | Internal key | Must call DecrRef() once |
DB.Get | Detached copy | User key | Must not call DecrRef() |
percolator.applyVersionedOps temporary entries | Borrowed pooled | Internal key | Always DecrRef() after ApplyInternalEntries |
LSM.Get / memtable reads | Borrowed pooled | Internal key | Upstream owner must release |
8. Key/Value Shape by Stage
| Stage | Entry.Key | Entry.Value | Notes |
|---|---|---|---|
| User write before queue | Internal key (CF + user key + ts) | Raw user bytes | Built by NewInternalEntry |
| After vlog step | Internal key | Inline value or ValuePtr.Encode() | Pointer marked by BitValuePointer |
| LSM/WAL stored form | Internal key | Encoded value payload | Used by replay/flush/compaction |
GetInternalEntry output | Internal key | Raw value bytes (pointer resolved) | Internal caller view |
Get / public iterator output | User key | Raw value bytes | External caller view |
Error Handling Guide
This document defines how NoKV should own, define, and propagate errors.
1. Ownership Rules
- Domain errors stay in domain packages.
- Cross-cutting runtime errors may live in
utilsonly when shared by multiple subsystems. - Command-local/business-flow errors should be unexported (
errXxx) and stay in command/service packages.
Examples:
kv: entry codec/read-decode errors.vfs: filesystem contract errors.pd/core: control-plane validation/conflict errors.
2. Propagation Rules
- Wrap with
%wwhen crossing package boundaries. - Match via
errors.Is, not string compare. - Keep stable sentinel values for retryable / control-flow decisions.
- Add context in upper layers; do not lose original cause.
3. Naming Rules
- Exported sentinels use
ErrXxx. - Error text should be lowercase and package-scoped when useful (for example
pd/core: ...,vfs: ...). - Avoid duplicate sentinels with identical semantics in different packages.
4. Current Error Map
Shared runtime sentinels
utils/error.go: common cross-package sentinels such as invalid request, key/value validation errors, throttling, and lifecycle guards.
Domain-specific sentinels
kv/entry_codec.go:ErrBadChecksum,ErrPartialEntryvfs/vfs.go:ErrRenameNoReplaceUnsupportedlsm/compact/errors.go: compaction planner/runtime domain errorsraftstore/peer/errors.go: peer lifecycle/state errorspb/errorpb.proto: region/store routing protobuf errors (RegionError,StoreNotMatch,RegionNotFound,KeyNotInRegion, …)wal/errors.go: WAL encode/decode and segment errorspd/core/errors.go: PD metadata and range validation errors
5. Propagation in Hot Paths
- Embedded write path (
DB.Set*-> commit worker -> LSM/WAL):- validation returns direct sentinel (
ErrEmptyKey,ErrNilValue,ErrInvalidRequest); - storage boundary errors are wrapped with context and preserved via
%w.
- validation returns direct sentinel (
- Distributed command path (
kv.Service->Store.*Command->kv.Apply):- region/leader/store/range failures are mapped to
errorpbmessages in protobuf responses; - execution failures return Go errors to RPC layer and are translated to gRPC status.
- region/leader/store/range failures are mapped to
- Recovery/replay path (WAL/Vlog/Manifest):
- partial/corrupt records return domain sentinels and are handled by truncation or restart logic in upper layers.
Entry Lifecycle, Encoding, and Field Semantics
kv.Entry is the core record container that flows through user APIs, commit batching,
WAL/value-log codecs, memtable indexes, SST blocks, and iterators.
This document explains:
- How key/value bytes are encoded.
- What
Entry.Key/Entry.Valuemean at each stage. - Which fields are authoritative vs derived.
- How
PopulateInternalMetakeeps internal fields consistent. - Borrowed-vs-detached ownership rules.
1. Structure Overview
Source: kv/entry.go, kv/key.go, kv/value.go
type Entry struct {
Key []byte
Value []byte
ExpiresAt uint64
CF ColumnFamily
Meta byte
Version uint64
Offset uint32
Hlen int
ValThreshold int64
ref int32
}
Important interpretation:
Keyis the canonical source of truth for internal records.CFandVersionare cached/derived fields for convenience.Valuecan represent either:- inline value bytes, or
- encoded
ValuePtrbytes (MetahasBitValuePointer).
2. Encoding Layers
2.1 Internal Key Encoding
Source: kv/key.go
InternalKey(cf, userKey, ts) layout:
- 4-byte CF header:
0xFF 'C' 'F' <cf-byte> - raw user key bytes
- 8-byte big-endian descending timestamp (
MaxUint64 - ts)
Helpers:
SplitInternalKey(internal) -> (cf, userKey, ts, ok)Timestamp(key)/StripTimestamp(key)/SameKey(a, b)
2.2 ValueStruct Encoding
Source: kv/value.go
ValueStruct layout:
Meta(1B)ExpiresAt(uvarint)Value(raw bytes)
ValueStruct does not store Version; version is always taken from internal key.
2.3 Entry Record Encoding (WAL / Vlog record payload)
Source: kv/entry_codec.go
Entry codec layout:
- header (uvarints: keyLen/valueLen/meta/expiresAt)
- key bytes
- value bytes
- crc32
DecodeEntryFrom now calls PopulateInternalMeta() after key decode so internal
records get consistent CF/Version immediately.
3. Stage-by-Stage Meaning of Key and Value
3.1 User write (DB.Set, DB.SetWithTTL, DB.ApplyInternalEntries)
Source: db.go
Set/SetWithTTLuseNewInternalEntry(...):Key: encoded internal key.Value: user value bytes.
ApplyInternalEntriesvalidates internal key, then writes back parsedCF/Versionfrom key before entering write pipeline.
3.2 Commit worker: vlog then LSM apply
Source: db_write.go, vlog.go
- Before
LSM.SetBatch, large values are replaced byValuePtr.Encode()bytes andBitValuePointeris set. - Small values stay inline.
Keyremains internal key throughout.
3.3 WAL replay / vlog iteration decode
Source: kv/entry_codec.go, lsm/memtable.go
- Decoded records carry internal key bytes in
Key. Valueis record payload bytes (inline value or pointer bytes).CF/Versionare derived from key viaPopulateInternalMeta.
3.4 Memtable index lookup
Source: lsm/memtable.go, utils/skiplist.go, utils/art.go
memIndex.Search(...)returns(matchedInternalKey, ValueStruct).memTable.Getassembles pooledEntryfrom this and callsPopulateInternalMeta.- For miss,
Key=niland zero value struct (sentinel, not a valid record).
3.5 SST / iterator decode
Source: lsm/builder.go, lsm/table.go
- Block iterator reconstructs internal key + value struct.
- It calls
PopulateInternalMetabefore exposing item entry. table.Searchclones key/value and re-populates internal metadata.
3.6 Internal read API (GetInternalEntry)
Source: db.go
loadBorrowedEntryfetches from LSM.- If
BitValuePointeris set:- decode pointer from
Value - read real bytes from vlog
- replace
Valuewith actual value bytes - clear
BitValuePointer
- decode pointer from
- Return entry with internal key still in
Key, andCF/Versionre-populated.
3.7 Public read API (Get, public iterator)
Source: db.go, iterator.go
- Public APIs convert internal key to user key for external consumers.
- Returned entry is detached copy (
DB.Get) or iterator materialized object.
3.8 Runtime State Flow Diagram
flowchart TD
A["DB.Set/SetWithTTL/Del"] --> B["kv.NewInternalEntry"]
B --> C["DB.ApplyInternalEntries"]
C --> D["commitWorker: vlog.write"]
D --> E["LSM/WAL persist"]
E --> F["DB.GetInternalEntry"]
F --> G{"BitValuePointer?"}
G -- yes --> H["vlog.read + clear pointer bit"]
G -- no --> I["inline value"]
H --> J["PopulateInternalMeta"]
I --> J
J --> K["DB.Get cloneEntry -> user key/value"]
J --> L["DB.NewIterator materialize -> user key/value"]
4. Field Validity Matrix
| Context | Key | Value | CF/Version | Ownership |
|---|---|---|---|---|
| Internal write path | Internal key | Inline value or ptr-bytes (large values) | Valid (set at build/validate time) | Borrowed/pooled |
| Decoded WAL/vlog record | Internal key | Encoded record payload value bytes | Valid after PopulateInternalMeta | Borrowed/pooled |
GetInternalEntry return | Internal key | Real value bytes (pointer resolved) | Valid (re-populated before return) | Borrowed/pooled |
DB.Get return | User key | Real value bytes | Valid for external semantics | Detached copy |
| Memtable miss sentinel | nil | nil | CFDefault/0 | Borrowed/pooled sentinel |
Rule of thumb:
- Internal code should treat
Keyas authoritative. CF/Versionshould be expected to matchSplitInternalKey(Key)afterPopulateInternalMeta.
5. PopulateInternalMeta Semantics
Source: kv/entry.go
func (e *Entry) PopulateInternalMeta() bool
Behavior:
- Parse
e.Keyas internal key. - If parse succeeds:
e.CF = parsedCFe.Version = parsedTS- return
true
- If parse fails:
- reset cache fields to safe defaults (
CFDefault,Version=0) - return
false
- reset cache fields to safe defaults (
Why it exists:
- pooled entries can carry stale cached fields if not reset carefully;
- parse-once helper gives a single normalization point at codec/index/read boundaries;
- keeps key-derived metadata (
CF/Version) authoritative and consistent.
6. Ownership and Refcount States
Source: kv/entry.go, docs/architecture.md
Borrowed entry (internal)
Returned by:
DecodeEntryFrommemTable.GetLSM.GetinternalsDB.GetInternalEntry
Contract:
- caller must call
DecrRef()exactly once. - double release panics (underflow guard).
Detached entry (public)
Returned by:
DB.Get(via clone)
Contract:
- do not call
DecrRef().
7. Contributor Rules
- Any new internal decode boundary should either:
- call
PopulateInternalMeta, or - immediately parse key with
SplitInternalKeyand set fields consistently.
- call
- Do not reintroduce value-side version caches.
- Keep internal APIs internal-key-first; only external APIs should expose user keys.
- Preserve borrowed/detached ownership contracts in comments and tests.
Configuration & Options
NoKV exposes two configuration surfaces:
- Runtime options for the embedded engine (
Optionsinoptions.go). - Cluster topology for distributed mode (
raft_config.example.jsonviaconfig.LoadFile/Validate).
1. Runtime Options (Embedded Engine)
NoKV.NewDefaultOptions() returns a tuned baseline. Override fields before
calling NoKV.Open(opt).
Key option groups (see options.go for the full list):
- Paths & durability
WorkDir,SyncWrites,ManifestSync,ManifestRewriteThreshold
- Write pipeline
WriteBatchMaxCount,WriteBatchMaxSize,WriteBatchWait
- Value log
ValueThreshold,ValueLogFileSize,ValueLogMaxEntriesValueLogGCInterval,ValueLogGCDiscardRatioValueLogGCParallelism,ValueLogGCReduceScore,ValueLogGCSkipScoreValueLogGCReduceBacklog,ValueLogGCSkipBacklogValueLogGCSampleSizeRatio,ValueLogGCSampleCountRatio,ValueLogGCSampleFromHeadValueLogBucketCount,ValueLogHotBucketCount,ValueLogHotKeyThreshold
- LSM & compaction
MemTableSize,MemTableEngine,SSTableMaxSz,NumCompactorsNumLevelZeroTables,IngestCompactBatchSize,IngestBacklogMergeScoreCompactionValueWeight,CompactionValueAlertThreshold
- Caches
BlockCacheSize,BloomCacheSize
- Hot key throttling
WriteHotKeyLimit,HotWriteBurstThreshold,HotWriteBatchMultiplierHotRingEnabled,HotRingTopK, decay/window settingsHotRingNodeCap,HotRingNodeSampleBits,HotRingRotationIntervalValueLogHotRingOverride+ValueLogHotRing*overrides
- WAL watchdog
EnableWALWatchdog,WALAutoGCIntervalWALAutoGCMinRemovable,WALAutoGCMaxBatchWALTypedRecordWarnRatio,WALTypedRecordWarnSegments
- Raft lag warnings (stats only)
RaftLagWarnSegments
Example:
opt := NoKV.NewDefaultOptions()
opt.WorkDir = "./data"
opt.SyncWrites = true
opt.ValueThreshold = 1024
opt.WriteBatchMaxCount = 128
db := NoKV.Open(opt)
defer db.Close()
Load Options From TOML
For convenience, you can load engine options from a TOML file. Unspecified
fields keep their defaults from NewDefaultOptions.
opt, err := NoKV.LoadOptionsFile("nokv.options.toml")
if err != nil {
log.Fatal(err)
}
db := NoKV.Open(opt)
defer db.Close()
Example (TOML):
work_dir = "./data"
mem_table_engine = "art"
value_threshold = 1024
write_hot_key_limit = 128
value_log_gc_interval = "30s"
Notes:
- Field names are case-insensitive;
_/-/.are ignored. - Durations accept Go-style strings (e.g.
"30s","200ms"). Numeric durations are interpreted as nanoseconds. - File extensions
.tomland.tmlare accepted. - JSON option files are rejected by design.
- Unknown fields return an error so typos do not silently pass.
2. Raft Topology File
raft_config.example.json is the single source of truth for distributed
topology. It is consumed by scripts, cmd/nokv-redis, and the config package.
Precedence rule: when a value can be provided by both CLI flags and config file, CLI flags take precedence; config acts as startup defaults.
Minimal shape:
{
"max_retries": 8,
"pd": {
"addr": "127.0.0.1:2379",
"docker_addr": "nokv-pd:2379",
"work_dir": "./artifacts/cluster/pd",
"docker_work_dir": "/var/lib/nokv-pd"
},
"stores": [
{ "store_id": 1, "listen_addr": "127.0.0.1:20170", "addr": "127.0.0.1:20170" }
],
"regions": [
{
"id": 1,
"start_key": "-",
"end_key": "-",
"epoch": { "version": 1, "conf_version": 1 },
"peers": [{ "store_id": 1, "peer_id": 101 }],
"leader_store_id": 1
}
]
}
Notes:
start_key/end_keyaccept plain strings,hex:<bytes>, or base64. Use"-"or empty for unbounded ranges.storesdefine both host and docker addresses for local runs vs containers.pd.addris the default PD endpoint for host scope;pd.docker_addris used when tools run in docker scope.pd.work_dir/pd.docker_work_dirare optional PD persistence directories used by bootstrap tooling andnokv pd --config ...when--workdiris not set explicitly.leader_store_idis optional bootstrap metadata. Runtime routing in cluster mode is resolved through PD (GetRegionByKey), not static leader hints.
Programmatic loading:
cfg, _ := config.LoadFile("raft_config.example.json")
if err := cfg.Validate(); err != nil { /* handle */ }
Related tools:
scripts/run_local_cluster.sh --config raft_config.example.jsongo run ./cmd/nokv-redis --raft-config raft_config.example.json
CLI (cmd/nokv) Reference
nokv provides operational visibility similar to RocksDB ldb / Badger CLI, with script-friendly JSON output.
Installation
go install ./cmd/nokv
Shared Flags
--workdir <path>: NoKV database directory (must containCURRENTfor manifest commands)--json: JSON output (default is plain text)--expvar <url>: forstats, fetch from/debug/vars--no-region-metrics: for offlinestats, skip attaching runtime region metrics
Subcommands
nokv stats
- Reads
StatsSnapshoteither offline (--workdir) or online (--expvar) - JSON output is nested by domain (not flat)
Common fields:
entriesflush.pending,flush.queue_length,flush.last_wait_mscompaction.backlog,compaction.max_scorevalue_log.segments,value_log.pending_deletes,value_log.gc.*wal.active_segment,wal.segment_count,wal.typed_record_ratiowrite.queue_depth,write.queue_entries,write.hot_key_limitedregion.total,region.running,region.removinghot.read_keys,hot.write_keyslsm.levels,lsm.value_bytes_totaltransport.*,redis.*
Example:
nokv stats --workdir ./testdata/db --json | jq '.flush.queue_length'
nokv manifest
- Reads manifest version state
- Shows log pointer, per-level file info, and value-log metadata
nokv vlog
- Lists value-log segments and current head per bucket
- Useful after GC/recovery checks
nokv regions
- Dumps manifest-backed region catalog (state/range/epoch/peers)
- Supports
--json
nokv serve
- Starts NoKV gRPC service backed by local
raftstore - Requires
--workdir,--store-id, and--pd-addr - Common flags:
--addr(default127.0.0.1:20160)--peer storeID=address(repeatable)--election-tick,--heartbeat-tick--raft-max-msg-bytes,--raft-max-inflight--raft-tick-interval,--raft-debug-log
Example:
nokv serve \
--workdir ./artifacts/cluster/store-1 \
--store-id 1 \
--addr 127.0.0.1:20170 \
--pd-addr 127.0.0.1:2379 \
--peer 2=127.0.0.1:20171 \
--peer 3=127.0.0.1:20172
Integration Tips
- Combine with
RECOVERY_TRACE_METRICS=1for recovery validation. - In CI, compare JSON snapshots to detect observability regressions.
- Use
nokv stats --expvarfor online diagnostics and--workdirfor offline forensics.
Memtable Design & Lifecycle
NoKV’s write path mirrors RocksDB: every write lands in the WAL and an in-memory memtable backed by a selectable in-memory index (skiplist or ART). The implementation lives in lsm/memtable.go and ties directly into the flush manager (lsm/flush).
1. Structure
type memTable struct {
lsm *LSM
segmentID uint32 // WAL segment backing this memtable
index memIndex
maxVersion uint64
walSize int64
}
The memtable index is an interface that can be backed by either a skiplist or ART:
type memIndex interface {
Add(*kv.Entry)
Search([]byte) ([]byte, kv.ValueStruct)
NewIterator(*utils.Options) utils.Iterator
MemSize() int64
IncrRef()
DecrRef()
}
- Memtable engine –
Options.MemTableEngineselectsart(default) orskiplistvianewMemIndex. ART is not a generic trie: it is an internal-key-only memtable index. It uses a reversible mem-comparable route key so trie ordering matches the LSM internal-key comparator;skiplistremains available as the simpler baseline alternative. - Arena sizing – both
utils.NewSkiplistandutils.NewARTusearenaSizeForto derive arena capacity fromOptions.MemTableSize. - WAL coupling – every
Setuseskv.EncodeEntryto materialise the payload to the active WAL segment before inserting into the chosen index.walSizetracks how much of the segment is consumed so flush can release it later. - Segment ID –
LSM.NewMemtableatomically incrementslevels.maxFID, switches the WAL to a new segment (wal.Manager.SwitchSegment), and tags the memtable with that FID. This matches RocksDB’slogfile_numberfield. - ART specifics – ART stores prefix-compressed inner nodes (Node4/16/48/256). Each leaf keeps both the private route key used by the trie and the original canonical internal key returned to callers. The main concurrency model is still copy-on-write payload/node cloning with CAS installs; the only retained writer-side
OLC-litefast path is a narrow in-placereplaceChildupdate. Reads stay lock-free and do not run full version validation.
2. Lifecycle
sequenceDiagram
participant WAL
participant MT as MemTable
participant Flush
participant Manifest
WAL->>MT: Append+Set(entry)
MT->>Flush: freeze (walSize + incomingEstimate > limit)
Flush->>Manifest: LogPointer + AddFile
Manifest-->>Flush: ack
Flush->>WAL: Release segments ≤ segmentID
- Active → Immutable – when
mt.walSize + estimateexceedsOptions.MemTableSize, the memtable is rotated and pushed onto the flush queue. The new active memtable triggers another WAL segment switch. - Flush – the flush manager drains immutable memtables, builds SSTables, logs manifest edits, and releases the WAL segment ID recorded in
memTable.segmentIDonce the SST is durably installed. - Recovery –
LSM.recoveryscans WAL files, reopens memtables per segment (most recent becomes active), and deletes segments ≤ the manifest’s log pointer. Entries are replayed viawal.Manager.ReplaySegmentinto fresh indexes and the active in-memory state is rebuilt.
Badger follows the same pattern, while RocksDB often uses skiplist-backed arenas with reference counting—NoKV reuses Badger’s arena allocator for simplicity.
3. Read Semantics
memTable.Getlooks up the chosen index and returns a borrowed, ref-counted*kv.Entryfrom the internal pool. The index search returns the matched internal key plus value struct, so memtable hit entries carry the concrete version key instead of the query sentinel key. Internal callers must release borrowed entries withDecrRefwhen done.MemTable.IncrRef/DecrRefdelegate to the index, allowing iterators to hold references while the flush manager processes immutable tables—mirroring RocksDB’sMemTable::Ref/Unreflifecycle.- WAL-backed values that exceed the value threshold are stored as pointers; the memtable stores the encoded pointer, and the transaction/iterator logic reads from the vlog on demand.
DB.Getreturns detached entries; callers must not callDecrRefon them.DB.GetInternalEntryreturns borrowed entries; callers must callDecrRefexactly once.
4. Integration with Other Subsystems
| Subsystem | Interaction |
|---|---|
| Distributed 2PC | kv.Apply + percolator write committed MVCC versions through the same WAL/memtable pipeline in raft mode. |
| Manifest | Flush completion logs EditLogPointer(segmentID) so restart can discard WAL files already persisted into SSTs. |
| Stats | Stats.Snapshot pulls FlushPending/Active/Queue counters via lsm.FlushMetrics, exposing how many immutables are waiting. |
| Value Log | lsm.flush emits discard stats keyed by segmentID, letting the value log GC know when entries become obsolete. |
5. Comparison
| Aspect | RocksDB | BadgerDB | NoKV |
|---|---|---|---|
| Data structure | Skiplist + arena | Skiplist + arena | Skiplist or ART + arena (art default) |
| WAL linkage | logfile_number per memtable | Segment ID stored in vlog entries | segmentID on memTable, logged via manifest |
| Recovery | Memtable replays from WAL, referencing MANIFEST | Replays WAL segments | Replays WAL segments, prunes ≤ manifest log pointer |
| Flush trigger | Size/entries/time | Size-based | WAL-size budget (walSize) with explicit queue metrics |
6. Operational Notes
- Tuning
Options.MemTableSizeaffects WAL segment count and flush latency. Larger memtables reduce flush churn but increase crash recovery time. - ART currently uses noticeably more memindex arena memory than skiplist because it stores both route keys and original internal keys in leaves; in local measurements the ART memindex is roughly
2xthe skiplist memindex footprint for the same key/value set. - Monitor
NoKV.Stats.flush.*fields to catch stalled immutables—an ever-growing queue often indicates slow SST builds or manifest contention. - Because memtables carry WAL segment IDs, deleting WAL files manually can lead to recovery failures; always rely on the engine’s manifest-driven cleanup.
See docs/flush.md for the end-to-end flush scheduler and [docs/architecture.md](architecture.md#3-end-to-end-write-flow) for where memtables sit in the write pipeline.
MemTable Flush Pipeline
NoKV’s flush path converts immutable memtables into L0 SST files, then advances the manifest WAL checkpoint and reclaims obsolete WAL segments. The task scheduler is in lsm/flush; SST persistence and manifest install are in lsm/builder.go and lsm/levels.go.
1. Responsibilities
- Persistence: materialize immutable memtables into SST files.
- Ordering: publish SST metadata to manifest only after the SST is durably installed (strict mode).
- Cleanup: remove WAL segments once checkpoint and raft constraints allow removal.
- Observability: export queue/build/release timing through flush metrics.
2. Stage Machine
flowchart LR
Active[Active MemTable]
Immutable[Immutable MemTable]
FlushQ[flush.Manager queue]
Build[StageBuild]
Install[StageInstall]
Release[StageRelease]
Active -->|threshold reached| Immutable --> FlushQ
FlushQ --> Build --> Install --> Release --> Active
- StagePrepare:
flush.Manager.Submitenqueues task and records wait-start time. - StageBuild: worker pulls task via
flush.Manager.Next, builds SST (levelManager.flush->openTable->tableBuilder.flush). - StageInstall: after SST + manifest edits succeed, worker marks install complete (
flush.Manager.Update(..., StageInstall, ...)). - StageRelease: worker removes immutable from in-memory list, closes memtable, records release metrics, and completes task.
3. SST Persistence Modes
Flush uses two write modes controlled by Options.ManifestSync:
-
Fast path (
ManifestSync=false)- Writes SST directly to final filename with
O_CREATE|O_EXCL. - No temp file/rename step.
- Highest throughput, weaker crash-consistency guarantees.
- Writes SST directly to final filename with
-
Strict path (
ManifestSync=true)- Writes to
"<table>.tmp.<pid>.<ns>". tmp.Sync()to persist SST bytes.RenameNoReplace(tmp, final)installs file atomically. If unsupported by platform/filesystem, returnsvfs.ErrRenameNoReplaceUnsupported.SyncDir(workdir)is called before manifest edit so directory entry is durable.
- Writes to
This is the durability ordering used by current code.
4. Execution Path in Code
lsm.Set/lsm.SetBatchdetectswalSize + estimate > MemTableSizeand rotates memtable.- Rotated memtable is submitted to
flush.Manager(lsm.submitFlush). - Worker executes
levelManager.flush(mt):- iterates memtable entries,
- builds SST via
tableBuilder, - prepares manifest edits:
EditAddFile+EditLogPointer.
- In strict mode,
SyncDirruns beforemanifest.LogEdits(...). - On successful manifest commit, table is added to L0 and
wal.RemoveSegmentruns when allowed.
5. Recovery Notes
- Startup rebuild (
levelManager.build) validates manifest SST entries against disk. - Missing or unreadable SSTs are treated as stale and removed from manifest via
EditDeleteFile, allowing startup to continue. - Temp SST names are only used in strict mode and are created in
WorkDirwith suffix.tmp.<pid>.<ns>(not a dedicatedtmp/directory).
6. Metrics & CLI
flush.Manager.Stats() feeds StatsSnapshot.Flush:
pending,queue,active- wait/build/release totals, counts, last, max
completed
Use:
nokv stats --workdir <dir>
to inspect flush backlog and latency.
7. Related Tests
lsm/flush/manager_test.go: queue/stage transitions and timing counters.db_test.go::TestRecoveryWALReplayRestoresData: replay still restores data after crash before flush completion.db_test.go::TestRecoveryCleansMissingSSTFromManifestanddb_test.go::TestRecoveryCleansCorruptSSTFromManifest: stale manifest SST cleanup on startup.
See also recovery.md, memtable.md, and wal.md.
Compaction & Cache Strategy
NoKV’s compaction pipeline borrows the leveled‑LSM layout from RocksDB, but layers it with an ingest buffer, lightweight cache telemetry, and simple concurrency guards so the implementation stays approachable while still handling bursty workloads.
1. Overview
Compactions are orchestrated by compact.Manager with lsm.levelManager implementing the executor hooks. Each level owns two lists of tables:
tables– the canonical sorted run for the level.ingest– a staging buffer that temporarily holds SSTables moved from the level above when there is not yet enough work (or bandwidth) to do a full merge.
The compaction manager periodically calls into the executor to build a list of compact.Priority entries. The priorities consider three signals:
- L0 table count – loosely capped by
Options.NumLevelZeroTables. - Level size vs target – computed by
levelTargets(), which dynamically adjusts the “base” level depending on total data volume. - Ingest buffer backlog – if a level’s
ingestshards have data, they receive elevated scores so staged tables are merged promptly.
The highest adjusted score is processed first. L0 compactions can either move tables into the ingest buffer of the base level (cheap re‑parenting) or compact directly into a lower level when the overlap warrants it.
Planning now happens via compact.Plan: LSM snapshots table metadata into compact.TableMeta, compact.PlanFor* selects table IDs + key ranges, and LSM resolves the plan back to *table before executing.
2. Ingest Buffer
moveToIngest (see lsm/executor.go) performs a metadata-only migration:
- Records a
manifest.EditDeleteFilefor the source level. - Logs a new
manifest.EditAddFiletargeting the destination level. - Removes the table from
thisLevel.tablesand appends it tonextLevel.ingest.
This keeps write amplification low when many small L0 tables arrive at once. Reads still see the newest data because levelHandler.searchIngestSST checks ingest before consulting tables.
Compaction tests (lsm/compaction_test.go) assert that after calling moveToIngest the table disappears from the source level and shows up in the ingest buffer.
3. Concurrency Guards
To prevent overlapping compactions:
compact.State.CompareAndAddtracks the key range of each in-flight compaction per level.- Attempts to register a compaction whose ranges intersect an existing one are rejected.
- When a compaction finishes,
compact.State.Deleteremoves the ranges and table IDs from the guard.
This mechanism is intentionally simple—just a mutex‐protected slice—yet effective in tests (TestCompactStatusGuards) that simulate back‑to‑back registrations on the same key range.
4. Cache Telemetry
NoKV’s cache is split into three parts (lsm/cache.go):
| Component | Purpose | Metrics hook |
|---|---|---|
| Block cache | Ristretto cache for L0/L1 blocks. | cacheMetrics.recordBlock(level, hit) |
| OS page cache path | Deeper levels bypass user-space cache and rely on mmap + kernel page cache. | Same as above |
| Bloom cache | Stores decoded bloom filters to reduce disk touches. | recordBloom(hit) |
Cache hit/miss signals are exported through StatsSnapshot.Cache (and surfaced by nokv stats / expvar), which is especially helpful when tuning ingest behaviour—if L0/L1 cache misses spike, the ingest buffer likely needs to be drained faster. TestCacheHotColdMetrics verifies cache hit accounting.
5. Interaction with Value Log
Compaction informs value‑log GC via discard statistics:
- During
subcompact, every entry merged out is inspected. If it stores aValuePtr, the amount is added to the discard map. - At the end of subcompaction, the accumulated discard map is pushed through
setDiscardStatsCh. valueLogreceives the stats and can safely rewrite or delete vlog segments with predominantly obsolete data.
This tight coupling keeps the value log from growing indefinitely after heavy overwrite workloads.
6. Testing Checklist
Relevant tests to keep compaction healthy:
lsm/compaction_test.goTestCompactionMoveToIngest– ensures metadata migration works and the ingest buffer grows.TestCompactStatusGuards– checks overlap detection.
lsm/cache_test.goTestCacheHotColdMetrics– validates cache hit accounting.
lsm/lsm_test.goTestCompact/TestHitStorage– end‑to‑end verification that data remains queryable across memtable flushes and compactions.
When adding new compaction heuristics or cache behaviour, extend these tests (or introduce new ones) so the behaviour stays observable.
7. Practical Tips
- Tune
Options.IngestCompactBatchSizewhen ingest queues build up; increasing it lets a single move cover more tables. - Observe
NoKV.Stats.cache.*andNoKV.Stats.compaction.*via the CLI (nokv stats) to decide whether you need more compaction workers or bigger caches. - For workloads dominated by range scans, consider increasing
Options.BlockCacheSizeif you want to keep more L0/L1 blocks in the user-space cache; cold data relies on the OS page cache. - Keep an eye on
NoKV.Stats.value_log.gc(for examplegc_runsandhead_updates); if compactions are generating discard stats but the value log head doesn’t move, GC thresholds may be too conservative.
With these mechanisms, NoKV stays resilient under bursty writes while keeping the code path small and discoverable—ideal for learning or embedding. Dive into the source files referenced above for deeper implementation details.
Ingest Buffer Architecture
The ingest buffer is a per-level staging area for SSTables—typically promoted from L0—designed to absorb bursts, reduce overlap, and unlock parallel compaction without touching the main level tables immediately. It combines fixed sharding, adaptive scheduling, and optional IngestKeep (ingest-merge) passes to keep write amplification and contention low.
flowchart LR
L0["L0 SSTables"] -->|moveToIngest| Ingest["Ingest Buffer (sharded)"]
subgraph levelN["Level N"]
Ingest -->|IngestDrain: ingest-only| MainTables["Main Tables"]
Ingest -->|IngestKeep: ingest-merge| Ingest
end
Ingest -.read path merge.-> ClientReads["Reads/Iterators"]
Design Highlights
- Sharded by key prefix: ingest tables are routed into fixed shards (top bits of the first byte). Sharding cuts cross-range overlap and enables safe parallel drain.
- Snapshot-friendly reads: ingest tables are read under the level
RLock, and iterators hold table refs so mmap-backed data stays valid without additional snapshots. - Two ingest paths:
- Ingest-only compaction: drain ingest → main level (or next level) with optional multi-shard parallelism guarded by
compact.State. - Ingest-merge: compact ingest tables back into ingest (stay in-place) to drop superseded versions before promoting, reducing downstream write amplification.
- Ingest-only compaction: drain ingest → main level (or next level) with optional multi-shard parallelism guarded by
- IngestMode enum: plans carry an
IngestModewithIngestNone,IngestDrain, andIngestKeep.IngestDraincorresponds to ingest-only (drain into main tables), whileIngestKeepcorresponds to ingest-merge (compact within ingest). - Adaptive scheduling:
- Shard selection is driven by
compact.PickShardOrder/compact.PickShardByBacklogusing per-shard size, age, and density. - Shard parallelism scales with backlog score (based on shard size/target file size) bounded by
IngestShardParallelism. - Batch size scales with shard backlog to drain faster under pressure.
- Ingest-merge triggers when backlog score exceeds
IngestBacklogMergeScore(default 2.0), with dynamic lowering under extreme backlog/age.
- Shard selection is driven by
- Observability: expvar/stats expose
IngestDrainvsIngestKeepcounts, duration, and tables processed, plus ingest size/value density per level/shard.
Configuration
IngestShardParallelism: max shards to compact in parallel (defaultmax(NumCompactors/2, 2), auto-scaled by backlog).IngestCompactBatchSize: base batch size per ingest compaction (auto-boosted by shard backlog).IngestBacklogMergeScore: backlog score threshold to triggerIngestKeep/ingest-merge (default 2.0).
Benefits
- Lower write amplification: bursty L0 SSTables land in ingest first;
IngestKeep/ingest-merge prunes duplicates before full compaction. - Reduced contention: sharding +
compact.Stateallow parallel ingest drain with minimal overlap. - Predictable reads: ingest is part of the read snapshot, so moving tables in/out does not change read semantics.
- Tunable and observable: knobs for parallelism and merge aggressiveness, with per-path metrics to guide tuning.
Future Work
- Deeper adaptive policies (IO/latency-aware), richer shard-level metrics, and more exhaustive parallel/restart testing under fault injection.
WAL Subsystem
NoKV’s write-ahead log mirrors RocksDB’s durability model and is implemented as a compact Go module similar to Badger’s journal. WAL appends happen alongside memtable writes (via lsm.Set), while values that are routed to the value log are written before the WAL so that replay always sees durable value pointers.
1. File Layout & Naming
- Location:
${Options.WorkDir}/wal/. - Naming pattern:
%05d.wal(e.g.00001.wal). - Rotation threshold: configurable via
wal.Config.SegmentSize(defaults to 64 MiB, minimum 64 KiB). - Verification:
wal.VerifyDirensures the directory exists prior toDB.Open.
On open, wal.Manager scans the directory, sorts segment IDs, and resumes the highest ID—exactly how RocksDB re-opens its MANIFEST and WAL sequence files.
2. Record Format
uint32 length (big-endian, includes type + payload)
uint8 type
[]byte payload
uint32 checksum (CRC32 Castagnoli over type + payload)
- Checksums use
kv.CastagnoliCrcTable, the same polynomial used by RocksDB (Castagnoli). Record encoding/decoding lives inwal/record.go. - The type byte allows mixing LSM mutations with raft log/state/snapshot records in the same WAL segment.
- Appends are buffered by
bufio.Writerso batches become single system calls. - Replay stops cleanly at truncated tails; tests simulate torn writes by truncating the final bytes and verifying replay remains idempotent (
wal/manager_test.go::TestManagerReplayHandlesTruncate).
3. Public API (Go)
mgr, _ := wal.Open(wal.Config{Dir: path})
info, _ := mgr.AppendEntry(entry)
batchInfo, _ := mgr.AppendEntryBatch(entries)
typedInfos, _ := mgr.AppendRecords(wal.Record{
Type: wal.RecordTypeRaftEntry,
Payload: raftPayload,
})
_ = mgr.Sync()
_ = mgr.Rotate()
_ = mgr.Replay(func(info wal.EntryInfo, payload []byte) error {
// reapply to memtable
return nil
})
Key behaviours:
AppendEntry/AppendEntryBatch/AppendRecordsautomatically callensureCapacityto decide when to rotate; they returnEntryInfo{SegmentID, Offset, Length}so upper layers can keep manifest/raft pointers.Syncflushes the active file (used forOptions.SyncWrites).Rotateforces a new segment (used after flush/compaction checkpoints similar to RocksDB’sLogFileManager::SwitchLog).Replayiterates segments in numeric order, forwarding each payload to the callback. Errors abort replay so recovery can surface corruption early.- Metrics (
wal.Manager.Metrics) reveal the active segment ID, total segments, and number of removed segments—these feed directly intoStatsSnapshotandnokv statsoutput.
Compared with Badger: Badger keeps a single vlog for both data and durability. NoKV splits WAL (durability) from vlog (value separation), matching RocksDB’s separation of WAL and blob files.
4. Integration Points
| Call Site | Purpose |
|---|---|
lsm.memTable.set | Encodes each entry (kv.EncodeEntry) and appends to WAL before inserting into the active memtable index (ART by default, skiplist when explicitly selected). |
DB.commitWorker | Commit worker applies batched writes via writeToLSM, which calls lsm.SetBatch and appends one WAL entry-batch record per request batch. |
DB.Set / DB.SetWithTTL / DB.Del / DB.ApplyInternalEntries | User/internal writes all flow through the same commit queue and eventually reach lsm.SetBatch + WAL append. |
lsm/levels.go::flush | Persists WAL checkpoint via manifest.LogEdits(EditAddFile, EditLogPointer) during flush install. |
lsm/levels.go::flush + lsm/levelManager.canRemoveWalSegment | Removes obsolete WAL segments after checkpoint/raft constraints are satisfied. |
db.runRecoveryChecks | Ensures WAL directory invariants before manifest replay, similar to Badger’s directory bootstrap. |
5. Metrics & Observability
Stats.collect reads the manager metrics and exposes them as:
NoKV.Stats.wal.active_segmentNoKV.Stats.wal.segment_countNoKV.Stats.wal.segments_removed
The CLI command nokv stats --workdir <dir> prints these alongside backlog, making WAL health visible without manual inspection. In high-throughput scenarios the active segment ID mirrors RocksDB’s LOG number growth.
6. WAL Watchdog (Auto GC)
The WAL watchdog runs inside the DB process to keep WAL backlog in check and surface warnings when raft-typed records dominate the log. It:
- Samples WAL metrics + per-segment metrics and combines them with
manifest.RaftPointerSnapshot()to compute removable segments. - Removes up to
WALAutoGCMaxBatchsegments when at leastWALAutoGCMinRemovableare eligible. - Exposes counters (
wal.auto_gc_runs/removed/last_unix) and warning state (wal.typed_record_ratio/warning/reason) throughStatsSnapshot.WAL.
Relevant options (see options.go for defaults):
EnableWALWatchdogWALAutoGCIntervalWALAutoGCMinRemovableWALAutoGCMaxBatchWALTypedRecordWarnRatioWALTypedRecordWarnSegments
7. Recovery Walkthrough
wal.Openreopens the highest segment, leaving the file pointer at the end (switchSegmentLocked).manifest.Managersupplies the WAL checkpoint (segment + offset) while building the version. Replay skips entries up to this checkpoint, ensuring we only reapply writes not yet materialised in SSTables.wal.Manager.Replay(invoked by the LSM recovery path) rebuilds memtables from entries newer than the manifest checkpoint. Value-log recovery only validates/truncates segments and does not reapply data.- If the final record is partially written, the CRC mismatch stops replay and the segment is truncated during recovery tests, mimicking RocksDB’s tolerant behaviour.
8. Operational Tips
- Configure
Options.SyncWrites=truefor synchronous durability (default async, similar to RocksDB’s default). - After large flushes, forcing
Rotatekeeps WAL files short, reducing replay time. - Archived WAL segments can be copied alongside manifest files for hot-backup strategies—since the manifest contains the WAL log number, snapshots behave like RocksDB’s
Checkpoints.
9. Truncation Metadata
raftstore/engine/wal_storagekeeps a per-group index of[firstIndex,lastIndex]spans for each WAL record so it can map raft log indices back to the segment that stored them.- When a log is truncated (either via snapshot or future compaction hooks), the manifest is updated via
LogRaftTruncatewith the index/term, segment ID (RaftLogPointer.SegmentIndex), and byte offset (RaftLogPointer.TruncatedOffset) that delimit the remaining WAL data. lsm/levelManager.canRemoveWalSegmentnow blocks garbage collection whenever any raft group still references a segment through its truncation metadata, preventing slow followers from losing required WAL history while letting aggressively compacted groups release older segments earlier.
For broader context, read the architecture overview and flush pipeline documents.
Value Log (vlog) Design
NoKV keeps the LSM tree lean by separating large values into sequential value log (vlog) files. The module is split between
vlog/manager.go– owns the open file set, rotation, and segment lifecycle helpers.vlog/io.go– append/read/iterate/verify/sample IO paths.vlog.go– integrates the manager with the DB write path, discard statistics, and garbage collection (GC).
The design echoes BadgerDB’s value log while remaining manifest-driven like RocksDB’s blob_db: vlog metadata (head pointer, pending deletions) is persisted inside the manifest so recovery can reconstruct the exact state without scanning the filesystem.
1. Layering (Engine View)
The value log is split into three layers so IO can stay reusable while DB policy lives in the core package:
- DB policy layer (
vlog.go,vlog_gc.go) – integrates the manager with the DB write path, persists vlog metadata in the manifest, and drives GC scheduling based on discard stats. - Value-log manager (
vlog/) – owns segment lifecycle (open/rotate/remove), encodes/decodes entries, and exposes append/read/sample APIs without touching MVCC or LSM policy. - File IO (
file/) – mmap-backedLogFileprimitives (open/close/truncate, read/write, read-only remap) shared by WAL/vlog/SST. Vlog currently usesLogFiledirectly instead of an intermediate store abstraction.
2. Directory Layout & Naming
<workdir>/
vlog/
bucket-000/
00000.vlog
00001.vlog
bucket-001/
00000.vlog
00001.vlog
...
- Files are named
%05d.vlogand live underworkdir/vlog/bucket-XXX/whenOptions.ValueLogBucketCount > 1.Manager.populatediscovers existing segments at open. Managertracks the active file ID (activeID) and byte offset;Manager.Headexposes these so the manifest can checkpoint them (manifest.EditValueLogHead).- Files created after a crash but never linked in the manifest are removed during
valueLog.reconcileManifest. - When HotRing routing is enabled (
ValueLogHotBucketCount+ValueLogHotKeyThreshold), buckets are split into hot vs cold ranges to isolate update-heavy keys from GC pressure.
3. Record Format
The vlog uses the shared encoding helper (kv.EncodeEntryTo), so entries written to the value log and the WAL are byte-identical.
+--------+----------+------+-------------+-----------+-------+
| KeyLen | ValueLen | Meta | ExpiresAt | Key bytes | Value |
+--------+----------+------+-------------+-----------+-------+
+ CRC32 (4 B)
- Header fields are varint-encoded (
kv.EntryHeader). - The first 20 bytes of every segment are reserved (
kv.ValueLogHeaderSize) for future metadata; iteration always skips this fixed header. kv.EncodeEntryand the entry iterator (kv.EntryIterator) perform the layout work, and each append finishes with a CRC32 to detect torn writes.vlog.VerifyDirscans all segments withsanitizeValueLogto trim corrupted tails after crashes, mirroring RocksDB’sblob_file::Sanitize. Badger performs a similar truncation pass at startup.
4. Manager API Surface
mgr, _ := vlog.Open(vlog.Config{Dir: "...", MaxSize: 1<<29})
ptr, _ := mgr.AppendEntry(entry)
ptrs, _ := mgr.AppendEntries(entries, writeMask)
val, unlock, _ := mgr.Read(ptr)
unlock() // release per-file lock
_ = mgr.Rewind(*ptr) // rollback partially written batch
_ = mgr.Remove(fid) // close + delete file
Key behaviours:
- Append + Rotate –
Manager.AppendEntryencodes and appends into the active file. The reservation path handles rotation when the active segment would exceedMaxSize; manual rotation is rare. - Crash recovery –
Manager.Rewindtruncates the active file and removes newer files when a write batch fails mid-flight.valueLog.writeuses this to guarantee idempotent WAL/value log ordering. - Safe reads –
Manager.Readreturns an mmap-backed slice plus an unlock callback. Active segments take a per-fileRWMutex, while sealed segments use a pin/unpin path to avoid long-held locks; callers that need ownership should copy the bytes before releasing the lock. - Verification –
VerifyDirvalidates entire directories (used by CLI and recovery) by parsing headers and CRCs.
Compared with RocksDB’s blob manager the surface is intentionally small—NoKV treats the manager as an append-only log with rewind semantics, while RocksDB maintains index structures inside the blob file metadata.
5. Integration with DB Writes
sequenceDiagram
participant Commit as commitWorker
participant Mgr as vlog.Manager
participant WAL as wal.Manager
participant Mem as MemTable
Commit->>Mgr: AppendEntries(entries, writeMask)
Mgr-->>Commit: ValuePtr list
Commit->>WAL: Append(entries+ptrs)
Commit->>Mem: apply to active memtable index
valueLog.writebuilds a write mask for each batch, then delegates toManager.AppendEntries. Entries staying in LSM (shouldWriteValueToLSM) receive zero-value pointers.- Rotation is handled inside the manager when the reserved bytes would exceed
MaxSize. The WAL append happens after the value log append so crash replay observes consistent pointers. - Any error triggers
Manager.Rewindback to the saved head pointer, removing new files and truncating partial bytes.vlog_test.goexercises both append- and rotate-failure paths. - All batched writes share the same pipeline: the commit worker always writes the value log first, then applies to WAL/memtable, keeping ordering and durability consistent.
Badger follows the same ordering (value log first, then write batch). RocksDB’s blob DB instead embeds blob references into the WAL entry before the blob file write, relying on two-phase commit between WAL and blob; NoKV avoids the extra coordination by reusing a single batching loop.
6. Discard Statistics & GC
flowchart LR Compaction -- "obsolete ptrs" --> DiscardStats DiscardStats -->|"batch json"| writeCh valuePtr["valueLog.newValuePtr(lfDiscardStatsKey)"] writeCh --> valuePtr valueLog -- "GC trigger" --> Manager
lfDiscardStatsaggregates per-file discard counts from LSM compaction workers (lsm/executor.go,subcompact->updateDiscardStats). Once the in-memory counter crossesdiscardStatsFlushThreshold, it marshals the map into JSON and writes it back through the DB pipeline under the special key!NoKV!discard.valueLog.flushDiscardStatsconsumes those stats, ensuring they are persisted even across crashes. During recoveryvalueLog.populateDiscardStatsreplays the JSON payload to repopulate the in-memory map.- GC uses
discardRatio = discardedBytes/totalBytesderived fromManager.Sample, which applies windowed iteration based on configurable ratios. If a file exceeds the configured threshold,valueLog.doRunGCrewrites live entries into the current head viadb.batchSet(the normal commit pipeline) and thenvalueLog.rewritetriggers manifest delete edits throughremoveValueLogFile.- Sampling behaviour is controlled by
Options.ValueLogGCSampleSizeRatio(default 0.10 of the file) andOptions.ValueLogGCSampleCountRatio(default 1% of the configured entry limit). Setting either to<=0keeps the default heuristics.Options.ValueLogGCSampleFromHeadstarts sampling from the beginning instead of a random window.
- Sampling behaviour is controlled by
- Completed deletions are logged via
lsm.LogValueLogDeleteso the manifest can skip them during replay. When GC rotates to a new head,valueLog.updateHeadrecords the pointer and incrementsNoKV.Stats.value_log.gc.head_updates.
RocksDB’s blob GC leans on compaction iterators to discover obsolete blobs. NoKV, like Badger, leverages flush/compaction discard stats so GC does not need to rescan SSTs.
7. GC Scheduling & Parallelism
NoKV runs value-log GC with bucket-aware parallelism while protecting the LSM from overload:
ValueLogGCParallelismcontrols the maximum number of concurrent GC tasks. When set to<= 0, it auto-tunes tomax(NumCompactors/2, 1)and is capped by the bucket count.- Each bucket has a lock-free guard, so no two GC jobs run in the same bucket at once.
- A lightweight semaphore limits total concurrency without blocking the GC scheduler thread.
Pressure-aware throttling
Compaction backlog and score feed into the GC scheduler:
- Reduce: when compaction backlog or max score crosses
ValueLogGCReduceBacklog/ValueLogGCReduceScore, GC parallelism is halved. - Skip: when compaction backlog or max score crosses
ValueLogGCSkipBacklog/ValueLogGCSkipScore, GC is skipped for that tick.
This keeps GC throughput high under light load but avoids compaction starvation under pressure.
8. Recovery Semantics
DB.Openrestores the manifest and fetches the last persisted head pointer.valueLog.openlaunchesflushDiscardStatsand iterates every vlog file viavalueLog.replayLog. Files marked invalid in the manifest are removed; valid ones are registered in the manager’s file map.valueLog.replayLogstreams entries to validate checksums and trims torn tails; it does not reapply data into the LSM. WAL replay remains the sole source of committed state.Manager.VerifyDirtrims torn records so replay never sees corrupt payloads.- After validation,
valueLog.populateDiscardStatsrehydrates discard counters from the persisted JSON entry if present.
The flow mirrors Badger’s vlog scanning but keeps state reconstruction anchored on WAL + manifest checkpoints, similar to RocksDB’s reliance on MANIFEST to mark blob files live or obsolete.
9. Observability & CLI
- Metrics in
stats.goreport segment counts, pending deletions, discard queue depth, and GC head pointer viaexpvar. - GC scheduling exposes
NoKV.Stats.value_log.gc(includinggc_parallelism,gc_active,gc_scheduled,gc_throttled,gc_skipped,gc_rejected) for diagnostics. nokv vlog --workdir <dir>loads a manager in read-only mode and prints current head plus file status (valid, gc candidate). It invokesvlog.VerifyDirbefore describing segments.- Recovery traces controlled by
RECOVERY_TRACE_METRICSlog every head movement and file removal, aiding pressure testing of GC edge cases. For ad-hoc diagnostics, enableOptions.ValueLogVerboseto emit replay/GC messages to stdout.
10. Quick Comparison
| Capability | RocksDB BlobDB | BadgerDB | NoKV |
|---|---|---|---|
| Head tracking | In MANIFEST (blob log number + offset) | Internal to vlog directory | Manifest entry via EditValueLogHead |
| GC trigger | Compaction sampling, blob garbage score | Discard stats from LSM tables | Discard stats flushed through lfDiscardStats |
| Failure recovery | Blob DB and WAL coordinate two-phase commits | Replays value log then LSM | Rewind-on-error + manifest-backed deletes |
| Read path | Separate blob cache | Direct read + checksum | Manager.Read with copy + per-file lock |
By anchoring the vlog state in the manifest and exposing rewind/verify primitives, NoKV maintains the determinism of RocksDB while keeping Badger’s simple sequential layout.
11. Further Reading
docs/recovery.md– failure matrix covering append crashes, GC interruptions, and manifest rewrites.docs/cache.md– how vlog-backed entries interact with the block cache.docs/stats.md– metric names surfaced for monitoring.
Manifest & Version Management
The manifest is NoKV’s metadata log for:
- SST files (
EditAddFile/EditDeleteFile) - WAL checkpoint (
EditLogPointer) - value-log metadata (
EditValueLogHead,EditDeleteValueLog,EditUpdateValueLog) - raft and region metadata (
EditRaftPointer,EditRegion)
Implementation: manifest/manager.go, manifest/codec.go, manifest/types.go.
1. Files on Disk
WorkDir/
CURRENT
MANIFEST-000001
MANIFEST-000002
CURRENTstores the active manifest filename.CURRENTis updated viaCURRENT.tmp -> CURRENTrename.MANIFEST-*stores append-only encoded edits.
2. In-Memory Version Model
type Version struct {
Levels map[int][]FileMeta
LogSegment uint32
LogOffset uint64
ValueLogs map[ValueLogID]ValueLogMeta
ValueLogHead map[uint32]ValueLogMeta
RaftPointers map[uint64]RaftLogPointer
Regions map[uint64]RegionMeta
}
Levels: per-level SST metadata.LogSegment/LogOffset: WAL replay checkpoint.ValueLogs+ValueLogHead: all known vlog segments and per-bucket active heads.RaftPointers/Regions: raftstore progress + region metadata.
3. Edit Append Semantics
Manager.LogEdits(edits...) does:
- Encode edits to a buffer.
- Write encoded bytes to current manifest file.
- Conditionally call
manifest.Sync()when:Manager.syncWrites == true, and- at least one edit type requires sync (
Add/DeleteFile,LogPointer, value-log edits).
- Apply edits to in-memory
Version. - Trigger manifest rewrite if size crosses threshold.
SetSync(bool) and SetRewriteThreshold(int64) are configured by LSM options.
4. Rewrite Flow
When rewrite threshold is exceeded (or Rewrite() is called):
- Create next
MANIFEST-xxxxxx. - Write a full snapshot of current
Version. - Flush writer, and
Sync()the new manifest whensyncWritesis enabled. - Update
CURRENTto point to new file. - Reopen the new manifest for appends and remove old manifest file.
If rewrite fails before CURRENT update, restart continues using previous manifest.
5. Interaction with Other Modules
| Module | Manifest usage |
|---|---|
lsm/levels.go::flush | Logs EditAddFile + EditLogPointer after SST install; compaction logs add/delete edits. |
lsm/levels.go::build | During startup, missing/corrupt SST entries are marked stale and cleaned via EditDeleteFile. |
wal | Replays from manifest checkpoint (LogSegment, LogOffset). |
vlog | Persists head/update/delete metadata and uses manifest state for stale/orphan cleanup on startup. |
raftstore | Persists raft pointers and region metadata through manifest edits. |
6. Recovery-Relevant Guarantees
- Manifest append is ordered by single manager mutex.
- WAL replay starts from manifest checkpoint.
- Stale manifest SST entries are self-healed on startup (delete edit appended).
CURRENTindirection protects against partial manifest rewrite publication.
7. Operational Commands
nokv manifest --workdir <dir>
Useful fields:
log_pointer.segment,log_pointer.offsetlevels[*].filesvalue_log_headsvalue_logs[*].valid
See recovery.md and flush.md for startup and flush ordering details.
VFS
The vfs package provides a small filesystem abstraction used by WAL, manifest, SST, value-log, and raftstore paths.
1. Core Interfaces
vfs.FS includes:
- file open/create:
OpenHandle,OpenFileHandle - path ops:
MkdirAll,Remove,RemoveAll,Rename,RenameNoReplace,Stat,ReadDir,Glob - helpers:
ReadFile,WriteFile,Truncate,Hostname
vfs.File includes:
- read/write/seek APIs
Sync,Truncate,Close,Stat,Name
vfs.Ensure(fs) maps nil to OSFS.
2. Rename Semantics
Rename:
- normal rename/move semantics.
- target replacement behavior is platform-dependent (
os.Renamebehavior).
RenameNoReplace:
- contract: fail with
os.ErrExistwhen destination already exists. - on Linux, uses
renameat2(..., RENAME_NOREPLACE). - on macOS, uses
renamex_np(..., RENAME_EXCL). - when atomic no-replace rename is unsupported by platform/filesystem, returns
vfs.ErrRenameNoReplaceUnsupported(no non-atomic fallback).
3. Directory Sync Helper
vfs.SyncDir(fs, dir) fsyncs directory metadata to persist entry updates (create/rename/remove).
This is used in strict durability paths (for example SST install before manifest publication) to guarantee directory entry persistence.
4. Fault Injection (FaultFS)
FaultFS wraps an underlying FS and can inject failures by operation/path.
- Rule helpers:
FailOnceRule,FailOnNthRule,FailAfterNthRule,FailOnceRenameRule - File-handle faults: write/sync/close/truncate
- Rename fault matching supports
src/dsttargeting
Used to test manifest/WAL/recovery failure paths deterministically.
5. Current Implementation Set
OSFS: production implementation (Goospackage).FaultFS: failure-injection wrapper over anyFS.
No in-memory FS is included yet.
6. Design Notes
- Keep storage code decoupled from direct
os.*calls. - Make crash/failure tests reproducible.
- Keep API minimal and only add operations required by real storage call sites.
References:
- Pebble VFS: https://pkg.go.dev/github.com/cockroachdb/pebble/vfs
- Pebble errorfs: https://pkg.go.dev/github.com/cockroachdb/pebble/vfs/errorfs
File Abstractions
The file package encapsulates direct file-system interaction for WAL, SST, and value-log files. It provides portable mmap helpers, allocation primitives, and log file wrappers.
1. Core Types
| Type | Purpose | Key Methods |
|---|---|---|
Options | Parameter bag for opening files (FID, path, size). | Used by WAL/vlog managers. |
MmapFile | Cross-platform mmap wrapper. | OpenMmapFile, AppendBuffer, Truncate, Sync. |
LogFile | Value-log specific helper built on MmapFile. | Open, Write, Read, DoneWriting, Truncate, Bootstrap. |
Darwin-specific builds live alongside (mmap_darwin.go, sstable_darwin.go) ensuring the package compiles on macOS without manual tuning.
2. Mmap Management
OpenMmapFileopens or creates a file, optionally extending it tomaxSz, then mmaps it. The returnedMmapFileexposesData []byteand the underlying*os.Filehandle.- Writes grow the map on demand:
AppendBufferchecks if the write would exceed the current mapping and callsTruncateto expand (doubling up to 1 GiB increments). Syncflushes dirty pages (mmap.Msync), whileDeleteunmaps, truncates, closes, and removes the file—used when dropping SSTs or value-log segments.
RocksDB relies on custom Env implementations for portability; NoKV keeps the logic in Go, relying on build tags for OS differences.
3. LogFile Semantics
LogFile wraps MmapFile to simplify value-log operations:
lf := &file.LogFile{}
_ = lf.Open(&file.Options{FID: 1, FileName: "00001.vlog", MaxSz: 1<<29})
var buf bytes.Buffer
payload, _ := kv.EncodeEntry(&buf, entry)
_ = lf.Write(offset, payload)
_ = lf.DoneWriting(offset + uint32(len(payload)))
Openmmaps the file and records current size (guarded to< 4 GiB).Readvalidates offsets against both the mmap length and tracked size, preventing partial reads when GC or drop operations shrink the file.- Entry encoding uses shared helpers in
kv(kv.EncodeEntry/kv.EncodeEntryTo);LogFilefocuses on write/read/truncate + durability semantics. DoneWritingguarantees durability for both data bytes[0, offset)and the file metadata (size).- Sequence: It flushes dirty pages (
msync), truncates the file tooffset, and performs a file-descriptor level sync (fsync) to ensure the new file size is persisted on disk before returning. - Contract: Success implies that after a crash, the file size will not exceed
offset, and all data prior tooffsetis safe. - After syncing, it reinitializes the mmap and keeps the file open in read-write mode for potential subsequent appends (if logic allows) or prepares it for read-only consumption.
- Sequence: It flushes dirty pages (
Rewind(viavlog.Manager.Rewind) leveragesLogFile.TruncateandInitto roll back partial batches after errors.
4. SST Helpers
While SSTable builders/readers live under lsm/table.go, they rely on file helpers to map index/data blocks efficiently. The build tags (sstable_linux.go, sstable_darwin.go) provide OS-specific tuning for direct I/O hints or mmap flags.
5. Comparison
| Engine | Approach |
|---|---|
| RocksDB | C++ Env & random-access file wrappers. |
| Badger | y.File abstraction with mmap. |
| NoKV | Go-native mmap wrappers with explicit log helpers. |
By keeping all filesystem primitives in one package, NoKV ensures WAL, vlog, and SST layers share consistent behaviour (sync semantics, truncation rules) and simplifies testing (file/mmap_linux_test.go).
6. Operational Notes
DoneWritingprovides strong crash-consistency guarantees. Even on filesystems whereftruncatemetadata persistence is asynchronous, the explicit post-truncatefsyncensures the file size is durable upon success.- Value-log and WAL segments rely on
DoneWriting/Truncateto seal files; avoid manipulating files externally or mmap metadata may desynchronise. LogFileupdates cached size internally onWrite/Truncate, so read bounds stay consistent during rewrite/rewind flows.vfs.SyncDiris used by strict durability flows to persist directory entry changes (create/rename/remove). For example, strict SST flush callsSyncDir(workdir)before manifest publication.
For more on how these primitives plug into higher layers, see docs/wal.md and docs/vlog.md.
Cache & Bloom Filters
NoKV’s LSM tier layers a multi-level block cache with bloom filter caching to accelerate lookups. The implementation is in lsm/cache.go.
1. Components
| Component | Purpose | Source |
|---|---|---|
cache.indexs | Table index cache (fid → *pb.TableIndex) reused across reopen. | utils/cache |
blockCache | Ristretto-based block cache (L0/L1 only) with per-table direct slots. | lsm/cache.go |
bloomCache | LRU cache of bloom filter bitsets per SST. | lsm/cache.go |
cacheMetrics | Atomic hit/miss counters for L0/L1 blocks and blooms. | lsm/cache.go#L30-L110 |
Badger uses a similar block cache split (Pinner/Cache) while RocksDB exposes block cache(s) via the BlockBasedTableOptions. NoKV keeps it Go-native and GC-friendly.
1.1 Index Cache & Handles
- SSTable metadata stays with the
tablestruct, while decoded protobuf indexes are stored incache.indexs. Lookups first hit the cache before falling back to disk. - SST handles are reopened on demand for lower levels. L0/L1 tables keep their file descriptors pinned, while deeper levels close them once no iterator is using the table.
2. Block Cache Strategy
User-space block cache (L0/L1, parsed blocks, Ristretto LFU-ish)
Deeper levels rely on OS page cache + mmap readahead
Options.BlockCacheSizesets capacity in blocks (cost=1 per block). Entries keep parsed blocks (data slice + offsets/baseKey/checksum), so hits avoid re-parsing.- Per-table direct slots (
table.cacheSlots[idx]) give a lock-free fast path. Misses fall back to the shared Ristretto cache (approx LFU with admission). - Evictions clear the table slot via
OnEvict; user-space cache only tracks L0/L1 blocks. Deeper levels depend on the OS page cache. - Access patterns:
getBlockalso updates hit/miss metrics for L0/L1; deeper levels bypass the cache and do not affect metrics.
flowchart LR Read --> CheckCache CheckCache -->|hit| Return CheckCache -->|miss| LoadFromTable["LoadFromTable (mmap + OS page cache)"] LoadFromTable --> InsertCache InsertCache --> Return
By default only L0 and L1 blocks are cached (level > 1 short-circuits), reflecting the higher re-use for top levels.
3. Bloom Cache
bloomCachestores the raw filter bitset (utils.Filter) per table ID. Entries are deep-copied (SafeCopy) to avoid sharing memory with mmaps.- Cache policy is LRU.
- Capacity is controlled by
Options.BloomCacheSize. - Bloom hits/misses are recorded via
cacheMetrics.recordBloom, feeding intoStatsSnapshot.Cache.BloomHitRate.
4. Metrics & Observability
cache.metricsSnapshot() produces:
type CacheMetrics struct {
L0Hits, L0Misses uint64
L1Hits, L1Misses uint64
BloomHits, BloomMisses uint64
IndexHits, IndexMisses uint64
}
Stats.Snapshot converts these into hit rates. Monitor them alongside the block cache sizes to decide when to scale memory.
5. HotRing Integration
- Hot detection: HotRing counts on read/write paths and triggers targeted prefetch for hot keys.
- Cache warmup: prefetch loads target blocks into the normal L0/L1 block cache path.
- Compaction coupling: HotRing top-k feeds compaction scoring; levels/ingest shards covering hot ranges get higher scores to trim overlap sooner.
- Tuning: Hot thresholds come from HotRing options (window/decay configurable).
6. Interaction with Value Log
- Keys stored as value pointers (large values) still populate block cache entries for the key/index block. The value payload is read directly from the vlog (
valueLog.read), so block cache hit rates remain meaningful. - Discard stats from flushes can demote cached blocks via
cache.dropBlock, ensuring obsolete SST data leaves the cache quickly.
7. Comparison
| Feature | RocksDB | BadgerDB | NoKV |
|---|---|---|---|
| Block cache policy | Configurable multiple caches | Single cache | Ristretto for L0/L1 + OS page cache for deeper levels |
| Bloom cache | Enabled per table, no explicit cache | Optional | Dedicated LRU storing filters |
| Metrics | Block cache stats via GetAggregatedIntProperty | Limited | NoKV.Stats.cache.* hit rates |
8. Operational Tips
- If bloom hit rate falls below ~60%, consider increasing bits-per-key or Bloom cache size.
- Track
nokv stats --jsoncache metrics over time; drops often indicate iterator misuse or working-set shifts.
More on SST layout lives in docs/manifest.md and docs/architecture.md.
HotRing – Hot Key Tracking
hotring is NoKV’s hot-key tracker. It samples read/write access frequency per key and exposes the hottest entries to the stats subsystem and CLI. The implementation lives in hotring/ inside this repository.
1. Motivation
- Cache hints –
DB.prefetchLoop(seedb.go) consumes hot keys to schedule asynchronous reads into the block cache. - Operational insight –
StatsSnapshot.Hot.ReadKeysandnokv stats --jsonsurface the hottest keys, aiding debugging of traffic hotspots. - Throttling –
HotRing.TouchAndClampenables simple rate caps: once a key crosses a threshold, callers can back off or log alerts.
Compared with RocksDB (which exposes block access stats via perf_context) and Badger (which lacks built-in hot-key reporting), NoKV offers a lightweight but concurrent-friendly tracker out of the box.
2. Data Structure
HotRing
buckets[] -> per-bucket lock-free linked list (Node)
hashFn -> hash(key) -> uint32
hashMask -> selects bucket (power of two size)
- Each bucket stores a sorted linked list of
Nodeordered by(tag, key), wheretagis derived from the upper bits of the hash. Head pointers areatomic.Pointer[Node], so readers walk the list without taking locks; writers use CAS to splice nodes while preserving order. defaultTableBits = 12→ 4096 buckets by default (NewHotRing). The mask ensures cheap modulo operations.- Nodes keep a
count(int32) updated atomically and anextpointer stored viaunsafe.Pointer. Sliding-window state is guarded by a tiny per-node spin lock instead of a process-wide mutex.
flowchart LR Key(key) -->|hash| Bucket["buckets[index] (atomic head)"] Bucket --> Node1 Node1 --> Node2 Node2 --> Node3 Node3 --> Nil[(nil)]
3. Core Operations
| Method | Behaviour | Notes |
|---|---|---|
Touch | Insert or increment key’s counter. | CAS-splices a new node if missing, then increments (window-aware when enabled). |
Frequency | Read-only counter lookup. | Lock-free lookup; uses sliding-window totals when configured. |
TouchAndClamp | Increment unless count >= limit, returning (count, limited). | Throttling follows sliding-window totals so hot bursts clamp quickly. |
TopN | Snapshot hottest keys sorted by count desc. | Walks buckets without locks, then sorts a copy. |
KeysAbove | Return all keys with counters ≥ threshold. | Handy for targeted throttling or debugging hot shards. |
Bucket ordering is preserved by findOrInsert, which CASes either the bucket head or the predecessor’s next pointer to splice new nodes. Reads never take locks; only per-node sliding-window updates spin briefly to avoid data races.
4. Integration Points
- DB reads –
DB.Get*and iterators calldb.recordRead, which invokesHotRing.Touchon a read-only ring for every successful lookup. - Write throttling & hot batching – writes are tracked by a write-only ring. When
Options.WriteHotKeyLimit > 0, writes useTouchAndClampto enforce throttling; when throttling is disabled butHotWriteBurstThreshold > 0, writes stillTouchso hot batching can trigger. - Stats –
StatsSnapshot.Hot.ReadKeysandStatsSnapshot.Hot.WriteKeyspublish read/write hot keys.expvarexposes these underNoKV.Stats.hot.read_keysandNoKV.Stats.hot.write_keys. - Caching – hot reads trigger asynchronous prefetch into the normal L0/L1 block cache path.
- Value log routing – a dedicated HotRing instance powers vlog hot/cold bucket routing. It tracks write hotness only (no read signal) to avoid polluting bucket selection. Hot keys are routed to hot buckets (
ValueLogHotBucketCount) onceValueLogHotKeyThresholdis reached; cold keys go to the cold range.
5. Comparisons
| Engine | Approach |
|---|---|
| RocksDB | External – TRACE / perf_context requires manual sampling. |
| Badger | None built-in. |
| NoKV | In-process ring with expvar/CLI export and throttling helpers. |
The HotRing emphasises simplicity: lock-free bucket lists with atomic counters (plus optional per-node window tracking), avoiding sketches while staying light enough for hundreds of thousands of hot keys.
6. Operational Tips
Options.HotRingTopKcontrols how many keys show up in stats; default 16. Increase it when investigating workloads with broad hot sets.- Combine
TouchAndClampwith request middleware to detect abusive tenants: whenlimitedis true, log the key and latency impact. - Resetting the ring is as simple as instantiating a new
HotRing—useful for benchmarks that require clean counters between phases.
For end-to-end examples see docs/stats.md and the CLI walkthrough in docs/cli.md.
6.1 Default Configuration
Global HotRing defaults (NewDefaultOptions):
| Option | Default value | Notes |
|---|---|---|
HotRingEnabled | true | Master switch for DB hot tracking. |
HotRingBits | 12 | 4096 buckets. |
HotRingTopK | 16 | Top-K hot keys for stats/CLI. |
HotRingDecayInterval | 0 | Decay disabled by default. |
HotRingDecayShift | 0 | Decay disabled by default. |
HotRingWindowSlots | 8 | Sliding window enabled. |
HotRingWindowSlotDuration | 250ms | ~2s window. |
HotRingRotationInterval | 30m | Dual-ring rotation enabled. |
HotRingNodeCap | 250,000 | Strict cap per ring. |
HotRingNodeSampleBits | 0 | Strict cap (no sampling). |
Value-log override defaults (ValueLogHotRing*):
| Option | Default value | Notes |
|---|---|---|
ValueLogHotRingOverride | true | Use dedicated VLog settings. |
ValueLogHotRingBits | 12 | 4096 buckets. |
ValueLogHotRingRotationInterval | 10m | Faster rotation for write-hotness. |
ValueLogHotRingNodeCap | 200,000 | Strict cap per ring. |
ValueLogHotRingNodeSampleBits | 0 | Strict cap (no sampling). |
ValueLogHotRingDecayInterval | 0 | Decay disabled (window handles recency). |
ValueLogHotRingDecayShift | 0 | Decay disabled. |
ValueLogHotRingWindowSlots | 6 | ~600ms window. |
ValueLogHotRingWindowSlotDuration | 100ms | Shorter write-hotness window. |
When ValueLogHotRingOverride=false, the value-log ring inherits the global HotRing
settings. When override is enabled, zeros disable features (except bits=0,
which falls back to the ring default).
7. Write-Path Throttling
Options.WriteHotKeyLimit wires the write-only HotRing into the write path. When set to a positive integer, user writes (DB.Set, DB.SetWithTTL) and internal writes (DB.ApplyInternalEntries) invoke HotRing.TouchAndClamp with the limit. Once a key (optionally scoped by column family via cfHotKey) reaches the limit, the write is rejected with utils.ErrHotKeyWriteThrottle. If throttling is disabled but HotWriteBurstThreshold > 0, the write ring still tracks frequency to enable hot write batching. This keeps pathological tenants or hot shards from overwhelming a single Raft group without adding heavyweight rate-limiters to the client stack.
Operational hints:
StatsSnapshot.Write.HotKeyLimitedand the CLI lineWrite.HotKeyThrottledexpose how many writes were rejected since the process started.- Applications should surface
utils.ErrHotKeyWriteThrottleto callers (e.g. HTTP 429) so clients can back off. - Prefetching continues to run independently—only writes are rejected; reads still register hotness so the cache layer knows what to prefetch.
- Set the limit conservatively (e.g. a few dozen) and pair it with richer
HotRinganalytics (top-K stats, expvar export) to identify outliers before tuning.
8. Time-Based Decay & Sliding Window
HotRing now exposes two complementary controls so “old” hotspots fade away automatically:
- Periodic decay (
Options.HotRingDecayInterval+HotRingDecayShift)
Everyintervalthe global counters are right-shifted (count >>= shift). This keepsTopNand stats output focused on recent traffic even if writes stop abruptly. - Sliding window (
Options.HotRingWindowSlots+HotRingWindowSlotDuration)
Per-key windows split time intoslots, each lastingslotDuration.Touchonly accumulates inside the current slot; once the window slides past, the stale contribution is dropped.TouchAndClampandFrequencyuse the sliding-window total, so write throttling reflects short-term pressure instead of lifetime counts.
Disable either mechanism by setting the interval/durations to zero. Typical starting points:
| Option | Default value | Effect |
|---|---|---|
HotRingDecayInterval | 0 | Decay disabled by default. |
HotRingDecayShift | 0 | Decay disabled by default. |
HotRingWindowSlots | 8 | Keep ~8 buckets of recency data. |
HotRingWindowSlotDuration | 250ms | Roughly 2s window for throttling. |
With both enabled, the decay loop keeps background stats tidy while the sliding window powers precise, short-term throttling logic.
Note: in NoKV, configuration normalization treats the sliding window as higher priority. If a window is enabled, decay is automatically disabled to avoid redundant background work.
9. Bounding Growth (Node Cap & Rotation)
HotRing does not automatically evict keys. To keep memory predictable in high-cardinality workloads, use a node cap (with optional sampling) and/or ring rotation.
Node cap + sampling
Options.HotRingNodeCapsets a per-ring upper bound on tracked keys.Options.HotRingNodeSampleBitscontrols stable sampling once the cap is hit:0= strict cap (no new keys after the cap).N= allow roughly1/2^Nof new keys (soft cap).- When
HotRingNodeCap = 0, sampling is disabled.
Dual-ring rotation
Options.HotRingRotationIntervalenables dual-ring rotation:- active ring receives new touches
- warm ring keeps the previous generation to avoid sudden drops
- Merge semantics:
Frequency/TouchAndClamp→max(active, warm)TopN/KeysAbove→sum(active, warm)
Memory note: rotation keeps two rings, so the upper bound is roughly
2 × HotRingNodeCap. If you have a fixed budget, halve the per-ring cap.
Suggested starting points:
| Option | Effect |
|---|---|
HotRingNodeCap | Hard cap per ring (0 disables). |
HotRingNodeSampleBits | Soft cap sampling rate. |
HotRingRotationInterval | Rotation period (0 disables). |
10. Value Log Overrides
NoKV maintains a value-log HotRing dedicated to hot/cold routing. By default this override is enabled so the write-only ring can use faster rotation and a shorter window. You can disable it to inherit the global HotRing config:
Options.ValueLogHotRingOverride = false(inherit global settings)- Or keep it enabled and tune
ValueLogHotRing*fields explicitly.
When override is enabled, the value-log ring uses the override values verbatim;
zeros disable a feature (for example, rotation). If override is disabled, it
inherits the global HotRing* configuration.
Percolator Distributed Transaction Design
This document explains NoKV’s distributed transaction path implemented by percolator/ and executed through raftstore.
The scope here is the current code path:
PrewriteCommitBatchRollbackResolveLockCheckTxnStatus- MVCC read visibility (
KvGet/KvScanthroughpercolator.Reader)
1. Where It Runs
Percolator logic is executed on the Raft apply path:
- Client sends NoKV RPC (
KvPrewrite,KvCommit, …). raftstore/kv/service.gowraps it into aRaftCmdRequest.- Store proposes command through Raft.
- On apply,
raftstore/kv/apply.godispatches topercolator.*.
sequenceDiagram
participant C as raftstore/client
participant S as kv.Service
participant R as Raft (leader->followers)
participant A as kv.Apply
participant P as percolator
participant DB as NoKV DB
C->>S: KvPrewrite/KvCommit...
S->>R: ProposeCommand(RaftCmdRequest)
R->>A: Apply committed log
A->>P: percolator.Prewrite/Commit...
P->>DB: CFDefault/CFLock/CFWrite reads+writes
A-->>S: RaftCmdResponse
S-->>C: NoKV RPC response
Key files:
percolator/txn.gopercolator/reader.gopercolator/codec.gopercolator/latch/latch.goraftstore/kv/apply.goraftstore/client/client.go
1.1 RPC to Percolator Function Mapping
| NoKV RPC | kv.Apply branch | Percolator function |
|---|---|---|
KvPrewrite | CMD_PREWRITE | Prewrite |
KvCommit | CMD_COMMIT | Commit |
KvBatchRollback | CMD_BATCH_ROLLBACK | BatchRollback |
KvResolveLock | CMD_RESOLVE_LOCK | ResolveLock |
KvCheckTxnStatus | CMD_CHECK_TXN_STATUS | CheckTxnStatus |
KvGet | CMD_GET | Reader.GetLock + Reader.GetValue |
KvScan | CMD_SCAN | Reader.GetLock + CFWrite iteration + GetInternalEntry |
2. MVCC Data Model
NoKV uses three MVCC column families:
CFDefault: stores user values atstart_tsCFLock: stores lock metadata at fixedlockColumnTs = MaxUint64CFWrite: stores commit records atcommit_ts
2.1 Lock Record
percolator.Lock (encoded by EncodeLock):
PrimaryTs(start timestamp)TTLKind(Put/Delete/Lock)MinCommitTs
2.2 Write Record
percolator.Write (encoded by EncodeWrite):
KindStartTsShortValue(codec supports it; current commit path does not populate it)
3. Concurrency Control: Latches
Before mutating keys, percolator acquires striped latches:
latch.Managerhashes keys to stripe mutexes.- Stripes are deduplicated and acquired in sorted order to avoid deadlocks.
- Guard releases in reverse order.
In raftstore/kv, latches are passed explicitly:
NewEntryAppliercreates onelatch.NewManager(512)and reuses it.Apply/NewApplieraccept an injected manager;nilfalls back tolatch.NewManager(512).
This serializes conflicting apply operations on overlapping keys in one node.
4. Two-Phase Commit Flow
Client side (raftstore/client.Client.TwoPhaseCommit):
- Group mutations by region.
- Prewrite primary region.
- Prewrite secondary regions.
- Commit primary region.
- Commit secondary regions.
sequenceDiagram
participant Cli as Client
participant R1 as Region(primary)
participant R2 as Region(secondary)
Cli->>R1: Prewrite(primary + local muts)
Cli->>R2: Prewrite(secondary muts)
Cli->>R1: Commit(keys,startTs,commitTs)
Cli->>R2: Commit(keys,startTs,commitTs)
5. Write-Side Operations
5.1 Prewrite
Prewrite runs mutation-by-mutation:
- Check existing lock on key:
- if lock exists with different
Ts->KeyError.Locked
- if lock exists with different
- Check latest committed write:
- if
commit_ts >= req.start_version->WriteConflict
- if
- Apply data intent:
Put: write value intoCFDefaultatstart_tsDelete/Lock: delete default value atstart_ts(if exists)
- Write lock into
CFLockatlockColumnTs
5.2 Commit
For each key:
- Read lock
- If no lock:
- if write with same
start_tsexists -> idempotent success - else -> abort (
lock not found)
- if write with same
- If lock
Ts != start_version->KeyError.Locked commitKey:- if
min_commit_ts > commit_version->CommitTsExpired - if write with same
start_tsalready exists:- rollback write -> abort
- write with different commit ts -> treat success, clean lock
- same commit ts -> success
- else write
CFWrite[key@commit_ts] = {kind,start_ts} - remove lock from
CFLock
- if
5.3 BatchRollback
For each key:
- If already has write at
start_ts:- rollback marker already exists -> success
- non-rollback write exists -> success (already committed)
- Remove lock (if any)
- Remove default value at
start_ts(if any) - Write rollback marker to
CFWriteatstart_ts
5.4 ResolveLock
commit_version == 0-> rollback matching lockscommit_version > 0-> commit matching locks- Returns number of resolved keys
6. Transaction Status Check
CheckTxnStatus targets the primary key and decides whether txn is alive, committed, or should be rolled back.
Decision order:
- Read lock on primary
- If lock exists but
lock.ts != req.lock_ts->KeyError.Locked - If lock exists and TTL expired (
current_ts >= lock.ts + ttl):- rollback primary
- action =
TTLExpireRollback
- If lock exists and caller pushes timestamp:
min_commit_ts = max(min_commit_ts, caller_start_ts+1)- action =
MinCommitTsPushed
- If no lock, check write by
start_ts:- committed write -> return
commit_version - rollback write -> action
LockNotExistRollback
- committed write -> return
- If no lock and no write, and
rollback_if_not_existis true:- write rollback marker
- action
LockNotExistRollback
7. Read Path Semantics (MVCC Visibility)
KvGet and KvScan read through percolator.Reader:
- Check lock first:
- if lock exists and
read_ts >= lock.ts, return locked error
- if lock exists and
- Find visible write in
CFWrite:- latest
commit_ts <= read_ts
- latest
- Interpret write kind:
Delete/Rollback=> not foundPut=> read value fromCFDefaultatstart_ts
Notes:
KvScancurrently rejects reverse scan.scanWritesuses internal iterator overCFWrite.
8. Error and Idempotency Behavior
| Operation | Idempotency/Conflict behavior |
|---|---|
| Prewrite | Rejects lock conflicts and write conflicts; returns per-key KeyError list. |
| Commit | Idempotent for already committed keys with same start_ts; stale/missing lock may abort. |
| BatchRollback | Safe to repeat; rollback marker prevents duplicate side effects. |
| ResolveLock | Safe to retry per key set; resolves only matching start_ts locks. |
| CheckTxnStatus | May push min_commit_ts, rollback expired primary lock, or return committed version. |
9. Current Operational Boundaries
- Percolator execution is tied to NoKV RPC + Raft apply path, with the command shape still following the TinyKV/TiKV MVCC model.
- Latch scope is process-local when one store shares a single
latch.Manager; region correctness still comes from Raft ordering. Write.ShortValueandWrite.ExpiresAtare codec fields; current commit path stores primary value bytes inCFDefaultand reads from there when short value is not present.
10. Validation and Tests
Primary coverage:
percolator/txn_test.goraftstore/kv/service_test.goraftstore/client/client_test.goraftstore/server/server_test.go
These tests cover 2PC happy path, lock conflicts, status checks, resolve/rollback behavior, and client region-aware retries.
RaftStore Deep Dive
raftstore powers NoKV’s distributed mode by layering multi-Raft replication on top of the embedded storage engine. Its RPC surface is exposed as the NoKV gRPC service, while the command model still tracks the TinyKV/TiKV region + MVCC design. This note explains the major packages, the boot and command paths, how transport and storage interact, and the supporting tooling for observability and testing.
1. Package Structure
| Package | Responsibility |
|---|---|
store | Orchestrates peer set, command pipeline, region manager, scheduler/heartbeat loops; exposes helpers such as StartPeer, ProposeCommand, SplitRegion. |
peer | Wraps etcd/raft RawNode, drives Ready processing (persist to WAL, send messages, apply entries), tracks snapshot resend/backlog. |
engine | WALStorage/DiskStorage/MemoryStorage across all Raft groups, leveraging the NoKV WAL while keeping manifest metadata in sync. |
transport | gRPC transport with retry/TLS/backpressure; exposes the raft Step RPC and can host additional services (NoKV). |
kv | NoKV RPC implementation, bridging Raft commands to MVCC operations via kv.Apply. |
server | ServerConfig + New that bind DB, Store, transport, and NoKV server into a reusable node primitive. |
2. Boot Sequence
-
Construct Server
srv, _ := raftstore.NewServer(raftstore.ServerConfig{ DB: db, Store: raftstore.StoreConfig{StoreID: 1}, Raft: myraft.Config{ElectionTick: 10, HeartbeatTick: 2, PreVote: true}, TransportAddr: "127.0.0.1:20160", })- A gRPC transport is created, the NoKV service is registered, and
transport.SetHandler(store.Step)wires raft Step handling. store.Storeloadsmanifest.RegionSnapshot()to rebuild the Region catalog (router + metrics).
- A gRPC transport is created, the NoKV service is registered, and
-
Start local peers
- CLI (
nokv serve) iterates the manifest snapshot and callsStore.StartPeerfor every region that includes the local store. - Each
peer.Configcarries raft parameters, the transport reference,kv.NewEntryApplier, WAL/manifest handles, and Region metadata. StartPeerregisters the peer through the peer-set/routing layer and may bootstrap or campaign for leadership.
- CLI (
-
Peer connectivity
transport.SetPeer(storeID, addr)defines outbound raft connections; the CLI exposes it via--peer storeID=addr.- Additional services can reuse the same gRPC server through
transport.WithServerRegistrar.
3. Command Execution
Read (strong leader read)
kv.Service.KvGetbuildspb.RaftCmdRequestand invokesStore.ReadCommand.validateCommandensures the region exists, epoch matches, and the local peer is leader; a RegionError is returned otherwise.peer.LinearizableReadobtains a safe read index, thenpeer.WaitAppliedwaits until local apply index reaches it.commandApplier(i.e.kv.Apply) runs GET/SCAN against the DB using MVCC readers to honor locks and version visibility.
Write (via Propose)
- Write RPCs (Prewrite/Commit/…) call
Store.ProposeCommand, encoding the command and routing to the leader peer. - The leader appends the encoded request to raft, replicates, and once committed the command pipeline hands data to
kv.Apply, which maps Prewrite/Commit/ResolveLock to thepercolatorpackage. engine.WALStoragepersists raft entries/state snapshots and updates manifest raft pointers. This keeps WAL GC and raft truncation aligned.- Raft apply only accepts command-encoded payloads (
RaftCmdRequest). Legacy raw KV payloads are rejected as unsupported.
Command flow diagram
sequenceDiagram
participant C as NoKV Client
participant SVC as kv.Service
participant ST as store.Store
participant PR as peer.Peer
participant RF as raft log/replication
participant AP as kv.Apply
participant DB as NoKV DB
rect rgb(237, 247, 255)
C->>SVC: KvGet/KvScan
SVC->>ST: ReadCommand
ST->>PR: LinearizableRead + WaitApplied
ST->>AP: commandApplier(req)
AP->>DB: internal read path
AP-->>C: read response
end
rect rgb(241, 253, 244)
C->>SVC: KvPrewrite/KvCommit/...
SVC->>ST: ProposeCommand
ST->>RF: route + replicate
RF->>AP: apply committed entry
AP->>DB: percolator mutate -> ApplyInternalEntries
AP-->>C: write response
end
4. Transport
- gRPC transport listens on
TransportAddr, serving both raft Step RPC and NoKV RPC. SetPeerupdates the mapping of remote store IDs to addresses;BlockPeercan be used by tests or chaos tooling.- Configurable retry/backoff/timeout options mirror production requirements. Tests cover message loss, blocked peers, and partitions.
5. Storage Backend (engine)
WALStoragepiggybacks on the embedded WAL: each Raft group writes typed entries, HardState, and snapshots into the shared log.LogRaftPointerandLogRaftTruncateedit manifest metadata so WAL GC knows how far it can compact per group.- Alternative storage backends (
DiskStorage,MemoryStorage) are available for tests and special scenarios.
6. NoKV RPC Integration
| RPC | Execution Path | Notes |
|---|---|---|
KvGet / KvScan | ReadCommand → LinearizableRead(ReadIndex) + WaitApplied → kv.Apply (read mode) | Leader-only strong read with Raft linearizability barrier. |
KvPrewrite / KvCommit / KvBatchRollback / KvResolveLock / KvCheckTxnStatus | ProposeCommand → command pipeline → raft log → kv.Apply | Pipeline matches proposals with apply results; MVCC latch manager prevents write conflicts. |
The cmd/nokv serve command uses raftstore.Server internally and prints a manifest summary (key ranges, peers) so operators can verify the node’s view at startup.
7. Client Interaction (raftstore/client)
- Region-aware routing with NotLeader/EpochNotMatch retry.
Mutatesplits mutations by region and performs two-phase commit (primary first).Put/Deleteare convenience wrappers.Scantransparently walks region boundaries.- End-to-end coverage lives in
raftstore/server/server_test.go, which launches real servers, uses the client to write and delete keys, and verifies the results.
8. Control Plane & Region Operations
8.1 Topology & Routing
- Topology is sourced from
raft_config.example.json(viaconfig.LoadFile) and reused by scripts, Docker Compose, and the Redis gateway. - Runtime routing is PD-first:
raftstore/clientresolves Regions by key throughGetRegionByKeyand caches route entries for retries. raft_configregions are treated as bootstrap/deployment metadata and are not the runtime source of truth once PD is available.- PD is the only control-plane source of truth for runtime scheduling/routing.
8.2 Split / Merge
- Split: leaders call
Store.ProposeSplit, which writes a splitAdminCommandinto the parent region’s raft log. On apply,Store.SplitRegionupdates the parent range/epoch and starts the child peer. - Merge: leaders call
Store.ProposeMerge, writing a mergeAdminCommand. On apply, the target region range/epoch is expanded and the source peer is stopped/removed from the manifest. - These operations are explicit/manual and are not auto-triggered by size/traffic heuristics.
9. Observability
store.RegionMetrics()feeds intoStatsSnapshot, making region counts and backlog visible via expvar andnokv stats.nokv regionsshows manifest-backed regions: ID, range, peers, state.scripts/transport_chaos.shexercises transport metrics under faults;scripts/run_local_cluster.shspins up multi-node clusters for manual inspection.
Store internals at a glance
| Component | File | Responsibility |
|---|---|---|
| Store facade | store.go | Store construction/wiring and shared component ownership (router, region manager, command pipeline, scheduler runtime). |
| Peer lifecycle | peer_lifecycle.go | Start/stop peers, router registration, lifecycle hooks, and store shutdown sequencing. |
| Command service | command_service.go | Region/epoch/key-range validation and read/propose request handling. |
| Admin service | admin_service.go | Split/merge proposal handling and applied admin command side effects. |
| Membership service | membership_service.go | Conf-change proposal helpers and manifest metadata updates after membership changes. |
| Region catalog | region_catalog.go | Public region catalog accessors and region metadata lifecycle operations. |
| Scheduler runtime | scheduler_runtime.go | Scheduler snapshot generation, store stats, operation application, and apply-entry dispatch. |
| Peer set | peer_set.go | Tracks active peers and exposes thread-safe lookups/iteration snapshots. |
| Command pipeline | command_pipeline.go | Assigns request IDs, records proposals, matches apply results, returns responses/errors to callers. |
| Region manager | region_manager.go | Validates state transitions, writes manifest edits, updates peer metadata, triggers region hooks. |
| Operation scheduler | operation_scheduler.go | Buffers planner output, enforces cooldown & burst limits, dispatches leader transfers or other operations. |
| Heartbeat loop | heartbeat_loop.go | Periodically publishes region/store heartbeats and, when the sink implements planner capability, drains scheduling actions. |
10. Current Boundaries and Guarantees
- Reads served through
ReadCommandare leader-strong and pass a Raft linearizability barrier (LinearizableRead+WaitApplied). - Mutating NoKV RPC commands are serialized through Raft log replication and apply.
- Command payload format on apply path is strict
RaftCmdRequestencoding. - Region metadata (range/epoch/peers) is validated before both read and write command execution.
store.RegionMetrics+StatsSnapshotprovide runtime visibility for region count, backlog, and scheduling health.
PD-lite
PD-lite is NoKV’s control-plane service for distributed mode.
It exposes a gRPC API (pb.PD) and is started by:
go run ./cmd/nokv pd --addr 127.0.0.1:2379
1. Responsibilities
PD-lite currently owns:
- Routing:
GetRegionByKey - Heartbeats:
StoreHeartbeat,RegionHeartbeat - Region removal:
RemoveRegion - ID service:
AllocID - TSO:
Tso
Runtime clients (for example cmd/nokv-redis raft backend) use PD as the
routing source of truth.
2. Runtime Architecture
flowchart LR
Store["nokv serve"] -->|"StoreHeartbeat / RegionHeartbeat"| PD["PD-lite (gRPC)"]
Gateway["nokv-redis (raft mode)"] -->|"GetRegionByKey / Tso"| PD
PD --> Cluster["pd/core.Cluster"]
Cluster --> Scheduler["leader-transfer hint planner"]
Core implementation units:
pd/core: in-memory cluster metadata model + allocators.pd/storage: persistence abstraction (Store) with local manifest+state implementation.pd/server: gRPC service + RPC validation/error mapping.pd/client: client wrapper used by store/gateway.pd/adapter: scheduler sink that forwards heartbeats into PD.
3. Persistence (--workdir)
When --workdir is provided, PD-lite persists control-plane state:
- Region catalog via manifest edits.
- Allocator checkpoints via
PD_STATE.json:id_currentts_current
Startup flow:
- Open
pd/storagewith--workdir. - Load snapshot (
regions+ allocator counters). - Compute starts as
max(cli_start, checkpoint+1). - Replay region snapshot into
pd/core.Cluster.
This avoids allocator rollback after restart and keeps route metadata stable.
4. Config Integration
raft_config.json supports PD endpoint + workdir defaults:
"pd": {
"addr": "127.0.0.1:2379",
"docker_addr": "nokv-pd:2379",
"work_dir": "./artifacts/cluster/pd",
"docker_work_dir": "/var/lib/nokv-pd"
}
Resolution rules:
- CLI override wins.
- Otherwise read from config by scope (
host/docker).
Helpers:
config.ResolvePDAddr(scope)config.ResolvePDWorkDir(scope)nokv-config pd --field addr|workdir --scope host|docker
5. Routing Source Convergence
NoKV now uses PD-first routing:
raftstore/clientresolves regions withGetRegionByKey.raft_configregions are bootstrap/deployment metadata.- Runtime route truth comes from PD heartbeats + PD region catalog.
This avoids dual sources drifting over time (config vs PD).
6. Serve Mode Semantics
nokv serve is now PD-only:
--pd-addris required.- Runtime routing/scheduling control-plane state is sourced from PD.
Related CLI behavior:
- Inspect control-plane state through PD APIs/metrics.
7. Comparison: TinyKV / TiKV
TinyKV (teaching stack)
- Uses a scheduler server (
tinyscheduler) as separate process. - Control plane integrates embedded etcd for metadata persistence.
- Educational architecture, minimal production hardening.
TiKV (production stack)
- PD is an independent, highly available cluster.
- PD internally uses etcd Raft for durable metadata + leader election.
- Rich scheduling and balancing policies, rolling updates, robust ops tooling.
NoKV PD-lite (current)
- Single PD-lite process with optional local persistence (
--workdir). - Sufficient for local clusters, testing, and architecture iteration.
- API shape intentionally aligned with a PD-style control plane so migration to stronger HA semantics is incremental.
8. Current Limitations / Next Steps
- No multi-PD quorum and no automatic PD failover.
- Scheduler policy is intentionally small (leader transfer focused).
- No advanced placement constraints yet.
These are deliberate scope limits for a fast-moving experimental platform.
Crash Recovery Playbook
This document describes how NoKV restores state after abnormal exit, and which tests validate each recovery contract.
1. Recovery Phases
flowchart TD
Start[DB.Open]
Verify[runRecoveryChecks]
WalOpen[wal.Open]
LSM[lsm.NewLSM]
Manifest[manifest replay + table load]
WALReplay[WAL replay to memtables]
VLog[valueLog recover]
Flush[submit immutable flush backlog]
Stats[stats/start background loops]
Start --> Verify --> WalOpen --> LSM --> Manifest --> WALReplay --> VLog --> Flush --> Stats
- Pre-flight verification:
DB.runRecoveryChecksrunsmanifest.Verify,wal.VerifyDir, and per-bucketvlog.VerifyDir. - WAL manager reopen:
wal.Openreopens latest segment and rebuilds counters. - Manifest replay + SST load:
levelManager.buildreplays manifest version and opens SST files. - Stale SST cleanup: if a manifest SST is missing or unreadable/corrupt, it is marked stale and removed from manifest (
EditDeleteFile) so startup can continue. - WAL replay:
lsm.recoveryreplays post-checkpoint WAL records into memtables. - Flush backlog restore: recovered immutable memtables are resubmitted to
flush.Manager. - ValueLog recovery: value-log managers reconcile on-disk files with manifest metadata, trim torn tails, and drop stale/orphan segments.
- Runtime restart: metrics and periodic workers start again.
2. Failure Scenarios & Tests
| Failure Point | Expected Recovery Behaviour | Tests |
|---|---|---|
| WAL tail truncated | Replay stops safely at truncated tail, preserving valid prefix records | wal/manager_test.go::TestManagerReplayHandlesTruncate |
| Crash before memtable flush install | WAL replay restores user data not yet flushed to SST | db_test.go::TestRecoveryWALReplayRestoresData |
| Manifest references missing SST | Startup removes stale manifest entry and continues | db_test.go::TestRecoveryCleansMissingSSTFromManifest |
| Manifest references corrupt/unreadable SST | Startup removes stale entry and continues | db_test.go::TestRecoveryCleansCorruptSSTFromManifest |
| ValueLog stale segment (manifest marked invalid) | Recovery deletes stale file from disk | db_test.go::TestRecoveryRemovesStaleValueLogSegment |
| ValueLog orphan segment (disk only) | Recovery deletes orphan file not tracked by manifest | db_test.go::TestRecoveryRemovesOrphanValueLogSegment |
| Manifest rewrite interrupted | Recovery keeps using CURRENT-selected manifest and data remains readable | db_test.go::TestRecoveryManifestRewriteCrash |
| ValueLog contains records absent from LSM/WAL | Recovery does not replay vlog as source-of-truth | db_test.go::TestRecoverySkipsValueLogReplay |
3. Recovery Tooling
3.1 Targeted tests
go test ./... -run 'Recovery|ReplayHandlesTruncate'
Set RECOVERY_TRACE_METRICS=1 to emit RECOVERY_METRIC ... lines in tests.
3.2 Script harness
RECOVERY_TRACE_METRICS=1 ./scripts/recovery_scenarios.sh
Outputs are saved under artifacts/recovery/.
3.3 CLI checks
nokv manifest --workdir <dir>: verify level files, WAL pointer, vlog metadata.nokv stats --workdir <dir>: confirm flush backlog converges.nokv vlog --workdir <dir>: inspect vlog segment state.
4. Operational Signals
Watch these fields during restart:
flush.queue_lengthwal.segment_countvalue_log.headsvalue_log.segmentsvalue_log.pending_deletes
If flush.queue_length remains high after replay, inspect flush worker throughput and manifest sync settings.
5. Notes on Consistency Model
- WAL + manifest remain the authoritative recovery chain for LSM state.
- ValueLog is reconciled/validated but is not replayed as a mutation source.
- In strict flush mode (
ManifestSync=true), SST install ordering isSST Sync -> RenameNoReplace -> SyncDir -> manifest edit.
For deeper internals, see flush.md, manifest.md, and wal.md.
Stats & Observability Pipeline
NoKV exposes runtime health through:
StatsSnapshot(structured in-process snapshot)expvar(/debug/vars)nokv statsCLI (plain text or JSON)
The implementation lives in stats.go, and collection runs continuously once DB is open.
1. Architecture
flowchart TD
subgraph COLLECTORS["Collectors"]
LSM["lsm.* metrics"]
WAL["wal metrics"]
VLOG["value log metrics"]
HOT["hotring"]
REGION["region metrics"]
TRANSPORT["grpc transport metrics"]
REDIS["redis gateway metrics"]
end
LSM --> SNAP["Stats.Snapshot()"]
WAL --> SNAP
VLOG --> SNAP
HOT --> SNAP
REGION --> SNAP
TRANSPORT --> SNAP
REDIS --> SNAP
SNAP --> EXP["Stats.collect -> expvar"]
SNAP --> CLI["nokv stats"]
Two-layer design:
metricslayer: only collects counters/gauges/snapshots.statslayer: aggregates cross-module data and exports.
2. Snapshot Schema
StatsSnapshot is now domain-grouped (not flat):
entriesflush.*compaction.*value_log.*(includesvalue_log.gc.*)wal.*raft.*write.*region.*hot.*cache.*lsm.*transport.*redis.*
Representative fields:
flush.pending,flush.queue_length,flush.last_wait_mscompaction.backlog,compaction.max_score,compaction.value_weightvalue_log.segments,value_log.pending_deletes,value_log.gc.gc_runswal.active_segment,wal.segment_count,wal.typed_record_ratioraft.group_count,raft.lagging_groups,raft.max_lag_segmentswrite.queue_depth,write.avg_request_wait_ms,write.hot_key_limitedregion.total,region.running,region.removing,region.tombstonehot.read_keys,hot.write_keys,hot.read_ring,hot.write_ringcache.block_l0_hit_rate,cache.bloom_hit_rate,cache.iterator_reusedlsm.levels,lsm.value_bytes_total
3. expvar Export
Stats.collect exports a single structured object:
NoKV.Stats
All domains (flush, compaction, value_log, wal, raft, write, region, hot, cache, lsm, transport, redis) are nested under this object.
Legacy scalar compatibility keys are removed. Consumers should read fields from NoKV.Stats directly.
4. CLI & JSON
nokv stats --workdir <dir>: offline snapshot from local DBnokv stats --expvar <host:port>: snapshot from running process/debug/varsnokv stats --json: machine-readable nested JSON
Example:
{
"entries": 1048576,
"flush": {
"pending": 2,
"queue_length": 2
},
"value_log": {
"segments": 6,
"pending_deletes": 1,
"gc": {
"gc_runs": 12
}
},
"hot": {
"read_keys": [
{"key": "user:123", "count": 42}
]
}
}
5. Operational Guidance
flush.queue_length+compaction.backlogboth rising: flush/compaction under-provisioned.value_log.discard_queuehigh for long periods: checkvalue_log.gc.*and compaction pressure.write.throttle_active=truefrequently: L0 pressure likely high; inspectcache.block_l0_hit_rateand compaction.write.hot_key_limitedincreasing: hot key write throttling is active.raft.lag_warning=true: at least one group exceeds lag threshold.
6. Comparison
| Engine | Built-in observability |
|---|---|
| RocksDB | Rich metrics/perf context, often needs additional tooling/parsing |
| Badger | Optional metrics integrations |
| NoKV | Native expvar + structured snapshot + CLI with offline/online modes |
Testing & Validation Matrix
This document inventories NoKV’s automated coverage and provides guidance for extending tests. It aligns module-level unit tests, integration suites, and benchmarking harnesses with the architectural features described elsewhere.
1. Quick Commands
# All unit + integration tests (uses local module caches)
GOCACHE=$PWD/.gocache GOMODCACHE=$PWD/.gomodcache go test ./...
# Focused distributed transaction suite
go test ./percolator/... ./raftstore/client/... -run 'Test.*(Commit|Prewrite|TwoPhaseCommit)'
# Crash recovery scenarios
RECOVERY_TRACE_METRICS=1 ./scripts/recovery_scenarios.sh
# Protobuf schema hygiene
make proto-check
# gRPC transport chaos tests + watchdog metrics
CHAOS_TRACE_METRICS=1 ./scripts/transport_chaos.sh
# Sample PD-lite service for shared TSO / routing in distributed tests
go run ./cmd/nokv pd --addr 127.0.0.1:2379 --id-start 1 --ts-start 100 --workdir ./artifacts/pd
# Local three-node cluster (includes manifest bootstrap + PD-lite)
./scripts/run_local_cluster.sh --config ./raft_config.example.json
# Tear down with Ctrl+C
# Docker-compose sandbox (3 nodes + PD-lite)
docker compose up --build
docker compose down -v
# Build RocksDB locally (installs into ./third_party/rocksdb/dist by default)
./scripts/build_rocksdb.sh
# YCSB baseline (records=1e6, ops=1e6, warmup=1e5, conc=16)
./scripts/run_benchmarks.sh
# YCSB with RocksDB (requires CGO, `benchmark_rocksdb`, and the RocksDB build above)
LD_LIBRARY_PATH="$(pwd)/third_party/rocksdb/dist/lib:${LD_LIBRARY_PATH}" \
CGO_CFLAGS="-I$(pwd)/third_party/rocksdb/dist/include" \
CGO_LDFLAGS="-L$(pwd)/third_party/rocksdb/dist/lib -lrocksdb -lz -lbz2 -lsnappy -lzstd -llz4" \
YCSB_ENGINES="nokv,badger,rocksdb" ./scripts/run_benchmarks.sh
# One-click script (auto-detect RocksDB, supports `YCSB_*` env vars to override defaults)
./scripts/run_benchmarks.sh
# Quick smoke run (smaller dataset)
NOKV_RUN_BENCHMARKS=1 YCSB_RECORDS=10000 YCSB_OPS=50000 YCSB_WARM_OPS=0 \
./scripts/run_benchmarks.sh -ycsb_workloads=A -ycsb_engines=nokv
Tip: Pin
GOCACHE/GOMODCACHEin CI to keep build artefacts local and avoid permission issues.
2. Module Coverage Overview
| Module | Tests | Coverage Highlights | Gaps / Next Steps |
|---|---|---|---|
| WAL | wal/manager_test.go | Segment rotation, sync semantics, replay tolerance for truncation, directory bootstrap. | Add IO fault injection, concurrent append stress. |
| LSM / Flush / Compaction | lsm/lsm_test.go, lsm/compaction_test.go, lsm/compact/*_test.go, lsm/flush/manager_test.go | Memtable correctness, iterator merging, flush pipeline metrics, compaction scheduling. | Extend backpressure assertions, test cache hot/cold split. |
| Manifest | manifest/manager_test.go, lsm/manifest_test.go | CURRENT swap safety, rewrite crash handling, vlog metadata persistence. | Simulate partial edit corruption, column family extensions. |
| ValueLog | vlog/manager_test.go, vlog/io_test.go, vlog_test.go | ValuePtr encoding/decoding, GC rewrite/rewind, concurrent iterator safety. | Long-running GC, discard-ratio edge cases. |
| Percolator / Distributed Txn | percolator/*_test.go, raftstore/client/client_test.go, stats_test.go | Prewrite/Commit/ResolveLock flows, 2PC retries, timestamp-driven MVCC behaviour, metrics accounting. | Mixed multi-region fuzzing with lock TTL and leader churn. |
| DB Integration | db_test.go, db_write_bench_test.go | End-to-end writes, recovery, and throttle behaviour. | Combine ValueLog GC + compaction stress, multi-DB interference. |
| CLI & Stats | cmd/nokv/main_test.go, stats_test.go | Golden JSON output, stats snapshot correctness, hot key ranking. | CLI error handling, expvar HTTP integration tests. |
| Redis Gateway | cmd/nokv-redis/backend_embedded_test.go, cmd/nokv-redis/server_test.go, cmd/nokv-redis/backend_raft_test.go | Embedded backend semantics (NX/XX, TTL, counters), RESP parser, raft backend config wiring, and PD-backed routing/TSO discovery. | End-to-end multi-region CRUD with raft backend, TTL lock cleanup under failures. |
| Scripts & Tooling | cmd/nokv-config/main_test.go, cmd/nokv/serve_test.go | nokv-config JSON/simple formats, manifest logging CLI, serve bootstrap behavior. | Add direct shell-script golden tests (currently not present) and failure-path diagnostics for run_local_cluster.sh. |
| Benchmark | benchmark/ycsb_test.go, benchmark/ycsb_runner.go | YCSB throughput/latency comparisons across engines (A-G) with detailed percentile + operation mix reporting. | Automate multi-node deployments and add longer-running, multi-GB stability baselines. |
3. System Scenarios
| Scenario | Coverage | Focus |
|---|---|---|
| Crash recovery | db_test.go, scripts/recovery_scenarios.sh | WAL replay, missing SST cleanup, vlog GC restart, manifest rewrite safety. |
| WAL pointer desync | raftstore/engine/wal_storage_test.go::TestWALStorageDetectsTruncatedSegment | Detects manifest pointer offsets beyond truncated WAL tails to avoid silent corruption. |
| Distributed transaction contention | raftstore/client/client_test.go::TestClientTwoPhaseCommitAndGet, percolator/*_test.go | Lock conflicts, retries, and 2PC sequencing under region routing. |
| Value separation + GC | vlog/manager_test.go, db_test.go::TestRecoveryRemovesStaleValueLogSegment | GC correctness, manifest integration, iterator stability. |
| Iterator consistency | lsm/iterator_test.go | Snapshot visibility, merging iterators across levels and memtables. |
| Throttling / backpressure | lsm/compaction_test.go, db_test.go::TestWriteThrottle | L0 backlog triggers, flush queue growth, metrics observation. |
| Distributed NoKV client | raftstore/client/client_test.go::TestClientTwoPhaseCommitAndGet, raftstore/transport/grpc_transport_test.go::TestGRPCTransportManualTicksDriveElection | Region-aware routing, NotLeader retries, manual tick-driven elections, cross-region 2PC sequencing. |
| Performance regression | benchmark package | Compare NoKV vs Badger/Pebble by default (RocksDB optional), produce human-readable reports under benchmark/benchmark_results. |
4. Observability in Tests
- RECOVERY_METRIC logs – produced when
RECOVERY_TRACE_METRICS=1; consumed by recovery script and helpful when triaging CI failures. - TRANSPORT_METRIC logs – emitted by
scripts/transport_chaos.shwhenCHAOS_TRACE_METRICS=1, capturing gRPC watchdog counters during network partitions and retries. - Stats snapshots –
stats_test.goverifies JSON structure so CLI output remains backwards compatible. - Benchmark artefacts – stored under
benchmark/benchmark_results/*.txtfor historical comparison. Aligns with README instructions.
5. Extending Coverage
- Property-based testing – integrate
testing/quickor third-party generators to randomise distributed 2PC sequences (prewrite/commit/rollback ordering). - Stress harness – add a Go-based stress driver to run mixed read/write workloads for hours, capturing metrics akin to RocksDB’s
db_stresstool. - Distributed readiness – strengthen raftstore fault-injection and long-run tests (leader transfer, transport chaos, snapshot catch-up) with reproducible CI artifacts.
- CLI smoke tests – simulate corrupted directories to ensure CLI emits actionable errors.
Keep this matrix updated when adding new modules or scenarios so documentation and automation remain aligned.
Scripts Overview
NoKV ships a small collection of helper scripts to streamline local experimentation, demos, diagnostics, and automation. This page summarises what each script does, how to use it, and which shared configuration it consumes.
Cluster helpers
scripts/run_local_cluster.sh
- Purpose – builds
nokvandnokv-config, readsraft_config.json, seeds manifests, starts PD-lite, and starts the NoKV nodes. If a store directory already contains a manifest (CURRENT), the seeding step is skipped so previously bootstrapped data is reused. - Usage
./scripts/run_local_cluster.sh --config ./raft_config.example.json --workdir ./artifacts/cluster
--config defaults to the repository’s raft_config.example.json; --workdir chooses the data root (./artifacts/cluster by default). For every entry under stores the script creates store-<id> and calls nokv-config manifest, then launches nokv pd and the store processes. PD state is persisted under pd.work_dir (or <workdir>/pd when config omits it), so region routing metadata survives restarts. By default, store processes resolve PD from config; --pd-listen override is only forwarded when explicitly set. The script runs in the foreground—press Ctrl+C to stop all spawned processes.
When --pd-listen is omitted, the script reads pd.addr from config and falls back to 127.0.0.1:2379.
❗️ Shutdown / restart note — To avoid WAL/manifest mismatches, stop the script with
Ctrl+Cand wait for child processes to exit. If you crash the process or the host, clean the workdir (rm -rf ./artifacts/cluster) before starting again; otherwise the replay step may panic when it encounters truncated WAL segments.
scripts/bootstrap_from_config.sh
- Purpose – manifest-only bootstrap, typically used in Docker Compose before the nodes start. Stores that already hold a manifest are detected and skipped.
- Usage
The script iterates over every store in the config and writes Region metadata via./scripts/bootstrap_from_config.sh --config /etc/nokv/raft_config.json --path-template /data/store-{id}nokv-config manifestinto the provided path template.
scripts/serve_from_config.sh
- Purpose – translate
raft_config.jsoninto anokv servecommand, avoiding manual--peerlists. It resolves peer IDs from the region metadata and maps every peer (other than the local store) to its advertised address so that gRPC transport works out of the box. - Usage
./scripts/serve_from_config.sh \ --config ./raft_config.json \ --store-id 1 \ --workdir ./artifacts/cluster/store-1 \ --scope local # use --scope docker inside containers--scopedecides whether to use the local addresses or the container-friendly ones. The script also resolves PD fromconfig.pdunless--pd-addrexplicitly overrides it. It assembles all peer mappings (excluding the local store) and execsnokv serve.
Diagnostics & benchmarking
| Script | Purpose |
|---|---|
scripts/recovery_scenarios.sh | Runs crash-recovery scenarios across WAL/manifest/vlog. Set RECOVERY_TRACE_METRICS=1 to collect metrics under artifacts/recovery/. |
scripts/transport_chaos.sh | Injects disconnects/blocks/delay into the raftstore transport to observe behaviour under faulty networks. |
scripts/run_benchmarks.sh | Executes YCSB benchmarks (default engines: NoKV/Badger/Pebble, workloads A-G; optional RocksDB via build tags). |
scripts/debug.sh | Convenience wrapper around dlv test for targeted debugging. |
scripts/gen.sh | Generates protobuf Go bindings through Buf with pinned remote plugin versions. |
Other helpers
cmd/nokv pd
PD-lite service used by local scripts and compose for:
- routing (
GetRegionByKey) - ID allocation (
AllocID) - timestamp allocation (
Tso)
Example:
go run ./cmd/nokv pd --addr 127.0.0.1:2379 --id-start 1 --ts-start 100 --workdir ./artifacts/pd
Relationship with nokv-config
nokv-config stores/regions/pdprovide structured views overraft_config.json, making it easy for scripts and CI to query the topology.nokv-config manifestwrites Region metadata into manifests and replaces the historicalmanifestctlbinary.cmd/nokv-redisreads the same config and usesconfig.pdby default in raft mode (--pd-addrremains an override).- Go tools or custom scripts can import
github.com/feichai0017/NoKV/configand callconfig.LoadFile/Validateto consume the sameraft_config.json, avoiding divergent schemas.
Maintaining a single raft_config.json keeps local scripts, Docker Compose, Redis gateway, and automated tests aligned.
Redis Gateway
cmd/nokv-redis exposes NoKV through a RESP-compatible endpoint and can operate in two modes:
| Mode | Description | Key flags |
|---|---|---|
Embedded (embedded) | Opens a local *NoKV.DB work directory. Commands (SET, SET NX/XX, EX/PX/EXAT/PXAT, MSET, INCR/DECR, DEL, MGET, EXISTS, …) run through regular DB APIs (Get/Set/SetWithTTL/Del) with backend-side synchronization for read-modify-write operations. | --workdir <dir> |
Raft (raft) | Routes requests through raftstore/client and a NoKV cluster. Writes execute via TwoPhaseCommit; TTL is persisted directly in entry expires_at metadata (same write path as value updates). Routing and TSO allocation are provided by PD-lite over gRPC (PD is runtime route source; config regions are bootstrap metadata). | --raft-config <file>--pd-addr host:port (optional override; defaults to config.pd) |
When both CLI and config provide the same setting, CLI wins.
Usage examples
Embedded backend
go run ./cmd/nokv-redis \
--addr 127.0.0.1:6380 \
--workdir ./work_redis \
--metrics-addr 127.0.0.1:9100 # optional expvar endpoint
Validate with redis-cli -p 6380 ping. Metrics are exposed at http://127.0.0.1:9100/debug/vars under NoKV.Stats.redis.
Raft backend
-
Start NoKV and PD-lite using the helper script or Docker Compose. Both consume
raft_config.example.json, initialise manifests for each store, and launchnokv pdautomatically:./scripts/run_local_cluster.sh # or: docker compose up --build -
Run the gateway:
go run ./cmd/nokv-redis \ --addr 127.0.0.1:6380 \ --raft-config raft_config.example.json
Supported commands
- String operations:
GET,SET,SET NX/XX,EX/PX/EXAT/PXAT,DEL,MGET,MSET,EXISTS - Integer operations:
INCR,DECR,INCRBY,DECRBY - Utility:
PING,ECHO,QUIT
TTL option semantics:
EX/PXare relative TTLs.EXAT/PXATare absolute expire timestamps.- The current engine expiry resolution is seconds, so sub-second TTL intent is rounded/coarsened to second granularity.
In both modes write commands are atomic. The Raft backend batches multi-key updates (MSET, DEL, …) into a single TwoPhaseCommit, matching the embedded semantics. Reads use direct DB.Get locally and leader reads with TTL checks remotely.
Configuration file
raft_config.example.json is shared by scripts/run_local_cluster.sh, Docker Compose, and the Redis gateway. Important fields:
stores– store ID, gRPC address, and optional container listen/advertise addressesregions– region ID, start/end keys (usehex:<bytes>for binary data), epoch, peer list, leader store IDmax_retries– maximum retries for region errors in the distributed clientpd– PD-lite endpoint(s) and optional persistence dirs:addr/docker_addrfor endpoint resolution by scopework_dir/docker_work_dirfor PD state persistence defaults
Use nokv-config to inspect or validate the configuration:
nokv-config stores --config raft_config.json
nokv-config regions --config raft_config.json --format json | jq '.[] | {id:.id, peers:.peers}'
For Go tooling, import github.com/feichai0017/NoKV/config and call config.LoadFile / Validate to reuse the same schema and defaults across CLIs, scripts, and applications.
Metrics
With --metrics-addr enabled the gateway publishes Redis metrics as part of NoKV.Stats on /debug/vars, for example:
{
"NoKV.Stats": {
"redis": {
"commands_total": 128,
"errors_total": 0,
"connections_active": 1,
"connections_accepted": 4,
"commands_per_operation": {
"PING": 4,
"SET": 32,
"GET": 64,
"MGET": 8,
"DEL": 10,
"INCR": 10
}
}
}
}
These counters are part of the process-wide expvar output and can be scraped alongside the rest of NoKV’s metrics.
Notes
Use this folder to capture per-debug or per-investigation notes. Keep entries short, factual, and easy to skim.
Add a new note
- Create a new file in
docs/notes/namedYYYY-MM-DD-short-title.md. - Add it to
docs/SUMMARY.mdunder Notes. - Use the template below to keep entries consistent.
Template
Context
Symptom
Repro
Investigation
Root cause
Fix
Follow-ups
2026-01-16 mmap choice
本文档详细对比了主流文件 I/O 模型的差异,并解析 NoKV 在不同组件(SSTable, WAL, VLog)中做出不同 I/O 选择的深层原因与权衡。
1. I/O 模型的四国杀
在 Linux/Unix 环境下,我们在设计存储引擎时通常面临四种选择。理解它们的优劣是做出正确架构决策的前提。
| 特性 | 标准 I/O (read/write) | 内存映射 (mmap) | 直接 I/O (O_DIRECT) | 异步 I/O (io_uring) |
|---|---|---|---|---|
| 机制 | 系统调用,数据在 Kernel Buffer 和 User Buffer 间拷贝 | 建立虚拟内存映射,缺页中断加载,零拷贝 | 绕过 Page Cache,直接 DMA 到用户内存 | 提交请求队列,内核异步完成,零系统调用开销 |
| 优势 | 简单,通用,Page Cache 自动预读 | 读延迟极低 (像访问内存一样),代码简单 | 完全可控 (内存/刷盘),无 GC 干扰 | 吞吐量极高,CPU 占用低 |
| 痛点 | 拷贝开销 (CPU copy),高频调用 Context Switch | 不可控 (Page Fault 阻塞,TLB shootdown),大文件污染 Cache | 复杂 (需自建 Buffer Pool,对齐限制) | 极复杂 (编程模型完全不同) |
| 适用 | 日志追加 (WAL) | 只读索引,随机小读 (SSTable) | 数据库自管理缓存 (MySQL, ScyllaDB) | 超高并发网络/磁盘 IO |
2. NoKV 的选择:因地制宜
NoKV 没有“一种 IO 走天下”,而是根据不同组件的访问模式(Access Pattern)选择了最适合的方案。
2.1 SSTable:坚定选择 mmap
SSTable 是 LSM Tree 的数据文件,具有 不可变 (Immutable) 和 随机读 (Random Read) 的特性。
- 痛点:如果用标准
pread,每次Get(key)都要发起一次系统调用。在 100k QPS 下,上下文切换(Context Switch)的开销是巨大的。 - mmap 的解法:
- 零拷贝:数据直接映射到用户空间,
slice = data[offset:len],没有memcpy。 - 零系统调用:热点数据如果在物理内存中,读取就是纯内存访问,纳秒级延迟。
- OS 帮我管缓存:利用操作系统的 Page Cache 管理热点,不用自己写复杂的 LRU Cache。
- 零拷贝:数据直接映射到用户空间,
2.2 WAL:回归标准 os.File + bufio
WAL (Write Ahead Log) 是 顺序追加 (Append Only) 且 持久化敏感 的。
- mmap 的痛点:
- 文件扩容麻烦:mmap 需要预先
ftruncate占位,写满了要remap,这在追写场景下很笨重。 - 落盘不可控:虽然有
msync,但 OS 何时把 Dirty Page 刷盘是不确定的。对于要求fsync严格落盘的 WAL,标准 IO 更可控。
- 文件扩容麻烦:mmap 需要预先
- NoKV 的选择:使用标准 I/O 配合
bufio.Writer。bufio提供了用户态缓冲,减少了write系统调用次数。fsync语义清晰,确保数据不丢。
2.3 ValueLog:目前的妥协 (mmap + madvise)
ValueLog 也是 顺序写,但面临 随机读(KV 分离查询时)。
- 现状:NoKV 目前对 VLog 也使用了
mmap。 - 写入控制:虽然使用 mmap 写入,但代码中显式调用了
madvise(MADV_DONTNEED)。- 在
DoneWriting(文件写满轮转)和SetReadOnly时,系统会通知内核“我不再需要这些页面了”。 - 目的:主动释放 VLog 刚刚写入的大量脏页占用的 Page Cache,防止它们把 SSTable 的热点数据(索引、Filter)挤出内存。
- 在
- 持久化:只有当
SyncWrites: true时,才会调用msync。平时依赖 OS 的后台刷盘。
3. 读写交互逻辑图
下面这张图展示了不同 IO 模型在 NoKV 读写流中的位置:
flowchart TD
subgraph "Write Path"
Mem[MemTable]
WAL["WAL (Standard IO)"]
Flush["Flush/Compact"]
end
subgraph "Persistence"
SST["SSTable (mmap)"]
VLog["ValueLog (mmap)"]
end
Write["Set(k, v)"] --> Mem
Write --> WAL
Mem -->|Full| Flush
Flush -->|"Small Values"| SST
Flush -->|"Large Values"| VLog
subgraph "Read Path"
Get["Get(k)"]
LSM["LSM Search"]
Get --> LSM
LSM -->|"1. Index Lookups"| SST
SST -->|"2. Zero Copy Read"| Kernel["Page Cache"]
LSM -->|"3. ValuePtr Found"| VLog
VLog -->|"4. Random Read"| Kernel
end
style WAL fill:#f9f,stroke:#333,stroke-width:2px
style SST fill:#bfb,stroke:#333,stroke-width:2px
style VLog fill:#bfb,stroke:#333,stroke-width:2px
4. 总结
NoKV 的 I/O 选型策略是 “读写分治,稳定为王”:
- 读密集 (SST):选
mmap,榨干内存带宽,减少 CPU 开销。 - 写敏感 (WAL):选
Standard IO,确保数据安全和追加性能。 - 大容量 (VLog):选
mmap+madvise,利用切片读取的便利性,同时主动管理缓存污染。
理解这些权衡,是掌握存储引擎底层性能优化的关键。
2026-01-16 hotring design
本文档详细记录了 NoKV 中 hotring 模块的设计灵感、架构定位、核心实现以及未来展望。这是一个从学术论文汲取灵感,并转化为工业级“热点探测器”的典型案例。
当前实现已并入 NoKV 仓库,位于 hotring/ 包。
1. 设计灵感:取其神而弃其形
来源:HotRing: A Hotspot-Aware In-Memory Key-Value Store (FAST ’20)
1.1 论文解决的痛点
在传统的 Hash 索引(链地址法)中,如果链表很长且热点数据位于链表尾部,每次访问热点都需要遍历大量冷数据,造成严重的 CPU Cache Miss 和长尾延迟。HotRing 提出将链表改为环形结构,并让 Head 指针智能指向热点节点,从而实现 $O(1)$ 的热点访问。
1.2 NoKV 的工程转化
NoKV 并没有照搬论文作为主索引(因为主索引是 LSM Tree),而是提取了 “热点感知” 这一核心思想,设计了一个轻量级、旁路式的热点统计模块。
- 差异点:
- 定位:论文是存数据的索引;NoKV 是记账的统计器。
- 结构:论文是环形链表 + 智能指针;NoKV 是分片 Hash + 有序链表 + 滑动窗口。
- 核心价值:在百万级 QPS 下,以极低的开销(Lock-Free List)精准识别系统中的“热点”,为缓存优化和限流提供数据支撑。
2. 核心架构:反馈驱动设计 (Feedback-Driven)
NoKV 的 HotRing 不仅仅是一个统计工具,它是整个系统“自适应优化”的大脑。
2.1 架构全景图
graph TD
Client[Client Request] --> DB[DB Layer]
subgraph "HotRing Subsystem (The Brain)"
Tracker[Hot Key Tracker]
Window[Sliding Window]
Decay[Decay Loop]
end
subgraph "Execution Layer"
LSM[LSM Tree]
Cache[Block Cache]
Compaction[Compaction Picker]
Limiter[Write Limiter]
end
DB -->|"1. Touch(key)"| Tracker
Tracker -->|"2. Update Counters"| Window
Decay -.->|"3. Age Out"| Window
Tracker -.->|"4. TopN Report"| Compaction
Compaction -->|"5. Hot Score"| LSM
2.2 关键交互流程
- 探测 (Probe):
- 读路径:每次
Get命中时调用Touch。 - 写路径:只有当启用了限流(
WriteHotKeyLimit)或突发检测时,才会调用TouchAndClamp。
- 读路径:每次
- 计算 (Compute):HotRing 内部利用滑动窗口算法计算实时 QPS。
- 反馈 (Feedback):
- Compaction 评分:
lsm/compact在选择压缩层级时,会参考HotRing.TopN。如果某一层包含大量热点 Key,会优先压缩该层(Hot Overlap Score),减少热点数据的读放大。 - 缓存预取 (Prefetch):DB 层会根据 TopN 结果触发预取逻辑。虽然 HotRing 不直接控制 Cache,但它提供的热点名单是预取策略的重要输入。
- 写入限流:对于写频率过高的 Key,
TouchAndClamp会触发限流保护。
- Compaction 评分:
3. 实现细节深度解析
3.1 并发控制:Lock-Free 与 Spin-Lock
为了支撑高并发,HotRing 采用了混合并发策略:
- 主链表 (Buckets & List):采用 Lock-Free 的 CAS 操作进行节点插入。
- Ordered List:链表节点按
(Tag, Key)排序,查找失败可提前终止。
- Ordered List:链表节点按
- 滑动窗口 (Window Counters):由于涉及复杂的窗口滚动和数组更新,使用了轻量级的 Spin-Lock (自旋锁) 保护。
node.lockWindow():CAS(&lock, 0, 1)。
- 衰减 (Decay):后台协程定期衰减时,会有互斥锁保护
decayMu,但实际的计数器衰减是原子操作。
3.2 统计算法:滑动窗口与衰减
如何区分“历史热点”和“突发热点”?
- 滑动窗口 (Sliding Window):
- 将时间切分为多个 Slot(如 8 个 Slot,每个 250ms)。
Touch时根据Timestamp % Slots写入对应 Slot。- 效果:能够精准反映“最近 2 秒”的热度,过期数据自动失效。
- 衰减 (Decay):
- 后台协程定期将所有 Counter 右移一位(
count >> 1)。 - 效果:模拟热度的“半衰期”,让不再访问的旧热点逐渐冷却。
- 后台协程定期将所有 Counter 右移一位(
3.3 与论文/算法的关键差异(工程化改动)
| 对比点 | 论文 / 经典算法 | NoKV HotRing |
|---|---|---|
| 目标 | 作为索引或严格频率估计 | 作为系统级热点反馈信号 |
| 数据结构 | 环形链表/Sketch | 哈希分桶 + 有序链表 |
| 误差控制 | 明确误差界 | 工程可接受范围 |
| 并发 | 复杂锁或全局结构 | Lock-Free + 轻量自旋锁 |
| 时间维度 | 常态累计 | 滑动窗口 + 衰减 |
结论:NoKV HotRing 是“工程可用”优先的实现,而不是“数学最优”优先。
4. 实际应用场景
4.1 可观测性 (Observability)
运维人员可以通过 CLI 实时查看系统热点,瞬间定位“谁在打挂数据库”。
# 使用 stats 命令查看
$ go run cmd/nokv/main.go stats --workdir ./work_test
...
Hot Keys:
key: user:1001, count: 52000
key: config:global, count: 12000
4.2 缓存与性能 (Performance)
- VIP 缓存区 (Hot Tier):LSM Cache 内部维护了一个小型的
Clock-Pro缓存(Hot Tier)。虽然它不是绝对的“免死金牌”(仍可能被更热的数据挤出),但它为热点 Block 提供了比普通 LRU 更强的保护。 - 热点压缩优先:通过 HotRing 的反馈,系统能主动将热点数据所在的重叠 SSTable 进行合并,将热点数据的查询路径压缩到最短。
5. 未来展望
基于目前的 HotRing 基础,NoKV 未来可以实现更高级的特性:
- 写吸收 (Write Absorption):
- 对于超高频写入的热点(如计数器),可以在内存中聚合 100 次更新为 1 次 VLog 写入,大幅降低 LSM 写放大。
- 动态数据迁移:
- 在分布式场景下,发现某个 Region 出现热点,自动触发 Region Split 或将该热点 Key 迁移到专用节点。
6. 总结
NoKV 的 hotring 是一个 “学术灵感 + 工程务实” 的典范。它没有追求理论上完美的环形索引结构,而是抓住了“热点感知”这一核心价值,用混合并发结构(Lock-Free + SpinLock)解决了工程中最头疼的监控盲区问题,并成功反哺了 Compaction 调度。
2026-02-01 compaction and ingest
本文档深入解析 NoKV 的 Compaction(压缩) 机制与 Ingest Buffer(导入缓冲) 的协同设计。这是 NoKV 解决 LSM Tree 经典的“写停顿(Write Stall)”问题的核心武器,也是体现其工业级稳定性的关键设计。
1. 设计理念:拒绝“写停顿”
在 LSM Tree 架构中,数据从 MemTable 刷入 L0 层。由于 L0 层的 SSTable 之间 Key 是重叠的,当 L0 文件数量达到上限(如 15 个)时,必须触发 L0 -> L1 的 Compaction。
- 传统痛点:L0 -> L1 的 Compaction 需要将 L0 文件与 L1 中所有重叠的文件读出,进行归并排序(Merge Sort),然后重写。这个过程涉及大量 IO 和 CPU,耗时较长。
- 后果:如果写入速度超过了 L0 -> L1 的压缩速度,L0 就会被填满,系统被迫触发 Write Stall(限制甚至停止写入),导致严重的性能抖动。
NoKV 的哲学:
“先收下,再整理。” 当 L0 拥堵时,不要阻塞写入去等待漫长的排序,而是先把 L0 的文件“甩”给下一层,让下一层暂时“保管”,等有空了再慢慢整理。
1.1 参考论文与工程对标
以下论文/系统是 NoKV compaction 与 ingest buffer 设计的主要参考坐标(按主题分类):
- LSM 设计与调参理论:
- Monkey (SIGMOD 2017) —— 全局调参、Bloom 过滤器与合并策略的权衡模型。
- Dostoevsky (SIGMOD 2018) —— Lazy Leveling / 低合并成本的层级策略。
- 写停顿与稳定性:
- bLSM (SIGMOD 2012) —— 强调写入吞吐稳定与 tail latency。
- Performance Stability in LSM-based Storage Systems —— compaction 抖动与写停顿成因分析。
- 工程系统实践:
- RocksDB Compaction (官方文档) —— leveled/tiered/universal 与 L0 处理策略。
- PebblesDB —— 碎片化/分片思路降低写放大。
- Co-KV —— 把 compaction 视作核心瓶颈的系统研究。
2. 核心组件:Ingest Buffer
为了实现上述哲学,NoKV 为每一层(Level 1+)引入了一个特殊的结构:Ingest Buffer。
2.1 结构定义 (lsm/ingest.go)
它不是一个简单的队列,而是一个分片化的容器:
type ingestBuffer struct {
shards []ingestShard // 默认 4 个分片
}
type ingestShard struct {
tables []*table // 暂存在这里的 SSTable 列表
ranges []tableRange // 对应的 Key 范围索引
}
- 分片 (Sharding):根据 Key 的前缀将暂存的表分配到不同的 Shard。
- 并行性:这允许后台的多个 Compactor 线程并行地处理不同 Key 范围的积压数据。
3. 交互逻辑:救火与还债
NoKV 的 Compaction 流程被设计为“快慢双轨”制。
3.1 快路径:L0 溢出卸载 (Offloading)
这是应对 Write Stall 的“救火”机制。
- 触发:L0 文件数过多。
- 动作 (
moveToIngest):- 不进行数据合并。
- 直接将 L0 的 SSTable 文件从 L0 列表中移除。
- 将这些文件加入到 L1 的
Ingest Buffer中。
- 代价:纯元数据操作,微秒级完成。
- 结果:L0 瞬间清空,写停顿解除。L1 暂时持有这些未排序的文件。
graph TD
subgraph Before_L0_Congested["Before: L0 Congested"]
L0["L0: 15 SSTables (Full)"]
L1["L1: Sorted SSTables"]
end
subgraph Action_Offload_Fast["Action: Offload (Fast)"]
Move["Move to Ingest"]
end
subgraph After_L0_Empty["After: L0 Empty"]
L0_New["L0: Empty"]
L1_New["L1: Sorted SSTables"]
L1_Ingest["L1 Ingest Buffer: 15 Unsorted Tables"]
end
L0 --> Move --> L1_Ingest
3.2 慢路径:后台异步归并 (Merge)
这是“还债”机制,确保存储结构的最终有序性。
- 触发:Compactor 发现某层的
Ingest Buffer积压严重(Score > 1)。 - 模式选择 (IngestMode):
- IngestDrain:将 Ingest Shard 合并进 Main Tables,彻底清空缓冲。
- IngestKeep:合并 Shard,但如果下游压力也大,可能会将输出结果继续保留在 Ingest Buffer 中(暂存结果),以避免写入放大的级联效应。
- 动作 (
fillTablesIngestShard):- 挑选一个积压最严重的
Shard。 - 锁定该 Shard 和 L1 中与其 Key 范围重叠的
Main Tables。 - 执行标准的归并排序。
- 生成新的
Main Tables,清空该 Shard。
- 挑选一个积压最严重的
4. 读路径的权衡
这种设计本质上是 “空间换时间” 和 “读写权衡”。我们牺牲了一点点读性能,换取了极致的写稳定性。
查询流程 (Get):
- 查 MemTable。
- 查 L0。
- 查 L1:
- 先查 L1 Ingest Buffer:因为这里面是从 L0 刚“甩”下来的新数据,版本更新。
- 需要在 Shard 内进行二分查找(因为 buffer 内的表之间可能有重叠)。
- 后查 L1 Main Tables:这是标准的有序数据,查找很快。
- 先查 L1 Ingest Buffer:因为这里面是从 L0 刚“甩”下来的新数据,版本更新。
- 查 L2…
5. 协同设计:Value-Aware Compaction
除了处理写抖动,Compaction 还承担了回收 VLog 空间的任务。
- 痛点:在 KV 分离架构中,LSM 里的删除只是写了一个 Tombstone,VLog 里的旧 Value 依然占着磁盘。
- 方案:
- Value Density (价值密度):Compaction Picker 会计算每一层的
TotalValueBytes / TotalSizeBytes。 - Discard Stats (失效统计):虽然 VLog GC 依赖专门的 discard stats,但 Compaction 必须负责通过重写 SSTable 来丢弃那些指向无效 Value 的指针。
- 策略:Compaction 会优先选择 Value 密度异常(或者包含大量 Stale 数据)的层级进行压缩,主动触发指针清理。
- Value Density (价值密度):Compaction Picker 会计算每一层的
6. 总结
NoKV 的 Compaction 和 Ingest Buffer 设计解决了一组复杂的工程矛盾:
| 问题 | 传统方案 | NoKV 方案 | 收益 |
|---|---|---|---|
| L0 拥堵 | 阻塞写入,强制合并 | L0 -> Ingest Buffer (快速卸载) | 零写停顿 (Zero Write Stall) |
| 合并卡顿 | 单线程大合并 | Sharding + Subcompaction | 并行处理,利用多核/SSD 优势 |
| VLog 膨胀 | 被动等待 | Value-Aware Scoring | 主动出击,加速空间回收 |
这是一个非常成熟的工业级设计,它不仅关注“存得下”,更关注“写得稳”和“删得掉”。
7. 与论文原始设计的关键对比(我们做了哪些改动)
7.1 与 bLSM / Performance Stability 的对比
| 论文观点 | 原文侧重点 | NoKV 改动 | 实际影响 |
|---|---|---|---|
| 写停顿主因是 L0 拥堵 + Compaction 过慢 | 强调稳定吞吐 | Ingest Buffer + 快速卸载 | 写停顿几乎消失 |
| 需要把后台任务节奏“拉平” | 关注 tail latency | 分片 + 并行 compaction + 动态调度 | 把抖动压在后台 |
7.2 与 Monkey / Dostoevsky 的对比
| 论文观点 | 原文侧重点 | NoKV 改动 | 实际影响 |
|---|---|---|---|
| LSM 参数需全局权衡(读/写/空间) | 理论模型 | 引入 ingest buffer 作为工程缓冲层 | 实际调参更稳定 |
| Lazy leveling 降低合并成本 | 减少写放大 | IngestKeep/Drain 模式 | 热点时延降低 |
7.3 与 RocksDB / PebblesDB 的对比
| 系统 | 原始设计 | NoKV 改动 | 说明 |
|---|---|---|---|
| RocksDB | L0 → leveled,universal 作为可选 | 引入每层 ingest 缓冲区 | 更适合 burst 场景 |
| PebblesDB | 碎片化 LSM | 按前缀分片 shard | 保持范围局部性 |
7.4 与论文原型不同的工程化点
- 分片并行:按 key 前缀 shard,使 ingest 与 compaction 可并行而不互相覆盖。
- IngestKeep / IngestDrain:把“快速止血”和“慢速还债”拆成两条路径。
- Value-aware compaction:与 VLog discard stats 联动,把无效指针尽快清掉。
- 调度基于 backlog/score:优先处理最急的 shard,而非随机挑选。
简单总结:论文解决的是“理论最优解”,NoKV 做的是“工程稳定性 + 可运维”。
2026-02-05 vlog 设计与 GC(WiscKey + HashKV 工程化)
这份笔记把 NoKV 的 ValueLog(vlog)设计、GC 机制、以及最近的并行化与热冷分流优化整理成一份完整版本。内容融合 WiscKey(KV 分离)与 HashKV(哈希分区/热冷分离)两条主线,并结合当前实现细节与参数策略。
一页摘要(TL;DR)
核心思路:LSM 只保存 Key+ValuePtr,大 Value 顺序写入 vlog;再用 多桶 + 热冷分流 把热点更新局部化,并通过 并行 GC + 压力控制 把 GC 开销稳定在可控范围。
| 设计点 | 借鉴 | NoKV 实现 | 直接收益 |
|---|---|---|---|
| KV 分离 | WiscKey | vlog + ValuePtr | LSM 更小、写入更顺序 |
| 哈希分区 | HashKV | ValueLogBucketCount | 垃圾局部化 |
| 热冷分流 | HashKV | HotRing 路由 | 热点不污染冷数据 |
| GC 并行 | 工程化 | ValueLogGCParallelism | 提升清理吞吐 |
| 压力控制 | 工程化 | reduce/skip 阈值 | 不与 compaction 抢资源 |
1. 论文借鉴要点
1.1 WiscKey
- KV 分离:LSM 只存 Key + ValuePtr,大 Value 写入 vlog。
- 顺序写:写入走日志追加,延迟稳定。
- GC 必要性:旧值只能通过搬运+删除回收。
1.2 HashKV
- 哈希分区:ValueLog 分桶,key 的历史版本集中。
- 热冷分离:热点更新影响局部桶,冷数据保持稳定。
- 轻 GC:热点桶高频回收,冷桶低频维护。
1.3 参考论文(标题)
- WiscKey: Separating Keys from Values in SSD-conscious Storage
- HashKV: Enabling Efficient Updates in KV Storage via Hashing
2. 设计目标(工程化视角)
- 写路径极简:顺序追加为主,不引入复杂索引结构。
- GC 不扰动主路径:并行但受控,避免和 compaction 争 IO。
- 热点更新局部化:尽量把垃圾限制在热桶。
- 可观测 + 可调参:让调参是“看得见的系统工程”。
2.1 设计约束与假设
- Crash Recovery 必须可靠:vlog 的 head/删除状态必须可恢复。
- 写放大优先于读放大:更倾向把写成本压低,读路径可容忍一次额外跳转。
- GC 可退让:GC 是“后台维护”,不能把 compaction 压死。
3. 架构总览(分层模型)
flowchart TD
subgraph DB["DB Policy 层"]
VlogGo["vlog.go / vlog_gc.go<br/>写入路由 + GC 调度"]
end
subgraph Mgr["ValueLog Manager"]
MgrGo["vlog/manager.go<br/>分段/轮转/读写"]
end
subgraph IO["IO Layer"]
File["file/ (mmap)<br/>LogFile"]
end
DB --> Mgr --> IO
4. 目录布局与分桶结构
<workdir>/
vlog/
bucket-000/
00000.vlog
00001.vlog
bucket-001/
00000.vlog
00001.vlog
...
ValueLogBucketCount > 1启用分桶。- ValuePtr 现在包含
Bucket/Fid/Offset/Len,LSM 可以精确定位。
4.1 记录格式与 ValuePtr 布局
vlog 记录格式(与 WAL 一致):
+--------+----------+------+-------------+-----------+-------+
| KeyLen | ValueLen | Meta | ExpiresAt | Key bytes | Value |
+--------+----------+------+-------------+-----------+-------+
+ CRC32 (4B)
ValuePtr 布局:
+------+--------+-----+--------+
| Len | Offset | Fid | Bucket |
+------+--------+-----+--------+
| 4B | 4B | 4B | 4B |
这保证了:LSM 索引只需持有 ValuePtr 即可定位到具体桶 + 文件 + 偏移。
4.2 Manifest 与恢复关系(NoKV 特有工程点)
与论文原型不同,NoKV 把 vlog 的 head 与删除事件写入 manifest:
flowchart LR A["vlog append"] --> B["update head"] B --> C["manifest edit"] C --> D["crash recovery"] D --> E["rebuild vlog state"]
这样恢复时不依赖完整目录扫描,避免误删/误开段。
5. 写入路径(顺序追加)
sequenceDiagram participant C as commitWorker participant V as vlog.Manager participant W as WAL participant M as MemTable C->>V: AppendEntries(entries) V-->>C: ValuePtr list C->>W: Append(entries+ptrs) C->>M: Apply to memtable
关键保证:vlog 写入在 WAL 之前,崩溃恢复时不会出现“指针悬空”。
6. 读路径(指针解引用)
flowchart LR K["Get(key)"] --> LSM["LSM 查索引"] LSM -->|inline value| V["直接返回"] LSM -->|ValuePtr| P["定位 bucket/fid/offset"] P --> R["vlog 读取 (mmap)"] R --> V
读路径的代价在于一次额外的 vlog 定位,但换来更小的 LSM 与更顺序的写入。
6. 热冷分流(HotRing 驱动)
热度统计只看写路径(写热点),避免读热点污染:
flowchart TD E["Entry 写入"] --> H["HotRing Touch"] H -->|hot| B1["热桶 0..H-1"] H -->|cold| B2["冷桶 H..N-1"] B1 --> V["vlog append"] B2 --> V
默认配置(可调):
ValueLogBucketCount = 16ValueLogHotBucketCount = 4ValueLogHotKeyThreshold = 8
7. GC 机制(采样 + 重写)
sequenceDiagram
participant GC as GC Thread
participant Stats as Discard Stats
participant Old as Old Segment
participant LSM as LSM
participant New as Active Segment
GC->>Stats: 选择候选文件
GC->>Old: Sample 10%
GC->>LSM: 校验指针是否仍指向旧值
alt discard 过阈值
loop 遍历旧文件
GC->>Old: Read Entry
GC->>LSM: Double Check
alt still live
GC->>New: Rewrite
end
end
GC->>Old: 删除旧文件
else discard不足
GC-->>Stats: 跳过
end
8. 并行 GC + 压力控制(核心工程化)
8.1 并行调度
ValueLogGCParallelism控制并发数(默认自动)。- 同桶互斥:同一桶不会并发 GC(无锁 CAS)。
- 全局 semaphore 限制同时 GC 数量。
8.2 压力控制
当 compaction 压力过高时,GC 自动降级或跳过:
flowchart LR
A["Compaction Stats"] --> B{"压力评估"}
B -->|低| C["并行 GC"]
B -->|中| D["并行度减半"]
B -->|高| E["跳过本轮 GC"]
阈值参数:
ValueLogGCReduceScore / ValueLogGCReduceBacklogValueLogGCSkipScore / ValueLogGCSkipBacklog
8.3 与论文实现的关键差异(重点对比)
WiscKey vs NoKV
| 维度 | WiscKey | NoKV |
|---|---|---|
| vlog 元数据 | 论文原型不强调 manifest | manifest 记录 head/删除 |
| GC 触发 | 依赖扫描与 stale ratio | 来自 LSM discard stats |
| GC 并行 | 未强调 | 多桶并行 + 压力控制 |
| 热点处理 | 无显式热冷 | HotRing 驱动热/冷桶 |
HashKV vs NoKV
| 维度 | HashKV | NoKV |
|---|---|---|
| 分区策略 | 哈希分区 | 哈希分桶 + 热/冷分流 |
| 目标 | 降低更新放大 | 降低 GC 波动 + write amp |
| GC 调度 | 以分区为单位 | 分桶并行 + compaction 压力控制 |
结论:NoKV 保留论文的“核心思想”,但在恢复一致性、调度策略、观测性上做了工程化强化。
9. 可观测性与调参抓手
关键指标(expvar):
NoKV.ValueLog.GcParallelismNoKV.ValueLog.GcActiveNoKV.ValueLog.GcScheduledNoKV.ValueLog.GcThrottledNoKV.ValueLog.GcSkippedNoKV.ValueLog.GcRejected
简单调参建议:
- 低负载:调高
ValueLogGCParallelism - 高负载:降低
ReduceScore或ReduceBacklog,更快降级
10. 代价与边界
- 桶数过多 → 文件碎片化、head 追踪成本上升
- 热桶过小 → 轮转频繁、写放大升高
- 并行 GC 过高 → 可能与 compaction 争抢 IO
11. 小结
NoKV 的 vlog 设计是典型的 “WiscKey + HashKV + 工程化调度”:
- 写路径保持顺序,延迟稳定
- 多桶 + 热冷分流 把垃圾局部化
- 并行 GC + 压力控制 把系统稳定性和吞吐平衡起来
这使得 vlog 从“可用”走向“可运维 + 可扩展”。
NoKV 内存内核:Arena 线性分配与自适应索引(ART vs SkipList)的极致工程实现
在高性能存储引擎中,内存管理直接决定了系统的吞吐上限和延迟稳定性。NoKV 的 MemTable 层通过 Arena 线性分配和高度优化的索引结构,实现了零 GC 压力下的极速读写。本文将深度拆解这一层级的核心架构设计与工程权衡。
1. Arena:构建“指针无关”的堆外内存池
NoKV 的 Arena(位于 utils/arena.go)是所有内存索引的物理基石。它通过“单向追加”和“偏移量寻址”机制,将海量的小对象分配从 Go Runtime 的堆内存中剥离出来。
1.1 物理布局与分配策略
Arena 并不是零散的内存块,而是一个连续的 []byte。
- 线性分配 (Bump Allocation):分配开销仅为一个原子加法。
- 内存对齐 (Alignment):为了保证原子操作(如 64 位版本号更新)的安全性,Arena 强制执行对齐:
// AllocAligned 确保分配的起始地址在 align 的倍数上
func (a *Arena) AllocAligned(size, align int) uint32 {
// 计算对齐补齐量 (Padding)
padding := (align - (int(a.n) % align)) % align
// 原子地移动分配指针
offset := a.Allocate(uint32(size + padding))
// 返回真实的起始偏移量
return offset + uint32(padding)
}
为什么对齐是刚需?:在现代 64 位 CPU 上,如果一个 8 字节的 uint64 跨越了缓存行 (Cache Line),硬件无法保证其原子写操作的原子性。NoKV 通过 Arena 对齐,保证了所有元数据更新(如 SkipList 的 Tower 链接或 ART 的版本号)在物理上是并发安全的。
1.2 寻址艺术:Uint32 Offset vs Pointer
在 Go 堆中存储数百万个节点指针会导致 GC 扫描极慢(STW 延迟剧增)。NoKV 所有的索引节点内部都使用 uint32 的 Offset 来互相引用。
- GC 友好性:Offset 对 GC 是透明的。在扫描阶段,GC 只需要扫描 Arena 那个巨大的
[]byte切片头,而不需要递归扫描数以万计的小对象。 - 内存节省:在 64 位系统上,
uint32(4B) 只有原生指针 (8B) 的一半大小,这让内存索引的有效负载比提高了近一倍。
2. SkipList:经典的并发写基准
SkipList(位于 utils/skiplist.go)以其实现简单、并发稳健著称,是 NoKV 的基准索引。
2.1 架构设计
- 多级塔式索引:通过随机化层高(MaxHeight=20,概率 P=0.25),实现平均 $O(\log N)$ 的查找复杂度。
- 无锁并发 (Lock-free):利用
atomic.CompareAndSwapUint32在每一层进行节点插入。
2.2 插入协议 (Add 逻辑)
SkipList 的插入不是简单的 Lock -> Insert,而是一个多阶段的原子安装过程:
- 节点创建:在 Arena 中预分配节点空间,设置 Key/Value 的 Offset。
- 寻找切面 (Find Splice):从最高层开始向下寻找每一层的前驱 (
prev) 和后继 (next) 节点。 - 逐层原子链接:从 Level 0 开始向上 CAS 安装。如果 CAS 失败(说明并发环境下有其他节点插入),则局部重试寻找切面,直到安装完成。
func (s *Skiplist) Add(entry *kv.Entry) {
// 1. 预分配节点并随机化层高
nodeOffset := s.newNode(entry.Key, entry.Value, height)
// 2. 局部 CAS 链接
for i := 0; i < height; i++ {
for {
prev, next := s.findSpliceForLevel(entry.Key, i)
// 将新节点的 next 指向找到的后继
s.setNextOffset(nodeOffset, i, next)
// 原子地将前驱的 next 指向新节点
if s.casNextOffset(prev, i, next, nodeOffset) {
break
}
}
}
}
3. ART:追求极致的自适应基数树
ART(Adaptive Radix Tree,位于 utils/art.go)是 NoKV 的默认索引,专为现代 CPU 缓存架构和内存效率平衡设计。
3.1 自适应节点架构
ART 会根据子节点数量动态调整节点物理大小,以平衡空间利用率和查询效率:
- Node4 / Node16:使用线性扫描寻找子节点,适合分支较少的路径。
- Node48:使用间接索引表(256 字节哈希表),空间效率极高。
- Node256:直接数组寻址,提供极致的 $O(k)$ 寻址性能。
3.2 排序难题:Comparable Route Key 编码
Radix Tree 原生基于字节比较,不支持 LSM 要求的复合排序(UserKey 升序 + Version 降序)。NoKV 通过一套精妙的编码解决:
- 编码公式:
RouteKey = EncodeComparable(UserKey) + BigEndian(Timestamp)。 - 设计价值:这保证了在 ART 树的层级深度遍历结果完全等价于 LSM 的
Key + Version排序逻辑,从而完美支持Seek和范围扫描。
3.3 并发模型:COW + OLC-lite
- 完全无锁读 (Lock-free Read):通过 COW (Copy-On-Write) 保证读取路径观察到的是一致的、不可变的节点快照。
- OLC-lite 原地更新:针对单点修改(如更新已存在的 Key 的 Value),ART 保留了一个快速路径,在不复制节点的前提下通过原子指令替换
ValuePtr,极大降低了分配开销。
4. MemTable 原子预留协议 (Workflow)
当用户调用 Set 时,LSM 内核的协调流程如下:
sequenceDiagram
participant U as User
participant LSM as LSM Kernel
participant MT as Active MemTable
participant FM as Flush Manager
U->>LSM: Set(entry)
LSM->>LSM: Acquire RLock (Concurrent writers entry)
LSM->>MT: tryReserve(estimate_size)
alt 预留成功 (Fast Path)
MT->>MT: 原子移动 reservedSize
MT->>MT: 并发写入 Index 和 WAL
LSM->>LSM: Release RLock
else 空间不足 (Slow Path - Rotation)
LSM->>LSM: Upgrade to WLock (Stop all writers)
LSM->>LSM: rotateLocked (Seal current MT -> Move to Immutables)
LSM->>FM: submitFlush(oldMemTable)
LSM->>LSM: Create New Active MemTable & WAL Segment
LSM->>LSM: Release WLock
LSM->>U: Retry Set (Go to Fast Path)
end
5. 性能参数与工程取舍建议
在 Options 中配置索引引擎时,应考虑负载特性:
| 维度 | SkipList | ART (Default) |
|---|---|---|
| 点查延迟 (Get) | $O(\log N)$,较高 Cache Miss | $O(k)$,缓存局部性极佳 |
| 范围定位 (Seek) | 中等 | 极快(前缀压缩优势) |
| 内存占用 | 极低(仅存储 Offset) | 较高(当前约 2x,包含内部节点) |
| 代码可维护性 | 极佳 | 较高(节点升级降级复杂) |
总结:NoKV 的内存设计遵循 “空间即状态,并发即原子” 的原则。Arena 锁定了物理稳定性,自适应索引提供了灵活的性能上限。这套方案让 NoKV 在单机热点场景下表现出了极强的韧性。
NoKV 写入流水线:从 MPSC 节拍器到自适应聚合的深度演进
高性能存储引擎的写入路径必须像“节拍器”一样稳定。NoKV 的写入流水线不仅是一个并发队列,它是一套具备 自适应反馈能力 和 分段一致性保证 的异步聚合系统。本文深度拆解 NoKV 如何在高并发压力下保持极致的写入吞吐与低尾延迟。
1. 架构模型:MPSC 聚合流水线
NoKV 并没有让每个用户协程都直接去竞争底层的磁盘锁或 WAL 互斥量,而是采用了 MPSC (Multi-Producer, Single-Consumer) 异步聚合模型。
1.1 设计背景:为什么是 MPSC?
在 LSM 引擎中,WAL (预写日志) 的写入必须是严格顺序的。如果让 1000 个用户协程并发地调用 write() 系统调用,内核态的上下文切换和文件锁竞争会瞬间压垮系统。
NoKV 通过 MPSC 模型,将数千个前台并发压力汇聚到一个后台 commitWorker 中,将随机小写入转化为大块的顺序磁盘 IO。
1.2 核心组件:commitQueue
commitQueue(位于 db_write.go)并不是一个简单的 Channel,它是由三部分构成的协同系统:
- RingBuffer:基于原子序列号的无锁循环队列,负责极速的数据传递。
- Spaces Channel:作为“票据”系统,控制队列的硬上限,形成天然的 Backpressure (背压)。
- Items Channel:作为“唤醒”信号,通知后台 Worker 有新数据到达,避免 Worker 空转浪费 CPU。
2. 自适应批处理算法 (Adaptive Coalescing)
nextCommitBatch 是整个流水线的灵魂。它不是死板地按照固定大小打包,而是能够根据系统负载动态调整其“吞吐模式”。
2.1 积压驱动的动态上限
系统实时监控队列积压程度 (queueLen)。当积压严重时,Worker 会自动“变强”:
// 动态调整 Batch 限制
backlog := int(cq.queueLen.Load())
if backlog > limitCount {
// 如果队列积压,按比例放大 Batch 大小,最高 4 倍
factor := min(max(backlog/limitCount, 1), 4)
limitCount = min(limitCount * factor, hardLimit)
limitSize *= int64(factor)
}
设计价值:这利用了批处理的规模效应。压力越大,聚合度越高,单次磁盘 I/O 摊薄的成本就越低,从而在高压下实现吞吐量的“逆增长”。
2.2 热点感知倍率 (Hot-Aware)
如果 Batch 中包含由 hotTracker 识别出的热点写 Key,系统会应用 HotWriteBatchMultiplier。
- 原理:对于热点 Key,多次写入往往可以合并。通过扩大 Batch,我们让热点请求在进入 WAL 之前有更多机会在内存中被“折叠”,极大减轻了物理磁盘的带宽压力。
2.3 Coalesce 等待机制
在队列瞬时为空时,Worker 不会立即触发提交,而是会短暂等待一个 WriteBatchWait(默认 200us)。这个微小的停顿是低延迟与高吞吐之间的微妙平衡点,它能让微量的突发流量共享一次 fsync 成本。
3. 核心调用逻辑:一个请求的“入库”之旅
graph TD
A[User db.Set/Apply] --> B{L0 Throttle Check?}
B -- Blocked --> A
B -- Pass --> C[Encode Internal Key]
C --> D[Push to RingBuffer]
D --> E[Wait for Request.WaitGroup]
subgraph "commitWorker (Background)"
F[nextCommitBatch: Collect Requests] --> G[vlog.write: Value Separation]
G --> H[wal.Append: Durability]
H --> I[lsm.SetBatch: Index Apply]
I --> J{SyncWrites Enabled?}
J -- Yes --> K[wal.Sync: Flush to Disk]
J -- No --> L[Skip Sync]
K --> M[Signal All WaitGroups]
L --> M
end
M --> N[User returns Success/Err]
4. 健壮性设计:分段式错误归因与回滚
在聚合系统中,最怕的是“一人犯错,全家连坐”。如果一个包含 100 个请求的 Batch 在执行到第 50 个时磁盘满了,剩下的 50 个怎么办?
NoKV 实现了 精确的失败路径追踪:
- 逐请求执行:
applyRequests会在遇到第一个错误时停止,并返回failedAt索引。 - 错误隔离:
// finishCommitRequests 负责分发结果 for i, cr := range batch.reqs { if i < failedAt { cr.req.Err = nil // 前面的请求已经成功落盘 } else { cr.req.Err = actualErr // 从失败点开始的所有请求标记为错误 } cr.req.wg.Done() } - VLog 回滚:如果写入一半失败,VLog 会执行
Rewind操作,将文件头指针回滚到上一个已知的安全点,防止留下不完整的脏数据。
5. 性能参数推导与调优
| 参数 | 默认值 | 调优逻辑 |
|---|---|---|
WriteBatchMaxCount | 64 | 聚合深度。提高可增加吞吐,但会拉长 P99 延迟。 |
WriteBatchWait | 200us | 聚合等待时间。Sync 写场景下建议保留,非 Sync 场景可设为 0。 |
SyncWrites | false | 是否每次 Batch 都调用 fsync。设为 true 会让吞吐下降一个数量级,但保证强持久性。 |
HotWriteMultiplier | 2 | 热点聚合倍率。对于倾斜严重的负载,可设为 4。 |
总结:NoKV 的写路径通过“牺牲”极微小的入队延迟,换取了极其稳健的磁盘顺序写入带宽。这种 “漏斗式” 的设计配合 自适应调节算法,是 NoKV 能够从容应对分布式环境下突发写潮汐的关键所在。
NoKV VFS 抽象:跨越 OS 边界的存储契约与确定性故障模拟
NoKV 的 VFS(Virtual File System)层不是为了增加复杂性,它是整个引擎实现 “确定性可靠性” 的核心堡垒。通过将存储语义与操作系统细节彻底解耦,NoKV 实现了跨平台的原子语义保障和极高强度的故障模拟测试。
1. 为什么工业级存储引擎需要 VFS?
在存储引擎开发中,直接依赖原生 os 包会带来三个致命问题:
- 原子语义缺失:LSM 引擎依赖
Rename的原子性来更新 Manifest。但在不同操作系统上,Rename是否允许覆盖现有文件、是否保证原子性,其表现差异巨大。 - 测试黑盒:如何验证磁盘在
Sync时突然断电的行为?如何在不拆硬盘的情况下模拟磁盘坏道? - 扩展受限:如果未来需要接入 分布式文件系统 (HDFS/S3) 或者实现 纯内存模式 (In-Memory),没有 VFS 将意味着需要重写整个存储内核。
NoKV 通过 vfs.FS 和 vfs.File 接口(位于 vfs/vfs.go),将所有的 IO 行为抽象为一套统一的契约。
2. FaultFS:精准的“故障手术刀”
vfs/faultfs.go 是 NoKV 最引以为傲的可靠性测试工具。它通过装饰器模式包装了标准文件系统,允许测试用例以编程方式注入各种极端故障。
2.1 故障策略 (FaultPolicy)
开发者可以定义极其复杂的故障场景,并观察引擎的自愈能力:
- FailOnce:在操作某特定文件时触发一次错误(模拟瞬时 IO 抖动)。
- FailOnNth:在第 N 次操作(如第 100 次
Write)时触发故障。这在验证 崩溃恢复 (Recovery) 的幂等性时至关重要。 - FailOnOp:只在执行
Sync或Truncate这种改变文件系统元数据的重型操作时触发故障。
2.2 实现机制解析
// faultFile 的 WriteAt 实现
func (f *faultFile) WriteAt(p []byte, off int64) (int, error) {
// 1. 在真正 IO 前,先通过 Policy 进行前置检查
if err := f.fs.before(OpWriteAt, f.name); err != nil {
return 0, err // 模拟故障返回
}
// 2. 执行真正的 OS 调用
return f.File.WriteAt(p, off)
}
通过这套机制,NoKV 的测试集成功模拟了:
- Manifest 写入一半时磁盘满。
- SSTable 生成后
Sync失败,但文件已存在的情况。 - WAL 在回滚过程中发生权限错误的极端场景。
3. 跨平台语义抹平:原子重命名协议
LSM 引擎的命脉在于 Manifest 的原子替换。NoKV 在 VFS 层针对不同系统做了极致的封装。
3.1 Linux 平台的 RenameNoReplace
在 Linux 上,NoKV 利用了 unix.RENAME_NOREPLACE 系统调用。
- 设计价值:它保证了如果目标 Manifest 文件已存在,Rename 会直接原子地报错,而不是覆盖它。这从根本上杜绝了由于进程异常重启导致的旧元数据被误覆盖的问题。
3.2 Darwin 平台的模拟支持
由于 macOS 并不原生支持 RENAME_NOREPLACE,NoKV 在 VFS 层通过专有的 getattrlist 和原子判断逻辑模拟了这一行为,确保了开发者在 Mac 上也能跑出与 Linux 生产环境完全一致的逻辑闭环。
4. 性能提升:PRead/PWrite 并发契约
VFS 不仅是为了可靠性,它还解锁了高性能的 Lock-free 并发读 模式。
- PRead 语义 (ReadAt):
vfs.File强制要求实现ReadAt。 - 无竞争读取:在 ValueLog 的读取路径中,多个查询协程可以并发地使用同一个文件描述符执行
ReadAt。由于ReadAt是自带 Offset 的原子操作,它不需要像Seek + Read模式那样需要获取文件句柄级别的互斥锁。 - 结果:在多核机器上执行大 Value 读取时,NoKV 的吞吐量随着 CPU 核心数呈完美的线性增长。
5. 存储引擎对比分析
| 特性 | NoKV | Pebble (CockroachDB) | RocksDB |
|---|---|---|---|
| VFS 核心架构 | 精简接口 + 装饰器注入 | 深度集成 (errorfs) | 复杂的 Env/FileSystem 抽象 |
| 故障注入强度 | 强(支持路径/操作级别计数) | 极强(支持各种计数策略) | 中(依赖 Env 注入点) |
| 并发读契约 | 强制 PRead/PWrite | 深度优化 PRead | 依赖操作系统支持 |
| 跨平台原子性 | 抹平 Linux/Darwin 差异 | 通过 Go 运行时保证 | 依赖特定的插件实现 |
总结:VFS 并不是一种代码开销,它是 “对每一行磁盘操作负责” 的态度。通过 VFS,NoKV 将复杂的底层系统调用和不可预测的硬件故障,收敛为了一个可预测、可测试、可证明的确定性模型。