Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Testing & Validation Matrix

This document inventories NoKV’s automated coverage and provides guidance for extending tests. It aligns module-level unit tests, integration suites, and benchmarking harnesses with the architectural features described elsewhere.


1. Quick Commands

# All unit + integration tests (uses local module caches)
GOCACHE=$PWD/.gocache GOMODCACHE=$PWD/.gomodcache go test ./...

# Focused distributed transaction suite
go test ./percolator/... ./raftstore/client/... -run 'Test.*(Commit|Prewrite|TwoPhaseCommit)'

# Focused distributed migration / membership / restart suite
go test ./raftstore/integration -count=1

# Crash recovery scenarios
RECOVERY_TRACE_METRICS=1 \
go test ./... -run 'TestRecovery(RemovesStaleValueLogSegment|FailsOnMissingSST|FailsOnCorruptSST|ManifestRewriteCrash|SlowFollowerSnapshotBacklog|SnapshotExportRoundTrip|WALReplayRestoresData)' -count=1 -v

# Protobuf schema hygiene
make proto-check

# gRPC transport chaos tests + watchdog metrics
CHAOS_TRACE_METRICS=1 \
go test -run 'TestGRPCTransport(HandlesPartition|MetricsWatchdog|MetricsBlockedPeers)' -count=1 -v ./raftstore/transport

# Sample Coordinator service for shared TSO / routing in distributed tests
go run ./cmd/nokv coordinator --addr 127.0.0.1:2379 --id-start 1 --ts-start 100 --workdir ./artifacts/coordinator

# Local three-node cluster (includes catalog bootstrap + Coordinator)
./scripts/dev/cluster.sh --config ./raft_config.example.json
# Tear down with Ctrl+C

# Docker-compose sandbox (3 nodes + Coordinator)
docker compose up -d
docker compose down -v

# Build RocksDB locally (installs into ./third_party/rocksdb/dist by default)
./scripts/build_rocksdb.sh
# YCSB baseline (records=1e6, ops=1e6, warmup=1e5, conc=16)
./scripts/run_benchmarks.sh
# YCSB with RocksDB (requires CGO, `benchmark_rocksdb`, and the RocksDB build above)
LD_LIBRARY_PATH="$(pwd)/third_party/rocksdb/dist/lib:${LD_LIBRARY_PATH}" \
CGO_CFLAGS="-I$(pwd)/third_party/rocksdb/dist/include" \
CGO_LDFLAGS="-L$(pwd)/third_party/rocksdb/dist/lib -lrocksdb -lz -lbz2 -lsnappy -lzstd -llz4" \
YCSB_ENGINES="nokv,badger,rocksdb" ./scripts/run_benchmarks.sh
# One-click script (auto-detect RocksDB, supports `YCSB_*` env vars to override defaults)
./scripts/run_benchmarks.sh
# Quick smoke run (smaller dataset)
NOKV_RUN_BENCHMARKS=1 YCSB_RECORDS=10000 YCSB_OPS=50000 YCSB_WARM_OPS=0 \
./scripts/run_benchmarks.sh -ycsb_workloads=A -ycsb_engines=nokv

Tip: Pin GOCACHE/GOMODCACHE in CI to keep build artefacts local and avoid permission issues.


2. Module Coverage Overview

ModuleTestsCoverage HighlightsGaps / Next Steps
WALengine/wal/manager_test.goSegment rotation, sync semantics, replay tolerance for truncation, directory bootstrap.Add IO fault injection, concurrent append stress.
LSM / Flush / Compactionengine/lsm/lsm_test.go, engine/lsm/picker_test.go, engine/lsm/planner_test.go, engine/lsm/compaction_test.go, engine/lsm/flush_runtime_test.goMemtable correctness, iterator merging, flush pipeline metrics, compaction scheduling.Extend backpressure assertions and workload-shape coverage.
Manifestengine/manifest/manager_test.go, engine/lsm/manifest_test.goCURRENT swap safety, rewrite crash handling, vlog metadata persistence.Simulate partial edit corruption, column family extensions.
ValueLogengine/vlog/manager_test.go, engine/vlog/io_test.go, vlog_test.goValuePtr encoding/decoding, GC rewrite/rewind, concurrent iterator safety.Long-running GC, discard-ratio edge cases.
Percolator / Distributed Txnpercolator/*_test.go, raftstore/client/client_test.go, stats_test.goPrewrite/Commit/ResolveLock flows, 2PC retries, timestamp-driven MVCC behaviour, metrics accounting.Mixed multi-region fuzzing with lock TTL and leader churn.
DB Integrationdb_test.go, db_bench_test.goEnd-to-end writes, recovery, and throttle behaviour.Combine ValueLog GC + compaction stress, multi-DB interference.
CLI & Statscmd/nokv/main_test.go, stats_test.goGolden JSON output, stats snapshot correctness, hot key ranking.CLI error handling, expvar HTTP integration tests.
Scripts & Toolingcmd/nokv-config/main_test.go, cmd/nokv/serve_test.gonokv-config JSON/simple formats, catalog bootstrap CLI, serve bootstrap behavior.Add direct shell-script golden tests (currently not present) and failure-path diagnostics for cluster.sh.
Distributed Migration & Membershipraftstore/integration/*_test.go, raftstore/migrate/*_test.go, raftstore/admin/service_test.goStandalone -> seeded -> cluster flow, snapshot install, add/remove peer, leader transfer, restart/dehost recovery, Coordinator outage after startup, quorum-loss context propagation, multi-region 2PC deadline propagation, repeated link flap during membership changes, partitioned follower catch-up, and snapshot-install interruption before publish.Keep expanding publish-boundary coverage and larger fault matrices around runtime/transport interleavings.
Benchmarkbenchmark/ycsb/ycsb_test.go, benchmark/ycsb/ycsb_runner.goYCSB throughput/latency comparisons across engines (A-F) with detailed percentile + operation mix reporting.Automate multi-node deployments and add longer-running, multi-GB stability baselines.

3. System Scenarios

ScenarioCoverageFocus
Crash recoverydb_test.goWAL replay, fail-fast on missing/corrupt SST (manifest preserved for investigation), vlog GC restart, manifest rewrite safety.
WAL pointer desyncraftstore/raftlog/wal_storage_test.go::TestWALStorageDetectsTruncatedSegmentDetects store-local raft pointer offsets beyond truncated WAL tails to avoid silent corruption.
Distributed transaction contentionraftstore/client/client_test.go::TestClientTwoPhaseCommitAndGet, percolator/*_test.goLock conflicts, retries, and 2PC sequencing under region routing.
Value separation + GCengine/vlog/manager_test.go, db_test.go::TestRecoveryRemovesStaleValueLogSegmentGC correctness, manifest integration, iterator stability.
Iterator consistencyengine/lsm/iterator_test.goSnapshot visibility, merging iterators across levels and memtables.
Throttling / backpressureengine/lsm/compaction_test.go, db_test.go::TestWriteThrottleL0 backlog triggers, flush queue growth, metrics observation.
Distributed NoKV clientraftstore/client/client_test.go::TestClientTwoPhaseCommitAndGet, raftstore/transport/grpc_transport_test.go::TestGRPCTransportManualTicksDriveElectionRegion-aware routing, NotLeader retries, manual tick-driven elections, cross-region 2PC sequencing.
Migration & membership orchestrationraftstore/integration/migration_flow_test.go, raftstore/integration/restart_recovery_test.go, raftstore/integration/coordinator_degraded_test.go, raftstore/integration/snapshot_interruption_test.go, raftstore/integration/context_propagation_test.go, raftstore/integration/transport_chaos_test.goSeed bootstrap, multi-peer rollout, leader transfer, peer removal, restarted follower recovery, removed-peer dehost after restart, Coordinator outage after startup, quorum-loss read/write timeouts, split-region 2PC deadline propagation, repeated link flap during membership changes, partitioned follower catch-up, transfer-leader retry after partition recovery, and snapshot-install interruption before publish.
Performance regressionbenchmark packageCompare NoKV vs Badger/Pebble by default (RocksDB optional), produce human-readable reports under benchmark/benchmark_results.

4. Observability in Tests

  • RECOVERY_METRIC logs – produced when RECOVERY_TRACE_METRICS=1; helpful when triaging targeted recovery suites and CI failures.
  • TRANSPORT_METRIC logs – emitted by transport chaos tests when CHAOS_TRACE_METRICS=1, capturing gRPC watchdog counters during network partitions and retries.
  • Stats snapshotsstats_test.go verifies JSON structure so CLI output remains backwards compatible.
  • Benchmark artefacts – stored under benchmark/results/ for shared suites and under suite-local result directories where applicable, for example benchmark/results/ycsb/.

5. Extending Coverage

  1. Property-based testing – integrate testing/quick or third-party generators to randomise distributed 2PC sequences (prewrite/commit/rollback ordering).
  2. Stress harness – add a Go-based stress driver to run mixed read/write workloads for hours, capturing metrics akin to RocksDB’s db_stress tool.
  3. Distributed readiness – strengthen raftstore fault-injection and long-run tests (leader transfer, transport chaos, snapshot catch-up) with reproducible CI artifacts.
  4. CLI smoke tests – simulate corrupted directories to ensure CLI emits actionable errors.

6. Distributed Test Layers

  • Protocol unit tests: package-local tests under raftstore/peer, raftstore/store, raftstore/admin, raftstore/snapshot, and raftstore/migrate validate one protocol surface at a time.
  • Node-local integration tests: store/admin tests verify snapshot install, membership application, and region runtime publication without booting a full cluster.
  • Multi-node deterministic data-plane integration tests: raftstore/integration uses raftstore/testcluster to boot real stores, wire transports, and drive migration/member flows against live runtimes.
  • Multi-node deterministic control-plane integration tests: coordinator/integration/*_test.go uses coordinator/testcluster to boot 3 coordinator + replicated meta, exercise rooted watch/reload propagation, follower write rejection, allocator-fence/remove-region propagation, and control-plane read staleness without mixing those cases into store/data-plane tests.
  • Restart and recovery suites: raftstore/integration/restart_recovery_test.go covers restarted followers, removed-peer dehost persistence, and leader restart with subsequent membership changes.
  • Control-plane degradation and publish-boundary tests: raftstore/integration/coordinator_degraded_test.go and raftstore/integration/snapshot_interruption_test.go cover live Coordinator outage after startup and failpoint-driven snapshot interruption before peer publication.

When adding new distributed tests:

  • use raftstore/testcluster for store/data-plane behavior
  • use coordinator/testcluster for control-plane / replicated-root behavior
  • avoid embedding ad-hoc cluster bootstrap helpers into feature-specific test files

7. Distributed Fault Matrix

Fault ClassCurrent CoveragePrimary TestsNotes
Snapshot export/install failureCoveredraftstore/migrate/expand_test.go, raftstore/store/peer_lifecycle_test.go, raftstore/admin/service_test.goCovers leader export failure, target install failure, and corrupt payload rejection without partially hosted peers.
Membership wait timeoutsCoveredraftstore/migrate/expand_test.go, raftstore/migrate/remove_peer_test.go, raftstore/migrate/transfer_leader_test.goVerifies timeout surfaces when leader metadata does not publish, target never hosts, peer removal never converges, or leader transfer stalls.
Follower restart after snapshot installCoveredraftstore/integration/restart_recovery_test.go::TestExpandedPeerRestartPreservesRegionAndDataEnsures installed peer persists region metadata and data after restart.
Removed peer restartCoveredraftstore/integration/restart_recovery_test.go::TestRemovedPeerRestartDoesNotRehostEnsures dehosted peers do not come back after restart.
Leader restart with follow-up membership changeCoveredraftstore/integration/restart_recovery_test.go::TestLeaderRestartStillAllowsMembershipChangesExercises leadership churn before a later remove-peer operation.
Control-plane degraded / Coordinator unavailableCoveredcoordinator/adapter/scheduler_client_test.go, raftstore/store/command_ops_test.go::TestStoreProposeCommandSurvivesSchedulerUnavailable, raftstore/integration/coordinator_degraded_test.go::TestClusterSurvivesCoordinatorUnavailableAfterStartupCovers both local degraded scheduler semantics and live multi-node Coordinator outage after route cache warmup; new cold-route misses still fail with RouteUnavailable as expected.
Scheduler queue overflow / dropped operationsCoveredraftstore/store/scheduler_runtime_test.go::TestStoreSchedulerStatusTracksQueueDropValidates local degraded status and dropped operation accounting.
Snapshot install interrupted before publishCoveredraftstore/integration/snapshot_interruption_test.go::TestExpandSnapshotInstallInterruptedBeforePublish, raftstore/store/peer_lifecycle_test.go::TestStoreInstallRegionSnapshotRejectsCorruptPayloadUses failpoint injection to verify target install aborts without leaving a hosted peer or polluted region metadata, then retries cleanly after restart.
Request cancel / deadline propagationCoveredraftstore/client/client_test.go::TestClientGetHonorsCanceledContextDuringRouteLookup, raftstore/client/client_test.go::TestClientGetHonorsCanceledContextDuringRPC, raftstore/client/client_test.go::TestClientPutHonorsCanceledContextDuringRouteLookup, raftstore/client/client_test.go::TestClientPutHonorsCanceledContextDuringRPC, raftstore/client/client_test.go::TestClientTwoPhaseCommitHonorsCanceledContextDuringMultiRegionRouteLookup, raftstore/client/client_test.go::TestClientTwoPhaseCommitHonorsCanceledContextDuringMultiRegionRPC, raftstore/client/client_test.go::TestClientResolveLocksHonorsCanceledContextDuringMultiRegionRPC, raftstore/integration/context_propagation_test.go::TestClientReadWriteHonorContextUnderQuorumLoss, raftstore/integration/context_propagation_test.go::TestClientTwoPhaseCommitHonorsContextAcrossSplitRegionsUnderPartialQuorumLossVerifies read/write paths plus multi-region 2PC and resolve-lock flows preserve caller cancellation/deadlines through route lookup, RPC, and live split-region quorum loss instead of collapsing to generic retry exhaustion.
Transport partition / interleave recoveryCoveredraftstore/transport/grpc_transport_test.go::TestGRPCTransportHandlesPartition, raftstore/transport/grpc_transport_test.go::TestGRPCTransportFailpointBeforeSendRPCRecoversAfterClear, raftstore/peer/peer_test.go::TestPeerFailpointAfterReadyAdvanceBeforeSendRecoversOnLaterTicks, raftstore/integration/transport_chaos_test.go::TestPartitionedFollowerCatchesUpAfterRecovery, raftstore/integration/transport_chaos_test.go::TestTransferLeaderRecoversAfterPartitionedTargetReturns, raftstore/integration/transport_chaos_test.go::TestRepeatedLinkFlapConvergesDuringMembershipChangesCovers low-level gRPC link blocking, send-boundary failpoints, Ready advance/send publication gaps, repeated link flaps during membership operations, and live cluster recovery after follower isolation/restart plus transfer-leader timeout/retry under transport partitions.
Split/merge restart safetyCoveredraftstore/store/store_test.go::TestStoreRestartPreservesSplitMergeLocalMeta, raftstore/integration/split_merge_recovery_test.go::TestSplitMergeRestartSafetyAcrossStoresCovers store-local recovery plus live multi-store split -> restart -> merge -> restart flow after making split/merge admin replay idempotent across restart.

Next fault-matrix additions should focus on:

  • more publish-boundary failpoints around snapshot install and migration init
  • deeper transport/interleave chaos beyond partition + recovery, especially more concurrent membership combinations and repeated multi-link flaps

Keep this matrix updated when adding new modules or scenarios so documentation and automation remain aligned.