Percolator Distributed Transaction Design
This document explains NoKV’s distributed transaction path implemented by percolator/ and executed through raftstore.
The scope here is the current code path:
PrewriteCommitBatchRollbackResolveLockCheckTxnStatus- MVCC read visibility (
KvGet/KvScanthroughpercolator.Reader)
1. Where It Runs
Percolator logic is executed on the Raft apply path:
- Client sends NoKV RPC (
KvPrewrite,KvCommit, …). raftstore/kv/service.gowraps it into aRaftCmdRequest.- Store proposes command through Raft.
- On apply,
raftstore/kv/apply.godispatches topercolator.*.
sequenceDiagram
participant C as raftstore/client
participant S as kv.Service
participant R as Raft (leader->followers)
participant A as kv.Apply
participant P as percolator
participant DB as NoKV DB
C->>S: KvPrewrite/KvCommit...
S->>R: ProposeCommand(RaftCmdRequest)
R->>A: Apply committed log
A->>P: percolator.Prewrite/Commit...
P->>DB: CFDefault/CFLock/CFWrite reads+writes
A-->>S: RaftCmdResponse
S-->>C: NoKV RPC response
Key files:
percolator/txn.gopercolator/reader.gopercolator/codec.gopercolator/latch/latch.goraftstore/kv/apply.goraftstore/client/client.go
1.1 RPC to Percolator Function Mapping
| NoKV RPC | kv.Apply branch | Percolator function |
|---|---|---|
KvPrewrite | CMD_PREWRITE | Prewrite |
KvCommit | CMD_COMMIT | Commit |
KvBatchRollback | CMD_BATCH_ROLLBACK | BatchRollback |
KvResolveLock | CMD_RESOLVE_LOCK | ResolveLock |
KvCheckTxnStatus | CMD_CHECK_TXN_STATUS | CheckTxnStatus |
KvGet | CMD_GET | Reader.GetLock + Reader.GetValue |
KvScan | CMD_SCAN | Reader.GetLock + CFWrite iteration + GetInternalEntry |
2. MVCC Data Model
NoKV uses three MVCC column families:
CFDefault: stores user values atstart_tsCFLock: stores lock metadata at fixedlockColumnTs = MaxUint64CFWrite: stores commit records atcommit_ts
2.1 Lock Record
percolator.Lock (encoded by EncodeLock):
PrimaryTs(start timestamp)TTLKind(Put/Delete/Lock)MinCommitTs
2.2 Write Record
percolator.Write (encoded by EncodeWrite):
KindStartTsShortValue(codec supports it; current commit path does not populate it)
3. Concurrency Control: Latches
Before mutating keys, percolator acquires striped latches:
latch.Managerhashes keys to stripe mutexes.- Stripes are deduplicated and acquired in sorted order to avoid deadlocks.
- Guard releases in reverse order.
In raftstore/kv, latches are passed explicitly:
NewEntryAppliercreates onelatch.NewManager(512)and reuses it.Apply/NewApplieraccept an injected manager;nilfalls back tolatch.NewManager(512).
This serializes conflicting apply operations on overlapping keys in one node.
4. Two-Phase Commit Flow
Client side (raftstore/client.Client.TwoPhaseCommit):
- Group mutations by region.
- Prewrite primary region.
- Prewrite secondary regions.
- Commit primary region.
- Commit secondary regions.
sequenceDiagram
participant Cli as Client
participant R1 as Region(primary)
participant R2 as Region(secondary)
Cli->>R1: Prewrite(primary + local muts)
Cli->>R2: Prewrite(secondary muts)
Cli->>R1: Commit(keys,startTs,commitTs)
Cli->>R2: Commit(keys,startTs,commitTs)
5. Write-Side Operations
5.1 Prewrite
Prewrite runs mutation-by-mutation:
- Check existing lock on key:
- if lock exists with different
Ts->KeyError.Locked
- if lock exists with different
- Check latest committed write:
- if
commit_ts >= req.start_version->WriteConflict
- if
- Apply data intent:
Put: write value intoCFDefaultatstart_tsDelete/Lock: delete default value atstart_ts(if exists)
- Write lock into
CFLockatlockColumnTs
5.2 Commit
For each key:
- Read lock
- If no lock:
- if write with same
start_tsexists -> idempotent success - else -> abort (
lock not found)
- if write with same
- If lock
Ts != start_version->KeyError.Locked commitKey:- if
min_commit_ts > commit_version->CommitTsExpired - if write with same
start_tsalready exists:- rollback write -> abort
- write with different commit ts -> treat success, clean lock
- same commit ts -> success
- else write
CFWrite[key@commit_ts] = {kind,start_ts} - remove lock from
CFLock
- if
5.3 BatchRollback
For each key:
- If already has write at
start_ts:- rollback marker already exists -> success
- non-rollback write exists -> success (already committed)
- Remove lock (if any)
- Remove default value at
start_ts(if any) - Write rollback marker to
CFWriteatstart_ts
5.4 ResolveLock
commit_version == 0-> rollback matching lockscommit_version > 0-> commit matching locks- Returns number of resolved keys
6. Transaction Status Check
CheckTxnStatus targets the primary key and decides whether txn is alive, committed, or should be rolled back.
Decision order:
- Read lock on primary
- If lock exists but
lock.ts != req.lock_ts->KeyError.Locked - If lock exists and TTL expired (
current_ts >= lock.ts + ttl):- rollback primary
- action =
TTLExpireRollback
- If lock exists and caller pushes timestamp:
min_commit_ts = max(min_commit_ts, caller_start_ts+1)- action =
MinCommitTsPushed
- If no lock, check write by
start_ts:- committed write -> return
commit_version - rollback write -> action
LockNotExistRollback
- committed write -> return
- If no lock and no write, and
rollback_if_not_existis true:- write rollback marker
- action
LockNotExistRollback
7. Read Path Semantics (MVCC Visibility)
KvGet and KvScan read through percolator.Reader:
- Check lock first:
- if lock exists and
read_ts >= lock.ts, return locked error
- if lock exists and
- Find visible write in
CFWrite:- latest
commit_ts <= read_ts
- latest
- Interpret write kind:
Delete/Rollback=> not foundPut=> read value fromCFDefaultatstart_ts
Notes:
KvScancurrently rejects reverse scan.scanWritesuses internal iterator overCFWrite.
8. Error and Idempotency Behavior
| Operation | Idempotency/Conflict behavior |
|---|---|
| Prewrite | Rejects lock conflicts and write conflicts; returns per-key KeyError list. |
| Commit | Idempotent for already committed keys with same start_ts; stale/missing lock may abort. |
| BatchRollback | Safe to repeat; rollback marker prevents duplicate side effects. |
| ResolveLock | Safe to retry per key set; resolves only matching start_ts locks. |
| CheckTxnStatus | May push min_commit_ts, rollback expired primary lock, or return committed version. |
9. Current Operational Boundaries
- Percolator execution is tied to NoKV RPC + Raft apply path, with the command shape still following the TinyKV/TiKV MVCC model.
- Latch scope is process-local when one store shares a single
latch.Manager; region correctness still comes from Raft ordering. Write.ShortValueandWrite.ExpiresAtare codec fields; current commit path stores primary value bytes inCFDefaultand reads from there when short value is not present.
10. Validation and Tests
Primary coverage:
percolator/txn_test.goraftstore/kv/service_test.goraftstore/client/client_test.goraftstore/server/server_test.go
These tests cover 2PC happy path, lock conflicts, status checks, resolve/rollback behavior, and client region-aware retries.