Write-Ahead Logging
This note dives into the WAL subsystem that powers QuillSQL’s recovery story. It explains how frames are generated, buffered, flushed, and replayed, and why logical heap/index records complement traditional physical FPWs.
1. Component map
| Layer | Location | Responsibility |
|---|---|---|
| WalManager | src/recovery/wal/mod.rs | Assigns LSNs, encodes WalRecordPayload, enqueues frames, and drives WalStorage::flush. |
| WalBuffer | src/recovery/wal/buffer.rs | Lock-free ring buffer (ConcurrentRingBuffer) that tracks pending frame count, bytes, and highest end LSN. |
| WalStorage | src/recovery/wal/storage.rs | Maintains WAL directory/segments, builds WalPages, and dispatches write/fsync tickets to a WalSink (default: DiskSchedulerWalSink). |
| WalWriterRuntime | src/recovery/wal/writer.rs | Background thread that periodically calls WalManager::flush(None). |
| ControlFileManager | src/recovery/control_file.rs | Persists durable_lsn, max_assigned_lsn, checkpoint metadata, and redo start for fast crash recovery. |
| Resource managers | src/recovery/resource_manager.rs, storage/*/heap_recovery.rs, storage/*/index_recovery.rs | Decode payloads per ResourceManagerId and execute redo/undo logic for heap/index/page operations. |
End-to-end DML flow:
ExecutionContextmutates tuples/index entries viaTableHeap/BPlusTreeIndex.- Operators invoke
WalManager::append_record_withorlog_page_updateafter data-page changes succeed. - Frames enter
WalBuffer. Once thresholds are met,flush()drains frames intoWalStorage, which schedules asynchronous writes/fsyncs. - During commit,
TransactionManagerwaits onWalManager::wait_for_durable(lsn)to guarantee WAL persistence before releasing locks.
2. Frame & record taxonomy
wal_record.rs defines the canonical envelope. encode_frame(lsn, prev_lsn, payload)
serializes a record, while WalAppendContext reports the LSN range back to the caller.
| ResourceManagerId | Payload | Purpose |
|---|---|---|
Page | PageWritePayload | Full-page image (FPW) on first-touch; also used when logical redo is not available. |
Heap | HeapRecordPayload::{Insert,Delete} | Logical tuple records carrying relation id, page/slot, TupleMetaRepr, and tuple bytes (delete payloads always carry the before-image). |
Index | IndexRecordPayload::{LeafInsert,LeafDelete,Leaf/Internal Split/Merge/Redistribute,Parent*,Root*} | Logical B+Tree leaf mutations plus structural changes (split/merge/redistribute, parent updates, root install/adopt/reset). |
Transaction | TransactionPayload | BEGIN / COMMIT / ABORT markers that seed Undo’s per-txn chains. |
Checkpoint | CheckpointPayload | Captures ATT/DPT snapshots plus redo start. |
Clr | ClrPayload | Compensation Log Records documenting each undo step and its undo_next_lsn. |
ResourceManagerId determines how RedoExecutor / UndoExecutor route frames:
Page → physical redo only; Heap/Index → logical redo/undo using storage helpers;
Transaction/CLR/Checkpoint → interpreted by the analysis/undo phases.
3. Heap WAL: MVCC-aware logical logging
- Emission points –
TableHeap::{insert,update,delete}_tuplecallappend_heap_record(seesrc/storage/heap/mvcc_heap.rs) before the data page is overwritten. Each record stores the relation identifier, page/slot, tuple metadata, and tuple bytes (delete payloads include the exact before-image). - Encoding –
src/storage/heap/wal_codec.rsserializesTupleMeta(insert/delete txn id, command id, MVCC chain pointers) plus tuple bytes in a compact layout. - Redo –
HeapResourceManager(src/storage/heap/heap_recovery.rs) decodes the payload and applies it at the slot level (insert/overwrite) with an LSN check; it repacks the slotted page layout to keep offsets consistent and sets page LSN to the frame LSN. - Undo – Always restores the stored before-image (metadata + tuple bytes) for loser transactions, removing the optional/branchy path.
- Interaction with FPW – Heap logical redo handles tuple-level replay; FPWs are only used on first-touch or when no logical log exists for a page.
4. Index WAL: logical B+Tree leaf operations
- Emission points –
BPlusTreeIndexlogs every leaf insert/delete and all structural changes (split/merge/redistribute, parent updates, root install/adopt/reset) viaappend_index_record(src/storage/index/btree_index.rs), usingsrc/storage/index/wal_codec.rsto encode key schema, key bytes, page ids, and txns. - Redo (
src/storage/index/index_recovery.rs) steps:- Decode the key with
TupleCodecusing the stored schema. - Fetch the target page through the buffer pool (preferred) or
DiskScheduler. - Skip if
page_lsn >= frame_lsn; otherwise apply the logical mutation/structural rebuild and bump the page version. - Write the updated page back with the frame LSN (including header root pointer for root install/adopt/reset).
- Decode the key with
- Undo – Inverts leaf inserts/deletes for loser transactions. Structural records are redo-only (idempotent via LSN/version checks).
- Benefits – Heap and index WAL are decoupled; logical leaf updates and structured payloads avoid full-page writes while remaining crash-safe for splits/merges/root changes.
5. Buffering & flush strategy
- Thresholds –
max_buffer_records(fromWalOptions::buffer_capacity),flush_coalesce_bytes, and one WAL page (4 KiB) trigger batched flushes. - Flush mechanics –
WalManager::flush_with_modedrains frames up to a target LSN, asksWalStorage::append_recordsto write them, then waits on allWalFlushTickets.flush_untilforces durability before commit or after undo. - Checkpoints –
log_checkpointforces a flush, recordscheckpoint_redo_startin the control file, and clears the “touched pages” set so new FPWs fire only once per checkpoint interval. - Background writer –
WalWriterRuntimeruns whenWalOptions::writer_interval_msis non-zero, smoothing out flush pressure even when foreground transactions are light.
6. Relation to ARIES
- Analysis –
AnalysisPassparses the latest checkpoint, reconstructs ATT/DPT by scanning the tail of the log, and chooses a redo start (min(dpt.rec_lsn)). - Redo (repeat history) –
RedoExecutoriterates fromstart_lsn, invoking the appropriate resource manager for each frame. Page RM checks pageLSN before applying FPW/delta; Heap/Index RMs use logical payloads to rebuild tuples or leaf entries. - Undo –
UndoExecutorchains loser transactions backwards, calling each resource manager’s undo method and emitting CLRs withundo_next_lsn. If recovery crashes mid-undo, the next run resumes at the recordedundo_next_lsn.
7. Tuning & troubleshooting
- Configuration –
WalOptionsinsideDatabaseOptionsexposesegment_size,sync_on_flush,writer_interval_ms,synchronous_commit,retain_segments, etc. - Introspection –
WalManager::pending_records()dumps in-memory frames for debugging;background::BackgroundWorkers::snapshot()reports WAL writer/checkpoint worker metadata. EnablingRUST_LOG=tracereveals FPW vs delta decisions and flush cadence. - Crash testing – Insert a forced
std::process::exit(1)after specific DMLs, then restart and inspectRecoverySummary(redo count + loser transactions) to ensure heap/index WAL cover the intended cases.
With logical heap/index records plus FPWs as a safety net, QuillSQL’s WAL stays teaching-friendly while mirroring production-grade recoverability. When introducing new components (e.g., custom indexes or vacuum steps), define a payload + resource manager and they will automatically participate in ARIES.