Recovery / WAL Module
src/recovery/ guarantees the Durability part of ACID. Through WAL, checkpoints, and the
ARIES algorithm, QuillSQL can recover to a consistent state after crashes.
Responsibilities
- Generate WAL records for every change and assign monotonically increasing LSNs.
- Manage WAL buffers, synchronous flushes, and the optional background WAL writer.
- Run the ARIES Analysis → Redo → Undo sequence during startup.
- Maintain the control file so the latest checkpoint and log truncation point are known.
Directory Layout
| Path | Description | Key Types |
|---|---|---|
recovery/wal/ | Log writer, buffer, storage, background writer runtime. | WalManager, WalBuffer, WalStorage |
recovery/wal_record.rs | Serialized record variants exposed to subsystems. | WalRecordPayload, ResourceManagerId |
recovery/resource_manager.rs | Registry that maps payloads to redo/undo handlers. | ResourceManager, RedoContext, UndoContext |
recovery/recovery_manager.rs | ARIES driver. | RecoveryManager, RecoverySummary |
recovery/control_file.rs | Persistent metadata (checkpoint info). | ControlFileManager |
WalManager & I/O pipeline
- Buffering –
WalManageruses a lock-freeWalBufferring.append_record_withattaches an LSN, encodes the frame, pushes it into the ring, and auto-flushes when either record count or byte thresholds (max_buffer_records,flush_coalesce_bytes, or a full WAL page) are met. Writers never block on Vec reallocation and thus scale with concurrent transactions. - Physical logging –
log_page_updateemits 4 KiBPageWriteFPWs on first touch (tracked viaDashSet<PageId>) or when logical redo is unavailable. Payloads carry the previouspage_lsnso redo can enforce the WAL rule. - Segmentation & flush –
WalStorageappends frames to rolling segments under the configured WAL directory, queues async write/fsync tickets via theDiskSchedulersink, and recycles sealed segments after checkpoints.flush()drains buffered records up to an optional target LSN, waits for outstanding tickets, updatesdurable_lsn, and optionally persists the control file snapshot. - Background writer –
start_background_flushspawnsWalWriterRuntime, which periodically callsflush(None);WalWriterHandlelives in the background worker registry so shutdown is coordinated with the db lifecycle. - Checkpoints –
log_checkpointwrites aCheckpointframe containing ATT/DPT, forces a flush, clears the “first-touch” set (so new FPWs are generated as needed), and updatescheckpoint_redo_startin the control file. Recovery uses this redo start to avoid rescanning the whole log. - Readers & durability waits –
WalManager::readeriterates frames straight from the WAL directory.wait_for_durableis the gate that synchronous commits call after emitting a commit record; it reusesflush_with_modeand condition variables to block untildurable_lsn >= target.
Record types & resource managers
All log frames share the envelope defined in wal_record.rs and are routed by
ResourceManagerId:
- Page (
ResourceManagerId::Page) – purely physicalPageWritepayloads. The page resource manager checkspage_lsn(buffer pool if available) before rewriting the page on redo. No undo is required because these are low-level physical changes. - Heap (
ResourceManagerId::Heap) – logical payloads include relation id, page/slot, tuple metadata, and tuple bytes.HeapResourceManagerapplies inserts/overwrites at the slot level with LSN checks and repacks slotted pages. Undo uses the stored before-image (metadata + bytes) for loser transactions. - Index (
ResourceManagerId::Index) – logs leaf inserts/deletes plus structural changes (split/merge/redistribute, parent updates, root install/adopt/reset). Redo mutates leaves or rebuilds structural pages with LSN/version guards; undo mirrors leaf inserts/deletes only. Heap and index WAL remain independent. - Transaction / CLR / Checkpoint –
TransactionPayload(BEGIN/COMMIT/ABORT) enables the undo index to link per-transaction log records.ClrPayloaddocuments each undo step so the recovery process can survive crashes mid-undo. Checkpoints snapshot ATT + DPT for the analysis phase.
ensure_default_resource_managers_registered wires Page, Heap, and Index resource
managers together so both redo (RedoExecutor) and undo (UndoExecutor) can dispatch
records uniformly.
Heap / Index WAL design
- Heap – Execution operators call
ExecutionContext::insert_tuple_with_indexes(and friends), which ultimately invokeMvccHeapto write WAL before dirtying the page. Each tuple carriesTupleMetaRepr(insert/delete txn + CID, MVCC chain pointers) plus encoded column bytes. During redo,HeapResourceManagerapplies inserts or overwrites at the slot level with LSN checks and repacks the slotted page before writing it back viaTableHeap::recovery_view. Undo leverages the stored before-image to restore bytes or toggle delete flags; vacuum code can later reclaim dead slots oncesafe_xminpasses them. - Index – B+Tree leaf operations log the logical key/value change, plus structural records for split/merge/redistribute/parent updates and root install/adopt/reset. Redo loads or rebuilds pages (buffer pool or disk), checks page LSN/version to avoid reapplying newer changes, and performs the requested mutation. Undo mirrors leaf-level inserts/deletes only. Heap and index WAL stay independent.
- Page images – For heap/index structural changes that rewrite large sections (e.g.,
new heap page, B+Tree splits), page guards still emit FPWs through
WalManager::log_page_update. Logical heap/index WAL ensures redo still works even when FPWs are skipped after the first touch.
ARIES Recovery
- Analysis – read the latest checkpoint, rebuild the Dirty Page Table (DPT) and Active Transaction Table (ATT).
- Redo – scan forward from the checkpoint LSN, reapplying operations when the DPT indicates a page may be out of date.
- Undo – roll back transactions still in ATT, writing Compensation Log Records (CLR) so the process is idempotent even if another crash happens mid-undo.
RecoverySummary reports how many records were redone and which transactions require
manual attention—great for classroom demonstrations.
Interactions
- TransactionManager – emits
Begin,Commit,Abortrecords and supplies undo information. - BufferManager – links to WAL via
set_wal_manager; checkpoints rely on the buffer’s dirty page metadata. - Background workers – WAL writer and checkpoint worker live in
backgroundand use handles exposed byWalManager.
Teaching Ideas
- Simulate crashes (e.g., panic right after
wal_manager.flush(None)) and inspect log output on restart. - Add a new WAL record type (like
CreateIndex) to see howRecoveryManagermust be extended. - Compare physical vs logical redo costs to discuss ARIES trade-offs.
Recovery Lab Playbook (CMU 15-445 style)
- Mini ARIES – Disable the background WAL writer, perform a few INSERT/UPDATE
operations, crash the process mid-transaction, and single-step through
analysis_pass.rsto observe ATT/DPT reconstruction. - Logical vs physical redo – Comment out
WalManager::log_page_updatefor heap pages and re-run the experiment. Recovery still succeeds thanks to logical heap payloads; re-enable FPWs to contrast log volume. - Index crash test – Inject a panic after
BPlusTreeIndex::insertlogs a leaf insert. On restart, watch the undo phase remove the stray key before replay finishes. - Group commit tuning – Play with
WalOptions::flush_coalesce_bytes,writer_interval_ms, andsynchronous_committo demonstrate how throughput vs latency tradeoffs mirror the 15-445 checkpoints/commit labs.
Further reading: ARIES details