Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

Recovery / WAL Module

src/recovery/ guarantees the Durability part of ACID. Through WAL, checkpoints, and the ARIES algorithm, QuillSQL can recover to a consistent state after crashes.


Responsibilities

  • Generate WAL records for every change and assign monotonically increasing LSNs.
  • Manage WAL buffers, synchronous flushes, and the optional background WAL writer.
  • Run the ARIES Analysis → Redo → Undo sequence during startup.
  • Maintain the control file so the latest checkpoint and log truncation point are known.

Directory Layout

PathDescriptionKey Types
recovery/wal/Log writer, buffer, storage, background writer runtime.WalManager, WalBuffer, WalStorage
recovery/wal_record.rsSerialized record variants exposed to subsystems.WalRecordPayload, ResourceManagerId
recovery/resource_manager.rsRegistry that maps payloads to redo/undo handlers.ResourceManager, RedoContext, UndoContext
recovery/recovery_manager.rsARIES driver.RecoveryManager, RecoverySummary
recovery/control_file.rsPersistent metadata (checkpoint info).ControlFileManager

WalManager & I/O pipeline

  • BufferingWalManager uses a lock-free WalBuffer ring. append_record_with attaches an LSN, encodes the frame, pushes it into the ring, and auto-flushes when either record count or byte thresholds (max_buffer_records, flush_coalesce_bytes, or a full WAL page) are met. Writers never block on Vec reallocation and thus scale with concurrent transactions.
  • Physical logginglog_page_update emits 4 KiB PageWrite FPWs on first touch (tracked via DashSet<PageId>) or when logical redo is unavailable. Payloads carry the previous page_lsn so redo can enforce the WAL rule.
  • Segmentation & flushWalStorage appends frames to rolling segments under the configured WAL directory, queues async write/fsync tickets via the DiskScheduler sink, and recycles sealed segments after checkpoints. flush() drains buffered records up to an optional target LSN, waits for outstanding tickets, updates durable_lsn, and optionally persists the control file snapshot.
  • Background writerstart_background_flush spawns WalWriterRuntime, which periodically calls flush(None); WalWriterHandle lives in the background worker registry so shutdown is coordinated with the db lifecycle.
  • Checkpointslog_checkpoint writes a Checkpoint frame containing ATT/DPT, forces a flush, clears the “first-touch” set (so new FPWs are generated as needed), and updates checkpoint_redo_start in the control file. Recovery uses this redo start to avoid rescanning the whole log.
  • Readers & durability waitsWalManager::reader iterates frames straight from the WAL directory. wait_for_durable is the gate that synchronous commits call after emitting a commit record; it reuses flush_with_mode and condition variables to block until durable_lsn >= target.

Record types & resource managers

All log frames share the envelope defined in wal_record.rs and are routed by ResourceManagerId:

  • Page (ResourceManagerId::Page) – purely physical PageWrite payloads. The page resource manager checks page_lsn (buffer pool if available) before rewriting the page on redo. No undo is required because these are low-level physical changes.
  • Heap (ResourceManagerId::Heap) – logical payloads include relation id, page/slot, tuple metadata, and tuple bytes. HeapResourceManager applies inserts/overwrites at the slot level with LSN checks and repacks slotted pages. Undo uses the stored before-image (metadata + bytes) for loser transactions.
  • Index (ResourceManagerId::Index) – logs leaf inserts/deletes plus structural changes (split/merge/redistribute, parent updates, root install/adopt/reset). Redo mutates leaves or rebuilds structural pages with LSN/version guards; undo mirrors leaf inserts/deletes only. Heap and index WAL remain independent.
  • Transaction / CLR / CheckpointTransactionPayload (BEGIN/COMMIT/ABORT) enables the undo index to link per-transaction log records. ClrPayload documents each undo step so the recovery process can survive crashes mid-undo. Checkpoints snapshot ATT + DPT for the analysis phase.

ensure_default_resource_managers_registered wires Page, Heap, and Index resource managers together so both redo (RedoExecutor) and undo (UndoExecutor) can dispatch records uniformly.

Heap / Index WAL design

  • Heap – Execution operators call ExecutionContext::insert_tuple_with_indexes (and friends), which ultimately invoke MvccHeap to write WAL before dirtying the page. Each tuple carries TupleMetaRepr (insert/delete txn + CID, MVCC chain pointers) plus encoded column bytes. During redo, HeapResourceManager applies inserts or overwrites at the slot level with LSN checks and repacks the slotted page before writing it back via TableHeap::recovery_view. Undo leverages the stored before-image to restore bytes or toggle delete flags; vacuum code can later reclaim dead slots once safe_xmin passes them.
  • Index – B+Tree leaf operations log the logical key/value change, plus structural records for split/merge/redistribute/parent updates and root install/adopt/reset. Redo loads or rebuilds pages (buffer pool or disk), checks page LSN/version to avoid reapplying newer changes, and performs the requested mutation. Undo mirrors leaf-level inserts/deletes only. Heap and index WAL stay independent.
  • Page images – For heap/index structural changes that rewrite large sections (e.g., new heap page, B+Tree splits), page guards still emit FPWs through WalManager::log_page_update. Logical heap/index WAL ensures redo still works even when FPWs are skipped after the first touch.

ARIES Recovery

  1. Analysis – read the latest checkpoint, rebuild the Dirty Page Table (DPT) and Active Transaction Table (ATT).
  2. Redo – scan forward from the checkpoint LSN, reapplying operations when the DPT indicates a page may be out of date.
  3. Undo – roll back transactions still in ATT, writing Compensation Log Records (CLR) so the process is idempotent even if another crash happens mid-undo.

RecoverySummary reports how many records were redone and which transactions require manual attention—great for classroom demonstrations.


Interactions

  • TransactionManager – emits Begin, Commit, Abort records and supplies undo information.
  • BufferManager – links to WAL via set_wal_manager; checkpoints rely on the buffer’s dirty page metadata.
  • Background workers – WAL writer and checkpoint worker live in background and use handles exposed by WalManager.

Teaching Ideas

  • Simulate crashes (e.g., panic right after wal_manager.flush(None)) and inspect log output on restart.
  • Add a new WAL record type (like CreateIndex) to see how RecoveryManager must be extended.
  • Compare physical vs logical redo costs to discuss ARIES trade-offs.

Recovery Lab Playbook (CMU 15-445 style)

  1. Mini ARIES – Disable the background WAL writer, perform a few INSERT/UPDATE operations, crash the process mid-transaction, and single-step through analysis_pass.rs to observe ATT/DPT reconstruction.
  2. Logical vs physical redo – Comment out WalManager::log_page_update for heap pages and re-run the experiment. Recovery still succeeds thanks to logical heap payloads; re-enable FPWs to contrast log volume.
  3. Index crash test – Inject a panic after BPlusTreeIndex::insert logs a leaf insert. On restart, watch the undo phase remove the stray key before replay finishes.
  4. Group commit tuning – Play with WalOptions::flush_coalesce_bytes, writer_interval_ms, and synchronous_commit to demonstrate how throughput vs latency tradeoffs mirror the 15-445 checkpoints/commit labs.

Further reading: ARIES details