Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

QuillSQL Logo

QuillSQL Internals

Welcome to the technical documentation for QuillSQL.

This book provides a deep dive into the internal architecture and implementation details of the database. It is intended for developers, contributors, and anyone interested in understanding how a relational database is built from the ground up, referencing concepts from classic database courses like CMU 15-445.


Table of Contents

QuillSQL Architecture

This chapter gives a tour of the major subsystems, the flow of a SQL statement, and the contract between execution, transactions, and storage. It uses Mermaid diagrams so you can render the visuals directly inside mdBook or any compatible viewer.


1. End-to-End Pipeline

flowchart LR
    subgraph Frontend
        CLI["bin/client"] --> Parser["sql::parser"]
        HTTP["bin/server"] --> Parser
    end
    Parser --> LPlan["plan::LogicalPlanner"]
    LPlan --> Optimizer["optimizer::LogicalOptimizer"]
    Optimizer --> PPlan["plan::PhysicalPlanner"]
    PPlan --> Exec["execution::ExecutionEngine (Volcano)"]

    subgraph Txn["transaction::*"]
        Session["session::SessionContext"]
        TM["TransactionManager"]
        LM["LockManager"]
        Session --> TM --> LM
    end

    Exec <-->|snapshot, locks| Txn
    Exec --> Binding

    subgraph Storage["storage::* & buffer::*"]
        Binding["storage::engine::TableBinding"]
        Heap["storage::table_heap::TableHeap"]
        MVCC["storage::heap::MvccHeap"]
        Index["storage::index::BPlusTree"]
        Buffer["buffer::BufferManager"]
        Disk["storage::disk_scheduler (io_uring)"]
        WAL["recovery::WalManager"]
        Binding --> Heap
        Binding --> Index
        Heap <--> Buffer
        Index <--> Buffer
        Buffer <--> Disk
        WAL --> Disk
    end

    Background["background::workers\n(checkpoint, WAL writer, MVCC vacuum)"] --> {Buffer, WAL, TM}

Key takeaways

  • The frontend (CLI/HTTP) only knows how to parse SQL and drive the planning stages; all shared state lives below.
  • The ExecutionEngine drives a Volcano iterator. Each physical operator calls into the transaction runtime for visibility checks and locking, but touches storage exclusively through a TableBinding. This makes the executor easy to reason about in a classroom setting.
  • Buffering, WAL, and disk I/O are centralized so that durability/ordering guarantees stay in one module.

2. Transactions, MVCC, and the Executor

sequenceDiagram
    autonumber
    participant Session
    participant TM as TransactionManager
    participant Runtime as TxnRuntime
    participant Exec
    participant Binding
    participant Heap as TableHeap

    Session->>TM: begin_txn(isolation, access_mode)
    TM-->>Session: Transaction handle
    Exec->>Runtime: TxnRuntime::new(&TM, &mut txn)
    Runtime-->>Exec: {snapshot, command_id}

    loop per tuple
        Exec->>Binding: scan.next()
        Binding->>Heap: TableIterator::next()
        Heap-->>Binding: (rid, meta, tuple)
        Binding-->>Exec: (rid, meta, tuple)
        Exec->>Runtime: is_visible(meta)?
        Runtime-->>Exec: yes/no
        alt visible & needs lock
            Exec->>Runtime: lock_row(table, rid, mode)
            Runtime-->>Exec: guard
        end
        alt mutation
            Exec->>Binding: prepare_row_for_write(...)
            Binding->>Heap: mvcc.full_tuple(...)
            Heap-->>Binding: (current_meta, tuple)
            Binding->>TM: push_undo + append WAL(HeapRecord)
        end
    end

    Session->>TM: commit(txn)
    TM->>WAL: Transaction{Commit}
    TM->>TM: flush_until(lsn) if synchronous
    TM-->>Session: release locks, clear snapshot

Snapshot policy

  • Read Uncommitted / Read Committed: every command obtains a fresh snapshot, so repeated statements can see committed updates.
  • Repeatable Read / Serializable: cache the first snapshot on the Transaction handle; subsequent commands reuse it for consistent reads. The lock manager releases S-locks at the end of each command for RR, but Serializable keeps them until commit.

Undo & WAL

  • Each mutation produces a logical HeapRecordPayload or IndexRecordPayload. The heap payload already carries redo (new bytes) and undo (old bytes), so recovery can replay forward or backward without re-reading heap pages.
  • On abort, the manager walks the undo stack, emits CLRs, and re-applies the inverse heap/index operations.

3. Storage Layering

graph TD
    subgraph Planner View
        Catalog
    end
    Catalog -->|Arc<TableHeap>| TableBinding
    TableBinding --> Heap["TableHeap"]
    TableBinding --> Mvcc["MvccHeap"]
    TableBinding --> Index["BPlusTreeIndex"]
    Heap -->|pages| BufferMgr
    Index -->|pages| BufferMgr
    BufferMgr --> DiskMgr
    BufferMgr --> WalMgr
LayerResponsibilityNotes
TableHeapPhysical slotted pages, tuple encoding, WAL page image helpersExposes insert/update/delete that emit heap-specific WAL payloads before dirtying frames.
MvccHeapVersion chain management, delete-marking, undo metadataCreates new versions, rewires forward/back pointers, and delegates actual I/O to TableHeap.
TableBindingSafe façade for the executorProvides scan, index_scan, insert, delete, update, and prepare_row_for_write, always pairing heap/index changes so operators stay small.
BufferManager + DiskSchedulerUnified cache + async I/OUses LRU-K (+ optional TinyLFU admission) and io_uring to keep the hot set resident.
WalManagerARIES-compatible logSupports logical heap/index records, page write/delta fallback, CLRs, checkpoints, and background flush threads.

4. TableBinding Contract

classDiagram
    class TableBinding {
        +scan() -> TupleStream
        +index_scan(name, IndexScanRequest) -> TupleStream
        +insert(&mut TxnContext, &Tuple)
        +delete(&mut TxnContext, RecordId, TupleMeta, Tuple)
        +update(&mut TxnContext, RecordId, Tuple, TupleMeta, Tuple) -> RecordId
        +prepare_row_for_write(&mut TxnContext, RecordId, &TupleMeta)
        +table_heap() -> Arc<TableHeap>
    }

    class PhysicalOperator {
        +init(&mut ExecutionContext)
        +next(&mut ExecutionContext) -> Option<Tuple>
    }

    PhysicalOperator --> TableBinding : uses

This binding hides every MVCC/WAL detail from the operators:

  • No more ad-hoc catalog.table_indexes() calls.
  • No direct references to MvccHeap or TableHeap.
  • Executor code reads like pseudo SQL: “lock table”, “scan binding”, “update tuple”.

5. WAL & Recovery Details

flowchart LR
    subgraph WAL Record Types
        HI["HeapInsert"] --> redo
        HD["HeapDelete (with before-image)"] --> redo & undo
        LI["IndexLeaf{Insert,Delete}"]
        IS["IndexSplit/Merge/Redistribute/Parent*"]
        IR["IndexRoot Install/Adopt/Reset"]
        PI["PageWrite (FPW)"]
        CI["Checkpoint"]
        CL["CLR"]
    end
    Exec -->|Heap/Index payloads| WalMgr
    BufferMgr -->|first-touch FPW| WalMgr
    WalMgr --> DiskScheduler --> log files
    Recovery -->|analysis/redo/undo| BufferMgr
  • Logical logging first: heap/index mutations emit redo + undo at the logical level. This keeps the WAL stream compact and human-readable for teaching.
  • Physical fallbacks: metadata-heavy pages (meta, freelist, header) still leverage PageWrite FPWs to guarantee a consistent base image, especially on the first page flush after a crash.
  • Restart: RecoveryManager performs the classical ARIES sequence. It uses the dirty_page_table and active_txn_table captured by checkpoints to limit redo and undo work.

6. Background Workers

WorkerPurposeConfiguration
WAL writerPeriodically flushes WAL buffers, coalesces adjacent writesWalOptions::writer_interval_ms, buffer_capacity
CheckpointRecords ATT + DPT, gently pushes dirty buffersWalOptions::checkpoint_interval_ms
Buffer writerFlushes frames when the dirty set grows too largebackground::BufferWriterConfig
MVCC vacuumReclaims obsolete tuple versions based on safe_xminMvccVacuumConfig
Index vacuumCleans up deleted index entries using B-link traversalIndexVacuumConfig

Workers register with background::BackgroundWorkers, so the database can stop and join them cleanly on shutdown. Each worker publishes metrics (intervals, batches processed) for observability.


7. Example: Repeatable Read UPDATE

sequenceDiagram
    participant T1 as Txn 1 (RR)
    participant T2 as Txn 2 (RC)
    participant Heap as TableHeap

    T1->>Heap: SELECT (snapshot S0)
    Heap-->>T1: version V0
    T2->>Heap: UPDATE -> insert V1, delete V0
    Heap-->>T2: ack
    T2->>T2: COMMIT (WAL + locks release)
    T1->>Heap: SELECT again
    Heap-->>T1: V0 still visible via snapshot S0
    T1->>T1: COMMIT
    Note over T1,Heap: Vacuum later removes V0 when safe_xmin advances

This timeline demonstrates:

  • Snapshots shield Repeatable Read statements from concurrent writers even if row-level locks are released early.
  • The MVCC chain (V1.prev_version = V0, V0.next_version = V1) lets future readers reach the newest committed version while the vacuum worker reclaims obsolete ones lazily.

8. Observability & Configuration Cheat Sheet

  • Logging: RUST_LOG=trace surfaces lock acquisitions, MVCC vacuums, and io_uring completions.
  • Runtime knobs: WalOptions (segment size, sync policy), BufferPoolConfig (capacity, TinyLFU toggle), MvccVacuumConfig, and IndexVacuumConfig can all be adjusted via DatabaseOptions.
  • Metrics: background::BackgroundWorkers::snapshot() reports worker health; WAL exposes current LSN and flush LSN; the buffer manager can dump the dirty page table for diagnostics.

With these layers in place, QuillSQL remains faithful to production-grade engines (MVCC + WAL + buffer pool) while keeping its code and documentation approachable for coursework and research prototypes.

Module Overview

This page gives a teaching-friendly tour of every QuillSQL subsystem. Each section explains where the code lives (src/<module>), the most important types, and the design decisions that make the module fit into the whole system. Read it together with architecture.md for the big-picture data flow.


1. SQL Front-End (src/sql)

SubmoduleResponsibilitiesKey Types / Functions
lexer.rs, parser.rsTranslate raw SQL into an AST using sqlparser plus Quill-specific extensions.parse_sql(&str) -> Vec<Statement>
ast/mod.rsNormalises identifiers, handles multi-part names, attaches spans for diagnostics.NormalizedIdent, ObjectNameExt
plan/lowering.rsBridges AST → planner structs for DDL extras not in upstream sqlparser.CreateIndexSpec, ColumnDefExt

Implementation notes:

  • We intentionally keep the AST “SQL-shaped”. No premature desugaring occurs in the parser, which keeps the step-by-step teaching narrative simple: SQL text → AST → logical plan.
  • Error messages bubble up with span information, so labs around parser extensions can show students exactly which token misbehaved.
  • Suggested exercises: extend the parser with ALTER TABLE or window functions, then observe how the new AST nodes flow into the planner layer.

2. Logical Planning (src/plan)

ComponentDescription
LogicalPlannerConverts AST nodes into strongly typed LogicalPlan variants. Responsible for type checking, alias resolution, and scope handling.
PlannerContextSurface for catalog lookups while planning.
PhysicalPlannerLowers optimized logical plans into physical Volcano operators (PhysicalPlan).

Highlights:

  • Every LogicalPlan variant stores its child plans in Arc, so the tree is cheap to clone for optimizer passes or debugging prints.
  • Planner enforces column binding rules: scope stacks keep track of aliases, CTEs, and correlation so students can see how real compilers resolve identifiers.
  • DDL nodes capture TableReference + Schema objects. Later phases use them to skip repeated catalog lookups (helpful for labs about metadata caching).

3. Optimizer (src/optimizer)

PieceResponsibility
LogicalOptimizerApplies a pipeline of rule-based rewrites (predicate pushdown, projection pruning, constant folding).
rulesIndividual OptimizerRule implementations.

Teaching hooks:

  • Each rule implements OptimizerRule::rewrite(&LogicalPlan) -> Option<LogicalPlan>. The return type makes it obvious whether a rewrite fired, so labs can instrument hit counts or add tracing.
  • Because rules are pure functions, students can safely reorder or A/B test heuristics in isolation (e.g., “what if we only push predicates below joins when the selectivity estimate exceeds X?”).
  • Example lab: add constant folding or join commutation, run the sqllogictest suite, and compare plan dumps to see new shapes.

4. Execution Engine (src/execution)

ElementDetails
PhysicalPlanEnum covering all Volcano operators. Each variant implements VolcanoExecutor.
VolcanoExecutorinit(&mut ExecutionContext) and next(&mut ExecutionContext) pair define the iterator model.
ExecutionContextSupplies catalog access, expression evaluation, and (most importantly) a pluggable StorageEngine.

Design notes:

  • Operators stay declarative. They describe what to scan or modify and delegate the how to storage handles. For example, PhysicalSeqScan now requests a TupleStream via ExecutionContext::table_stream, so it never touches TableHeap internals.
  • Helper methods such as eval_predicate, insert_tuple_with_indexes, or prepare_row_for_write encapsulate MVCC/locking rules so physical plans remain short.
  • ExecutionContext caches TxnContext, making it straightforward to teach isolation semantics: examine TxnContext::lock_table and read_visible_tuple to see when locks are taken or released.
  • Suggested lab: implement a new physical operator (e.g., hash join) by wiring two child PhysicalPlans without ever touching storage-specific code. This highlights the payoff of the handle abstraction.

5. Transaction Runtime (src/transaction)

TypePurpose
TransactionManagerCreates/commits/aborts transactions, manages undo chains, and coordinates WAL durability.
TxnContextPer-command wrapper passed to execution. Provides MVCC snapshot (TransactionSnapshot), lock helpers, and command ids.
LockManagerMulti-granularity 2PL with deadlock detection.

Deeper dive:

  • Transaction stores a cached TransactionSnapshot plus its undo records. Students can inspect Transaction::set_snapshot to see how Repeatable Read/Serializable keep a stable view.
  • TxnRuntime pairs a transaction with a command id. Every SQL statement increments the command id so MVCC can distinguish between tuples created earlier in the same transaction vs. the current statement—great for explaining “recent writes are invisible during UPDATE scanning”.
  • LockManager exposes functions like lock_table / lock_row. Internally it keeps a wait-for graph and victim selection policy, which makes deadlock detection tangible for concurrency lectures.
  • Undo entries (UndoRecord) store heap + index data. When an abort occurs the engine generates CLRs, demonstrating ARIES-style logical undo.

6. Storage Engine & Handles (src/storage)

ComponentHighlights
engine.rsDefines StorageEngine, TableHandle, IndexHandle, TupleStream, and IndexScanRequest.
table_heapSlotted-page heap with MVCC metadata (TupleMeta, forward/back pointers).
indexB+Tree (B-link) implementation with iterators and codecs.

Key ideas for teaching: Topics to emphasise:

  • Handle abstraction: Execution asks the engine for a TableHandle, then calls full_scan() to receive a TupleStream. The default engine simply wraps TableHeap/BPlusTreeIndex, but students can plug in their own engines for research.
  • TupleStream: Minimal interface returning (RecordId, TupleMeta, Tuple) triples. Operators layer MVCC visibility on top, while the stream hides buffering details.
  • Scan extensions: Consider extending full_scan() to accept projection/batch hints— a natural exercise for showing how execution and storage negotiate capabilities.
  • table_heap demonstrates MVCC version chains (forward/back pointers) and the slotted page layout. Encourage students to trace MvccHeap::update alongside WAL entries to see how new versions link together.
  • index/btree_index.rs implements a B-link tree with separate codecs. Scanning via TreeIndexIterator shows how to perform lock-free range scans using sibling pointers— perfect for advanced systems lectures.

7. Buffer Manager & Disk I/O (src/buffer, src/storage/disk_*)

ModuleDescription
buffer::BufferManagerPage table + LRU-K/TinyLFU replacer, dirty tracking, guard types for safe borrowing.
storage::disk_schedulerio_uring-powered async I/O. Foreground threads enqueue read/write/fsync commands and await completions.
storage::disk_managerThin wrapper for file layout, page enlargement, and durability fences.

Extra details:

  • Page guards come in three flavours: read, write, and upgradeable. Each implements Drop semantics that release latches automatically, reinforcing RAII patterns.
  • Replacement policy combines LRU-K history with TinyLFU admission. Labs can toggle the feature flag to measure hit-rate differences under sqllogictest or custom workloads.
  • DiskScheduler uses lock-free queues plus dedicated worker tasks. A teaching exercise is to run with RUST_LOG=debug and observe how read/write/fsync commands are batched.

8. Recovery & WAL (src/recovery)

ItemNotes
WalManagerAllocates LSNs, buffers log records, drives background WAL writer, and integrates with checkpoints.
RecoveryManagerImplements ARIES analysis/redo/undo. Uses ControlFileManager snapshots to seed restart.
wal_record.rsDefines logical (HeapInsert, IndexDelete, index structure/root records) and physical (PageWrite FPWs) records.

Teaching hook:

  • WAL and data share the disk scheduler. Students can trace one UPDATE from log append, to buffer dirtying, to redo/undo via the exact same RecordId.
  • Recovery exports statistics (redo count, loser transactions) so labs can check their WAL experiments automatically.
  • recovery/analysis.rs shows how Dirty Page Table and Active Transaction Table are reconstructed—ideal for demystifying ARIES’ first phase.
  • Students can implement “logical replay only” or “page image replay” modes by toggling the commit record types, then verify behaviour using the provided transaction tests.

9. Background Services (src/background)

WorkerWhat it does
WAL writerFlushes WAL buffer at configured cadence.
CheckpointCaptures the Dirty Page Table + Active Transaction Table.
Buffer writerFlushes dirty frames to keep checkpoints short.
MVCC vacuumReclaims tuple versions older than safe_xmin.

More context:

  • Workers share a BackgroundWorkers registry so the database can spawn/stop them as a group (handy for tests). The registry exposes shutdown_all() which unit tests call to ensure a clean exit.
  • Config structs (IndexVacuumConfig, MvccVacuumConfig, etc.) live in src/config and are exposed through DatabaseOptions for easy fiddling in labs.
  • WAL writer and checkpoint worker simply wrap closures around tokio::task::JoinHandle. This design keeps async runtime code out of the core engine, making it simpler for students to trace background effects.
  • Exercise idea: tweak MvccVacuumConfig::batch_limit and observe how many tuple versions stay behind by querying hidden statistics tables.

10. Configuration & Session Layer (src/database, src/session, src/config)

TypeRole
DatabaseBoots disk/WAL/buffer components, runs recovery, wires background workers, and executes SQL strings.
SessionContextTracks per-connection defaults (autocommit, isolation) and holds the active transaction handle.
config::*Central place for WAL, buffer pool, vacuum, and HTTP/CLI tuning knobs.

Extra pointers:

  • Front-ends (bin/client, bin/server) both embed a Database, proving that the core library can serve multiple UIs without change.
  • DatabaseOptions show how to construct dev/test setups (temporary files, alternate WAL directories) in a few lines.
  • session::SessionContext demonstrates auto-commit semantics: it lazily starts a transaction and uses TransactionScope to interpret SET TRANSACTION statements.
  • Configuration structs derive Clone + Debug, making them easy to print in labs or feed from environment variables (HTTP server uses QUILL_DB_FILE, PORT, etc.).

11. Testing & Documentation (tests/, docs/)

AreaPurpose
tests/sql_example/*.sltsqllogictest suites for SQL coverage.
tests/transaction_tests.rsUnit tests for MVCC invariants, locking, and visibility.
docs/This mdBook. Every module adds its own deep-dive chapter, making it straightforward for students to jump from code to guided explanations.

Testing strategy:

  • Developers can run cargo test -q for fast feedback. Long-running suites can be wrapped with timeout as suggested in AGENTS.md.
  • Example-driven docs (like this page) mirror the repository layout, so onboarding students can find code and tests with minimal guesswork.
  • Encourage students to add sqllogictest cases alongside code changes. Because each case lives in tests/sql_example, git diffs double as documentation.
  • For modules with heavy concurrency (lock manager, WAL), pair unit tests with tracing: the CI logs become walkthroughs for tricky paths.

12. Suggested Reading Order for Learners

  1. Introduction ➜ Architecture – get the 10,000ft view.
  2. Module Overview (this page) – map names to directories.
  3. Execution + Storage chapters – understand TupleStreams, handles, and MVCC.
  4. Recovery + Transaction – see how WAL && MVCC interlock.
  5. Buffer, Index, Background – dive into advanced systems topics once the basics stick.

Treat this document as a living index: update it whenever a subsystem gains new entry points (e.g., asynchronous scans, new index types) so future contributors always know where to look next.

Contributor’s Guide

Welcome, and thank you for your interest in contributing to QuillSQL! Whether you’re fixing a bug, adding a new feature, or improving the documentation, this guide will help you get started.

1. Getting Started: Your Development Environment

Prerequisites

  • Rust: QuillSQL is written in Rust. If you don’t have it yet, install it via rustup. This will provide you with rustc (the compiler) and cargo (the package manager and build tool).
    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    
  • Build Essentials: Ensure you have a C++ compiler like gcc or clang installed, which is a common dependency for some Rust crates.

Setup

  1. Fork the Repository: Start by forking the main QuillSQL repository to your own GitHub account.

  2. Clone Your Fork: Clone your forked repository to your local machine.

    git clone https://github.com/feichai0017/QuillSQL.git
    cd QuillSQL
    
  3. Build the Project: Compile the entire project to ensure everything is set up correctly.

    cargo build
    

2. Development Workflow

Running Tests

Before and after making any changes, it’s crucial to run the test suite to ensure you haven’t broken anything.

  • Run all unit and integration tests:

    cargo test
    
  • Run the benchmark suite:

    cargo bench
    

Code Style and Quality

We adhere to the standard Rust coding style and use tools to enforce it.

  • Formatting: Before committing, please format your code with rustfmt.

    cargo fmt --all
    
  • Linting: We use clippy to catch common mistakes and improve code quality. Please ensure clippy passes without warnings.

    cargo clippy --all-targets -- -D warnings
    

Submitting Your Contribution

  1. Create a New Branch: Create a descriptive branch name for your feature or bugfix.

    git checkout -b my-awesome-feature
    
  2. Make Your Changes: Write your code. Add new tests to cover your changes. Ensure all existing tests still pass.

  3. Format and Lint: Run cargo fmt and cargo clippy as described above.

  4. Commit Your Work: Write a clear and concise commit message.

    git add .
    git commit -m "feat: Add support for window functions"
    
  5. Push to Your Fork: Push your branch to your fork on GitHub.

    git push -u origin my-awesome-feature
    
  6. Open a Pull Request: Go to the original QuillSQL repository on GitHub. You should see a prompt to open a Pull Request from your new branch. Fill out the PR template with a description of your changes.

3. Working on the Documentation

The documentation is built using mdbook. To preview your changes locally, you’ll need to install it and the mermaid plugin.

  1. Install mdbook and mdbook-mermaid:

    cargo install mdbook
    cargo install mdbook-mermaid
    
  2. Serve the Book Locally: Run the following command from the root of the project.

    mdbook serve docs
    

    This will build the book and start a local web server. You can open your browser to http://localhost:3000 to see the live-previewed documentation.

SQL Front-End

The SQL front-end lives in src/sql/. It turns raw UTF-8 query text into the abstract syntax trees (ASTs) consumed by planning, while layering Quill-specific name handling and diagnostics on top of sqlparser.


Responsibilities

  • Parse SQL text into sqlparser::ast::Statement values.
  • Record precise spans so error messages can highlight the exact byte range.
  • Normalise identifiers (case folding, quoted names, multi-part paths).
  • Provide helper traits so the logical planner can lower AST nodes without duplicating syntax checks.

Directory Layout

PathPurposeKey Types
lexer.rsToken helpers that preserve offsets.Token, TokenExt
parser.rsSingle entry point used across the codebase.parse_sql, SqlInput
ast/mod.rsPlanner-facing helpers.NormalizedIdent, ObjectNameExt
error.rsSpan-aware parser errors.SqlError, SqlSpan

Parsing Pipeline

  1. Lexing – wrap sqlparser’s lexer so every token keeps start/end offsets.
  2. AST generation – invoke sqlparser to produce standard Statement structs.
  3. Normalisation – convert identifiers into NormalizedIdent, deal with schema qualifiers, and build pieces of TableReference.
  4. Planner bridge – traits like ColumnRefExt expose methods such as relation() or column() so LogicalPlanner can treat different SQL syntaxes uniformly.

Interactions

  • Logical planner consumes the AST directly and relies on helper traits from this module to convert identifiers into catalog references.
  • Database / Session catch SqlError values, so both CLI and HTTP front-ends show consistent caret diagnostics.
  • Tests (tests/sql_example/*.slt, tests/sql_parser.rs) assert on parser output and error strings to keep teaching feedback stable.

Implementation Notes

  • SqlSpan stores byte offsets, which makes it trivial to slice the original SQL and render highlighted errors.
  • Extended statements (e.g., EXPLAIN, BEGIN TRANSACTION) show how to Layer Quill-specific syntax without forking sqlparser entirely.
  • We avoid desugaring at this stage so students can trace SQL → AST → logical plan step by step.

Teaching Ideas

  • Add a new statement (CREATE VIEW, ALTER TABLE ...) and follow the AST through the pipeline.
  • Improve error hints (“Did you forget FROM?”) to see how better diagnostics aid users.
  • Write fuzz tests that round-trip SQL → AST → SQL to discuss parser determinism.

Catalog Module

src/catalog/ acts as QuillSQL’s data dictionary. It tracks schema/table/index metadata, statistics, and the mapping between logical names and physical storage objects such as TableHeap and BPlusTreeIndex. Every layer—planner, execution, background workers—uses the catalog to discover structure.


Responsibilities

  • Persist definitions for schemas, tables, columns, indexes, and constraints.
  • Map logical TableReferences to physical handles (heap files, index roots, file ids).
  • Store table statistics (row counts, histograms) that drive ANALYZE and optimization.
  • Manage the DDL lifecycle: creation and deletion update the in-memory registry and the on-disk metadata pages.

Directory Layout

PathDescriptionKey Types
mod.rsPublic API surface.Catalog, TableHandleRef
schema.rsSchema objects and table references.Schema, Column, TableReference
registry/Thread-safe registry for heaps (MVCC vacuum).TableRegistry
statistics.rsANALYZE output and helpers.TableStatistics
loader.rsBoot-time metadata loader.load_catalog_data

Core Concepts

TableReference

Unified identifier (database, schema, table). Logical planner, execution, and transaction code all use it when requesting handles from the catalog.

Registries

TableRegistry maps internal IDs to Arc<TableHeap> plus logical names. It is used by the MVCC vacuum worker to iterate user tables without poking directly into catalog data.

Schema & Column

Schema stores column definitions (type, default, nullability). Execution uses it when materialising tuples; the planner uses it to check expression types. Schema::project helps physical operators build projected outputs.

TableStatistics

ANALYZE writes row counts and histograms into the catalog. Optimizer rules and planner heuristics can consult these stats when deciding whether to push filters or pick indexes. Each column tracks null/non-null counts, min/max values, and a sample-based distinct estimate, enabling DuckDB-style selectivity heuristics (1/distinct, uniform ranges).


Interactions

  • SQL / Planner – DDL planning calls Catalog::create_table / create_index; name binding relies on Schema.
  • ExecutionExecutionContext::table_handle and index_handle fetch physical handles through the catalog, so scans never hard-code heap locations.
  • Background workers – MVCC and index vacuum iterate the registries via Arc clones.
  • Recoveryload_catalog_data rebuilds the in-memory catalog from control files and metadata pages during startup.

Teaching Ideas

  • Extend the schema system with hidden or computed columns and teach the catalog to store the extra metadata.
  • Add histogram bins to TableStatistics and demonstrate how a simple cost heuristic can choose better plans.
  • Turn on RUST_LOG=catalog=debug to observe how DDL mutates the registries.

Expression & Scalar Evaluation

The expression subsystem (src/expression/) powers column computations, predicates, and UPDATE assignments. It keeps expression trees approachable while demonstrating how they are evaluated during execution.


Responsibilities

  • Store planner-produced expression trees (Expr) in a serializable, traversable enum.
  • Bind column references, constants, and built-in functions.
  • Evaluate expressions against Tuples at runtime, yielding ScalarValue.
  • Provide type inference and casting so arithmetic/comparison operators remain well-typed.

Directory Layout

PathDescriptionKey Types
mod.rsPublic API and core enum.Expr, ExprTrait
scalar.rsRuntime scalar representation + conversions.ScalarValue, DataType
binder.rsHelpers for the planner/SQL binder.BoundExpr

Concepts

Expr Enum

Expresses column refs, literals, comparisons, logical ops, arithmetic, and function invocations. Each variant implements ExprTrait::evaluate(&self, tuple) and returns a ScalarValue.

ScalarValue

Unified runtime value across types (int, bigint, bool, decimal, varchar, …). Includes cast_to(DataType) so results can be coerced to the target column type before writes.

Type Inference

Planner code invokes Expr::data_type(schema) to predict result types. Execution then casts when needed—e.g., UPDATE t SET a = b + 1 uses the column’s declared type for a.


Interactions

  • Planner – builds Expr trees with bound columns; execution reuses them verbatim.
  • ExecutionContext – exposes eval_expr and eval_predicate, wrapping expression evaluation plus boolean coercion (NULL becomes false for predicates).
  • Optimizer – rules like constant folding traverse Expr trees and reuse ScalarValue arithmetic helpers.

Teaching Ideas

  • Add a simple built-in function (length(expr)) to follow the pipeline from parsing to evaluation.
  • Implement short-circuiting or full three-valued boolean logic and validate with sqllogictest.
  • Instrument Expr::evaluate with tracing to visualise expression evaluation inside physical operators.

Query Planner Module

src/plan/ bridges parsed SQL and executable operators. It converts the AST into a logical plan, applies rewrites (via the optimizer), and finally emits a physical plan (PhysicalPlan) that the Volcano engine can run.


Responsibilities

  1. LogicalPlanner – walks the AST, binds table/column names using PlannerContext, performs type checking, and builds a LogicalPlan tree.
  2. PlannerContext – exposes catalog lookups plus scope information for CTEs, subqueries, and aliases.
  3. PhysicalPlanner – lowers an optimized LogicalPlan into a tree of Volcano operators.

Directory Layout

PathDescriptionKey Types
logical_plan.rsLogical algebra nodes.LogicalPlan, LogicalExpr, JoinType
logical_planner.rsAST → logical transformation.LogicalPlanner
physical_plan.rsPhysicalPlan enum definition.PhysicalPlan, Physical* structs
physical_planner.rsLogical → physical lowering.PhysicalPlanner
planner_context.rsCatalog/scope abstraction.PlannerContext

Workflow

  1. Name bindingLogicalPlanner resolves table + column references, creates TableReferences, and validates schemas via the catalog.
  2. Logical tree – each SQL clause becomes a logical node (FROM → SeqScan, WHERE → Filter, GROUP BY → Aggregate, etc.).
  3. Physical selectionPhysicalPlanner picks concrete algorithms (sequential scan, index scan, nested-loop join, sort, limit …). Because every physical node implements VolcanoExecutor, the execution engine can pull tuples immediately.

Interactions

  • SQL front-end – provides the AST; helper traits (NormalizedIdent, etc.) keep name resolution consistent.
  • CatalogPlannerContext relies on it to confirm table/index existence and fetch schemas.
  • Optimizer – operates purely on LogicalPlan; the planner must emit clean, traversable trees so rules can fire.
  • Execution – physical nodes carry TableReference, SchemaRef, and hints that the execution engine passes to the storage layer.

Teaching Ideas

  • Implement a new logical operator (e.g., LogicalDistinct) and add the corresponding physical operator to trace the full lifecycle.
  • Experiment with early projection inside the logical plan and observe its impact on downstream operators.
  • Use pretty_format_logical_plan/physical_plan dumps to visualise rewrites before and after optimizer passes.

Further reading: The Lifecycle of a Query

The Lifecycle of a Query

When you submit a SQL query to a database, it doesn’t just magically produce a result. The database undertakes a sophisticated, multi-stage process to translate the declarative SQL statement (which describes what data you want) into an imperative, efficient execution plan (which describes how to get that data). This entire process is the responsibility of the Query Planner.

In QuillSQL, this process follows a classic, compiler-like pipeline, which is a cornerstone of modern database architecture as taught in courses like CMU 15-445.

The journey from a SQL string to an executable plan involves several transformations:

SQL String -> AST (Abstract Syntax Tree) -> Logical Plan -> Optimized Logical Plan -> Physical Plan

Let’s break down each stage.

Stage 1: Parsing (SQL -> AST)

The first step is purely syntactic. The raw SQL query string is fed into a parser. QuillSQL uses the excellent sqlparser crate for this. The parser checks if the SQL conforms to valid grammar and, if so, converts it into an Abstract Syntax Tree (AST).

An AST is a direct tree representation of the SQL query’s structure. For example, SELECT id FROM users WHERE age > 30 would be parsed into a tree structure with nodes representing the SELECT clause, the table users, the WHERE clause, and the predicate age > 30.

Stage 2: Logical Planning (AST -> Logical Plan)

Next, the LogicalPlanner (plan/logical_planner.rs) walks the AST and converts it into a Logical Plan.

A Logical Plan is a tree of relational algebra operators. It describes the query in terms of high-level data operations, completely independent of how the data is stored or which algorithms will be used. It defines what to do, not how to do it.

Key logical operators in QuillSQL (plan/logical_plan/mod.rs) include:

  • TableScan(users): Represents reading the entire users table.
  • Filter(predicate: age > 30): Represents filtering rows based on a condition.
  • Projection(columns: [id]): Represents selecting specific columns.
  • Join: Represents joining two data sources.
  • Aggregate: Represents a GROUP BY operation.
  • Sort: Represents an ORDER BY operation.

For our example query, the initial logical plan might look like this:

Projection(columns=[id])
  └── Filter(predicate=age > 30)
        └── TableScan(users)

Stage 3: Logical Optimization

Before executing the plan, we have a crucial opportunity to make it more efficient. The LogicalOptimizer (optimizer/logical_optimizer.rs) takes the logical plan and applies a series of transformation rules to produce a new, equivalent logical plan that is expected to be faster.

QuillSQL uses a simple but effective rule-based optimizer. A classic example of such a rule is Predicate Pushdown. Consider this query:

SELECT name FROM (SELECT * FROM users JOIN cities ON users.city_id = cities.id) WHERE users.age > 30;

A naive logical plan would first perform a full JOIN between users and cities and then filter the massive result. Predicate pushdown is a rule that would rewrite the plan to apply the age > 30 filter before the join:

Before Optimization:

Filter(users.age > 30)
  └── Join(users.city_id = cities.id)
        ├── TableScan(users)
        └── TableScan(cities)

After Optimization (Predicate Pushdown):

Join(users.city_id = cities.id)
  ├── Filter(users.age > 30)
  │     └── TableScan(users)
  └── TableScan(cities)

By filtering early, we dramatically reduce the number of rows that need to be processed by the expensive Join operator. QuillSQL implements similar rules, such as PushDownLimit, which pushes LIMIT clauses down the tree to reduce the amount of data processed.

Stage 4: Physical Planning (Logical Plan -> Physical Plan)

Finally, the PhysicalPlanner (plan/physical_planner.rs) converts the optimized logical plan into a Physical Plan.

A Physical Plan describes exactly how the query will be executed. It maps each logical operator to a concrete algorithm or implementation.

  • A LogicalPlan::TableScan becomes a PhysicalSeqScan (a sequential scan of the table heap).
  • A LogicalPlan::Filter becomes a PhysicalFilter, which implements the filtering logic.
  • A LogicalPlan::Join becomes a PhysicalNestedLoopJoin. This is where the database commits to a specific join algorithm. A more advanced database might have multiple options (e.g., PhysicalHashJoin, PhysicalSortMergeJoin) and would use a cost model to choose the best one. QuillSQL currently implements Nested Loop Join.

Each node in the physical plan tree is an executor that the Execution Engine can run. This final plan is what gets executed to produce the query result.

Conclusion

This layered approach—from syntax to a logical representation, then to an optimized logical representation, and finally to a concrete physical execution plan—is fundamental to database design. It provides a clear separation of concerns and, most importantly, creates a dedicated optimization stage, which is the key to achieving high performance on a wide variety of SQL queries.


For Study & Discussion

  1. Logical vs. Physical: Why is the separation between logical and physical plans so important? What would be the disadvantages of a simpler system that converted the AST directly into a physical plan?

  2. Join Algorithms: QuillSQL currently only implements NestedLoopJoin. What are two other common join algorithms? Describe how they work and in what scenarios they would be more performant than a nested loop join.

  3. Programming Exercise (Advanced): Implement a PhysicalHashJoin operator. This is a significant undertaking that involves: a. Creating a PhysicalHashJoin struct that implements the VolcanoExecutor trait. b. In the init() phase, it should consume the entire “build” side (typically the smaller, right-hand table) and build an in-memory hash table from its rows. c. In the next() phase, it should read from the “probe” side (the left-hand table) one row at a time, probe the hash table for matches, and emit the joined tuples. d. Modify the PhysicalPlanner to choose PhysicalHashJoin instead of PhysicalNestedLoopJoin for equi-joins.

  4. Programming Exercise: Add support for the UNION ALL operator. This would involve: a. Adding a Union variant to the LogicalPlan enum. b. Updating the LogicalPlanner to recognize the UNION syntax in the AST and create a LogicalPlan::Union node. c. Creating a PhysicalUnion executor that pulls tuples from its first child until it’s exhausted, and then pulls tuples from its second child.

Optimizer Module

src/optimizer/ contains a lightweight, teaching-friendly rule engine. It rewrites LogicalPlan trees into cheaper equivalents without requiring a full cost-based framework.


Responsibilities

  • Define the OptimizerRule trait (“match → rewrite”).
  • Ship built-in rules such as predicate pushdown, projection pruning, and limit pushdown.
  • Provide a pipeline (LogicalOptimizer) that repeatedly applies rules until reaching a fixpoint, while remaining extensible for future cost models.

Directory Layout

PathDescriptionKey Types
mod.rsOptimizer entry point.LogicalOptimizer
rule.rsTrait + shared helpers.OptimizerRule
rules/*Concrete rewrites.PushDownFilter, PushDownLimit, …

How It Works

  1. LogicalOptimizer::optimize(plan) iterates through the registered rule list.
  2. Each rule implements fn apply(&LogicalPlan) -> Option<LogicalPlan>. Returning Some means the rule fired; the pipeline restarts to reach a fixpoint.
  3. Rules are pure functions, which keeps them easy to unit test and reason about.

Examples:

  • PushDownFilter moves filters below scans/joins to reduce input size sooner.
  • PushDownLimit applies LIMIT before expensive joins/sorts when safe.
  • PruneProjection removes unused columns so execution/storage decode less data.

Extending With Statistics

The optimizer intentionally remains heuristics-only, and the physical planner sticks to simple sequential scans. For coursework, students can still read TableStatistics from the catalog to prototype their own cardinality estimates or cost heuristics (e.g., to experiment with when to prefer an index scan), but no estimator ships in-tree.


Interactions

  • LogicalPlan – the optimizer only sees logical nodes; physical/storage layers remain untouched.
  • Catalog / Statistics – current rules are heuristic, but TableStatistics remains available for students who want to prototype their own cost-based decisions.
  • Execution – leaner logical plans translate into simpler physical plans (e.g., predicate pushdown allows PhysicalSeqScan to discard rows earlier).

Teaching Ideas

  • Implement a new rule (join reordering, constant folding) and use RUST_LOG=trace to compare plan dumps before/after.
  • Discuss pipeline ordering—swap rule order and observe different outcomes.
  • Prototype a tiny cost estimator using row counts from TableStatistics to decide on index scans vs sequential scans.

Further reading: Rule-Based Optimization

Rule-Based Optimization

After the LogicalPlanner creates an initial LogicalPlan, it’s passed to the LogicalOptimizer. The initial plan is a direct, syntactically correct translation of the SQL query, but it’s often not the most efficient way to execute it. The optimizer’s job is to transform this plan into an equivalent, but more performant, logical plan.

The Optimizer in QuillSQL

QuillSQL implements a Rule-Based Optimizer. This is a common and powerful approach where the optimizer is equipped with a set of predefined transformation rules. It repeatedly applies these rules to the logical plan tree until no more rules can be applied, or a maximum number of passes is reached.

The main components are:

  • LogicalOptimizer (optimizer/logical_optimizer.rs): The main driver. It holds a list of rules and contains the logic to recursively walk the plan tree and apply them.
  • LogicalOptimizerRule Trait: An interface that every optimization rule must implement. Its core method is try_optimize, which takes a plan node and attempts to return a rewritten, optimized version of that node.

Deep Dive: The PushDownLimit Rule

One of the most classic and effective optimizations is “pushing down” operations as far as possible towards the data source. Let’s examine the PushDownLimit rule (optimizer/rule/push_down_limit.rs) to see this in action.

Consider the following query:

SELECT * FROM users ORDER BY signup_date LIMIT 10;

The Naive Plan

A naive logical plan for this query would be:

Limit(10)
  └── Sort(by: signup_date)
        └── TableScan(users)

If executed directly, this plan would:

  1. Scan the entire users table.
  2. Sort the entire table by signup_date.
  3. Finally, discard all but the first 10 rows.

This is incredibly inefficient, especially for a large table, as it involves a massive, memory-intensive sort operation.

The Optimization Rule

The PushDownLimit rule is designed to recognize this specific pattern: a Limit operator directly on top of a Sort operator.

When the optimizer applies this rule, the try_optimize method matches on the Limit node. It inspects its child and sees that it’s a Sort node. The rule then knows it can apply its logic.

The Rewritten Plan

The rule rewrites the plan tree by “pushing” the limit information into the Sort node itself:

Limit(10)
  └── Sort(by: signup_date, limit: 10)
        └── TableScan(users)

Notice the new limit: 10 property on the Sort node. This seemingly small change has a huge performance impact. When the PhysicalSort operator is created from this logical node, it now knows that it only needs to find the top 10 rows. Instead of performing a full sort, it can use a much more efficient algorithm, like a heap sort (using a min-heap of size 10), to find the top 10 rows in a single pass over the data.

This optimization avoids sorting the entire table, dramatically reducing both CPU and memory consumption.

Other Rules

QuillSQL implements other simple but effective rules:

  • EliminateLimit: Removes a LIMIT clause if it provides no value (e.g., LIMIT NULL).
  • MergeLimit: If two LIMIT clauses are stacked on top of each other (which can happen after other rule transformations), this rule merges them into a single, more restrictive LIMIT.

Conclusion

While QuillSQL’s optimizer is currently rule-based and relatively simple, it demonstrates the fundamental principles of query optimization. By separating the logical representation of a query from its physical execution and applying equivalence-preserving transformations, a database can achieve massive performance gains. More advanced systems build on this with a Cost-Based Optimizer, which uses table statistics to estimate the “cost” of different physical plans (e.g., choosing between a NestedLoopJoin and a HashJoin) and pick the cheapest one.


For Study & Discussion

  1. Rule Ordering: The LogicalOptimizer applies its list of rules in a fixed order for a set number of passes. Can the order in which rules are applied affect the final, optimized plan? Can one rule’s transformation enable another rule to be applied in a subsequent pass?

  2. Cost-Based vs. Rule-Based: What is the primary limitation of a purely rule-based optimizer? When would a rule-based optimizer make a poor decision that a cost-based optimizer (with accurate statistics) would get right? (Hint: consider join algorithm selection).

  3. Programming Exercise: Implement the classic Predicate Pushdown rule. Your rule should look for a Filter operator whose child is a Join. If the filter’s predicate only uses columns from one side of the join, the rule should push the Filter node down to that side of the join, below the Join node. This is one of the most effective optimizations in any database.

  4. Programming Exercise: Implement a Constant Folding rule. This rule would traverse expression trees and pre-compute constant expressions. For example:

    • An expression WHERE age = 10 + 5 would be rewritten to WHERE age = 15.
    • An expression WHERE 1 = 1 would be evaluated to true, and a smart optimizer could then potentially eliminate the WHERE clause entirely.

Execution Engine

src/execution/ drives PhysicalPlan trees using the Volcano (iterator) model. Every operator pulls tuples from its children, coordinating closely with transactions, storage, and expression evaluation.


Core Components

ComponentRole
PhysicalPlanEnum covering all physical operators; each implements VolcanoExecutor.
ExecutionContextShared context carrying the catalog, TxnContext, storage engine, and expression helpers.
TupleStreamUnified scan interface returned by table/index handles.

Execution Flow

  1. ExecutionEngine::execute calls init on the root plan (and recursively on children).
  2. The engine loops calling next, with parents pulling tuples from children.
  3. ExecutionContext supplies transaction snapshots, lock helpers, and expression evaluation per call.
  4. Once next returns None, the accumulated results are returned to the caller (CLI, HTTP API, or tests).

Operator Examples

  • PhysicalSeqScan – acquires a table_stream from the storage engine, uses ScanPrefetch for batching, and relies on TxnContext::read_visible_tuple for MVCC.
  • PhysicalIndexScan – uses index_stream, tracks invisible_hits, and notifies the catalog when garbage accumulates.
  • PhysicalUpdate/PhysicalDelete – call prepare_row_for_write to re-validate locks and the latest tuple before invoking apply_update/delete.
  • PhysicalNestedLoopJoin – showcases the parent/child pull loop and acts as a baseline for more advanced joins.

Interactions

  • StorageEngine – all data access goes through handles/streams, keeping execution storage-agnostic.
  • TransactionTxnContext enforces locking, snapshots, and undo logging; operators never talk to LockManager directly.
  • ExpressionExecutionContext::eval_expr / eval_predicate evaluate expressions built by the planner.
  • Optimizer/Planner – execution honours the plan as-is; all structural choices happen upstream.

Teaching Ideas

  • Implement a new operator (e.g., PhysicalMergeJoin) to see how ExecutionContext support generalises.
  • Add adaptive prefetching inside PhysicalSeqScan to explore iterator hints.
  • Enable RUST_LOG=execution=trace to watch the init/next call sequence during a query.

Further reading: The Volcano Execution Model

The Volcano Execution Model

Once the Query Planner has produced an optimized PhysicalPlan, it’s the job of the Execution Engine to run it and produce results. The execution engine is the component that brings the plan to life, interacting with the transaction manager and storage layer to process data.

QuillSQL uses the classic Volcano Model, also known as the Iterator Model. This is a pull-based execution model where each physical operator in the plan tree acts as an iterator that the parent operator can “pull” rows from.

1. The VolcanoExecutor Trait

At the heart of the execution model is the VolcanoExecutor trait (execution/mod.rs). Every physical operator implements this simple trait:

#![allow(unused)]
fn main() {
pub trait VolcanoExecutor {
    fn init(&self, context: &mut ExecutionContext) -> QuillSQLResult<()>;
    fn next(&self, context: &mut ExecutionContext) -> QuillSQLResult<Option<Tuple>>;
    fn output_schema(&self) -> SchemaRef;
}
}
  • init(): This method is called once at the beginning of execution. It allows an operator to set up its initial state (e.g., a SeqScan operator would initialize its table iterator here).
  • next(): This is the core method. When called, the operator produces its next output tuple. It returns Some(tuple) if it has a row, or None if it has exhausted its data source. The top-level ExecutionEngine simply calls next() on the root of the plan tree in a loop until it receives None.

2. The ExecutionContext

Notice that both init() and next() take a mutable ExecutionContext. This object is the “world” in which the query runs. It is passed down the operator tree and exposes:

  • Catalog + StorageEngine: Operators call context.table(&table_ref) to obtain a TableBinding. The binding encapsulates heap/index access (scan, insert, delete, update, prepare-row-for-write) so operators never touch TableHeap or MvccHeap directly.
  • TxnContext: Provides the current transaction, MVCC snapshot, and helper methods for row/table locks and visibility checks.
  • Expression helpers: eval_expr / eval_predicate evaluate AST expressions against a tuple without leaking ScalarValue plumbing into operators.

Thanks to the binding abstraction, the operator code only expresses “what” should happen (scan/update/delete) while the binding implements “how” (MVCC chain navigation, undo logging, index maintenance).

3. Anatomy of Physical Operators

Data flows up the tree from the leaves (scans) to the root. Let’s see how it works by examining a few key operators.

Leaf Operator: PhysicalSeqScan

A sequential scan sits at the leaf of a plan tree and streams tuples from a table binding.

  • init():
    1. Acquires an IntentionShared lock on the table via TxnContext.
    2. Requests a TableBinding (context.table(&table_ref)) and calls binding.scan(...) to obtain a TupleStream. The binding hides the actual TableIterator implementation and any MVCC plumbing.
  • next():
    1. Pops the next (rid, meta, tuple) from its prefetch buffer (which in turn pulls from the binding’s stream).
    2. Calls TxnContext::read_visible_tuple to perform the MVCC visibility check and acquire the required shared row lock.
    3. Returns the tuple if visible; otherwise, keep looping.

Unary Operator: PhysicalFilter

A filter has one child operator (its input). It implements a WHERE clause.

  • next(): Its logic is a simple, tight loop:
    1. It calls self.input.next() to get a tuple from its child.
    2. If the child returns None, the filter is also exhausted and returns None.
    3. If it receives a tuple, it evaluates its predicate expression (e.g., age > 30) against the tuple.
    4. If the predicate evaluates to true, it returns the tuple. Otherwise, it loops back to step 1.

Binary Operator: PhysicalNestedLoopJoin

A join has two children: a left (outer) and a right (inner).

  • next(): It implements the classic nested loop join algorithm:
    1. Fetch one tuple from the outer (left) child and hold onto it.
    2. Enter a loop: fetch tuples one by one from the inner (right) child.
    3. For each inner tuple, combine it with the held outer tuple and evaluate the join condition. If it matches, return the combined tuple.
    4. When the inner child is exhausted, rewind it by calling self.right_input.init() again.
    5. Go back to step 1 to fetch the next tuple from the outer child.
    6. Repeat until the outer child is also exhausted.

4. Putting It All Together

Consider the query SELECT name FROM users WHERE age > 30. The physical plan is Projection -> Filter -> SeqScan.

  1. The ExecutionEngine calls next() on the Projection operator.
  2. The Projection operator needs a tuple, so it calls next() on its child, Filter.
  3. The Filter operator needs a tuple, so it calls next() on its child, SeqScan.
  4. The SeqScan operator asks its TableBinding for the next tuple, and after the MVCC check finds a visible tuple for a user with age = 25.
  5. SeqScan returns this tuple up to Filter.
  6. Filter evaluates age > 30 on the tuple. It’s false, so it loops, calling SeqScan.next() again.
  7. SeqScan pulls another visible tuple (this time age = 40, name = 'Alice') through the binding.
  8. SeqScan returns this tuple up to Filter.
  9. Filter evaluates age > 30. It’s true! It returns the tuple for Alice up to Projection.
  10. Projection takes the full tuple for Alice, creates a new tuple containing only the name column ('Alice'), and returns this new tuple as the result.

This process repeats, with tuples flowing up the tree one at a time, until the SeqScan operator runs out of pages and returns None, which then propagates up the tree, signaling the end of execution.


For Study & Discussion

  1. Push vs. Pull Models: The Volcano model is a “pull-based” model. An alternative is a “push-based” model, where operators push their results to their parents as soon as they are ready. What are the potential advantages and disadvantages of each model, particularly concerning cache efficiency and control flow?

  2. Blocking vs. Non-Blocking Operators: Some operators, like PhysicalFilter, can produce their first output row as soon as they receive their first input row. These are non-blocking. Other operators, like PhysicalSort, must consume their entire input before they can produce even a single row of output. These are blocking. What is the impact of blocking operators on query latency and memory usage?

  3. Programming Exercise: The current PhysicalNestedLoopJoin is simple but can be inefficient as it re-scans the entire inner table for every outer row. Implement a PhysicalBlockNestedLoopJoin operator. This version would read a block (a small batch) of tuples from the outer table into an in-memory buffer, and then iterate through the inner table once for that entire block. This can significantly reduce the number of times the inner table needs to be scanned.

  4. Programming Exercise: Implement the PhysicalLimit operator. Its next() method should: a. Keep an internal counter. b. If the counter is less than the offset, pull and discard tuples from its child. c. If the counter is between offset and offset + limit, pull a tuple from its child and return it. d. Once the limit is reached, it should stop pulling from its child and return None for all subsequent calls. This is important for efficiency, as it stops the execution of the entire sub-tree below it.

Transaction Module

src/transaction/ enforces the Atomicity and Isolation parts of ACID. It combines MVCC with strict two-phase locking so reads and writes can proceed concurrently without violating correctness.


Main Components

TypeRole
TransactionManagerCreates/commits/aborts transactions, assigns txn & command ids, coordinates WAL.
TransactionStores state, held locks, undo chain, and cached snapshot.
TxnContext / TxnRuntimeExecution-time wrapper exposing MVCC + locking helpers.
LockManagerMulti-granularity locking (IS/IX/S/SIX/X) with deadlock detection.
TransactionSnapshotTracks xmin/xmax/active_txns for visibility checks.

Workflow

  1. SessionContext calls TransactionManager::begin to create a transaction.
  2. Each SQL statement builds a TxnRuntime, yielding a fresh command id and snapshot.
  3. Operators call TxnContext::lock_table/lock_row to obey strict 2PL.
  4. TableHandle::insert/delete/update records undo, acquires locks, and emits WAL via TxnContext.
  5. Commit: write a Commit record → flush depending on synchronous_commit → release locks.
  6. Abort: walk the undo list, write CLRs, restore heap/index state, release locks.

MVCC Details

  • TupleMeta stores inserting/deleting txn ids and command ids. read_visible_tuple checks snapshots and, if needed, rewinds to the latest visible version.
  • Isolation levels:
    • Read Uncommitted – minimal snapshot caching.
    • Read Committed – refresh snapshot each command to avoid dirty reads.
    • Repeatable Read / Serializable – capture the snapshot once; RR releases shared locks at statement end, Serializable holds them to commit to avoid phantoms.
  • UPDATE skips versions created by the same (txn_id, command_id) to avoid looping back over freshly inserted tuples.

Locking

  • Multi-granularity hierarchy: table-level IS/IX/S/SIX/X plus row-level S/X.
  • Deadlock detection: LockManager maintains a wait-for graph and periodically chooses a victim (usually the longest waiter).
  • Release policy: exclusive/intent locks stay until commit; RR drops shared row locks at statement end, Serializable waits until commit.

Interactions

  • ExecutionContext – all helpers (lock acquisition, visibility checks, undo logging) are exposed here, so physical operators never touch LockManager directly.
  • StorageEngine – handles call TxnContext before mutating heaps/indexes; MVCC metadata lives in TupleMeta. Deletes and updates now push the affected index keys into the undo chain so heap/index WAL stay in lockstep.
  • Recovery – Begin/Commit/Abort records emitted here drive ARIES undo/redo.
  • Background – MVCC vacuum reads TransactionManager::oldest_active_txn() to compute safe_xmin.

Teaching Ideas

  • Change DatabaseOptions::default_isolation_level and compare SELECT behaviour under RC vs RR.
  • Write a unit test that deadlocks two transactions and watch LockManager pick a victim.
  • Implement statement-level snapshot refresh or Serializable Snapshot Isolation (SSI) as an advanced exercise.

Lab Walkthrough (à la CMU 15-445)

  1. Warm-up – Start two sessions, run BEGIN; SELECT ...; under RC vs RR, and trace which snapshot TxnRuntime installs by logging txn.current_command_id().
  2. MVCC visibility – Extend the transaction_tests.rs suite with a scenario where txn1 updates a row while txn2 reads it. Instrument TupleMeta printing so students see how (insert_txn_id, delete_txn_id) change as versions are linked.
  3. Undo tracing – Force an abort after a multi-index UPDATE. Watch the undo stack entries unfold: Insert removes the new version + index entries, Delete restores the old version + keys. Map each step to the WAL records that are written.
  4. Crash drill – Add panic!() right after TransactionManager::commit is called but before locks are released. Reboot, run recovery, and inspect the loser list; students can connect the dots between undo actions, CLRs, and ARIES theory.

Further reading: MVCC and 2PL

MVCC and 2PL

Of the four ACID properties, Isolation is often the most complex to implement. It ensures that concurrent transactions do not interfere with each other, making it appear as if each transaction is executing sequentially, one after another. Without proper isolation, a database would suffer from concurrency-related anomalies like dirty reads, non-repeatable reads, and phantom reads.

Databases typically use two main strategies for concurrency control:

  1. Pessimistic Concurrency Control: Assumes conflicts are likely and prevents them from happening by using locks. The most common protocol is Two-Phase Locking (2PL).
  2. Optimistic Concurrency Control: Assumes conflicts are rare. Transactions proceed without locking, and the database validates at commit time that no conflicts occurred. A popular variant is Multi-Version Concurrency Control (MVCC).

QuillSQL, like many modern relational databases (e.g., PostgreSQL, Oracle), implements a powerful hybrid model that combines MVCC with 2PL.

1. MVCC: Reading without Blocking

The core idea of MVCC is “writers don’t block readers, and readers don’t block writers.” This is achieved by never overwriting data in-place. When a row is updated, the database creates a new version of that row, preserving the old one.

Version Chains and TupleMeta

As discussed in the Storage Engine chapter, every tuple on disk has associated metadata (TupleMeta) that includes:

  • insert_txn_id: The ID of the transaction that created this version.
  • delete_txn_id: The ID of the transaction that marked this version as deleted.
  • prev_version / next_version: Pointers (RIDs) that form a linked list of versions for a single logical row, called the version chain.

Transaction Snapshots

When a transaction begins, it asks the TransactionManager for a snapshot of the database state. This snapshot, defined in transaction/mvcc.rs, contains three key pieces of information:

  • xmin: The oldest active transaction ID at the time of the snapshot. Any transaction with an ID less than xmin is guaranteed to be either committed or aborted.
  • xmax: The next transaction ID to be assigned. Any transaction with an ID greater than or equal to xmax was not yet started when the snapshot was taken.
  • active_txns: A list of all other transaction IDs that were active when the snapshot was taken.

The Visibility Check

When a transaction scans the database, for every tuple version it encounters, it performs a visibility check using its snapshot. The logic in MvccSnapshot::is_visible determines if the version should be “seen” by the current transaction. In simplified terms, a tuple version is visible if:

  1. Its insert_txn_id belongs to a transaction that was already committed before our snapshot was taken.
  2. AND its delete_txn_id is either not set, OR it belongs to a transaction that was not yet committed when our snapshot was taken.

This mechanism elegantly solves several concurrency problems. Since a reader transaction only ever sees a consistent snapshot of the database, it is completely immune to changes being made by other concurrent writer transactions. This prevents dirty reads and non-repeatable reads.

2. Two-Phase Locking (2PL): Preventing Write-Write Conflicts

While MVCC is excellent for read-write conflicts, it does not, by itself, prevent two transactions from trying to modify the same logical row at the same time (a write-write conflict). This is where locking comes in.

QuillSQL implements a strict Two-Phase Locking protocol via the LockManager (transaction/lock_manager.rs).

The Two Phases

  1. Growing Phase: The transaction can acquire new locks as it executes.
  2. Shrinking Phase: Once the transaction releases its first lock, it cannot acquire any new locks.

In practice, QuillSQL uses Strict 2PL, where the shrinking phase is delayed until the transaction commits or aborts. All locks are held until the very end.

Hierarchical Locking (Intention Locks)

To be efficient, the LockManager uses a multi-granularity, hierarchical locking scheme. Before a transaction can take a fine-grained lock on a row (a Shared or Exclusive lock), it must first acquire a coarser-grained intention lock on the table.

  • IntentionShared (IS): Signals the intent to read rows from the table.
  • IntentionExclusive (IX): Signals the intent to modify rows in the table.

This prevents a transaction wanting to lock the entire table from conflicting with another transaction that is already modifying a single row. For example, a SELECT * FROM users FOR UPDATE (which needs an X lock on the table) will be blocked if another transaction already holds an IX lock on users.

Deadlock Detection

When two or more transactions are waiting for each other to release locks in a circular chain, a deadlock occurs. The LockManager detects this by building a waits-for graph. When a transaction T1 has to wait for a lock held by T2, an edge T1 -> T2 is added to the graph. If adding an edge creates a cycle, a deadlock is detected, and one of the transactions (the victim) is immediately aborted to break the cycle.

3. The Hybrid Model: MVCC + 2PL in Action

In QuillSQL, these two mechanisms work in concert:

  • A reader (SELECT) transaction acquires an MVCC snapshot. It uses this snapshot to determine visibility. It only needs to acquire Shared (S) locks on the rows it reads to prevent other transactions from modifying them, thus ensuring repeatable reads in higher isolation levels.

  • A writer (UPDATE, DELETE) transaction must acquire an Exclusive (X) lock on the specific row it intends to modify. Once the lock is granted, it knows no other writer can interfere. It can then safely create a new tuple version as part of the MVCC protocol.

This hybrid approach provides the best of both worlds: reads are fast and non-blocking, while write-write conflicts are safely prevented by the locking protocol.


For Study & Discussion

  1. Isolation Levels: QuillSQL supports multiple SQL isolation levels. How does the behavior of MVCC snapshots and 2PL change between ReadCommitted and RepeatableRead? In ReadCommitted, a transaction gets a new snapshot for every statement, whereas in RepeatableRead, it uses the same snapshot for the entire transaction. What concurrency anomalies does this difference prevent?

  2. Phantom Reads: Even with MVCC and row-level 2PL, a RepeatableRead transaction can suffer from phantom reads. Imagine T1 runs SELECT COUNT(*) FROM users WHERE age > 30. Then, T2 inserts a new user with age = 40 and commits. If T1 runs its SELECT query again, it will see a new “phantom” row that wasn’t there before. How can a database prevent this to achieve the Serializable isolation level? (Hint: research predicate locking and index-range locking).

  3. Deadlock Handling: QuillSQL detects deadlocks by building a waits-for graph and aborting a transaction. What is an alternative strategy for handling deadlocks? For example, what are the pros and cons of using lock timeouts instead of cycle detection?

  4. Programming Exercise (Advanced): A full implementation of Serializable isolation often requires index-range locking to prevent phantoms. Extend the LockManager to support locking a range of keys within a B+Tree index. This would require a new lock type and a way to check for overlapping ranges. When a SELECT ... WHERE age > 30 query runs, it would place a shared lock on the (30, +∞) range in the index on the age column, preventing any other transaction from inserting a new user with an age in that range.

Storage Engine

The storage engine persists relational data, covering heap files, indexes, page formats, and the handles exposed to execution. Understanding this layer is key to reasoning about performance, MVCC, and recovery.


Responsibilities

  • Manage TableHeap insert/delete/update paths and their MVCC metadata.
  • Maintain indexes (see the Index module for details).
  • Expose the StorageEngine trait so execution can fetch TableHandle / IndexHandle instances per table.
  • Provide TupleStream so sequential and index scans share a unified interface.

Directory Layout

PathPurposeKey Types
engine.rsDefault engine plus handle definitions.StorageEngine, TableHandle, TupleStream
table_heap/Heap storage + MVCC logic.TableHeap, MvccHeap
index/B+Tree implementation.BPlusTreeIndex
page/Page, RID, tuple metadata.Page, RecordId, TupleMeta
tuple/Row encoding and projection helpers.Tuple
disk_manager.rsFile layout and page I/O.DiskManager
disk_scheduler.rsio_uring-backed async scheduler.DiskScheduler

Core Abstractions

StorageEngine Trait

#![allow(unused)]
fn main() {
pub trait StorageEngine {
    fn table(&self, catalog: &Catalog, table: &TableReference)
        -> QuillSQLResult<Arc<dyn TableHandle>>;
    fn indexes(&self, catalog: &Catalog, table: &TableReference)
        -> QuillSQLResult<Vec<Arc<dyn IndexHandle>>>;
}
}

The default implementation wraps the row-oriented heap + B+Tree combo, but the trait is ready for column stores, remote storage, or async engines.

TableHandle

Offers full_scan(), insert, delete, update, and prepare_row_for_write. MVCC, undo, and locking concerns live here so execution operators only describe intent. Every delete/update now receives the table’s index handles so HeapTableHandle can delete or re-insert keys in tandem with heap tuples—exactly the behaviour CMU 15-445’s buffer/heap projects walk you through.

TupleStream

Minimal iterator that returns (RecordId, TupleMeta, Tuple) triples. Index scans use IndexScanRequest to describe ranges.


Interactions

  • ExecutionExecutionContext::table_stream / index_stream delegate to handles.
  • Transaction – Handle methods call into TxnContext to acquire locks, record undo, and emit WAL.
  • Buffer ManagerTableHeap/BPlusTreeIndex access pages through the shared buffer pool.
  • Recovery – Heap/index mutations generate WAL records (HeapInsert, HeapDelete, IndexInsert, …) that ARIES replays.
  • Background – MVCC vacuum and index cleanup obtain handles and iterate tuples via the same abstractions as foreground scans.

Teaching Ideas

  • Implement a toy columnar handle to show how the execution engine can stay agnostic to storage layout.
  • Extend the TableHandle::full_scan / TableIterator plumbing to accept projection hints so students can experiment with column pruning.
  • Enable RUST_LOG=storage::table_heap=trace and trace MVCC version chains as updates occur.
  • Follow the CMU 15-445 Lab 2 flow: instrument TableBinding::delete to print every RID
    • key pair, run an UPDATE with multiple indexes, and confirm the WAL stream contains the matching HeapInsert/HeapDelete + IndexLeafInsert/IndexLeafDelete entries.

Further reading: Disk I/O, Page & Tuple Layout, Table Heap & MVCC

Disk I/O — Scheduler, io_uring Data Pages & WAL Runtime

1. Architecture

  • Request Path: foreground components enqueue DiskRequest objects via DiskScheduler::{schedule_read, schedule_write, …}. A dispatcher thread drains the global channel and distributes work round-robin to N io_uring workers. Each worker owns its own ring and file-descriptor cache, so once a request is forwarded, execution proceeds entirely off the foreground thread.
  • Stable APIs: schedule_read(page_id), schedule_write(page_id, Bytes), schedule_read_pages(Vec<PageId>), schedule_allocate(), schedule_deallocate(page_id) — every call returns a channel the caller can block on or poll.
  • Batch Reads: ReadPages fans out per-page SQEs while a shared BatchState tracks completions. Even if the kernel completes I/O out of order, the caller receives a Vec<BytesMut> that preserves the original page order.

2. WAL Runtime (buffered I/O)

  • Dedicated WAL runtime threads handle sequential WAL appends/reads using buffered I/O. They now keep a per-thread cache of open segment files, eliminating repeated open()/close() on every log record.
  • Worker count defaults to max(1, available_parallelism / 2) but is tunable through IOSchedulerConfig.
  • Optional sync on a request triggers sync_data / fdatasync so WalManager can honour synchronous commit or checkpoint barriers. Data pages stay on the io_uring dataplane; WAL always uses buffered writes.

3. io_uring Backend (Linux)

  • Each worker owns an IoUring with configurable queue_depth, optional SQPOLL idle timeout, and a pool of registered fixed buffers sized to PAGE_SIZE. Workers submit SQEs asynchronously and drain CQEs in small batches to keep the ring warm.
  • Read batching relies on shared BatchState instances (Rc<RefCell<_>>) so multi-page callers see ordered results without blocking the kernel on serialization.
  • Writes keep their payload alive until completion; if a fixed buffer slot is available we reuse it, otherwise we fall back to heap buffers. A companion WriteState tracks an optional fdatasync so the caller still observes exactly one Result<()> once all CQEs land.
  • Errors (short read/write, errno) are normalised into QuillSQLError values that flow back on the original channel.

4. Configuration

  • config::IOSchedulerConfig controls:
    • workers: number of io_uring workers (default = available parallelism).
    • wal_workers: WAL runtime threads (default workers / 2).
    • iouring_queue_depth, iouring_fixed_buffers, iouring_sqpoll_idle_ms.
    • fsync_on_write: whether data-page writes also issue fdatasync (WAL sync is managed separately by WalManager).

5. Concurrency & Safety

  • Worker-local file descriptors plus positional I/O remove shared mutable state on the hot path. The new per-worker handle cache further reduces syscall overhead.
  • Shutdown sequence: enqueue Shutdown, dispatcher forwards it to every worker, each worker drains outstanding SQEs/CQEs, and finally dispatcher + workers are joined.
  • BufferPool and TableHeap integrate via the same scheduler channels; inflight guards prevent duplicate page fetches even when multiple scans touch adjacent pages.

6. Performance Notes

  • Random page access benefits from fewer syscalls and deeper outstanding queue depth than the blocking fallback.
  • Only the io_uring backend currently ships (Linux x86_64). A portable fallback remains future work.
  • For large sequential scans, rely on the buffer pool’s sequential access pattern or add a custom iterator on top of ReadPages if you want to experiment with direct I/O.

7. Future Work

  • Queue-depth aware scheduling and CQE bulk harvesting.
  • Optional group commit (aggregate writes, single fsync) behind configuration.
  • Metrics hooks (queue depth, submit/complete throughput, latency percentiles, error codes).
  • Cross-platform fallback backend and richer prioritisation/throttling policies.
  • Control-plane knobs for throttling individual background workers.

Page and Tuple Layout

1. The Page: The Atomic Unit of I/O

A database file is not treated as one continuous stream of data. Instead, it is broken down into fixed-size blocks called pages. A page is the atomic unit of transfer between the disk and the in-memory Buffer Pool. Whenever the database needs to read a piece of data (like a single row), it must load the entire page containing that data into memory.

In QuillSQL, the page size is a constant defined at quill-sql/src/buffer/mod.rs:

  • PAGE_SIZE: 4096 bytes (4 KB)

This fixed-size approach simplifies buffer management and allows for efficient, aligned I/O operations, especially when using Direct I/O to bypass the OS cache.

2. TablePage: The Slotted Page Layout

While a page is a generic 4KB block of bytes, pages that store actual table data are structured in a specific way. QuillSQL uses a classic Slotted Page layout, which is a core concept in database implementation (as taught in CMU 15-445).

A TablePage is organized into three main parts:

<------------------------------ 4KB ------------------------------>
+----------------+-----------------+-----------------+--------------+
|  Page Header   |   Slot Array    |      Free       |   Tuple      |
| (grows ->)     |   (grows ->)    |      Space      |     Data     |
|                |                 |                 | (<- grows)   |
+----------------+-----------------+-----------------+--------------+
  1. Page Header (TablePageHeader): Located at the beginning of the page. It contains metadata about the page itself.
  2. Slot Array (tuple_infos): An array of TupleInfo structs that grows from after the header. Each entry in this array acts as a “pointer” or “directory entry” for a tuple on the page.
  3. Tuple Data: The actual raw data of the tuples is stored starting from the end of the page, growing backwards towards the middle.

This design has a key advantage: it decouples a tuple’s logical identifier from its physical location on the page.

The RecordId (RID)

A specific tuple is uniquely identified by a RecordId (RID). The RID is a stable pointer composed of two parts:

  • page_id: The ID of the page where the tuple resides.
  • slot_num: The index into the Slot Array on that page.

So, RID = (page_id, slot_num).

When the database needs to delete a tuple or if a variable-length tuple is updated and grows in size, the tuple’s data might need to be moved within the page (for compaction). In a slotted page design, we only need to update the offset in the corresponding slot array entry. The tuple’s RID (page_id, slot_num) remains unchanged. This prevents a cascade of updates to all secondary indexes that might be pointing to that tuple.

TablePageHeader and Slot (TupleInfo)

Let’s look at the physical layout, as defined in storage/codec/table_page.rs:

  • TablePageHeader:

    • lsn: The Log Sequence Number of the last change made to this page, crucial for WAL recovery.
    • next_page_id: The ID of the next page in this table, forming a linked list of pages.
    • num_tuples: The number of active tuples on the page.
    • num_deleted_tuples: The number of “dead” or deleted tuples.
    • tuple_infos: The slot array itself.
  • TupleInfo (A single slot in the array):

    • offset: The byte offset from the beginning of the page where the tuple’s data begins.
    • size: The size of the tuple’s data in bytes.
    • meta: A nested TupleMeta struct containing critical information for concurrency control.

For Study & Discussion

  1. Layout Trade-offs: What is the main benefit of having the tuple data grow from the end of the page backwards, while the header and slot array grow from the beginning forwards? What happens when they meet?

  2. Record ID Stability: Why is it so important that a tuple’s RecordId does not change even if the tuple’s physical data is moved within the page? What would break if the RID was just a direct byte offset?

  3. Large Objects: The current design assumes a tuple fits entirely on one page. How would you modify this page layout to support tuples that are larger than 4KB (e.g., a long blog post stored in a VARCHAR column)? Research how systems like PostgreSQL handle this with their “TOAST” (The Oversized-Attribute Storage Technique) mechanism.

  4. Programming Exercise: Implement a defragment() method for the TablePage. After several insertions and deletions, the free space on a page can become fragmented into small, unusable chunks. This method should reorganize the page by moving the existing tuples to be contiguous, creating a single, large block of free space. Remember to update the offset in each TupleInfo slot after moving its corresponding tuple data!

Table Heap and MVCC

The TableHeap (storage/table_heap.rs) is the component responsible for managing the collection of pages that belong to a single table. While the TablePage defines the layout within a single page, the TableHeap provides the high-level API for inserting, updating, and deleting tuples, making it central to the implementation of MVCC.

TableBinding & StorageEngine

From the executor’s perspective, heap access is routed through storage::engine::TableBinding. A binding is produced by the StorageEngine (consulting the catalog) and carries:

  • An Arc<TableHeap> for physical page access.
  • An MvccHeap wrapper that maintains version chains.
  • All indexes defined on the table.

The binding exposes ergonomic methods such as scan, index_scan, insert, update, delete, and prepare_row_for_write. Each method internally uses the TableHeap + MvccHeap pair, appends the appropriate logical WAL record (HeapInsert/HeapDelete), and keeps indexes in sync. This keeps physical operators extremely small and makes it easy to swap out the storage engine for experiments.

Tuple Serialization

A logical Tuple struct in memory must be converted into a byte array to be stored on a page. This process is handled by storage/codec/tuple.rs.

The serialized format of a tuple consists of two parts:

  1. Null Bitmap: A compact bitmap at the beginning of the tuple data. Each bit corresponds to a column in the schema; if the bit is 1, the column’s value is NULL. This avoids storing any data for null fields.
  2. Attribute Data: The actual data for all non-null columns, serialized one after another.

TupleMeta: The Heart of MVCC

The most important part of a tuple’s on-disk metadata is the TupleMeta struct, which is stored directly within the TupleInfo slot in the page header. This is the heart of QuillSQL’s Multi-Version Concurrency Control (MVCC) implementation.

  • insert_txn_id: The ID of the transaction that created this version of the tuple.
  • delete_txn_id: The ID of the transaction that “deleted” this version. (A value of 0 or INVALID means it’s not deleted).
  • is_deleted: A boolean flag indicating the tuple version is considered deleted.
  • prev_version: Option<RecordId>: A link (RID) to the previous version of this logical row.
  • next_version: Option<RecordId>: A link (RID) to the next version of this logical row.

These fields allow QuillSQL to maintain a version chain for each logical row. When a row is updated, the TableHeap’s mvcc_update method is called, which, instead of overwriting the data, performs the following steps:

  1. Creates a new tuple version with the new data by calling insert_tuple.
  2. Sets the prev_version pointer of this new version to the RID of the old version.
  3. Updates the next_version pointer of the old version to point to the new version.
  4. Sets the delete_txn_id on the old version to mark it as “dead”.

A transaction can then traverse this chain and use the transaction IDs in the TupleMeta to determine which version of a row is visible to it based on its own transaction ID and isolation level, thus achieving transaction isolation without long-held read locks.

Logical Logging (Redo + Undo)

Every heap mutation emits a logical WAL record before the page frame is dirtied:

OperationPayloadRedoUndo
InsertHeapInsertPayloadRe-insert tuple bytes into the slotRemove the tuple (implicit via delete logic)
DeleteHeapDeletePayload (new tombstone + old tuple)Mark slot deleted and update version chainRecreate the prior tuple image

This makes crash recovery straightforward: redo simply replays the payload, and undo uses the “old” portion of the payload even if the page is currently inconsistent. Physical PageWrite/PageDelta records are still used for metadata-heavy pages (meta, freelist, etc.) to guarantee a consistent starting point, but ordinary table DML lives entirely in these logical heap records.


For Study & Discussion

  1. Version Chain Traversal: Imagine a transaction with ID T10 needs to read a row. It finds a version of the row that was inserted by T5 (committed) but deleted by T12 (in-progress). Should T10 see this version? What if the deleter was T8 (committed)? What if the deleter was T10 itself? Walk through the visibility check logic.

  2. Garbage Collection: The MVCC model creates many “dead” tuple versions that are no longer visible to any active or future transaction. What is the long-term problem with leaving these dead versions in the database? This problem is typically solved by a process called vacuuming or garbage collection. When is it safe to physically remove a dead tuple version?

  3. Programming Exercise (Advanced): Implement a basic VACUUM function for a TableHeap. This function should scan the table and identify dead tuple versions that are no longer visible to any currently active transaction. Once identified, it should physically remove them from their pages. This is a challenging exercise that touches transaction management, storage, and concurrency.

Buffer Manager

The buffer manager (src/buffer/) implements QuillSQL’s shared buffer pool, bridging the speed gap between RAM and disk. It lets storage/execution read and write pages safely while coordinating with WAL and asynchronous I/O.


Responsibilities

  • Maintain a fixed-size set of page frames caching TableHeap and B+Tree pages.
  • Expose RAII-style guards (pin/unpin) that enforce safe concurrent access.
  • Keep the page table, replacement policy, dirty-page tracking, and WAL coordination in sync.
  • Submit async I/O through DiskScheduler.

Directory Layout

PathDescriptionKey Types
buffer_manager.rsCore buffer pool.BufferManager, BufferFrame
page.rsGuard types and pin/unpin logic.ReadPageGuard, WritePageGuard
replacer.rsLRU-K + TinyLFU replacement.Replacer
metrics.rsOptional instrumentation hooks.BufferMetrics

Key Mechanisms

Guard Model

  • ReadPageGuard, WritePageGuard, and UpgradeableGuard ensure only compatible access modes coexist on a page.
  • Guards drop automatically to release pins; paired with Rust’s borrow checker, they make latch semantics tangible.

Replacement Policy

  • LRU-K tracks the last K touches to protect hot pages from scan pollution.
  • TinyLFU decides whether a new page should enter the cache, offering probabilistic admission against noisy workloads.

WAL Coordination

  • Before flushing a dirty page, the buffer checks page_lsn and asks WalManager to flush up to that LSN (write-ahead rule).
  • set_wal_manager wires the buffer to WAL so checkpoints can inspect the oldest dirty LSN.

Disk Scheduler

  • All physical reads/writes go through DiskScheduler::submit_*, sharing worker threads with WAL and demonstrating the benefits of a unified I/O layer.

Interactions

  • Storage engineTableHeap and BPlusTreeIndex access pages exclusively through the buffer manager.
  • Recovery – checkpoints consult the buffer’s dirty page table to build the ARIES DPT.
  • Background writer – periodically walks dirty_frames to flush pages in the background.

Teaching Ideas

  • Disable TinyLFU via feature flag, rerun sqllogictest, and compare hit rates.
  • Swap the replacement policy with CLOCK to experiment with cache algorithms.
  • Enable RUST_LOG=buffer=debug and trace the pin/unpin lifecycle of hot pages.

Further reading: Page & Page Guards, The Buffer Pool

Page & Page Guards

Before the Buffer Manager can hand out a reference to a page in memory, it must ensure that the page won’t be evicted while it’s being used by another thread. This is accomplished by pinning.

Pinning

Pinning simply means incrementing a “pin count” associated with the page’s frame in the buffer pool. A frame with a pin count greater than zero is forbidden from being chosen as a victim by the page replacer.

  • When a thread wants to use a page, it must first pin it.
  • When the thread is finished with the page, it must unpin it (decrementing the count).

Manually managing pin counts is tedious and error-prone. Forgetting to unpin a page leads to a memory leak, as the frame can never be evicted. To solve this, QuillSQL uses a common and powerful C++ and Rust pattern: Resource Acquisition Is Initialization (RAII).

ReadPageGuard and WritePageGuard

Instead of returning a raw pointer to the page memory, the BufferManager’s fetch_page_* methods return a guard object: ReadPageGuard or WritePageGuard.

These guards are responsible for the lifetime of the pin and the lock on the page:

  1. Acquisition: When a PageGuard is created, its constructor acquires the appropriate lock (RwLock) on the page’s frame and increments the frame’s pin count.

    • ReadPageGuard takes a read lock, allowing multiple concurrent readers.
    • WritePageGuard takes an exclusive write lock.
  2. Usage: The calling code uses the guard object to access the page’s data. The guard provides safe, locked access to the underlying byte array.

  3. Release: When the guard variable goes out of scope (e.g., at the end of a function), its drop() method is automatically called by the Rust compiler. This drop() implementation handles all the cleanup:

    • It decrements the pin count.
    • It releases the lock on the frame.
    • If it’s a WritePageGuard and the data was modified, it informs the BufferManager that the page is now dirty.

This RAII pattern makes using the buffer pool much safer and more ergonomic, as it makes it impossible to forget to unpin a page or release a lock.

The Buffer Pool: Architecture and Lifecycle

1. Core Components & Architecture

QuillSQL’s buffer management is split into two main structs, reflecting a separation of concerns:

  • BufferPool (buffer/buffer_pool.rs): A low-level, “dumb” container. It owns the actual memory arena for pages and provides basic mapping from a PageId to a memory location (FrameId).
  • BufferManager (buffer/buffer_manager.rs): The high-level, “smart” coordinator. It contains the BufferPool and implements all the logic for fetching pages, choosing which pages to evict, and interacting with other database components like the transaction and recovery managers.

This architecture is centered around three key data structures:

  1. Frames (The Arena): The BufferPool pre-allocates a large, contiguous block of memory on the heap. This block is divided into a fixed number of smaller chunks called frames. Each frame is exactly PAGE_SIZE (4KB) and can hold the contents of one disk page.

  2. Page Table (page_table): A hash map (specifically, a concurrent DashMap) that maps a logical PageId to the FrameId where it currently resides in memory. This provides fast O(1) lookups to check if a page is already in the buffer pool.

  3. Replacer (LRUKReplacer): When a requested page is not in memory and the buffer pool is full, one of the existing pages must be evicted to make room. The Replacer is the component that implements the page replacement policy and decides which page is the best candidate for eviction. QuillSQL uses an LRU-K replacement policy, a sophisticated variant of the classic Least Recently Used (LRU) algorithm.

2. The Lifecycle of a Page Request

When another part of the database (e.g., the execution engine) needs to access a page, it calls buffer_manager.fetch_page_read(page_id) or fetch_page_write(page_id). This initiates a critical sequence of events.

The flow is as follows:

  1. Request: An executor requests Page P.

  2. Lookup: The BufferManager consults the PageTable.

  3. Case 1: Cache Hit

    • The PageTable contains an entry for P, mapping it to FrameId F.
    • The BufferManager increments the pin count for frame F.
    • It informs the LRUKReplacer that the page has been accessed (updating its priority).
    • It returns a PageGuard wrapping a reference to the memory in frame F.
  4. Case 2: Cache Miss

    • The PageTable has no entry for P.
    • The BufferManager must bring the page from disk. It asks the Replacer to choose a victim frame F_v to evict.
    • If the victim frame F_v is dirty: The BufferManager first writes the contents of F_v back to disk via the DiskScheduler. This is essential for data durability.
    • The PageTable entry for the old page residing in F_v is removed.
    • The BufferManager issues a read request to the DiskScheduler to load page P’s data from disk into frame F_v.
    • The PageTable is updated with the new mapping: P -> F_v.
    • The process then continues like a cache hit: frame F_v is pinned and a PageGuard is returned.

3. Concurrency

The buffer pool is a shared resource accessed by many concurrent threads. QuillSQL uses a combination of locking strategies:

  • Page Table: A DashMap is used for the page table, which is highly optimized for concurrent reads and writes.
  • Frame-Level Locks: Each frame in the pool has its own RwLock. A ReadPageGuard acquires a read lock on the frame, allowing multiple threads to read the same page concurrently. A WritePageGuard acquires an exclusive write lock, ensuring that only one thread can modify a page at a time.
  • Replacer/Free List: These shared structures are protected by a Mutex.

4. Integration with WAL and Recovery

The Buffer Manager is a key player in the ARIES recovery protocol.

  • Dirty Pages: When a WritePageGuard is dropped, if it has modified the page, it marks the page as dirty. The BufferManager maintains a dirty_page_table that tracks all dirty pages and the Log Sequence Number (LSN) of the first WAL record that caused the page to become dirty.
  • Forced WAL (Write-Ahead Logging): Before a dirty page can be written back to disk (either during eviction or a checkpoint), the BufferManager must ensure that all WAL records up to that page’s current LSN have been flushed to durable storage. This is the fundamental WAL rule and is enforced by the flush_page method, which calls wal_manager.flush(lsn) before writing the page to disk.

For Study & Discussion

  1. Replacement Policies: QuillSQL uses LRU-K. What are the potential advantages of LRU-K over a simple LRU policy? What kind of workload would benefit most? Conversely, what are the trade-offs of using Clock/Second-Chance instead?

  2. The “Double Caching” Problem: We mentioned that using Direct I/O helps avoid double caching. If Direct I/O is disabled, how does the database’s buffer pool interact with the operating system’s file system cache? Why can this lead to suboptimal performance?

  3. Programming Exercise: Implement a ClockReplacer that adheres to the Replacer trait. Modify the BufferManager to use your new replacer instead of LRUKReplacer. Run the benchmark suite and compare the performance. Does it change?

  4. Programming Exercise: Add metrics to the BufferManager to track the buffer pool hit rate (i.e., num_hits / (num_hits + num_misses)). You could expose this via a new SHOW BUFFER_STATS; SQL command.

Index Module

Indexes live in src/storage/index/. QuillSQL currently ships a B+Tree (B-link variant) that is exposed to execution via IndexHandle. Indexes allow point lookups and range scans in O(log n), dramatically reducing the need for full table scans.


Responsibilities

  • Maintain an ordered key → RecordId mapping per indexed table.
  • Support point probes, range scans, insert/update/delete maintenance.
  • Cooperate with MVCC: entries reference heap tuples while visibility checks remain in execution/transaction code.
  • Provide IndexHandle::range_scan, returning a TupleStream so physical operators don’t need to know tree internals.

Directory Layout

PathPurposeKey Types
btree_index.rsCore B+Tree, page formats, insert/delete logic.BPlusTreeIndex
btree_iterator.rsRange-scan iterator with sibling traversal.TreeIndexIterator
btree_codec.rsPage encode/decode utilities.BPlusTreeLeafPageCodec

Key Concepts

Each leaf stores a pointer to its right sibling. Iterators use this to keep scanning even if a concurrent split occurs, avoiding restarts from the root and enabling latch-free range scans.

Latch Crabbing

Insert/delete operations climb the tree with shared latches and upgrade only when necessary (e.g., right before splitting), reducing contention.

Range Scan → TupleStream

IndexHandle::range_scan wraps TreeIndexIterator and automatically fetches heap tuples, returning (rid, meta, tuple) triples. Execution remains storage-agnostic.

Inline Maintenance

Index inserts/updates/deletes modify the tree immediately and emit logical WAL for redo. There is no deferred “index vacuum”; once a heap tuple is deleted its index entry is removed in the same transaction.


Interactions

  • Catalog – stores Arc<BPlusTreeIndex> instances alongside table metadata so execution can fetch handles directly.
  • ExecutionPhysicalIndexScan uses ExecutionContext::index_stream; DML operators call insert_tuple_with_indexes so heap writes and index maintenance stay in sync.
  • Transaction/MVCC – heaps store transaction metadata; indexes just reference RIDs, so MVCC visibility is enforced when tuples are materialised.
  • Recovery – WAL contains IndexInsert/IndexDelete records to replay structural changes after crashes.

Teaching Ideas

  • Build a covering index example to show how avoiding heap lookups improves latency.
  • Instrument TreeIndexIterator to visualise sibling traversal during range scans.
  • Compare SeqScan vs IndexScan on selective predicates to highlight indexing benefits.

Further reading: B+Tree internals

B+ Tree Index — Architecture and Concurrency

1. Architecture Overview

1.1 Node and Page Structure

  • Node Types: Internal and Leaf.
    • Internal Nodes: Store separator keys and pointers to child pages. The first pointer is a “sentinel” that points to the subtree for all keys less than the first key in the node.
    • Leaf Nodes: Store (key, RecordId) pairs in sorted order. Leaves are linked together in a singly linked list (next_page_id) to allow for efficient range scans.
  • Header Page: A fixed page (header_page_id) that acts as the entry point to the tree. It stores the root_page_id, allowing for atomic updates to the tree root when it splits or shrinks.
  • Page Layout:
    • Internal Page: {Header, High Key, Array<(Key, ChildPageId)>}. The High Key is part of the B-link optimization.
    • Leaf Page: {Header, Array<(Key, RID)>}.

The tree uses B-link pointers (next_page_id) on all levels (both internal and leaf nodes). This creates a “side-link” to the right sibling. This is crucial for concurrency, as it allows readers to recover from transient inconsistent states caused by concurrent page splits by “chasing” the link to the right sibling.

              +------------------+
              |   Root (Int)     |
              +------------------+
             /         |         \
   +--------+      +--------+      +--------+
   | Int P1 |----->| Int P2 |----->| Int P3 |  (Internal B-link pointers)
   +--------+      +--------+      +--------+
   /   |   \      /   |   \      /   |   \
+----+ +----+  +----+ +----+  +----+ +----+
| L1 |-| L2 |->| L3 |-| L4 |->| L5 |-| L6 | (Leaf chain / B-links)
+----+ +----+  +----+ +----+  +----+ +----+

2. Concurrency Control

The B+Tree employs a sophisticated, lock-free read path and a high-concurrency write path using latch crabbing.

  • Readers traverse the tree from the root without taking any locks.
  • On each page, a version number is read before and after processing the page’s contents. If the version changes, it indicates a concurrent modification, and the read operation restarts from the root.
  • If a reader is traversing an internal node and the search key is greater than or equal to the node’s high_key, it knows a split has occurred. Instead of restarting, it can use the next_page_id B-link to “chase” to the right sibling and continue the search, minimizing restarts.

2.2 Write Path: Latch Crabbing

Writers (insert/delete) use a technique called latch crabbing (or lock coupling) to ensure safe concurrent modifications.

  • Process: A writer acquires a write latch on a parent node before fetching and latching a child node. Once the child is latched, the writer checks if the child is “safe” for the operation (i.e., not full for an insert, not at minimum size for a delete).
    • If the child is safe, the latch on the parent (and all other ancestors) is released.
    • If the child is unsafe, the parent latch is held, as the child might need to split or merge, which would require modifying the parent.
  • Context Struct: This process is managed by a Context struct that holds the stack of write guards (write_set) for the current traversal path. Releasing ancestor latches is as simple as clearing this stack.
Latch Crabbing on Insert:
1. Latch(Root)
2. Descend to Child C1. Latch(C1).
3. Check if C1 is safe (not full).
   IF SAFE:
     ReleaseLatch(Root). Path is now just {C1}.
   IF UNSAFE (full):
     Keep Latch(Root). Path is {Root, C1}.
4. Descend from C1 to C2. Latch(C2).
5. Check if C2 is safe... and so on.

2.3 Deadlock Avoidance

When modifying siblings (during merge or redistribution), deadlocks are possible if two threads try to acquire latches on the same two pages in opposite orders. This is prevented by enforcing a strict PageId-ordered locking protocol. When two sibling pages must be latched, the page with the lower PageId is always latched first.

3. Key Algorithms & Features

  • Parent-Guided Redirection: During an insert or delete, after a writer has descended to a leaf, it re-validates its position using the parent (if a latch is still held). If a concurrent split has moved the target key range to a different sibling, the writer can jump directly to the correct page instead of traversing the leaf chain, preventing race conditions.
  • Iterator: The iterator performs a forward scan by following the leaf chain (next_page_id). It uses a lightweight form of OLC, checking the leaf page version to detect concurrent modifications and restart if necessary to ensure it doesn’t miss keys.
    • Sequential Scan Optimization (RingBuffer): For large range scans, the iterator switches to a “synchronous batch fetch + local ring buffer” mode. It fills a small RingBuffer<BytesMut> with consecutive leaf pages (by following next_page_id) and then decodes KVs locally without holding page guards for long. This reduces buffer pool pollution and syscall/lock overhead.
    • Two Iteration Modes:
      • Guard Mode: Keep a ReadPageGuard and decode per step; prefetch next leaf best-effort.
      • Bytes Mode: After switching, decode from BytesMut buffers in the local ring; when a leaf is exhausted, pop next bytes from the ring or refill by following the chain.
  • Prefix Compression: Keys in internal nodes are prefix-compressed to save space. Each key is stored as (lcp, suffix_len, suffix). This reduces the size of internal pages, increasing the tree’s fanout and reducing its height, which improves cache performance and reduces I/O.
  • Split/Merge Safety:
    • Split: When a node splits, the new right sibling is written first. Then, the B-link pointer and separator key are published atomically by updating the left sibling and parent. This ensures readers can always navigate the structure correctly.
    • Merge/Redistribute: When a node is underfull, the implementation first tries to borrow an entry from a sibling (redistribute). If both siblings are at minimum size, it merges with a sibling. All these operations carefully maintain the B-link chain and parent pointers.

4. Benchmarks & Performance

4.1 Example: Range Scan Benchmark

This benchmark measures the efficiency of the leaf-chain traversal, which is critical for SELECT queries with WHERE clauses on indexed columns. It benefits from iterator prefetching and prefix compression.

#![allow(unused)]
fn main() {
// Pseudo-code for a range scan benchmark
use std::time::Instant;

fn benchmark_range_scan(index: Arc<BPlusTreeIndex>, num_keys: i64, num_passes: usize) {
    // 1. Populate the index with sequential keys
    for key in 1..=num_keys {
        let tuple = create_tuple_from_key(key, index.key_schema.clone());
        index.insert(&tuple, create_rid_from_key(key)).unwrap();
    }

    // 2. Run the benchmark
    let start = Instant::now();
    let mut count = 0;
    for _ in 0..num_passes {
        // Create an iterator over the full key range
        let mut iter = TreeIndexIterator::new(index.clone(), ..);
        while let Some(_) = iter.next().unwrap() {
            count += 1;
        }
    }
    let elapsed = start.elapsed();
    let total_items_scanned = num_keys as usize * num_passes;
    let items_per_sec = total_items_scanned as f64 / elapsed.as_secs_f64();

    println!(
        "Range Scan: Scanned {} items in {:?}. Throughput: {:.2} items/sec",
        total_items_scanned, elapsed, items_per_sec
    );
}
}

4.2 Performance Notes

  • Hot Reads: Performance on hot-spot reads depends on keeping upper levels and hot leaves resident in the buffer pool. Warm up the cache for read-heavy benchmarks. Protect hot pages from pollution by enabling TinyLFU admission.
  • Large Range Scans: Prefer table SeqScan with ring buffer (bypass) when scanning most of the table. For index scans over very large ranges, consider future Bitmap Heap Scan rather than bypassing the pool for leaves.

4.3 Configuration

  • config::BTreeConfig controls iterator behavior:
    • seq_batch_enable (bool): enable batch mode with local ring buffer.
    • seq_window (usize): number of leaf pages to prefill into the ring per refill.
    • prefetch_enable/prefetch_window: guard-mode prefetch hints to buffer pool. Defaults are conservative; increase seq_window for long scans to reduce I/O hop.

5. Future Work

  • Stronger OLC with bounded retries and telemetry.
  • CSB+-like internal layout and columnar key prefixing.
  • NUMA-aware partitioning and router.
  • WAL/MVCC for crash recovery and snapshot isolation.

Recovery / WAL Module

src/recovery/ guarantees the Durability part of ACID. Through WAL, checkpoints, and the ARIES algorithm, QuillSQL can recover to a consistent state after crashes.


Responsibilities

  • Generate WAL records for every change and assign monotonically increasing LSNs.
  • Manage WAL buffers, synchronous flushes, and the optional background WAL writer.
  • Run the ARIES Analysis → Redo → Undo sequence during startup.
  • Maintain the control file so the latest checkpoint and log truncation point are known.

Directory Layout

PathDescriptionKey Types
recovery/wal/Log writer, buffer, storage, background writer runtime.WalManager, WalBuffer, WalStorage
recovery/wal_record.rsSerialized record variants exposed to subsystems.WalRecordPayload, ResourceManagerId
recovery/resource_manager.rsRegistry that maps payloads to redo/undo handlers.ResourceManager, RedoContext, UndoContext
recovery/recovery_manager.rsARIES driver.RecoveryManager, RecoverySummary
recovery/control_file.rsPersistent metadata (checkpoint info).ControlFileManager

WalManager & I/O pipeline

  • BufferingWalManager uses a lock-free WalBuffer ring. append_record_with attaches an LSN, encodes the frame, pushes it into the ring, and auto-flushes when either record count or byte thresholds (max_buffer_records, flush_coalesce_bytes, or a full WAL page) are met. Writers never block on Vec reallocation and thus scale with concurrent transactions.
  • Physical logginglog_page_update emits 4 KiB PageWrite FPWs on first touch (tracked via DashSet<PageId>) or when logical redo is unavailable. Payloads carry the previous page_lsn so redo can enforce the WAL rule.
  • Segmentation & flushWalStorage appends frames to rolling segments under the configured WAL directory, queues async write/fsync tickets via the DiskScheduler sink, and recycles sealed segments after checkpoints. flush() drains buffered records up to an optional target LSN, waits for outstanding tickets, updates durable_lsn, and optionally persists the control file snapshot.
  • Background writerstart_background_flush spawns WalWriterRuntime, which periodically calls flush(None); WalWriterHandle lives in the background worker registry so shutdown is coordinated with the db lifecycle.
  • Checkpointslog_checkpoint writes a Checkpoint frame containing ATT/DPT, forces a flush, clears the “first-touch” set (so new FPWs are generated as needed), and updates checkpoint_redo_start in the control file. Recovery uses this redo start to avoid rescanning the whole log.
  • Readers & durability waitsWalManager::reader iterates frames straight from the WAL directory. wait_for_durable is the gate that synchronous commits call after emitting a commit record; it reuses flush_with_mode and condition variables to block until durable_lsn >= target.

Record types & resource managers

All log frames share the envelope defined in wal_record.rs and are routed by ResourceManagerId:

  • Page (ResourceManagerId::Page) – purely physical PageWrite payloads. The page resource manager checks page_lsn (buffer pool if available) before rewriting the page on redo. No undo is required because these are low-level physical changes.
  • Heap (ResourceManagerId::Heap) – logical payloads include relation id, page/slot, tuple metadata, and tuple bytes. HeapResourceManager applies inserts/overwrites at the slot level with LSN checks and repacks slotted pages. Undo uses the stored before-image (metadata + bytes) for loser transactions.
  • Index (ResourceManagerId::Index) – logs leaf inserts/deletes plus structural changes (split/merge/redistribute, parent updates, root install/adopt/reset). Redo mutates leaves or rebuilds structural pages with LSN/version guards; undo mirrors leaf inserts/deletes only. Heap and index WAL remain independent.
  • Transaction / CLR / CheckpointTransactionPayload (BEGIN/COMMIT/ABORT) enables the undo index to link per-transaction log records. ClrPayload documents each undo step so the recovery process can survive crashes mid-undo. Checkpoints snapshot ATT + DPT for the analysis phase.

ensure_default_resource_managers_registered wires Page, Heap, and Index resource managers together so both redo (RedoExecutor) and undo (UndoExecutor) can dispatch records uniformly.

Heap / Index WAL design

  • Heap – Execution operators call ExecutionContext::insert_tuple_with_indexes (and friends), which ultimately invoke MvccHeap to write WAL before dirtying the page. Each tuple carries TupleMetaRepr (insert/delete txn + CID, MVCC chain pointers) plus encoded column bytes. During redo, HeapResourceManager applies inserts or overwrites at the slot level with LSN checks and repacks the slotted page before writing it back via TableHeap::recovery_view. Undo leverages the stored before-image to restore bytes or toggle delete flags; vacuum code can later reclaim dead slots once safe_xmin passes them.
  • Index – B+Tree leaf operations log the logical key/value change, plus structural records for split/merge/redistribute/parent updates and root install/adopt/reset. Redo loads or rebuilds pages (buffer pool or disk), checks page LSN/version to avoid reapplying newer changes, and performs the requested mutation. Undo mirrors leaf-level inserts/deletes only. Heap and index WAL stay independent.
  • Page images – For heap/index structural changes that rewrite large sections (e.g., new heap page, B+Tree splits), page guards still emit FPWs through WalManager::log_page_update. Logical heap/index WAL ensures redo still works even when FPWs are skipped after the first touch.

ARIES Recovery

  1. Analysis – read the latest checkpoint, rebuild the Dirty Page Table (DPT) and Active Transaction Table (ATT).
  2. Redo – scan forward from the checkpoint LSN, reapplying operations when the DPT indicates a page may be out of date.
  3. Undo – roll back transactions still in ATT, writing Compensation Log Records (CLR) so the process is idempotent even if another crash happens mid-undo.

RecoverySummary reports how many records were redone and which transactions require manual attention—great for classroom demonstrations.


Interactions

  • TransactionManager – emits Begin, Commit, Abort records and supplies undo information.
  • BufferManager – links to WAL via set_wal_manager; checkpoints rely on the buffer’s dirty page metadata.
  • Background workers – WAL writer and checkpoint worker live in background and use handles exposed by WalManager.

Teaching Ideas

  • Simulate crashes (e.g., panic right after wal_manager.flush(None)) and inspect log output on restart.
  • Add a new WAL record type (like CreateIndex) to see how RecoveryManager must be extended.
  • Compare physical vs logical redo costs to discuss ARIES trade-offs.

Recovery Lab Playbook (CMU 15-445 style)

  1. Mini ARIES – Disable the background WAL writer, perform a few INSERT/UPDATE operations, crash the process mid-transaction, and single-step through analysis_pass.rs to observe ATT/DPT reconstruction.
  2. Logical vs physical redo – Comment out WalManager::log_page_update for heap pages and re-run the experiment. Recovery still succeeds thanks to logical heap payloads; re-enable FPWs to contrast log volume.
  3. Index crash test – Inject a panic after BPlusTreeIndex::insert logs a leaf insert. On restart, watch the undo phase remove the stray key before replay finishes.
  4. Group commit tuning – Play with WalOptions::flush_coalesce_bytes, writer_interval_ms, and synchronous_commit to demonstrate how throughput vs latency tradeoffs mirror the 15-445 checkpoints/commit labs.

Further reading: ARIES details

The ARIES Recovery Protocol

Of the four ACID properties, Durability is the one that guarantees that once a transaction is committed, its changes will survive any subsequent system failure. In a disk-based database, this is achieved through a careful and robust recovery protocol. QuillSQL implements a recovery strategy inspired by the well-known ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) protocol, centered around a Write-Ahead Log (WAL).

1. The Write-Ahead Logging (WAL) Principle

The fundamental rule of WAL is simple but powerful:

Before a modified data page is ever written from memory back to disk, the log record describing that modification must first be written to durable storage (the log file).

This allows the database to perform most of its data modifications in memory on the Buffer Pool without needing to synchronously write every changed page to disk, which would be extremely slow. In case of a crash, the database can use the log to reconstruct the state of the system and ensure all committed changes are present and all uncommitted changes are undone.

Log Records (WalRecord)

The WAL is a sequential, append-only file composed of log records. Each record is assigned a unique, monotonically increasing Log Sequence Number (LSN). A log record in QuillSQL (src/recovery/wal_record.rs) generally contains:

  • LSN: The address of the log record.
  • Transaction ID: The ID of the transaction that generated this record.
  • Payload: The actual content of the log record, which varies by type.

QuillSQL uses several types of log records:

  • Transaction: Marks the BEGIN, COMMIT, or ABORT of a transaction.
  • PageWrite: Physical records describing a change to a page. PageWrite contains a full image of the page (used for first-touch or when logical redo is unavailable), a First-Page-Write (FPW) style safeguard.
  • HeapInsert / HeapDelete: Logical tuple-level records. Each delete carries the full before-image so undo can always restore bytes/metadata.
  • Index*: Logical B+Tree records covering leaf inserts/deletes and structural changes (split/merge/redistribute, parent updates, root install/adopt/reset).
  • Checkpoint: A special record that marks a point of partial durability, allowing the log to be truncated.
  • CLR (Compensation Log Record): A special record written during recovery to describe an undo action. CLRs are redo-only and are never undone themselves.

2. The ARIES Recovery Protocol

On database startup, the RecoveryManager (recovery/recovery_manager.rs) is invoked to replay() the WAL. This process follows the three phases of ARIES.

Phase 1: Analysis

The goal of the Analysis phase (recovery/analysis.rs) is to figure out the state of the database at the exact moment of the crash.

  1. It starts by finding the last successful Checkpoint record in the WAL.
  2. It then scans the log forward from that checkpoint to the end of the log.
  3. During this scan, it builds two critical data structures:
    • Active Transaction Table (ATT): A list of all transactions that have a BEGIN record but no corresponding COMMIT or ABORT record. These are the potential “loser” transactions that will need to be undone.
    • Dirty Page Table (DPT): A list of all pages that were modified in the buffer pool but might not have been written to disk before the crash. For each dirty page, it records the LSN of the first log record that made it dirty (this is called the recLSN).

At the end of this phase, the RecoveryManager knows exactly which transactions to roll back and the earliest point in the log from which it needs to start re-applying changes.

Phase 2: Redo (Repeating History)

The goal of the Redo phase (recovery/redo.rs) is to restore the database to its exact state at the moment of the crash, including all changes from both committed and uncommitted (loser) transactions.

  1. The Redo phase finds the smallest recLSN from the Dirty Page Table built during Analysis. This LSN is the starting point for the redo scan.
  2. It scans the log forward from this starting point.
  3. For every log record it encounters that describes a change to a page (e.g., PageWrite, Heap*, Index*), it re-applies the change. This is idempotent: if the change is already present on the page (because it was successfully flushed to disk before the crash), re-applying it does no harm.

At the end of this phase, the database state on disk is identical to how it was in memory right before the crash.

Phase 3: Undo (Rolling Back Losers)

The final phase (recovery/undo.rs) is responsible for rolling back all the “loser” transactions identified during Analysis.

  1. The UndoExecutor takes the list of loser transactions.
  2. For each loser transaction, it scans the WAL backwards, following the chain of log records for that transaction.
  3. For each operation record (like HeapInsert, HeapDelete), it performs the logical inverse operation:
    • To undo an Insert, it performs a Delete.
    • To undo a Delete, it restores the deleted data.
  4. Crucially, for every undo action it performs, it writes a Compensation Log Record (CLR) to the WAL. The CLR contains information about the undo action and, importantly, a pointer to the next log record that needs to be undone for that transaction.

This use of CLRs makes the recovery process itself crash-proof. If the system crashes during the undo phase, the next time recovery runs, it will see the CLRs. It will simply continue the undo process from where it left off by following the pointers in the CLRs, without ever having to undo an undo action.

3. Checkpoints

If the log were allowed to grow forever, recovery would become slower and slower. Checkpoints are a background process that periodically creates a point of partial durability.

A checkpoint does the following:

  1. Temporarily stops new transactions from starting.
  2. Writes a BEGIN_CHECKPOINT record to the WAL, containing the current ATT and DPT.
  3. Flushes all dirty pages from the buffer pool to disk.
  4. Writes an END_CHECKPOINT record to the WAL.

Once a checkpoint is complete, the RecoveryManager knows that all changes described in log records before the BEGIN_CHECKPOINT record are safely on disk. Therefore, those older parts of the WAL file are no longer needed for recovery and can be truncated or recycled, keeping the recovery process efficient.


For Study & Discussion

  1. Repeating History: The Redo phase re-applies changes from all transactions, including those that will be undone later (the “losers”). This seems wasteful. Why is this “repeating history” approach a core principle of ARIES? What would go wrong if we tried to only redo changes from committed transactions?

  2. Compensation Log Records (CLRs): What specific problem would occur if the system crashed during the Undo phase and we didn’t write CLRs? How does a CLR’s “redo-only” nature make the recovery process idempotent and robust against repeated crashes?

  3. Checkpointing Frequency: What are the performance trade-offs between checkpointing very frequently versus very infrequently? Consider both normal runtime performance and the time it takes to recover after a crash.

  4. Programming Exercise: Add a new WAL record type. For example, a HeapReclaim record that logs the physical removal of a dead tuple by a vacuum process. To do this, you would need to: a. Add a variant to the HeapRecordPayload enum in wal_record.rs. b. Update the codec functions to handle its serialization. c. Decide what information the HeapReclaim record needs to contain. d. Implement the redo logic for this new record in the appropriate ResourceManager. What should happen when you redo a HeapReclaim record?

Write-Ahead Logging

This note dives into the WAL subsystem that powers QuillSQL’s recovery story. It explains how frames are generated, buffered, flushed, and replayed, and why logical heap/index records complement traditional physical FPWs.


1. Component map

LayerLocationResponsibility
WalManagersrc/recovery/wal/mod.rsAssigns LSNs, encodes WalRecordPayload, enqueues frames, and drives WalStorage::flush.
WalBuffersrc/recovery/wal/buffer.rsLock-free ring buffer (ConcurrentRingBuffer) that tracks pending frame count, bytes, and highest end LSN.
WalStoragesrc/recovery/wal/storage.rsMaintains WAL directory/segments, builds WalPages, and dispatches write/fsync tickets to a WalSink (default: DiskSchedulerWalSink).
WalWriterRuntimesrc/recovery/wal/writer.rsBackground thread that periodically calls WalManager::flush(None).
ControlFileManagersrc/recovery/control_file.rsPersists durable_lsn, max_assigned_lsn, checkpoint metadata, and redo start for fast crash recovery.
Resource managerssrc/recovery/resource_manager.rs, storage/*/heap_recovery.rs, storage/*/index_recovery.rsDecode payloads per ResourceManagerId and execute redo/undo logic for heap/index/page operations.

End-to-end DML flow:

  1. ExecutionContext mutates tuples/index entries via TableHeap / BPlusTreeIndex.
  2. Operators invoke WalManager::append_record_with or log_page_update after data-page changes succeed.
  3. Frames enter WalBuffer. Once thresholds are met, flush() drains frames into WalStorage, which schedules asynchronous writes/fsyncs.
  4. During commit, TransactionManager waits on WalManager::wait_for_durable(lsn) to guarantee WAL persistence before releasing locks.

2. Frame & record taxonomy

wal_record.rs defines the canonical envelope. encode_frame(lsn, prev_lsn, payload) serializes a record, while WalAppendContext reports the LSN range back to the caller.

ResourceManagerIdPayloadPurpose
PagePageWritePayloadFull-page image (FPW) on first-touch; also used when logical redo is not available.
HeapHeapRecordPayload::{Insert,Delete}Logical tuple records carrying relation id, page/slot, TupleMetaRepr, and tuple bytes (delete payloads always carry the before-image).
IndexIndexRecordPayload::{LeafInsert,LeafDelete,Leaf/Internal Split/Merge/Redistribute,Parent*,Root*}Logical B+Tree leaf mutations plus structural changes (split/merge/redistribute, parent updates, root install/adopt/reset).
TransactionTransactionPayloadBEGIN / COMMIT / ABORT markers that seed Undo’s per-txn chains.
CheckpointCheckpointPayloadCaptures ATT/DPT snapshots plus redo start.
ClrClrPayloadCompensation Log Records documenting each undo step and its undo_next_lsn.

ResourceManagerId determines how RedoExecutor / UndoExecutor route frames: Page → physical redo only; Heap/Index → logical redo/undo using storage helpers; Transaction/CLR/Checkpoint → interpreted by the analysis/undo phases.


3. Heap WAL: MVCC-aware logical logging

  • Emission pointsTableHeap::{insert,update,delete}_tuple call append_heap_record (see src/storage/heap/mvcc_heap.rs) before the data page is overwritten. Each record stores the relation identifier, page/slot, tuple metadata, and tuple bytes (delete payloads include the exact before-image).
  • Encodingsrc/storage/heap/wal_codec.rs serializes TupleMeta (insert/delete txn id, command id, MVCC chain pointers) plus tuple bytes in a compact layout.
  • RedoHeapResourceManager (src/storage/heap/heap_recovery.rs) decodes the payload and applies it at the slot level (insert/overwrite) with an LSN check; it repacks the slotted page layout to keep offsets consistent and sets page LSN to the frame LSN.
  • Undo – Always restores the stored before-image (metadata + tuple bytes) for loser transactions, removing the optional/branchy path.
  • Interaction with FPW – Heap logical redo handles tuple-level replay; FPWs are only used on first-touch or when no logical log exists for a page.

4. Index WAL: logical B+Tree leaf operations

  • Emission pointsBPlusTreeIndex logs every leaf insert/delete and all structural changes (split/merge/redistribute, parent updates, root install/adopt/reset) via append_index_record (src/storage/index/btree_index.rs), using src/storage/index/wal_codec.rs to encode key schema, key bytes, page ids, and txns.
  • Redo (src/storage/index/index_recovery.rs) steps:
    1. Decode the key with TupleCodec using the stored schema.
    2. Fetch the target page through the buffer pool (preferred) or DiskScheduler.
    3. Skip if page_lsn >= frame_lsn; otherwise apply the logical mutation/structural rebuild and bump the page version.
    4. Write the updated page back with the frame LSN (including header root pointer for root install/adopt/reset).
  • Undo – Inverts leaf inserts/deletes for loser transactions. Structural records are redo-only (idempotent via LSN/version checks).
  • Benefits – Heap and index WAL are decoupled; logical leaf updates and structured payloads avoid full-page writes while remaining crash-safe for splits/merges/root changes.

5. Buffering & flush strategy

  • Thresholdsmax_buffer_records (from WalOptions::buffer_capacity), flush_coalesce_bytes, and one WAL page (4 KiB) trigger batched flushes.
  • Flush mechanicsWalManager::flush_with_mode drains frames up to a target LSN, asks WalStorage::append_records to write them, then waits on all WalFlushTickets. flush_until forces durability before commit or after undo.
  • Checkpointslog_checkpoint forces a flush, records checkpoint_redo_start in the control file, and clears the “touched pages” set so new FPWs fire only once per checkpoint interval.
  • Background writerWalWriterRuntime runs when WalOptions::writer_interval_ms is non-zero, smoothing out flush pressure even when foreground transactions are light.

6. Relation to ARIES

  1. AnalysisAnalysisPass parses the latest checkpoint, reconstructs ATT/DPT by scanning the tail of the log, and chooses a redo start (min(dpt.rec_lsn)).
  2. Redo (repeat history)RedoExecutor iterates from start_lsn, invoking the appropriate resource manager for each frame. Page RM checks pageLSN before applying FPW/delta; Heap/Index RMs use logical payloads to rebuild tuples or leaf entries.
  3. UndoUndoExecutor chains loser transactions backwards, calling each resource manager’s undo method and emitting CLRs with undo_next_lsn. If recovery crashes mid-undo, the next run resumes at the recorded undo_next_lsn.

7. Tuning & troubleshooting

  • ConfigurationWalOptions inside DatabaseOptions expose segment_size, sync_on_flush, writer_interval_ms, synchronous_commit, retain_segments, etc.
  • IntrospectionWalManager::pending_records() dumps in-memory frames for debugging; background::BackgroundWorkers::snapshot() reports WAL writer/checkpoint worker metadata. Enabling RUST_LOG=trace reveals FPW vs delta decisions and flush cadence.
  • Crash testing – Insert a forced std::process::exit(1) after specific DMLs, then restart and inspect RecoverySummary (redo count + loser transactions) to ensure heap/index WAL cover the intended cases.

With logical heap/index records plus FPWs as a safety net, QuillSQL’s WAL stays teaching-friendly while mirroring production-grade recoverability. When introducing new components (e.g., custom indexes or vacuum steps), define a payload + resource manager and they will automatically participate in ARIES.

Background Services

src/background/ hosts the asynchronous workers that keep a database healthy: WAL writers, checkpoints, buffer flushers, and MVCC vacuum. A central registry makes it easy to start/stop workers together—ideal for teaching how background maintenance supports foreground queries.


Responsibilities

  • Start workers according to configuration (WalOptions, MvccVacuumConfig, etc.).
  • Define lightweight traits (CheckpointWal, BufferMaintenance, TxnSnapshotOps) so workers can run without pulling in an async runtime.
  • Provide BackgroundWorkers, a registry that tracks WorkerHandles and shuts them down when Database drops.

Built-in Workers

WorkerTriggerBehavior
WAL writerwal_writer_interval_msCalls WalManager::background_flush to durably write log buffers.
Checkpointcheckpoint_interval_msCaptures dirty page / active txn tables and emits Checkpoint records to bound recovery.
Buffer writerbg_writer_intervalFlushes dirty frames to reduce checkpoint pressure.
MVCC vacuumMvccVacuumConfigRemoves obsolete tuple versions once safe_xmin advances.

Every worker registers itself with BackgroundWorkers; shutdown_all() ensures threads exit cleanly during tests or process teardown.


Interactions

  • WalManager – WAL writer and checkpoint workers operate on Arc<dyn CheckpointWal>.
  • BufferManager – background flushers inspect dirty frames and help checkpoints capture consistent snapshots.
  • TransactionManager – MVCC vacuum queries TxnSnapshotOps for safe_xmin.

Teaching Ideas

  • Tune MvccVacuumConfig::batch_limit and chart how quickly old tuple versions disappear.
  • Disable a worker in tests to show why unflushed WAL or missing checkpoints lengthen recovery.
  • Enable RUST_LOG=background=info to trace how these tasks complement foreground load.

Configuration & Runtime Options

src/config/ centralizes tunables used by DatabaseOptions, the CLI/HTTP front-ends, and background workers. Keeping knobs in one place makes it easy to demonstrate how WAL, buffering, or vacuum behavior changes under different settings.


Key Types

TypeDescription
DatabaseOptionsTop-level options when constructing a database (WAL config, default isolation, etc.).
WalOptionsWAL directory, segment size, flush strategy, writer interval, sync mode.
IndexVacuumConfig / MvccVacuumConfigBackground worker intervals (buffer writer, MVCC vacuum).
BufferPoolConfigOptional overrides for pool size, TinyLFU, and replacement policy details.

Usage

  • CLI/HTTP front-ends parse env vars or config files into DatabaseOptions and pass them to Database::new_*.
  • During bootstrap_storage, the database wires these options into WalManager, DiskScheduler, and BackgroundWorkers.
  • Workers and execution components receive Arc references to the relevant configs so they can adapt at runtime without global state.

Teaching Ideas

  • Toggle WalOptions::synchronous_commit to discuss commit latency vs durability.
  • Shrink the buffer pool to highlight eviction behavior under different replacement policies.
  • Adjust MvccVacuumConfig intervals and measure how vacuum frequency affects foreground write throughput.

Front-Ends (CLI / HTTP)

The bin/ directory contains the user-facing entry points. Both binaries embed the same Database type, so they demonstrate how the core engine can power different UIs.

BinaryPurpose
client.rsInteractive CLI (REPL) that reads SQL, executes it, and prints tabular output.
server.rsHTTP + JSON API for integration tests or web UIs.

CLI (bin/client.rs)

  • Uses rustyline to provide history, multi-line editing, and familiar shell shortcuts.
  • Each command calls database.run(sql) and formats the resulting Vec<Tuple>.
  • Supports meta commands (e.g., .tables) that expose catalog metadata—great for teaching how logical objects map to physical handles.

HTTP (bin/server.rs)

  • Built with axum/hyper (depending on the current Cargo.toml), exposing endpoints such as:
    • POST /query – run arbitrary SQL and return rows or an error payload.
    • Health/metrics endpoints—which you can extend in labs to surface background worker status or buffer metrics.
  • Configuration comes from QUILL_DB_FILE, QUILL_HTTP_ADDR, PORT, etc., mirroring how production services inject settings.

Teaching Ideas

  • Extend the CLI with \describe table to practice catalog lookups.
  • Add transaction endpoints (BEGIN/COMMIT) to the HTTP server to demonstrate how SessionContext scopes transactions per connection.
  • Combine CLI interaction with RUST_LOG tracing to walk through the entire query lifecycle.

Testing & Documentation

QuillSQL is intended for teaching, so the repo invests heavily in examples and automated verification. The tests/ tree and this mdBook work together to illustrate every module.


Test Suite

LocationPurpose
tests/sql_example/*.sltsqllogictest suites covering DDL, DML, transactions, and indexes.
tests/transaction_tests.rsRust unit tests that stress MVCC visibility, lock conflicts, and isolation semantics.
tests/storage_*Component tests for heap/index/buffer internals—perfect references for lab exercises.

Common commands:

cargo test -q
# focused run
cargo test tests::transaction_tests::repeatable_read_sees_consistent_snapshot_after_update -- --nocapture

For long-running suites, wrap with timeout to guard against hangs.


Documentation (mdBook)

  • The docs/ directory is an mdBook; run mdbook serve docs to browse locally.
  • Each module, including this page, has a dedicated chapter so instructors can teach subsystem by subsystem.
  • Anchor chapters such as architecture.md, transactions.md, and wal.md walk through end-to-end flows and subsystem deep dives.

Teaching Ideas

  • Require sqllogictest additions alongside code changes to reinforce “tests as docs”.
  • Use the mdBook site during lectures to connect diagrams with source files.
  • Assign “doc walk-through” tasks where students extend diagrams or add experiment instructions to existing chapters.