QuillSQL Internals
Welcome to the technical documentation for QuillSQL.
This book provides a deep dive into the internal architecture and implementation details of the database. It is intended for developers, contributors, and anyone interested in understanding how a relational database is built from the ground up, referencing concepts from classic database courses like CMU 15-445.
Table of Contents
-
Overall Architecture: A high-level overview of the entire system.
-
Core Modules
- Buffer Manager: The in-memory page cache.
- Storage Engine: How data is physically stored.
- Indexes: The B+Tree implementation.
- Recovery Manager (WAL): Crash recovery and the ARIES protocol.
- Transaction Manager: Concurrency control with MVCC and 2PL.
- Query Plan: The journey from SQL to an executable plan.
- Query Optimizer: Rule-based plan transformations.
- Execution Engine: The Volcano (iterator) execution model.
QuillSQL Architecture
This chapter gives a tour of the major subsystems, the flow of a SQL statement, and the contract between execution, transactions, and storage. It uses Mermaid diagrams so you can render the visuals directly inside mdBook or any compatible viewer.
1. End-to-End Pipeline
flowchart LR
subgraph Frontend
CLI["bin/client"] --> Parser["sql::parser"]
HTTP["bin/server"] --> Parser
end
Parser --> LPlan["plan::LogicalPlanner"]
LPlan --> Optimizer["optimizer::LogicalOptimizer"]
Optimizer --> PPlan["plan::PhysicalPlanner"]
PPlan --> Exec["execution::ExecutionEngine (Volcano)"]
subgraph Txn["transaction::*"]
Session["session::SessionContext"]
TM["TransactionManager"]
LM["LockManager"]
Session --> TM --> LM
end
Exec <-->|snapshot, locks| Txn
Exec --> Binding
subgraph Storage["storage::* & buffer::*"]
Binding["storage::engine::TableBinding"]
Heap["storage::table_heap::TableHeap"]
MVCC["storage::heap::MvccHeap"]
Index["storage::index::BPlusTree"]
Buffer["buffer::BufferManager"]
Disk["storage::disk_scheduler (io_uring)"]
WAL["recovery::WalManager"]
Binding --> Heap
Binding --> Index
Heap <--> Buffer
Index <--> Buffer
Buffer <--> Disk
WAL --> Disk
end
Background["background::workers\n(checkpoint, WAL writer, MVCC vacuum)"] --> {Buffer, WAL, TM}
Key takeaways
- The frontend (CLI/HTTP) only knows how to parse SQL and drive the planning stages; all shared state lives below.
- The
ExecutionEnginedrives a Volcano iterator. Each physical operator calls into the transaction runtime for visibility checks and locking, but touches storage exclusively through aTableBinding. This makes the executor easy to reason about in a classroom setting. - Buffering, WAL, and disk I/O are centralized so that durability/ordering guarantees stay in one module.
2. Transactions, MVCC, and the Executor
sequenceDiagram
autonumber
participant Session
participant TM as TransactionManager
participant Runtime as TxnRuntime
participant Exec
participant Binding
participant Heap as TableHeap
Session->>TM: begin_txn(isolation, access_mode)
TM-->>Session: Transaction handle
Exec->>Runtime: TxnRuntime::new(&TM, &mut txn)
Runtime-->>Exec: {snapshot, command_id}
loop per tuple
Exec->>Binding: scan.next()
Binding->>Heap: TableIterator::next()
Heap-->>Binding: (rid, meta, tuple)
Binding-->>Exec: (rid, meta, tuple)
Exec->>Runtime: is_visible(meta)?
Runtime-->>Exec: yes/no
alt visible & needs lock
Exec->>Runtime: lock_row(table, rid, mode)
Runtime-->>Exec: guard
end
alt mutation
Exec->>Binding: prepare_row_for_write(...)
Binding->>Heap: mvcc.full_tuple(...)
Heap-->>Binding: (current_meta, tuple)
Binding->>TM: push_undo + append WAL(HeapRecord)
end
end
Session->>TM: commit(txn)
TM->>WAL: Transaction{Commit}
TM->>TM: flush_until(lsn) if synchronous
TM-->>Session: release locks, clear snapshot
Snapshot policy
Read Uncommitted / Read Committed: every command obtains a fresh snapshot, so repeated statements can see committed updates.Repeatable Read / Serializable: cache the first snapshot on theTransactionhandle; subsequent commands reuse it for consistent reads. The lock manager releases S-locks at the end of each command for RR, but Serializable keeps them until commit.
Undo & WAL
- Each mutation produces a logical
HeapRecordPayloadorIndexRecordPayload. The heap payload already carries redo (new bytes) and undo (old bytes), so recovery can replay forward or backward without re-reading heap pages. - On abort, the manager walks the undo stack, emits CLRs, and re-applies the inverse heap/index operations.
3. Storage Layering
graph TD
subgraph Planner View
Catalog
end
Catalog -->|Arc<TableHeap>| TableBinding
TableBinding --> Heap["TableHeap"]
TableBinding --> Mvcc["MvccHeap"]
TableBinding --> Index["BPlusTreeIndex"]
Heap -->|pages| BufferMgr
Index -->|pages| BufferMgr
BufferMgr --> DiskMgr
BufferMgr --> WalMgr
| Layer | Responsibility | Notes |
|---|---|---|
TableHeap | Physical slotted pages, tuple encoding, WAL page image helpers | Exposes insert/update/delete that emit heap-specific WAL payloads before dirtying frames. |
MvccHeap | Version chain management, delete-marking, undo metadata | Creates new versions, rewires forward/back pointers, and delegates actual I/O to TableHeap. |
TableBinding | Safe façade for the executor | Provides scan, index_scan, insert, delete, update, and prepare_row_for_write, always pairing heap/index changes so operators stay small. |
BufferManager + DiskScheduler | Unified cache + async I/O | Uses LRU-K (+ optional TinyLFU admission) and io_uring to keep the hot set resident. |
WalManager | ARIES-compatible log | Supports logical heap/index records, page write/delta fallback, CLRs, checkpoints, and background flush threads. |
4. TableBinding Contract
classDiagram
class TableBinding {
+scan() -> TupleStream
+index_scan(name, IndexScanRequest) -> TupleStream
+insert(&mut TxnContext, &Tuple)
+delete(&mut TxnContext, RecordId, TupleMeta, Tuple)
+update(&mut TxnContext, RecordId, Tuple, TupleMeta, Tuple) -> RecordId
+prepare_row_for_write(&mut TxnContext, RecordId, &TupleMeta)
+table_heap() -> Arc<TableHeap>
}
class PhysicalOperator {
+init(&mut ExecutionContext)
+next(&mut ExecutionContext) -> Option<Tuple>
}
PhysicalOperator --> TableBinding : uses
This binding hides every MVCC/WAL detail from the operators:
- No more ad-hoc
catalog.table_indexes()calls. - No direct references to
MvccHeaporTableHeap. - Executor code reads like pseudo SQL: “lock table”, “scan binding”, “update tuple”.
5. WAL & Recovery Details
flowchart LR
subgraph WAL Record Types
HI["HeapInsert"] --> redo
HD["HeapDelete (with before-image)"] --> redo & undo
LI["IndexLeaf{Insert,Delete}"]
IS["IndexSplit/Merge/Redistribute/Parent*"]
IR["IndexRoot Install/Adopt/Reset"]
PI["PageWrite (FPW)"]
CI["Checkpoint"]
CL["CLR"]
end
Exec -->|Heap/Index payloads| WalMgr
BufferMgr -->|first-touch FPW| WalMgr
WalMgr --> DiskScheduler --> log files
Recovery -->|analysis/redo/undo| BufferMgr
- Logical logging first: heap/index mutations emit redo + undo at the logical level. This keeps the WAL stream compact and human-readable for teaching.
- Physical fallbacks: metadata-heavy pages (meta, freelist, header) still leverage PageWrite FPWs to guarantee a consistent base image, especially on the first page flush after a crash.
- Restart:
RecoveryManagerperforms the classical ARIES sequence. It uses thedirty_page_tableandactive_txn_tablecaptured by checkpoints to limit redo and undo work.
6. Background Workers
| Worker | Purpose | Configuration |
|---|---|---|
| WAL writer | Periodically flushes WAL buffers, coalesces adjacent writes | WalOptions::writer_interval_ms, buffer_capacity |
| Checkpoint | Records ATT + DPT, gently pushes dirty buffers | WalOptions::checkpoint_interval_ms |
| Buffer writer | Flushes frames when the dirty set grows too large | background::BufferWriterConfig |
| MVCC vacuum | Reclaims obsolete tuple versions based on safe_xmin | MvccVacuumConfig |
| Index vacuum | Cleans up deleted index entries using B-link traversal | IndexVacuumConfig |
Workers register with background::BackgroundWorkers, so the database can stop and join them cleanly on shutdown. Each worker publishes metrics (intervals, batches processed) for observability.
7. Example: Repeatable Read UPDATE
sequenceDiagram
participant T1 as Txn 1 (RR)
participant T2 as Txn 2 (RC)
participant Heap as TableHeap
T1->>Heap: SELECT (snapshot S0)
Heap-->>T1: version V0
T2->>Heap: UPDATE -> insert V1, delete V0
Heap-->>T2: ack
T2->>T2: COMMIT (WAL + locks release)
T1->>Heap: SELECT again
Heap-->>T1: V0 still visible via snapshot S0
T1->>T1: COMMIT
Note over T1,Heap: Vacuum later removes V0 when safe_xmin advances
This timeline demonstrates:
- Snapshots shield Repeatable Read statements from concurrent writers even if row-level locks are released early.
- The MVCC chain (
V1.prev_version = V0,V0.next_version = V1) lets future readers reach the newest committed version while the vacuum worker reclaims obsolete ones lazily.
8. Observability & Configuration Cheat Sheet
- Logging:
RUST_LOG=tracesurfaces lock acquisitions, MVCC vacuums, and io_uring completions. - Runtime knobs:
WalOptions(segment size, sync policy),BufferPoolConfig(capacity, TinyLFU toggle),MvccVacuumConfig, andIndexVacuumConfigcan all be adjusted viaDatabaseOptions. - Metrics:
background::BackgroundWorkers::snapshot()reports worker health; WAL exposes current LSN and flush LSN; the buffer manager can dump the dirty page table for diagnostics.
With these layers in place, QuillSQL remains faithful to production-grade engines (MVCC + WAL + buffer pool) while keeping its code and documentation approachable for coursework and research prototypes.
Module Overview
This page gives a teaching-friendly tour of every QuillSQL subsystem. Each section
explains where the code lives (src/<module>), the most important types, and the
design decisions that make the module fit into the whole system. Read it together
with architecture.md for the big-picture data flow.
1. SQL Front-End (src/sql)
| Submodule | Responsibilities | Key Types / Functions |
|---|---|---|
lexer.rs, parser.rs | Translate raw SQL into an AST using sqlparser plus Quill-specific extensions. | parse_sql(&str) -> Vec<Statement> |
ast/mod.rs | Normalises identifiers, handles multi-part names, attaches spans for diagnostics. | NormalizedIdent, ObjectNameExt |
plan/lowering.rs | Bridges AST → planner structs for DDL extras not in upstream sqlparser. | CreateIndexSpec, ColumnDefExt |
Implementation notes:
- We intentionally keep the AST “SQL-shaped”. No premature desugaring occurs in the parser, which keeps the step-by-step teaching narrative simple: SQL text → AST → logical plan.
- Error messages bubble up with span information, so labs around parser extensions can show students exactly which token misbehaved.
- Suggested exercises: extend the parser with
ALTER TABLEor window functions, then observe how the new AST nodes flow into the planner layer.
2. Logical Planning (src/plan)
| Component | Description |
|---|---|
LogicalPlanner | Converts AST nodes into strongly typed LogicalPlan variants. Responsible for type checking, alias resolution, and scope handling. |
PlannerContext | Surface for catalog lookups while planning. |
PhysicalPlanner | Lowers optimized logical plans into physical Volcano operators (PhysicalPlan). |
Highlights:
- Every
LogicalPlanvariant stores its child plans inArc, so the tree is cheap to clone for optimizer passes or debugging prints. - Planner enforces column binding rules: scope stacks keep track of aliases, CTEs, and correlation so students can see how real compilers resolve identifiers.
- DDL nodes capture
TableReference+Schemaobjects. Later phases use them to skip repeated catalog lookups (helpful for labs about metadata caching).
3. Optimizer (src/optimizer)
| Piece | Responsibility |
|---|---|
LogicalOptimizer | Applies a pipeline of rule-based rewrites (predicate pushdown, projection pruning, constant folding). |
rules | Individual OptimizerRule implementations. |
Teaching hooks:
- Each rule implements
OptimizerRule::rewrite(&LogicalPlan) -> Option<LogicalPlan>. The return type makes it obvious whether a rewrite fired, so labs can instrument hit counts or add tracing. - Because rules are pure functions, students can safely reorder or A/B test heuristics in isolation (e.g., “what if we only push predicates below joins when the selectivity estimate exceeds X?”).
- Example lab: add constant folding or join commutation, run the sqllogictest suite, and compare plan dumps to see new shapes.
4. Execution Engine (src/execution)
| Element | Details |
|---|---|
PhysicalPlan | Enum covering all Volcano operators. Each variant implements VolcanoExecutor. |
VolcanoExecutor | init(&mut ExecutionContext) and next(&mut ExecutionContext) pair define the iterator model. |
ExecutionContext | Supplies catalog access, expression evaluation, and (most importantly) a pluggable StorageEngine. |
Design notes:
- Operators stay declarative. They describe what to scan or modify and delegate the
how to storage handles. For example,
PhysicalSeqScannow requests aTupleStreamviaExecutionContext::table_stream, so it never touchesTableHeapinternals. - Helper methods such as
eval_predicate,insert_tuple_with_indexes, orprepare_row_for_writeencapsulate MVCC/locking rules so physical plans remain short. - ExecutionContext caches
TxnContext, making it straightforward to teach isolation semantics: examineTxnContext::lock_tableandread_visible_tupleto see when locks are taken or released. - Suggested lab: implement a new physical operator (e.g., hash join) by wiring two child
PhysicalPlans without ever touching storage-specific code. This highlights the payoff of the handle abstraction.
5. Transaction Runtime (src/transaction)
| Type | Purpose |
|---|---|
TransactionManager | Creates/commits/aborts transactions, manages undo chains, and coordinates WAL durability. |
TxnContext | Per-command wrapper passed to execution. Provides MVCC snapshot (TransactionSnapshot), lock helpers, and command ids. |
LockManager | Multi-granularity 2PL with deadlock detection. |
Deeper dive:
Transactionstores a cachedTransactionSnapshotplus its undo records. Students can inspectTransaction::set_snapshotto see how Repeatable Read/Serializable keep a stable view.TxnRuntimepairs a transaction with a command id. Every SQL statement increments the command id so MVCC can distinguish between tuples created earlier in the same transaction vs. the current statement—great for explaining “recent writes are invisible during UPDATE scanning”.LockManagerexposes functions likelock_table/lock_row. Internally it keeps a wait-for graph and victim selection policy, which makes deadlock detection tangible for concurrency lectures.- Undo entries (
UndoRecord) store heap + index data. When an abort occurs the engine generates CLRs, demonstrating ARIES-style logical undo.
6. Storage Engine & Handles (src/storage)
| Component | Highlights |
|---|---|
engine.rs | Defines StorageEngine, TableHandle, IndexHandle, TupleStream, and IndexScanRequest. |
table_heap | Slotted-page heap with MVCC metadata (TupleMeta, forward/back pointers). |
index | B+Tree (B-link) implementation with iterators and codecs. |
Key ideas for teaching: Topics to emphasise:
- Handle abstraction: Execution asks the engine for a
TableHandle, then callsfull_scan()to receive aTupleStream. The default engine simply wrapsTableHeap/BPlusTreeIndex, but students can plug in their own engines for research. - TupleStream: Minimal interface returning
(RecordId, TupleMeta, Tuple)triples. Operators layer MVCC visibility on top, while the stream hides buffering details. - Scan extensions: Consider extending
full_scan()to accept projection/batch hints— a natural exercise for showing how execution and storage negotiate capabilities. table_heapdemonstrates MVCC version chains (forward/back pointers) and the slotted page layout. Encourage students to traceMvccHeap::updatealongside WAL entries to see how new versions link together.index/btree_index.rsimplements a B-link tree with separate codecs. Scanning viaTreeIndexIteratorshows how to perform lock-free range scans using sibling pointers— perfect for advanced systems lectures.
7. Buffer Manager & Disk I/O (src/buffer, src/storage/disk_*)
| Module | Description |
|---|---|
buffer::BufferManager | Page table + LRU-K/TinyLFU replacer, dirty tracking, guard types for safe borrowing. |
storage::disk_scheduler | io_uring-powered async I/O. Foreground threads enqueue read/write/fsync commands and await completions. |
storage::disk_manager | Thin wrapper for file layout, page enlargement, and durability fences. |
Extra details:
- Page guards come in three flavours: read, write, and upgradeable. Each implements
Dropsemantics that release latches automatically, reinforcing RAII patterns. - Replacement policy combines LRU-K history with TinyLFU admission. Labs can toggle the feature flag to measure hit-rate differences under sqllogictest or custom workloads.
- DiskScheduler uses lock-free queues plus dedicated worker tasks. A teaching exercise is
to run with
RUST_LOG=debugand observe how read/write/fsync commands are batched.
8. Recovery & WAL (src/recovery)
| Item | Notes |
|---|---|
WalManager | Allocates LSNs, buffers log records, drives background WAL writer, and integrates with checkpoints. |
RecoveryManager | Implements ARIES analysis/redo/undo. Uses ControlFileManager snapshots to seed restart. |
wal_record.rs | Defines logical (HeapInsert, IndexDelete, index structure/root records) and physical (PageWrite FPWs) records. |
Teaching hook:
- WAL and data share the disk scheduler. Students can trace one UPDATE from log append,
to buffer dirtying, to redo/undo via the exact same
RecordId. - Recovery exports statistics (redo count, loser transactions) so labs can check their WAL experiments automatically.
recovery/analysis.rsshows how Dirty Page Table and Active Transaction Table are reconstructed—ideal for demystifying ARIES’ first phase.- Students can implement “logical replay only” or “page image replay” modes by toggling the commit record types, then verify behaviour using the provided transaction tests.
9. Background Services (src/background)
| Worker | What it does |
|---|---|
| WAL writer | Flushes WAL buffer at configured cadence. |
| Checkpoint | Captures the Dirty Page Table + Active Transaction Table. |
| Buffer writer | Flushes dirty frames to keep checkpoints short. |
| MVCC vacuum | Reclaims tuple versions older than safe_xmin. |
More context:
- Workers share a
BackgroundWorkersregistry so the database can spawn/stop them as a group (handy for tests). The registry exposesshutdown_all()which unit tests call to ensure a clean exit. - Config structs (
IndexVacuumConfig,MvccVacuumConfig, etc.) live insrc/configand are exposed throughDatabaseOptionsfor easy fiddling in labs. - WAL writer and checkpoint worker simply wrap closures around
tokio::task::JoinHandle. This design keeps async runtime code out of the core engine, making it simpler for students to trace background effects. - Exercise idea: tweak
MvccVacuumConfig::batch_limitand observe how many tuple versions stay behind by querying hidden statistics tables.
10. Configuration & Session Layer (src/database, src/session, src/config)
| Type | Role |
|---|---|
Database | Boots disk/WAL/buffer components, runs recovery, wires background workers, and executes SQL strings. |
SessionContext | Tracks per-connection defaults (autocommit, isolation) and holds the active transaction handle. |
config::* | Central place for WAL, buffer pool, vacuum, and HTTP/CLI tuning knobs. |
Extra pointers:
- Front-ends (
bin/client,bin/server) both embed aDatabase, proving that the core library can serve multiple UIs without change. DatabaseOptionsshow how to construct dev/test setups (temporary files, alternate WAL directories) in a few lines.session::SessionContextdemonstrates auto-commit semantics: it lazily starts a transaction and usesTransactionScopeto interpretSET TRANSACTIONstatements.- Configuration structs derive
Clone + Debug, making them easy to print in labs or feed from environment variables (HTTP server usesQUILL_DB_FILE,PORT, etc.).
11. Testing & Documentation (tests/, docs/)
| Area | Purpose |
|---|---|
tests/sql_example/*.slt | sqllogictest suites for SQL coverage. |
tests/transaction_tests.rs | Unit tests for MVCC invariants, locking, and visibility. |
docs/ | This mdBook. Every module adds its own deep-dive chapter, making it straightforward for students to jump from code to guided explanations. |
Testing strategy:
- Developers can run
cargo test -qfor fast feedback. Long-running suites can be wrapped withtimeoutas suggested inAGENTS.md. - Example-driven docs (like this page) mirror the repository layout, so onboarding students can find code and tests with minimal guesswork.
- Encourage students to add sqllogictest cases alongside code changes. Because each case lives
in
tests/sql_example, git diffs double as documentation. - For modules with heavy concurrency (lock manager, WAL), pair unit tests with tracing: the CI logs become walkthroughs for tricky paths.
12. Suggested Reading Order for Learners
- Introduction ➜ Architecture – get the 10,000ft view.
- Module Overview (this page) – map names to directories.
- Execution + Storage chapters – understand TupleStreams, handles, and MVCC.
- Recovery + Transaction – see how WAL && MVCC interlock.
- Buffer, Index, Background – dive into advanced systems topics once the basics stick.
Treat this document as a living index: update it whenever a subsystem gains new entry points (e.g., asynchronous scans, new index types) so future contributors always know where to look next.
Contributor’s Guide
Welcome, and thank you for your interest in contributing to QuillSQL! Whether you’re fixing a bug, adding a new feature, or improving the documentation, this guide will help you get started.
1. Getting Started: Your Development Environment
Prerequisites
- Rust: QuillSQL is written in Rust. If you don’t have it yet, install it via rustup. This will provide you with
rustc(the compiler) andcargo(the package manager and build tool).curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh - Build Essentials: Ensure you have a C++ compiler like
gccorclanginstalled, which is a common dependency for some Rust crates.
Setup
-
Fork the Repository: Start by forking the main QuillSQL repository to your own GitHub account.
-
Clone Your Fork: Clone your forked repository to your local machine.
git clone https://github.com/feichai0017/QuillSQL.git cd QuillSQL -
Build the Project: Compile the entire project to ensure everything is set up correctly.
cargo build
2. Development Workflow
Running Tests
Before and after making any changes, it’s crucial to run the test suite to ensure you haven’t broken anything.
-
Run all unit and integration tests:
cargo test -
Run the benchmark suite:
cargo bench
Code Style and Quality
We adhere to the standard Rust coding style and use tools to enforce it.
-
Formatting: Before committing, please format your code with
rustfmt.cargo fmt --all -
Linting: We use
clippyto catch common mistakes and improve code quality. Please ensureclippypasses without warnings.cargo clippy --all-targets -- -D warnings
Submitting Your Contribution
-
Create a New Branch: Create a descriptive branch name for your feature or bugfix.
git checkout -b my-awesome-feature -
Make Your Changes: Write your code. Add new tests to cover your changes. Ensure all existing tests still pass.
-
Format and Lint: Run
cargo fmtandcargo clippyas described above. -
Commit Your Work: Write a clear and concise commit message.
git add . git commit -m "feat: Add support for window functions" -
Push to Your Fork: Push your branch to your fork on GitHub.
git push -u origin my-awesome-feature -
Open a Pull Request: Go to the original QuillSQL repository on GitHub. You should see a prompt to open a Pull Request from your new branch. Fill out the PR template with a description of your changes.
3. Working on the Documentation
The documentation is built using mdbook. To preview your changes locally, you’ll need to install it and the mermaid plugin.
-
Install
mdbookandmdbook-mermaid:cargo install mdbook cargo install mdbook-mermaid -
Serve the Book Locally: Run the following command from the root of the project.
mdbook serve docsThis will build the book and start a local web server. You can open your browser to
http://localhost:3000to see the live-previewed documentation.
SQL Front-End
The SQL front-end lives in src/sql/. It turns raw UTF-8 query text into the abstract
syntax trees (ASTs) consumed by planning, while layering Quill-specific name handling
and diagnostics on top of sqlparser.
Responsibilities
- Parse SQL text into
sqlparser::ast::Statementvalues. - Record precise spans so error messages can highlight the exact byte range.
- Normalise identifiers (case folding, quoted names, multi-part paths).
- Provide helper traits so the logical planner can lower AST nodes without duplicating syntax checks.
Directory Layout
| Path | Purpose | Key Types |
|---|---|---|
lexer.rs | Token helpers that preserve offsets. | Token, TokenExt |
parser.rs | Single entry point used across the codebase. | parse_sql, SqlInput |
ast/mod.rs | Planner-facing helpers. | NormalizedIdent, ObjectNameExt |
error.rs | Span-aware parser errors. | SqlError, SqlSpan |
Parsing Pipeline
- Lexing – wrap sqlparser’s lexer so every token keeps start/end offsets.
- AST generation – invoke sqlparser to produce standard
Statementstructs. - Normalisation – convert identifiers into
NormalizedIdent, deal with schema qualifiers, and build pieces ofTableReference. - Planner bridge – traits like
ColumnRefExtexpose methods such asrelation()orcolumn()soLogicalPlannercan treat different SQL syntaxes uniformly.
Interactions
- Logical planner consumes the AST directly and relies on helper traits from this module to convert identifiers into catalog references.
- Database / Session catch
SqlErrorvalues, so both CLI and HTTP front-ends show consistent caret diagnostics. - Tests (
tests/sql_example/*.slt,tests/sql_parser.rs) assert on parser output and error strings to keep teaching feedback stable.
Implementation Notes
SqlSpanstores byte offsets, which makes it trivial to slice the original SQL and render highlighted errors.- Extended statements (e.g.,
EXPLAIN,BEGIN TRANSACTION) show how to Layer Quill-specific syntax without forking sqlparser entirely. - We avoid desugaring at this stage so students can trace SQL → AST → logical plan step by step.
Teaching Ideas
- Add a new statement (
CREATE VIEW,ALTER TABLE ...) and follow the AST through the pipeline. - Improve error hints (“Did you forget FROM?”) to see how better diagnostics aid users.
- Write fuzz tests that round-trip SQL → AST → SQL to discuss parser determinism.
Catalog Module
src/catalog/ acts as QuillSQL’s data dictionary. It tracks schema/table/index metadata,
statistics, and the mapping between logical names and physical storage objects such as
TableHeap and BPlusTreeIndex. Every layer—planner, execution, background workers—uses
the catalog to discover structure.
Responsibilities
- Persist definitions for schemas, tables, columns, indexes, and constraints.
- Map logical
TableReferences to physical handles (heap files, index roots, file ids). - Store table statistics (row counts, histograms) that drive ANALYZE and optimization.
- Manage the DDL lifecycle: creation and deletion update the in-memory registry and the on-disk metadata pages.
Directory Layout
| Path | Description | Key Types |
|---|---|---|
mod.rs | Public API surface. | Catalog, TableHandleRef |
schema.rs | Schema objects and table references. | Schema, Column, TableReference |
registry/ | Thread-safe registry for heaps (MVCC vacuum). | TableRegistry |
statistics.rs | ANALYZE output and helpers. | TableStatistics |
loader.rs | Boot-time metadata loader. | load_catalog_data |
Core Concepts
TableReference
Unified identifier (database, schema, table). Logical planner, execution, and transaction code all use it when requesting handles from the catalog.
Registries
TableRegistry maps internal IDs to Arc<TableHeap> plus logical names. It is used by
the MVCC vacuum worker to iterate user tables without poking directly into catalog data.
Schema & Column
Schema stores column definitions (type, default, nullability). Execution uses it when
materialising tuples; the planner uses it to check expression types. Schema::project
helps physical operators build projected outputs.
TableStatistics
ANALYZE writes row counts and histograms into the catalog. Optimizer rules and planner
heuristics can consult these stats when deciding whether to push filters or pick indexes.
Each column tracks null/non-null counts, min/max values, and a sample-based distinct
estimate, enabling DuckDB-style selectivity heuristics (1/distinct, uniform ranges).
Interactions
- SQL / Planner – DDL planning calls
Catalog::create_table/create_index; name binding relies onSchema. - Execution –
ExecutionContext::table_handleandindex_handlefetch physical handles through the catalog, so scans never hard-code heap locations. - Background workers – MVCC and index vacuum iterate the registries via
Arcclones. - Recovery –
load_catalog_datarebuilds the in-memory catalog from control files and metadata pages during startup.
Teaching Ideas
- Extend the schema system with hidden or computed columns and teach the catalog to store the extra metadata.
- Add histogram bins to
TableStatisticsand demonstrate how a simple cost heuristic can choose better plans. - Turn on
RUST_LOG=catalog=debugto observe how DDL mutates the registries.
Expression & Scalar Evaluation
The expression subsystem (src/expression/) powers column computations, predicates, and
UPDATE assignments. It keeps expression trees approachable while demonstrating how they
are evaluated during execution.
Responsibilities
- Store planner-produced expression trees (
Expr) in a serializable, traversable enum. - Bind column references, constants, and built-in functions.
- Evaluate expressions against
Tuples at runtime, yieldingScalarValue. - Provide type inference and casting so arithmetic/comparison operators remain well-typed.
Directory Layout
| Path | Description | Key Types |
|---|---|---|
mod.rs | Public API and core enum. | Expr, ExprTrait |
scalar.rs | Runtime scalar representation + conversions. | ScalarValue, DataType |
binder.rs | Helpers for the planner/SQL binder. | BoundExpr |
Concepts
Expr Enum
Expresses column refs, literals, comparisons, logical ops, arithmetic, and function
invocations. Each variant implements ExprTrait::evaluate(&self, tuple) and returns a
ScalarValue.
ScalarValue
Unified runtime value across types (int, bigint, bool, decimal, varchar, …). Includes
cast_to(DataType) so results can be coerced to the target column type before writes.
Type Inference
Planner code invokes Expr::data_type(schema) to predict result types. Execution then
casts when needed—e.g., UPDATE t SET a = b + 1 uses the column’s declared type for a.
Interactions
- Planner – builds
Exprtrees with bound columns; execution reuses them verbatim. - ExecutionContext – exposes
eval_exprandeval_predicate, wrapping expression evaluation plus boolean coercion (NULLbecomes false for predicates). - Optimizer – rules like constant folding traverse
Exprtrees and reuseScalarValuearithmetic helpers.
Teaching Ideas
- Add a simple built-in function (
length(expr)) to follow the pipeline from parsing to evaluation. - Implement short-circuiting or full three-valued boolean logic and validate with sqllogictest.
- Instrument
Expr::evaluatewith tracing to visualise expression evaluation inside physical operators.
Query Planner Module
src/plan/ bridges parsed SQL and executable operators. It converts the AST into a
logical plan, applies rewrites (via the optimizer), and finally emits a physical plan
(PhysicalPlan) that the Volcano engine can run.
Responsibilities
- LogicalPlanner – walks the AST, binds table/column names using
PlannerContext, performs type checking, and builds aLogicalPlantree. - PlannerContext – exposes catalog lookups plus scope information for CTEs, subqueries, and aliases.
- PhysicalPlanner – lowers an optimized
LogicalPlaninto a tree of Volcano operators.
Directory Layout
| Path | Description | Key Types |
|---|---|---|
logical_plan.rs | Logical algebra nodes. | LogicalPlan, LogicalExpr, JoinType |
logical_planner.rs | AST → logical transformation. | LogicalPlanner |
physical_plan.rs | PhysicalPlan enum definition. | PhysicalPlan, Physical* structs |
physical_planner.rs | Logical → physical lowering. | PhysicalPlanner |
planner_context.rs | Catalog/scope abstraction. | PlannerContext |
Workflow
- Name binding –
LogicalPlannerresolves table + column references, createsTableReferences, and validates schemas via the catalog. - Logical tree – each SQL clause becomes a logical node (FROM →
SeqScan, WHERE →Filter, GROUP BY →Aggregate, etc.). - Physical selection –
PhysicalPlannerpicks concrete algorithms (sequential scan, index scan, nested-loop join, sort, limit …). Because every physical node implementsVolcanoExecutor, the execution engine can pull tuples immediately.
Interactions
- SQL front-end – provides the AST; helper traits (
NormalizedIdent, etc.) keep name resolution consistent. - Catalog –
PlannerContextrelies on it to confirm table/index existence and fetch schemas. - Optimizer – operates purely on
LogicalPlan; the planner must emit clean, traversable trees so rules can fire. - Execution – physical nodes carry
TableReference,SchemaRef, and hints that the execution engine passes to the storage layer.
Teaching Ideas
- Implement a new logical operator (e.g.,
LogicalDistinct) and add the corresponding physical operator to trace the full lifecycle. - Experiment with early projection inside the logical plan and observe its impact on downstream operators.
- Use
pretty_format_logical_plan/physical_plandumps to visualise rewrites before and after optimizer passes.
Further reading: The Lifecycle of a Query
The Lifecycle of a Query
When you submit a SQL query to a database, it doesn’t just magically produce a result. The database undertakes a sophisticated, multi-stage process to translate the declarative SQL statement (which describes what data you want) into an imperative, efficient execution plan (which describes how to get that data). This entire process is the responsibility of the Query Planner.
In QuillSQL, this process follows a classic, compiler-like pipeline, which is a cornerstone of modern database architecture as taught in courses like CMU 15-445.
The journey from a SQL string to an executable plan involves several transformations:
SQL String -> AST (Abstract Syntax Tree) -> Logical Plan -> Optimized Logical Plan -> Physical Plan
Let’s break down each stage.
Stage 1: Parsing (SQL -> AST)
The first step is purely syntactic. The raw SQL query string is fed into a parser. QuillSQL uses the excellent sqlparser crate for this. The parser checks if the SQL conforms to valid grammar and, if so, converts it into an Abstract Syntax Tree (AST).
An AST is a direct tree representation of the SQL query’s structure. For example, SELECT id FROM users WHERE age > 30 would be parsed into a tree structure with nodes representing the SELECT clause, the table users, the WHERE clause, and the predicate age > 30.
Stage 2: Logical Planning (AST -> Logical Plan)
Next, the LogicalPlanner (plan/logical_planner.rs) walks the AST and converts it into a Logical Plan.
A Logical Plan is a tree of relational algebra operators. It describes the query in terms of high-level data operations, completely independent of how the data is stored or which algorithms will be used. It defines what to do, not how to do it.
Key logical operators in QuillSQL (plan/logical_plan/mod.rs) include:
TableScan(users): Represents reading the entireuserstable.Filter(predicate: age > 30): Represents filtering rows based on a condition.Projection(columns: [id]): Represents selecting specific columns.Join: Represents joining two data sources.Aggregate: Represents aGROUP BYoperation.Sort: Represents anORDER BYoperation.
For our example query, the initial logical plan might look like this:
Projection(columns=[id])
└── Filter(predicate=age > 30)
└── TableScan(users)
Stage 3: Logical Optimization
Before executing the plan, we have a crucial opportunity to make it more efficient. The LogicalOptimizer (optimizer/logical_optimizer.rs) takes the logical plan and applies a series of transformation rules to produce a new, equivalent logical plan that is expected to be faster.
QuillSQL uses a simple but effective rule-based optimizer. A classic example of such a rule is Predicate Pushdown. Consider this query:
SELECT name FROM (SELECT * FROM users JOIN cities ON users.city_id = cities.id) WHERE users.age > 30;
A naive logical plan would first perform a full JOIN between users and cities and then filter the massive result. Predicate pushdown is a rule that would rewrite the plan to apply the age > 30 filter before the join:
Before Optimization:
Filter(users.age > 30)
└── Join(users.city_id = cities.id)
├── TableScan(users)
└── TableScan(cities)
After Optimization (Predicate Pushdown):
Join(users.city_id = cities.id)
├── Filter(users.age > 30)
│ └── TableScan(users)
└── TableScan(cities)
By filtering early, we dramatically reduce the number of rows that need to be processed by the expensive Join operator. QuillSQL implements similar rules, such as PushDownLimit, which pushes LIMIT clauses down the tree to reduce the amount of data processed.
Stage 4: Physical Planning (Logical Plan -> Physical Plan)
Finally, the PhysicalPlanner (plan/physical_planner.rs) converts the optimized logical plan into a Physical Plan.
A Physical Plan describes exactly how the query will be executed. It maps each logical operator to a concrete algorithm or implementation.
- A
LogicalPlan::TableScanbecomes aPhysicalSeqScan(a sequential scan of the table heap). - A
LogicalPlan::Filterbecomes aPhysicalFilter, which implements the filtering logic. - A
LogicalPlan::Joinbecomes aPhysicalNestedLoopJoin. This is where the database commits to a specific join algorithm. A more advanced database might have multiple options (e.g.,PhysicalHashJoin,PhysicalSortMergeJoin) and would use a cost model to choose the best one. QuillSQL currently implements Nested Loop Join.
Each node in the physical plan tree is an executor that the Execution Engine can run. This final plan is what gets executed to produce the query result.
Conclusion
This layered approach—from syntax to a logical representation, then to an optimized logical representation, and finally to a concrete physical execution plan—is fundamental to database design. It provides a clear separation of concerns and, most importantly, creates a dedicated optimization stage, which is the key to achieving high performance on a wide variety of SQL queries.
For Study & Discussion
-
Logical vs. Physical: Why is the separation between logical and physical plans so important? What would be the disadvantages of a simpler system that converted the AST directly into a physical plan?
-
Join Algorithms: QuillSQL currently only implements
NestedLoopJoin. What are two other common join algorithms? Describe how they work and in what scenarios they would be more performant than a nested loop join. -
Programming Exercise (Advanced): Implement a
PhysicalHashJoinoperator. This is a significant undertaking that involves: a. Creating aPhysicalHashJoinstruct that implements theVolcanoExecutortrait. b. In theinit()phase, it should consume the entire “build” side (typically the smaller, right-hand table) and build an in-memory hash table from its rows. c. In thenext()phase, it should read from the “probe” side (the left-hand table) one row at a time, probe the hash table for matches, and emit the joined tuples. d. Modify thePhysicalPlannerto choosePhysicalHashJoininstead ofPhysicalNestedLoopJoinfor equi-joins. -
Programming Exercise: Add support for the
UNION ALLoperator. This would involve: a. Adding aUnionvariant to theLogicalPlanenum. b. Updating theLogicalPlannerto recognize theUNIONsyntax in the AST and create aLogicalPlan::Unionnode. c. Creating aPhysicalUnionexecutor that pulls tuples from its first child until it’s exhausted, and then pulls tuples from its second child.
Optimizer Module
src/optimizer/ contains a lightweight, teaching-friendly rule engine. It rewrites
LogicalPlan trees into cheaper equivalents without requiring a full cost-based
framework.
Responsibilities
- Define the
OptimizerRuletrait (“match → rewrite”). - Ship built-in rules such as predicate pushdown, projection pruning, and limit pushdown.
- Provide a pipeline (
LogicalOptimizer) that repeatedly applies rules until reaching a fixpoint, while remaining extensible for future cost models.
Directory Layout
| Path | Description | Key Types |
|---|---|---|
mod.rs | Optimizer entry point. | LogicalOptimizer |
rule.rs | Trait + shared helpers. | OptimizerRule |
rules/* | Concrete rewrites. | PushDownFilter, PushDownLimit, … |
How It Works
LogicalOptimizer::optimize(plan)iterates through the registered rule list.- Each rule implements
fn apply(&LogicalPlan) -> Option<LogicalPlan>. ReturningSomemeans the rule fired; the pipeline restarts to reach a fixpoint. - Rules are pure functions, which keeps them easy to unit test and reason about.
Examples:
- PushDownFilter moves filters below scans/joins to reduce input size sooner.
- PushDownLimit applies LIMIT before expensive joins/sorts when safe.
- PruneProjection removes unused columns so execution/storage decode less data.
Extending With Statistics
The optimizer intentionally remains heuristics-only, and the physical planner sticks to
simple sequential scans. For coursework, students can still read TableStatistics from
the catalog to prototype their own cardinality estimates or cost heuristics (e.g., to
experiment with when to prefer an index scan), but no estimator ships in-tree.
Interactions
- LogicalPlan – the optimizer only sees logical nodes; physical/storage layers remain untouched.
- Catalog / Statistics – current rules are heuristic, but
TableStatisticsremains available for students who want to prototype their own cost-based decisions. - Execution – leaner logical plans translate into simpler physical plans (e.g.,
predicate pushdown allows
PhysicalSeqScanto discard rows earlier).
Teaching Ideas
- Implement a new rule (join reordering, constant folding) and use
RUST_LOG=traceto compare plan dumps before/after. - Discuss pipeline ordering—swap rule order and observe different outcomes.
- Prototype a tiny cost estimator using row counts from
TableStatisticsto decide on index scans vs sequential scans.
Further reading: Rule-Based Optimization
Rule-Based Optimization
After the LogicalPlanner creates an initial LogicalPlan, it’s passed to the LogicalOptimizer. The initial plan is a direct, syntactically correct translation of the SQL query, but it’s often not the most efficient way to execute it. The optimizer’s job is to transform this plan into an equivalent, but more performant, logical plan.
The Optimizer in QuillSQL
QuillSQL implements a Rule-Based Optimizer. This is a common and powerful approach where the optimizer is equipped with a set of predefined transformation rules. It repeatedly applies these rules to the logical plan tree until no more rules can be applied, or a maximum number of passes is reached.
The main components are:
LogicalOptimizer(optimizer/logical_optimizer.rs): The main driver. It holds a list of rules and contains the logic to recursively walk the plan tree and apply them.LogicalOptimizerRuleTrait: An interface that every optimization rule must implement. Its core method istry_optimize, which takes a plan node and attempts to return a rewritten, optimized version of that node.
Deep Dive: The PushDownLimit Rule
One of the most classic and effective optimizations is “pushing down” operations as far as possible towards the data source. Let’s examine the PushDownLimit rule (optimizer/rule/push_down_limit.rs) to see this in action.
Consider the following query:
SELECT * FROM users ORDER BY signup_date LIMIT 10;
The Naive Plan
A naive logical plan for this query would be:
Limit(10)
└── Sort(by: signup_date)
└── TableScan(users)
If executed directly, this plan would:
- Scan the entire
userstable. - Sort the entire table by
signup_date. - Finally, discard all but the first 10 rows.
This is incredibly inefficient, especially for a large table, as it involves a massive, memory-intensive sort operation.
The Optimization Rule
The PushDownLimit rule is designed to recognize this specific pattern: a Limit operator directly on top of a Sort operator.
When the optimizer applies this rule, the try_optimize method matches on the Limit node. It inspects its child and sees that it’s a Sort node. The rule then knows it can apply its logic.
The Rewritten Plan
The rule rewrites the plan tree by “pushing” the limit information into the Sort node itself:
Limit(10)
└── Sort(by: signup_date, limit: 10)
└── TableScan(users)
Notice the new limit: 10 property on the Sort node. This seemingly small change has a huge performance impact. When the PhysicalSort operator is created from this logical node, it now knows that it only needs to find the top 10 rows. Instead of performing a full sort, it can use a much more efficient algorithm, like a heap sort (using a min-heap of size 10), to find the top 10 rows in a single pass over the data.
This optimization avoids sorting the entire table, dramatically reducing both CPU and memory consumption.
Other Rules
QuillSQL implements other simple but effective rules:
EliminateLimit: Removes aLIMITclause if it provides no value (e.g.,LIMIT NULL).MergeLimit: If twoLIMITclauses are stacked on top of each other (which can happen after other rule transformations), this rule merges them into a single, more restrictiveLIMIT.
Conclusion
While QuillSQL’s optimizer is currently rule-based and relatively simple, it demonstrates the fundamental principles of query optimization. By separating the logical representation of a query from its physical execution and applying equivalence-preserving transformations, a database can achieve massive performance gains. More advanced systems build on this with a Cost-Based Optimizer, which uses table statistics to estimate the “cost” of different physical plans (e.g., choosing between a NestedLoopJoin and a HashJoin) and pick the cheapest one.
For Study & Discussion
-
Rule Ordering: The
LogicalOptimizerapplies its list of rules in a fixed order for a set number of passes. Can the order in which rules are applied affect the final, optimized plan? Can one rule’s transformation enable another rule to be applied in a subsequent pass? -
Cost-Based vs. Rule-Based: What is the primary limitation of a purely rule-based optimizer? When would a rule-based optimizer make a poor decision that a cost-based optimizer (with accurate statistics) would get right? (Hint: consider join algorithm selection).
-
Programming Exercise: Implement the classic Predicate Pushdown rule. Your rule should look for a
Filteroperator whose child is aJoin. If the filter’s predicate only uses columns from one side of the join, the rule should push theFilternode down to that side of the join, below theJoinnode. This is one of the most effective optimizations in any database. -
Programming Exercise: Implement a Constant Folding rule. This rule would traverse expression trees and pre-compute constant expressions. For example:
- An expression
WHERE age = 10 + 5would be rewritten toWHERE age = 15. - An expression
WHERE 1 = 1would be evaluated totrue, and a smart optimizer could then potentially eliminate theWHEREclause entirely.
- An expression
Execution Engine
src/execution/ drives PhysicalPlan trees using the Volcano (iterator) model. Every
operator pulls tuples from its children, coordinating closely with transactions,
storage, and expression evaluation.
Core Components
| Component | Role |
|---|---|
PhysicalPlan | Enum covering all physical operators; each implements VolcanoExecutor. |
ExecutionContext | Shared context carrying the catalog, TxnContext, storage engine, and expression helpers. |
TupleStream | Unified scan interface returned by table/index handles. |
Execution Flow
ExecutionEngine::executecallsiniton the root plan (and recursively on children).- The engine loops calling
next, with parents pulling tuples from children. ExecutionContextsupplies transaction snapshots, lock helpers, and expression evaluation per call.- Once
nextreturnsNone, the accumulated results are returned to the caller (CLI, HTTP API, or tests).
Operator Examples
- PhysicalSeqScan – acquires a
table_streamfrom the storage engine, usesScanPrefetchfor batching, and relies onTxnContext::read_visible_tuplefor MVCC. - PhysicalIndexScan – uses
index_stream, tracksinvisible_hits, and notifies the catalog when garbage accumulates. - PhysicalUpdate/PhysicalDelete – call
prepare_row_for_writeto re-validate locks and the latest tuple before invokingapply_update/delete. - PhysicalNestedLoopJoin – showcases the parent/child pull loop and acts as a baseline for more advanced joins.
Interactions
- StorageEngine – all data access goes through handles/streams, keeping execution storage-agnostic.
- Transaction –
TxnContextenforces locking, snapshots, and undo logging; operators never talk toLockManagerdirectly. - Expression –
ExecutionContext::eval_expr/eval_predicateevaluate expressions built by the planner. - Optimizer/Planner – execution honours the plan as-is; all structural choices happen upstream.
Teaching Ideas
- Implement a new operator (e.g.,
PhysicalMergeJoin) to see howExecutionContextsupport generalises. - Add adaptive prefetching inside
PhysicalSeqScanto explore iterator hints. - Enable
RUST_LOG=execution=traceto watch theinit/nextcall sequence during a query.
Further reading: The Volcano Execution Model
The Volcano Execution Model
Once the Query Planner has produced an optimized PhysicalPlan, it’s the job of the Execution Engine to run it and produce results. The execution engine is the component that brings the plan to life, interacting with the transaction manager and storage layer to process data.
QuillSQL uses the classic Volcano Model, also known as the Iterator Model. This is a pull-based execution model where each physical operator in the plan tree acts as an iterator that the parent operator can “pull” rows from.
1. The VolcanoExecutor Trait
At the heart of the execution model is the VolcanoExecutor trait (execution/mod.rs). Every physical operator implements this simple trait:
#![allow(unused)]
fn main() {
pub trait VolcanoExecutor {
fn init(&self, context: &mut ExecutionContext) -> QuillSQLResult<()>;
fn next(&self, context: &mut ExecutionContext) -> QuillSQLResult<Option<Tuple>>;
fn output_schema(&self) -> SchemaRef;
}
}
init(): This method is called once at the beginning of execution. It allows an operator to set up its initial state (e.g., aSeqScanoperator would initialize its table iterator here).next(): This is the core method. When called, the operator produces its next output tuple. It returnsSome(tuple)if it has a row, orNoneif it has exhausted its data source. The top-levelExecutionEnginesimply callsnext()on the root of the plan tree in a loop until it receivesNone.
2. The ExecutionContext
Notice that both init() and next() take a mutable ExecutionContext. This object is the “world” in which the query runs. It is passed down the operator tree and exposes:
Catalog+StorageEngine: Operators callcontext.table(&table_ref)to obtain aTableBinding. The binding encapsulates heap/index access (scan, insert, delete, update, prepare-row-for-write) so operators never touchTableHeaporMvccHeapdirectly.TxnContext: Provides the current transaction, MVCC snapshot, and helper methods for row/table locks and visibility checks.- Expression helpers:
eval_expr/eval_predicateevaluate AST expressions against a tuple without leakingScalarValueplumbing into operators.
Thanks to the binding abstraction, the operator code only expresses “what” should happen (scan/update/delete) while the binding implements “how” (MVCC chain navigation, undo logging, index maintenance).
3. Anatomy of Physical Operators
Data flows up the tree from the leaves (scans) to the root. Let’s see how it works by examining a few key operators.
Leaf Operator: PhysicalSeqScan
A sequential scan sits at the leaf of a plan tree and streams tuples from a table binding.
init():- Acquires an
IntentionSharedlock on the table viaTxnContext. - Requests a
TableBinding(context.table(&table_ref)) and callsbinding.scan(...)to obtain aTupleStream. The binding hides the actualTableIteratorimplementation and any MVCC plumbing.
- Acquires an
next():- Pops the next
(rid, meta, tuple)from its prefetch buffer (which in turn pulls from the binding’s stream). - Calls
TxnContext::read_visible_tupleto perform the MVCC visibility check and acquire the required shared row lock. - Returns the tuple if visible; otherwise, keep looping.
- Pops the next
Unary Operator: PhysicalFilter
A filter has one child operator (its input). It implements a WHERE clause.
next(): Its logic is a simple, tight loop:- It calls
self.input.next()to get a tuple from its child. - If the child returns
None, the filter is also exhausted and returnsNone. - If it receives a tuple, it evaluates its predicate expression (e.g.,
age > 30) against the tuple. - If the predicate evaluates to
true, it returns the tuple. Otherwise, it loops back to step 1.
- It calls
Binary Operator: PhysicalNestedLoopJoin
A join has two children: a left (outer) and a right (inner).
next(): It implements the classic nested loop join algorithm:- Fetch one tuple from the outer (left) child and hold onto it.
- Enter a loop: fetch tuples one by one from the inner (right) child.
- For each inner tuple, combine it with the held outer tuple and evaluate the join condition. If it matches, return the combined tuple.
- When the inner child is exhausted, rewind it by calling
self.right_input.init()again. - Go back to step 1 to fetch the next tuple from the outer child.
- Repeat until the outer child is also exhausted.
4. Putting It All Together
Consider the query SELECT name FROM users WHERE age > 30. The physical plan is Projection -> Filter -> SeqScan.
- The
ExecutionEnginecallsnext()on theProjectionoperator. - The
Projectionoperator needs a tuple, so it callsnext()on its child,Filter. - The
Filteroperator needs a tuple, so it callsnext()on its child,SeqScan. - The
SeqScanoperator asks itsTableBindingfor the next tuple, and after the MVCC check finds a visible tuple for a user withage = 25. SeqScanreturns this tuple up toFilter.Filterevaluatesage > 30on the tuple. It’s false, so it loops, callingSeqScan.next()again.SeqScanpulls another visible tuple (this timeage = 40,name = 'Alice') through the binding.SeqScanreturns this tuple up toFilter.Filterevaluatesage > 30. It’s true! It returns the tuple for Alice up toProjection.Projectiontakes the full tuple for Alice, creates a new tuple containing only thenamecolumn ('Alice'), and returns this new tuple as the result.
This process repeats, with tuples flowing up the tree one at a time, until the SeqScan operator runs out of pages and returns None, which then propagates up the tree, signaling the end of execution.
For Study & Discussion
-
Push vs. Pull Models: The Volcano model is a “pull-based” model. An alternative is a “push-based” model, where operators push their results to their parents as soon as they are ready. What are the potential advantages and disadvantages of each model, particularly concerning cache efficiency and control flow?
-
Blocking vs. Non-Blocking Operators: Some operators, like
PhysicalFilter, can produce their first output row as soon as they receive their first input row. These are non-blocking. Other operators, likePhysicalSort, must consume their entire input before they can produce even a single row of output. These are blocking. What is the impact of blocking operators on query latency and memory usage? -
Programming Exercise: The current
PhysicalNestedLoopJoinis simple but can be inefficient as it re-scans the entire inner table for every outer row. Implement aPhysicalBlockNestedLoopJoinoperator. This version would read a block (a small batch) of tuples from the outer table into an in-memory buffer, and then iterate through the inner table once for that entire block. This can significantly reduce the number of times the inner table needs to be scanned. -
Programming Exercise: Implement the
PhysicalLimitoperator. Itsnext()method should: a. Keep an internal counter. b. If the counter is less than theoffset, pull and discard tuples from its child. c. If the counter is betweenoffsetandoffset + limit, pull a tuple from its child and return it. d. Once the limit is reached, it should stop pulling from its child and returnNonefor all subsequent calls. This is important for efficiency, as it stops the execution of the entire sub-tree below it.
Transaction Module
src/transaction/ enforces the Atomicity and Isolation parts of ACID. It combines MVCC
with strict two-phase locking so reads and writes can proceed concurrently without
violating correctness.
Main Components
| Type | Role |
|---|---|
TransactionManager | Creates/commits/aborts transactions, assigns txn & command ids, coordinates WAL. |
Transaction | Stores state, held locks, undo chain, and cached snapshot. |
TxnContext / TxnRuntime | Execution-time wrapper exposing MVCC + locking helpers. |
LockManager | Multi-granularity locking (IS/IX/S/SIX/X) with deadlock detection. |
TransactionSnapshot | Tracks xmin/xmax/active_txns for visibility checks. |
Workflow
SessionContextcallsTransactionManager::beginto create a transaction.- Each SQL statement builds a
TxnRuntime, yielding a fresh command id and snapshot. - Operators call
TxnContext::lock_table/lock_rowto obey strict 2PL. TableHandle::insert/delete/updaterecords undo, acquires locks, and emits WAL viaTxnContext.- Commit: write a
Commitrecord → flush depending onsynchronous_commit→ release locks. - Abort: walk the undo list, write CLRs, restore heap/index state, release locks.
MVCC Details
TupleMetastores inserting/deleting txn ids and command ids.read_visible_tuplechecks snapshots and, if needed, rewinds to the latest visible version.- Isolation levels:
- Read Uncommitted – minimal snapshot caching.
- Read Committed – refresh snapshot each command to avoid dirty reads.
- Repeatable Read / Serializable – capture the snapshot once; RR releases shared locks at statement end, Serializable holds them to commit to avoid phantoms.
- UPDATE skips versions created by the same
(txn_id, command_id)to avoid looping back over freshly inserted tuples.
Locking
- Multi-granularity hierarchy: table-level IS/IX/S/SIX/X plus row-level S/X.
- Deadlock detection:
LockManagermaintains a wait-for graph and periodically chooses a victim (usually the longest waiter). - Release policy: exclusive/intent locks stay until commit; RR drops shared row locks at statement end, Serializable waits until commit.
Interactions
- ExecutionContext – all helpers (lock acquisition, visibility checks, undo logging)
are exposed here, so physical operators never touch
LockManagerdirectly. - StorageEngine – handles call
TxnContextbefore mutating heaps/indexes; MVCC metadata lives inTupleMeta. Deletes and updates now push the affected index keys into the undo chain so heap/index WAL stay in lockstep. - Recovery – Begin/Commit/Abort records emitted here drive ARIES undo/redo.
- Background – MVCC vacuum reads
TransactionManager::oldest_active_txn()to computesafe_xmin.
Teaching Ideas
- Change
DatabaseOptions::default_isolation_leveland compare SELECT behaviour under RC vs RR. - Write a unit test that deadlocks two transactions and watch
LockManagerpick a victim. - Implement statement-level snapshot refresh or Serializable Snapshot Isolation (SSI) as an advanced exercise.
Lab Walkthrough (à la CMU 15-445)
- Warm-up – Start two sessions, run
BEGIN; SELECT ...;under RC vs RR, and trace which snapshotTxnRuntimeinstalls by loggingtxn.current_command_id(). - MVCC visibility – Extend the
transaction_tests.rssuite with a scenario wheretxn1updates a row whiletxn2reads it. InstrumentTupleMetaprinting so students see how(insert_txn_id, delete_txn_id)change as versions are linked. - Undo tracing – Force an abort after a multi-index UPDATE. Watch the undo stack
entries unfold:
Insertremoves the new version + index entries,Deleterestores the old version + keys. Map each step to the WAL records that are written. - Crash drill – Add
panic!()right afterTransactionManager::commitis called but before locks are released. Reboot, run recovery, and inspect the loser list; students can connect the dots between undo actions, CLRs, and ARIES theory.
Further reading: MVCC and 2PL
MVCC and 2PL
Of the four ACID properties, Isolation is often the most complex to implement. It ensures that concurrent transactions do not interfere with each other, making it appear as if each transaction is executing sequentially, one after another. Without proper isolation, a database would suffer from concurrency-related anomalies like dirty reads, non-repeatable reads, and phantom reads.
Databases typically use two main strategies for concurrency control:
- Pessimistic Concurrency Control: Assumes conflicts are likely and prevents them from happening by using locks. The most common protocol is Two-Phase Locking (2PL).
- Optimistic Concurrency Control: Assumes conflicts are rare. Transactions proceed without locking, and the database validates at commit time that no conflicts occurred. A popular variant is Multi-Version Concurrency Control (MVCC).
QuillSQL, like many modern relational databases (e.g., PostgreSQL, Oracle), implements a powerful hybrid model that combines MVCC with 2PL.
1. MVCC: Reading without Blocking
The core idea of MVCC is “writers don’t block readers, and readers don’t block writers.” This is achieved by never overwriting data in-place. When a row is updated, the database creates a new version of that row, preserving the old one.
Version Chains and TupleMeta
As discussed in the Storage Engine chapter, every tuple on disk has associated metadata (TupleMeta) that includes:
insert_txn_id: The ID of the transaction that created this version.delete_txn_id: The ID of the transaction that marked this version as deleted.prev_version/next_version: Pointers (RIDs) that form a linked list of versions for a single logical row, called the version chain.
Transaction Snapshots
When a transaction begins, it asks the TransactionManager for a snapshot of the database state. This snapshot, defined in transaction/mvcc.rs, contains three key pieces of information:
xmin: The oldest active transaction ID at the time of the snapshot. Any transaction with an ID less thanxminis guaranteed to be either committed or aborted.xmax: The next transaction ID to be assigned. Any transaction with an ID greater than or equal toxmaxwas not yet started when the snapshot was taken.active_txns: A list of all other transaction IDs that were active when the snapshot was taken.
The Visibility Check
When a transaction scans the database, for every tuple version it encounters, it performs a visibility check using its snapshot. The logic in MvccSnapshot::is_visible determines if the version should be “seen” by the current transaction. In simplified terms, a tuple version is visible if:
- Its
insert_txn_idbelongs to a transaction that was already committed before our snapshot was taken. - AND its
delete_txn_idis either not set, OR it belongs to a transaction that was not yet committed when our snapshot was taken.
This mechanism elegantly solves several concurrency problems. Since a reader transaction only ever sees a consistent snapshot of the database, it is completely immune to changes being made by other concurrent writer transactions. This prevents dirty reads and non-repeatable reads.
2. Two-Phase Locking (2PL): Preventing Write-Write Conflicts
While MVCC is excellent for read-write conflicts, it does not, by itself, prevent two transactions from trying to modify the same logical row at the same time (a write-write conflict). This is where locking comes in.
QuillSQL implements a strict Two-Phase Locking protocol via the LockManager (transaction/lock_manager.rs).
The Two Phases
- Growing Phase: The transaction can acquire new locks as it executes.
- Shrinking Phase: Once the transaction releases its first lock, it cannot acquire any new locks.
In practice, QuillSQL uses Strict 2PL, where the shrinking phase is delayed until the transaction commits or aborts. All locks are held until the very end.
Hierarchical Locking (Intention Locks)
To be efficient, the LockManager uses a multi-granularity, hierarchical locking scheme. Before a transaction can take a fine-grained lock on a row (a Shared or Exclusive lock), it must first acquire a coarser-grained intention lock on the table.
IntentionShared (IS): Signals the intent to read rows from the table.IntentionExclusive (IX): Signals the intent to modify rows in the table.
This prevents a transaction wanting to lock the entire table from conflicting with another transaction that is already modifying a single row. For example, a SELECT * FROM users FOR UPDATE (which needs an X lock on the table) will be blocked if another transaction already holds an IX lock on users.
Deadlock Detection
When two or more transactions are waiting for each other to release locks in a circular chain, a deadlock occurs. The LockManager detects this by building a waits-for graph. When a transaction T1 has to wait for a lock held by T2, an edge T1 -> T2 is added to the graph. If adding an edge creates a cycle, a deadlock is detected, and one of the transactions (the victim) is immediately aborted to break the cycle.
3. The Hybrid Model: MVCC + 2PL in Action
In QuillSQL, these two mechanisms work in concert:
-
A reader (
SELECT) transaction acquires an MVCC snapshot. It uses this snapshot to determine visibility. It only needs to acquireShared(S) locks on the rows it reads to prevent other transactions from modifying them, thus ensuring repeatable reads in higher isolation levels. -
A writer (
UPDATE,DELETE) transaction must acquire anExclusive(X) lock on the specific row it intends to modify. Once the lock is granted, it knows no other writer can interfere. It can then safely create a new tuple version as part of the MVCC protocol.
This hybrid approach provides the best of both worlds: reads are fast and non-blocking, while write-write conflicts are safely prevented by the locking protocol.
For Study & Discussion
-
Isolation Levels: QuillSQL supports multiple SQL isolation levels. How does the behavior of MVCC snapshots and 2PL change between
ReadCommittedandRepeatableRead? InReadCommitted, a transaction gets a new snapshot for every statement, whereas inRepeatableRead, it uses the same snapshot for the entire transaction. What concurrency anomalies does this difference prevent? -
Phantom Reads: Even with MVCC and row-level 2PL, a
RepeatableReadtransaction can suffer from phantom reads. ImagineT1runsSELECT COUNT(*) FROM users WHERE age > 30. Then,T2inserts a new user withage = 40and commits. IfT1runs itsSELECTquery again, it will see a new “phantom” row that wasn’t there before. How can a database prevent this to achieve theSerializableisolation level? (Hint: research predicate locking and index-range locking). -
Deadlock Handling: QuillSQL detects deadlocks by building a waits-for graph and aborting a transaction. What is an alternative strategy for handling deadlocks? For example, what are the pros and cons of using lock timeouts instead of cycle detection?
-
Programming Exercise (Advanced): A full implementation of
Serializableisolation often requires index-range locking to prevent phantoms. Extend theLockManagerto support locking a range of keys within a B+Tree index. This would require a new lock type and a way to check for overlapping ranges. When aSELECT ... WHERE age > 30query runs, it would place a shared lock on the(30, +∞)range in the index on theagecolumn, preventing any other transaction from inserting a new user with an age in that range.
Storage Engine
The storage engine persists relational data, covering heap files, indexes, page formats, and the handles exposed to execution. Understanding this layer is key to reasoning about performance, MVCC, and recovery.
Responsibilities
- Manage
TableHeapinsert/delete/update paths and their MVCC metadata. - Maintain indexes (see the Index module for details).
- Expose the
StorageEnginetrait so execution can fetchTableHandle/IndexHandleinstances per table. - Provide
TupleStreamso sequential and index scans share a unified interface.
Directory Layout
| Path | Purpose | Key Types |
|---|---|---|
engine.rs | Default engine plus handle definitions. | StorageEngine, TableHandle, TupleStream |
table_heap/ | Heap storage + MVCC logic. | TableHeap, MvccHeap |
index/ | B+Tree implementation. | BPlusTreeIndex |
page/ | Page, RID, tuple metadata. | Page, RecordId, TupleMeta |
tuple/ | Row encoding and projection helpers. | Tuple |
disk_manager.rs | File layout and page I/O. | DiskManager |
disk_scheduler.rs | io_uring-backed async scheduler. | DiskScheduler |
Core Abstractions
StorageEngine Trait
#![allow(unused)]
fn main() {
pub trait StorageEngine {
fn table(&self, catalog: &Catalog, table: &TableReference)
-> QuillSQLResult<Arc<dyn TableHandle>>;
fn indexes(&self, catalog: &Catalog, table: &TableReference)
-> QuillSQLResult<Vec<Arc<dyn IndexHandle>>>;
}
}
The default implementation wraps the row-oriented heap + B+Tree combo, but the trait is ready for column stores, remote storage, or async engines.
TableHandle
Offers full_scan(), insert, delete, update, and
prepare_row_for_write. MVCC, undo, and locking concerns live here so execution operators
only describe intent. Every delete/update now receives the table’s index handles so
HeapTableHandle can delete or re-insert keys in tandem with heap tuples—exactly the
behaviour CMU 15-445’s buffer/heap projects walk you through.
TupleStream
Minimal iterator that returns (RecordId, TupleMeta, Tuple) triples. Index scans use
IndexScanRequest to describe ranges.
Interactions
- Execution –
ExecutionContext::table_stream/index_streamdelegate to handles. - Transaction – Handle methods call into
TxnContextto acquire locks, record undo, and emit WAL. - Buffer Manager –
TableHeap/BPlusTreeIndexaccess pages through the shared buffer pool. - Recovery – Heap/index mutations generate WAL records (
HeapInsert,HeapDelete,IndexInsert, …) that ARIES replays. - Background – MVCC vacuum and index cleanup obtain handles and iterate tuples via the same abstractions as foreground scans.
Teaching Ideas
- Implement a toy columnar handle to show how the execution engine can stay agnostic to storage layout.
- Extend the
TableHandle::full_scan/TableIteratorplumbing to accept projection hints so students can experiment with column pruning. - Enable
RUST_LOG=storage::table_heap=traceand trace MVCC version chains as updates occur. - Follow the CMU 15-445 Lab 2 flow: instrument
TableBinding::deleteto print every RID- key pair, run an UPDATE with multiple indexes, and confirm the WAL stream contains the matching HeapInsert/HeapDelete + IndexLeafInsert/IndexLeafDelete entries.
Further reading: Disk I/O, Page & Tuple Layout, Table Heap & MVCC
Disk I/O — Scheduler, io_uring Data Pages & WAL Runtime
1. Architecture
- Request Path: foreground components enqueue
DiskRequestobjects viaDiskScheduler::{schedule_read, schedule_write, …}. A dispatcher thread drains the global channel and distributes work round-robin to N io_uring workers. Each worker owns its own ring and file-descriptor cache, so once a request is forwarded, execution proceeds entirely off the foreground thread. - Stable APIs:
schedule_read(page_id),schedule_write(page_id, Bytes),schedule_read_pages(Vec<PageId>),schedule_allocate(),schedule_deallocate(page_id)— every call returns a channel the caller can block on or poll. - Batch Reads:
ReadPagesfans out per-page SQEs while a sharedBatchStatetracks completions. Even if the kernel completes I/O out of order, the caller receives aVec<BytesMut>that preserves the original page order.
2. WAL Runtime (buffered I/O)
- Dedicated WAL runtime threads handle sequential WAL appends/reads using buffered I/O. They now keep a per-thread cache of open segment files, eliminating repeated
open()/close()on every log record. - Worker count defaults to
max(1, available_parallelism / 2)but is tunable throughIOSchedulerConfig. - Optional
syncon a request triggerssync_data/fdatasyncsoWalManagercan honour synchronous commit or checkpoint barriers. Data pages stay on the io_uring dataplane; WAL always uses buffered writes.
3. io_uring Backend (Linux)
- Each worker owns an
IoUringwith configurablequeue_depth, optional SQPOLL idle timeout, and a pool of registered fixed buffers sized toPAGE_SIZE. Workers submit SQEs asynchronously and drain CQEs in small batches to keep the ring warm. - Read batching relies on shared
BatchStateinstances (Rc<RefCell<_>>) so multi-page callers see ordered results without blocking the kernel on serialization. - Writes keep their payload alive until completion; if a fixed buffer slot is available we reuse it, otherwise we fall back to heap buffers. A companion
WriteStatetracks an optionalfdatasyncso the caller still observes exactly oneResult<()>once all CQEs land. - Errors (short read/write, errno) are normalised into
QuillSQLErrorvalues that flow back on the original channel.
4. Configuration
config::IOSchedulerConfigcontrols:workers: number of io_uring workers (default = available parallelism).wal_workers: WAL runtime threads (default workers / 2).iouring_queue_depth,iouring_fixed_buffers,iouring_sqpoll_idle_ms.fsync_on_write: whether data-page writes also issuefdatasync(WAL sync is managed separately byWalManager).
5. Concurrency & Safety
- Worker-local file descriptors plus positional I/O remove shared mutable state on the hot path. The new per-worker handle cache further reduces syscall overhead.
- Shutdown sequence: enqueue
Shutdown, dispatcher forwards it to every worker, each worker drains outstanding SQEs/CQEs, and finally dispatcher + workers are joined. - BufferPool and TableHeap integrate via the same scheduler channels; inflight guards prevent duplicate page fetches even when multiple scans touch adjacent pages.
6. Performance Notes
- Random page access benefits from fewer syscalls and deeper outstanding queue depth than the blocking fallback.
- Only the io_uring backend currently ships (Linux x86_64). A portable fallback remains future work.
- For large sequential scans, rely on the buffer pool’s sequential access pattern or add
a custom iterator on top of
ReadPagesif you want to experiment with direct I/O.
7. Future Work
- Queue-depth aware scheduling and CQE bulk harvesting.
- Optional group commit (aggregate writes, single fsync) behind configuration.
- Metrics hooks (queue depth, submit/complete throughput, latency percentiles, error codes).
- Cross-platform fallback backend and richer prioritisation/throttling policies.
- Control-plane knobs for throttling individual background workers.
Page and Tuple Layout
1. The Page: The Atomic Unit of I/O
A database file is not treated as one continuous stream of data. Instead, it is broken down into fixed-size blocks called pages. A page is the atomic unit of transfer between the disk and the in-memory Buffer Pool. Whenever the database needs to read a piece of data (like a single row), it must load the entire page containing that data into memory.
In QuillSQL, the page size is a constant defined at quill-sql/src/buffer/mod.rs:
PAGE_SIZE: 4096 bytes (4 KB)
This fixed-size approach simplifies buffer management and allows for efficient, aligned I/O operations, especially when using Direct I/O to bypass the OS cache.
2. TablePage: The Slotted Page Layout
While a page is a generic 4KB block of bytes, pages that store actual table data are structured in a specific way. QuillSQL uses a classic Slotted Page layout, which is a core concept in database implementation (as taught in CMU 15-445).
A TablePage is organized into three main parts:
<------------------------------ 4KB ------------------------------>
+----------------+-----------------+-----------------+--------------+
| Page Header | Slot Array | Free | Tuple |
| (grows ->) | (grows ->) | Space | Data |
| | | | (<- grows) |
+----------------+-----------------+-----------------+--------------+
- Page Header (
TablePageHeader): Located at the beginning of the page. It contains metadata about the page itself. - Slot Array (
tuple_infos): An array ofTupleInfostructs that grows from after the header. Each entry in this array acts as a “pointer” or “directory entry” for a tuple on the page. - Tuple Data: The actual raw data of the tuples is stored starting from the end of the page, growing backwards towards the middle.
This design has a key advantage: it decouples a tuple’s logical identifier from its physical location on the page.
The RecordId (RID)
A specific tuple is uniquely identified by a RecordId (RID). The RID is a stable pointer composed of two parts:
page_id: The ID of the page where the tuple resides.slot_num: The index into the Slot Array on that page.
So, RID = (page_id, slot_num).
When the database needs to delete a tuple or if a variable-length tuple is updated and grows in size, the tuple’s data might need to be moved within the page (for compaction). In a slotted page design, we only need to update the offset in the corresponding slot array entry. The tuple’s RID (page_id, slot_num) remains unchanged. This prevents a cascade of updates to all secondary indexes that might be pointing to that tuple.
TablePageHeader and Slot (TupleInfo)
Let’s look at the physical layout, as defined in storage/codec/table_page.rs:
-
TablePageHeader:lsn: The Log Sequence Number of the last change made to this page, crucial for WAL recovery.next_page_id: The ID of the next page in this table, forming a linked list of pages.num_tuples: The number of active tuples on the page.num_deleted_tuples: The number of “dead” or deleted tuples.tuple_infos: The slot array itself.
-
TupleInfo(A single slot in the array):offset: The byte offset from the beginning of the page where the tuple’s data begins.size: The size of the tuple’s data in bytes.meta: A nestedTupleMetastruct containing critical information for concurrency control.
For Study & Discussion
-
Layout Trade-offs: What is the main benefit of having the tuple data grow from the end of the page backwards, while the header and slot array grow from the beginning forwards? What happens when they meet?
-
Record ID Stability: Why is it so important that a tuple’s
RecordIddoes not change even if the tuple’s physical data is moved within the page? What would break if the RID was just a direct byte offset? -
Large Objects: The current design assumes a tuple fits entirely on one page. How would you modify this page layout to support tuples that are larger than 4KB (e.g., a long blog post stored in a
VARCHARcolumn)? Research how systems like PostgreSQL handle this with their “TOAST” (The Oversized-Attribute Storage Technique) mechanism. -
Programming Exercise: Implement a
defragment()method for theTablePage. After several insertions and deletions, the free space on a page can become fragmented into small, unusable chunks. This method should reorganize the page by moving the existing tuples to be contiguous, creating a single, large block of free space. Remember to update theoffsetin eachTupleInfoslot after moving its corresponding tuple data!
Table Heap and MVCC
The TableHeap (storage/table_heap.rs) is the component responsible for managing the collection of pages that belong to a single table. While the TablePage defines the layout within a single page, the TableHeap provides the high-level API for inserting, updating, and deleting tuples, making it central to the implementation of MVCC.
TableBinding & StorageEngine
From the executor’s perspective, heap access is routed through storage::engine::TableBinding. A binding is produced by the StorageEngine (consulting the catalog) and carries:
- An
Arc<TableHeap>for physical page access. - An
MvccHeapwrapper that maintains version chains. - All indexes defined on the table.
The binding exposes ergonomic methods such as scan, index_scan, insert, update, delete, and prepare_row_for_write. Each method internally uses the TableHeap + MvccHeap pair, appends the appropriate logical WAL record (HeapInsert/HeapDelete), and keeps indexes in sync. This keeps physical operators extremely small and makes it easy to swap out the storage engine for experiments.
Tuple Serialization
A logical Tuple struct in memory must be converted into a byte array to be stored on a page. This process is handled by storage/codec/tuple.rs.
The serialized format of a tuple consists of two parts:
- Null Bitmap: A compact bitmap at the beginning of the tuple data. Each bit corresponds to a column in the schema; if the bit is
1, the column’s value isNULL. This avoids storing any data for null fields. - Attribute Data: The actual data for all non-null columns, serialized one after another.
TupleMeta: The Heart of MVCC
The most important part of a tuple’s on-disk metadata is the TupleMeta struct, which is stored directly within the TupleInfo slot in the page header. This is the heart of QuillSQL’s Multi-Version Concurrency Control (MVCC) implementation.
insert_txn_id: The ID of the transaction that created this version of the tuple.delete_txn_id: The ID of the transaction that “deleted” this version. (A value of0orINVALIDmeans it’s not deleted).is_deleted: A boolean flag indicating the tuple version is considered deleted.prev_version: Option<RecordId>: A link (RID) to the previous version of this logical row.next_version: Option<RecordId>: A link (RID) to the next version of this logical row.
These fields allow QuillSQL to maintain a version chain for each logical row. When a row is updated, the TableHeap’s mvcc_update method is called, which, instead of overwriting the data, performs the following steps:
- Creates a new tuple version with the new data by calling
insert_tuple. - Sets the
prev_versionpointer of this new version to the RID of the old version. - Updates the
next_versionpointer of the old version to point to the new version. - Sets the
delete_txn_idon the old version to mark it as “dead”.
A transaction can then traverse this chain and use the transaction IDs in the TupleMeta to determine which version of a row is visible to it based on its own transaction ID and isolation level, thus achieving transaction isolation without long-held read locks.
Logical Logging (Redo + Undo)
Every heap mutation emits a logical WAL record before the page frame is dirtied:
| Operation | Payload | Redo | Undo |
|---|---|---|---|
| Insert | HeapInsertPayload | Re-insert tuple bytes into the slot | Remove the tuple (implicit via delete logic) |
| Delete | HeapDeletePayload (new tombstone + old tuple) | Mark slot deleted and update version chain | Recreate the prior tuple image |
This makes crash recovery straightforward: redo simply replays the payload, and undo uses the “old” portion of the payload even if the page is currently inconsistent. Physical PageWrite/PageDelta records are still used for metadata-heavy pages (meta, freelist, etc.) to guarantee a consistent starting point, but ordinary table DML lives entirely in these logical heap records.
For Study & Discussion
-
Version Chain Traversal: Imagine a transaction with ID
T10needs to read a row. It finds a version of the row that was inserted byT5(committed) but deleted byT12(in-progress). ShouldT10see this version? What if the deleter wasT8(committed)? What if the deleter wasT10itself? Walk through the visibility check logic. -
Garbage Collection: The MVCC model creates many “dead” tuple versions that are no longer visible to any active or future transaction. What is the long-term problem with leaving these dead versions in the database? This problem is typically solved by a process called vacuuming or garbage collection. When is it safe to physically remove a dead tuple version?
-
Programming Exercise (Advanced): Implement a basic
VACUUMfunction for aTableHeap. This function should scan the table and identify dead tuple versions that are no longer visible to any currently active transaction. Once identified, it should physically remove them from their pages. This is a challenging exercise that touches transaction management, storage, and concurrency.
Buffer Manager
The buffer manager (src/buffer/) implements QuillSQL’s shared buffer pool, bridging the
speed gap between RAM and disk. It lets storage/execution read and write pages safely
while coordinating with WAL and asynchronous I/O.
Responsibilities
- Maintain a fixed-size set of page frames caching
TableHeapand B+Tree pages. - Expose RAII-style guards (pin/unpin) that enforce safe concurrent access.
- Keep the page table, replacement policy, dirty-page tracking, and WAL coordination in sync.
- Submit async I/O through
DiskScheduler.
Directory Layout
| Path | Description | Key Types |
|---|---|---|
buffer_manager.rs | Core buffer pool. | BufferManager, BufferFrame |
page.rs | Guard types and pin/unpin logic. | ReadPageGuard, WritePageGuard |
replacer.rs | LRU-K + TinyLFU replacement. | Replacer |
metrics.rs | Optional instrumentation hooks. | BufferMetrics |
Key Mechanisms
Guard Model
ReadPageGuard,WritePageGuard, andUpgradeableGuardensure only compatible access modes coexist on a page.- Guards drop automatically to release pins; paired with Rust’s borrow checker, they make latch semantics tangible.
Replacement Policy
- LRU-K tracks the last K touches to protect hot pages from scan pollution.
- TinyLFU decides whether a new page should enter the cache, offering probabilistic admission against noisy workloads.
WAL Coordination
- Before flushing a dirty page, the buffer checks
page_lsnand asksWalManagerto flush up to that LSN (write-ahead rule). set_wal_managerwires the buffer to WAL so checkpoints can inspect the oldest dirty LSN.
Disk Scheduler
- All physical reads/writes go through
DiskScheduler::submit_*, sharing worker threads with WAL and demonstrating the benefits of a unified I/O layer.
Interactions
- Storage engine –
TableHeapandBPlusTreeIndexaccess pages exclusively through the buffer manager. - Recovery – checkpoints consult the buffer’s dirty page table to build the ARIES DPT.
- Background writer – periodically walks
dirty_framesto flush pages in the background.
Teaching Ideas
- Disable TinyLFU via feature flag, rerun sqllogictest, and compare hit rates.
- Swap the replacement policy with CLOCK to experiment with cache algorithms.
- Enable
RUST_LOG=buffer=debugand trace the pin/unpin lifecycle of hot pages.
Further reading: Page & Page Guards, The Buffer Pool
Page & Page Guards
Before the Buffer Manager can hand out a reference to a page in memory, it must ensure that the page won’t be evicted while it’s being used by another thread. This is accomplished by pinning.
Pinning
Pinning simply means incrementing a “pin count” associated with the page’s frame in the buffer pool. A frame with a pin count greater than zero is forbidden from being chosen as a victim by the page replacer.
- When a thread wants to use a page, it must first pin it.
- When the thread is finished with the page, it must unpin it (decrementing the count).
Manually managing pin counts is tedious and error-prone. Forgetting to unpin a page leads to a memory leak, as the frame can never be evicted. To solve this, QuillSQL uses a common and powerful C++ and Rust pattern: Resource Acquisition Is Initialization (RAII).
ReadPageGuard and WritePageGuard
Instead of returning a raw pointer to the page memory, the BufferManager’s fetch_page_* methods return a guard object: ReadPageGuard or WritePageGuard.
These guards are responsible for the lifetime of the pin and the lock on the page:
-
Acquisition: When a
PageGuardis created, its constructor acquires the appropriate lock (RwLock) on the page’s frame and increments the frame’s pin count.ReadPageGuardtakes a read lock, allowing multiple concurrent readers.WritePageGuardtakes an exclusive write lock.
-
Usage: The calling code uses the guard object to access the page’s data. The guard provides safe, locked access to the underlying byte array.
-
Release: When the guard variable goes out of scope (e.g., at the end of a function), its
drop()method is automatically called by the Rust compiler. Thisdrop()implementation handles all the cleanup:- It decrements the pin count.
- It releases the lock on the frame.
- If it’s a
WritePageGuardand the data was modified, it informs theBufferManagerthat the page is now dirty.
This RAII pattern makes using the buffer pool much safer and more ergonomic, as it makes it impossible to forget to unpin a page or release a lock.
The Buffer Pool: Architecture and Lifecycle
1. Core Components & Architecture
QuillSQL’s buffer management is split into two main structs, reflecting a separation of concerns:
BufferPool(buffer/buffer_pool.rs): A low-level, “dumb” container. It owns the actual memory arena for pages and provides basic mapping from aPageIdto a memory location (FrameId).BufferManager(buffer/buffer_manager.rs): The high-level, “smart” coordinator. It contains theBufferPooland implements all the logic for fetching pages, choosing which pages to evict, and interacting with other database components like the transaction and recovery managers.
This architecture is centered around three key data structures:
-
Frames (The Arena): The
BufferPoolpre-allocates a large, contiguous block of memory on the heap. This block is divided into a fixed number of smaller chunks called frames. Each frame is exactlyPAGE_SIZE(4KB) and can hold the contents of one disk page. -
Page Table (
page_table): A hash map (specifically, a concurrentDashMap) that maps a logicalPageIdto theFrameIdwhere it currently resides in memory. This provides fast O(1) lookups to check if a page is already in the buffer pool. -
Replacer (
LRUKReplacer): When a requested page is not in memory and the buffer pool is full, one of the existing pages must be evicted to make room. TheReplaceris the component that implements the page replacement policy and decides which page is the best candidate for eviction. QuillSQL uses an LRU-K replacement policy, a sophisticated variant of the classic Least Recently Used (LRU) algorithm.
2. The Lifecycle of a Page Request
When another part of the database (e.g., the execution engine) needs to access a page, it calls buffer_manager.fetch_page_read(page_id) or fetch_page_write(page_id). This initiates a critical sequence of events.
The flow is as follows:
-
Request: An executor requests Page
P. -
Lookup: The
BufferManagerconsults thePageTable. -
Case 1: Cache Hit
- The
PageTablecontains an entry forP, mapping it toFrameIdF. - The
BufferManagerincrements the pin count for frameF. - It informs the
LRUKReplacerthat the page has been accessed (updating its priority). - It returns a
PageGuardwrapping a reference to the memory in frameF.
- The
-
Case 2: Cache Miss
- The
PageTablehas no entry forP. - The
BufferManagermust bring the page from disk. It asks theReplacerto choose a victim frameF_vto evict. - If the victim frame
F_vis dirty: TheBufferManagerfirst writes the contents ofF_vback to disk via theDiskScheduler. This is essential for data durability. - The
PageTableentry for the old page residing inF_vis removed. - The
BufferManagerissues a read request to theDiskSchedulerto load pageP’s data from disk into frameF_v. - The
PageTableis updated with the new mapping:P -> F_v. - The process then continues like a cache hit: frame
F_vis pinned and aPageGuardis returned.
- The
3. Concurrency
The buffer pool is a shared resource accessed by many concurrent threads. QuillSQL uses a combination of locking strategies:
- Page Table: A
DashMapis used for the page table, which is highly optimized for concurrent reads and writes. - Frame-Level Locks: Each frame in the pool has its own
RwLock. AReadPageGuardacquires a read lock on the frame, allowing multiple threads to read the same page concurrently. AWritePageGuardacquires an exclusive write lock, ensuring that only one thread can modify a page at a time. - Replacer/Free List: These shared structures are protected by a
Mutex.
4. Integration with WAL and Recovery
The Buffer Manager is a key player in the ARIES recovery protocol.
- Dirty Pages: When a
WritePageGuardis dropped, if it has modified the page, it marks the page as dirty. TheBufferManagermaintains adirty_page_tablethat tracks all dirty pages and the Log Sequence Number (LSN) of the first WAL record that caused the page to become dirty. - Forced WAL (Write-Ahead Logging): Before a dirty page can be written back to disk (either during eviction or a checkpoint), the
BufferManagermust ensure that all WAL records up to that page’s current LSN have been flushed to durable storage. This is the fundamental WAL rule and is enforced by theflush_pagemethod, which callswal_manager.flush(lsn)before writing the page to disk.
For Study & Discussion
-
Replacement Policies: QuillSQL uses LRU-K. What are the potential advantages of LRU-K over a simple LRU policy? What kind of workload would benefit most? Conversely, what are the trade-offs of using Clock/Second-Chance instead?
-
The “Double Caching” Problem: We mentioned that using Direct I/O helps avoid double caching. If Direct I/O is disabled, how does the database’s buffer pool interact with the operating system’s file system cache? Why can this lead to suboptimal performance?
-
Programming Exercise: Implement a
ClockReplacerthat adheres to theReplacertrait. Modify theBufferManagerto use your new replacer instead ofLRUKReplacer. Run the benchmark suite and compare the performance. Does it change? -
Programming Exercise: Add metrics to the
BufferManagerto track the buffer pool hit rate (i.e.,num_hits / (num_hits + num_misses)). You could expose this via a newSHOW BUFFER_STATS;SQL command.
Index Module
Indexes live in src/storage/index/. QuillSQL currently ships a B+Tree (B-link variant)
that is exposed to execution via IndexHandle. Indexes allow point lookups and range
scans in O(log n), dramatically reducing the need for full table scans.
Responsibilities
- Maintain an ordered key →
RecordIdmapping per indexed table. - Support point probes, range scans, insert/update/delete maintenance.
- Cooperate with MVCC: entries reference heap tuples while visibility checks remain in execution/transaction code.
- Provide
IndexHandle::range_scan, returning aTupleStreamso physical operators don’t need to know tree internals.
Directory Layout
| Path | Purpose | Key Types |
|---|---|---|
btree_index.rs | Core B+Tree, page formats, insert/delete logic. | BPlusTreeIndex |
btree_iterator.rs | Range-scan iterator with sibling traversal. | TreeIndexIterator |
btree_codec.rs | Page encode/decode utilities. | BPlusTreeLeafPageCodec |
Key Concepts
B-link Structure
Each leaf stores a pointer to its right sibling. Iterators use this to keep scanning even if a concurrent split occurs, avoiding restarts from the root and enabling latch-free range scans.
Latch Crabbing
Insert/delete operations climb the tree with shared latches and upgrade only when necessary (e.g., right before splitting), reducing contention.
Range Scan → TupleStream
IndexHandle::range_scan wraps TreeIndexIterator and automatically fetches heap tuples,
returning (rid, meta, tuple) triples. Execution remains storage-agnostic.
Inline Maintenance
Index inserts/updates/deletes modify the tree immediately and emit logical WAL for redo. There is no deferred “index vacuum”; once a heap tuple is deleted its index entry is removed in the same transaction.
Interactions
- Catalog – stores
Arc<BPlusTreeIndex>instances alongside table metadata so execution can fetch handles directly. - Execution –
PhysicalIndexScanusesExecutionContext::index_stream; DML operators callinsert_tuple_with_indexesso heap writes and index maintenance stay in sync. - Transaction/MVCC – heaps store transaction metadata; indexes just reference RIDs, so MVCC visibility is enforced when tuples are materialised.
- Recovery – WAL contains
IndexInsert/IndexDeleterecords to replay structural changes after crashes.
Teaching Ideas
- Build a covering index example to show how avoiding heap lookups improves latency.
- Instrument
TreeIndexIteratorto visualise sibling traversal during range scans. - Compare SeqScan vs IndexScan on selective predicates to highlight indexing benefits.
Further reading: B+Tree internals
B+ Tree Index — Architecture and Concurrency
1. Architecture Overview
1.1 Node and Page Structure
- Node Types:
InternalandLeaf.- Internal Nodes: Store separator keys and pointers to child pages. The first pointer is a “sentinel” that points to the subtree for all keys less than the first key in the node.
- Leaf Nodes: Store
(key, RecordId)pairs in sorted order. Leaves are linked together in a singly linked list (next_page_id) to allow for efficient range scans.
- Header Page: A fixed page (
header_page_id) that acts as the entry point to the tree. It stores theroot_page_id, allowing for atomic updates to the tree root when it splits or shrinks. - Page Layout:
- Internal Page:
{Header, High Key, Array<(Key, ChildPageId)>}. TheHigh Keyis part of the B-link optimization. - Leaf Page:
{Header, Array<(Key, RID)>}.
- Internal Page:
1.2 B-link Structure
The tree uses B-link pointers (next_page_id) on all levels (both internal and leaf nodes). This creates a “side-link” to the right sibling. This is crucial for concurrency, as it allows readers to recover from transient inconsistent states caused by concurrent page splits by “chasing” the link to the right sibling.
+------------------+
| Root (Int) |
+------------------+
/ | \
+--------+ +--------+ +--------+
| Int P1 |----->| Int P2 |----->| Int P3 | (Internal B-link pointers)
+--------+ +--------+ +--------+
/ | \ / | \ / | \
+----+ +----+ +----+ +----+ +----+ +----+
| L1 |-| L2 |->| L3 |-| L4 |->| L5 |-| L6 | (Leaf chain / B-links)
+----+ +----+ +----+ +----+ +----+ +----+
2. Concurrency Control
The B+Tree employs a sophisticated, lock-free read path and a high-concurrency write path using latch crabbing.
2.1 Read Path: Optimistic Lock Coupling (OLC) with B-links
- Readers traverse the tree from the root without taking any locks.
- On each page, a
versionnumber is read before and after processing the page’s contents. If the version changes, it indicates a concurrent modification, and the read operation restarts from the root. - If a reader is traversing an internal node and the search key is greater than or equal to the node’s
high_key, it knows a split has occurred. Instead of restarting, it can use thenext_page_idB-link to “chase” to the right sibling and continue the search, minimizing restarts.
2.2 Write Path: Latch Crabbing
Writers (insert/delete) use a technique called latch crabbing (or lock coupling) to ensure safe concurrent modifications.
- Process: A writer acquires a write latch on a parent node before fetching and latching a child node. Once the child is latched, the writer checks if the child is “safe” for the operation (i.e., not full for an insert, not at minimum size for a delete).
- If the child is safe, the latch on the parent (and all other ancestors) is released.
- If the child is unsafe, the parent latch is held, as the child might need to split or merge, which would require modifying the parent.
ContextStruct: This process is managed by aContextstruct that holds the stack of write guards (write_set) for the current traversal path. Releasing ancestor latches is as simple as clearing this stack.
Latch Crabbing on Insert:
1. Latch(Root)
2. Descend to Child C1. Latch(C1).
3. Check if C1 is safe (not full).
IF SAFE:
ReleaseLatch(Root). Path is now just {C1}.
IF UNSAFE (full):
Keep Latch(Root). Path is {Root, C1}.
4. Descend from C1 to C2. Latch(C2).
5. Check if C2 is safe... and so on.
2.3 Deadlock Avoidance
When modifying siblings (during merge or redistribution), deadlocks are possible if two threads try to acquire latches on the same two pages in opposite orders. This is prevented by enforcing a strict PageId-ordered locking protocol. When two sibling pages must be latched, the page with the lower PageId is always latched first.
3. Key Algorithms & Features
- Parent-Guided Redirection: During an insert or delete, after a writer has descended to a leaf, it re-validates its position using the parent (if a latch is still held). If a concurrent split has moved the target key range to a different sibling, the writer can jump directly to the correct page instead of traversing the leaf chain, preventing race conditions.
- Iterator: The iterator performs a forward scan by following the leaf chain (
next_page_id). It uses a lightweight form of OLC, checking the leaf page version to detect concurrent modifications and restart if necessary to ensure it doesn’t miss keys.- Sequential Scan Optimization (RingBuffer): For large range scans, the iterator switches to a “synchronous batch fetch + local ring buffer” mode. It fills a small
RingBuffer<BytesMut>with consecutive leaf pages (by followingnext_page_id) and then decodes KVs locally without holding page guards for long. This reduces buffer pool pollution and syscall/lock overhead. - Two Iteration Modes:
- Guard Mode: Keep a
ReadPageGuardand decode per step; prefetch next leaf best-effort. - Bytes Mode: After switching, decode from
BytesMutbuffers in the local ring; when a leaf is exhausted, pop next bytes from the ring or refill by following the chain.
- Guard Mode: Keep a
- Sequential Scan Optimization (RingBuffer): For large range scans, the iterator switches to a “synchronous batch fetch + local ring buffer” mode. It fills a small
- Prefix Compression: Keys in internal nodes are prefix-compressed to save space. Each key is stored as
(lcp, suffix_len, suffix). This reduces the size of internal pages, increasing the tree’s fanout and reducing its height, which improves cache performance and reduces I/O. - Split/Merge Safety:
- Split: When a node splits, the new right sibling is written first. Then, the B-link pointer and separator key are published atomically by updating the left sibling and parent. This ensures readers can always navigate the structure correctly.
- Merge/Redistribute: When a node is underfull, the implementation first tries to borrow an entry from a sibling (redistribute). If both siblings are at minimum size, it merges with a sibling. All these operations carefully maintain the B-link chain and parent pointers.
4. Benchmarks & Performance
4.1 Example: Range Scan Benchmark
This benchmark measures the efficiency of the leaf-chain traversal, which is critical for SELECT queries with WHERE clauses on indexed columns. It benefits from iterator prefetching and prefix compression.
#![allow(unused)]
fn main() {
// Pseudo-code for a range scan benchmark
use std::time::Instant;
fn benchmark_range_scan(index: Arc<BPlusTreeIndex>, num_keys: i64, num_passes: usize) {
// 1. Populate the index with sequential keys
for key in 1..=num_keys {
let tuple = create_tuple_from_key(key, index.key_schema.clone());
index.insert(&tuple, create_rid_from_key(key)).unwrap();
}
// 2. Run the benchmark
let start = Instant::now();
let mut count = 0;
for _ in 0..num_passes {
// Create an iterator over the full key range
let mut iter = TreeIndexIterator::new(index.clone(), ..);
while let Some(_) = iter.next().unwrap() {
count += 1;
}
}
let elapsed = start.elapsed();
let total_items_scanned = num_keys as usize * num_passes;
let items_per_sec = total_items_scanned as f64 / elapsed.as_secs_f64();
println!(
"Range Scan: Scanned {} items in {:?}. Throughput: {:.2} items/sec",
total_items_scanned, elapsed, items_per_sec
);
}
}
4.2 Performance Notes
- Hot Reads: Performance on hot-spot reads depends on keeping upper levels and hot leaves resident in the buffer pool. Warm up the cache for read-heavy benchmarks. Protect hot pages from pollution by enabling TinyLFU admission.
- Large Range Scans: Prefer table SeqScan with ring buffer (bypass) when scanning most of the table. For index scans over very large ranges, consider future Bitmap Heap Scan rather than bypassing the pool for leaves.
4.3 Configuration
config::BTreeConfigcontrols iterator behavior:seq_batch_enable(bool): enable batch mode with local ring buffer.seq_window(usize): number of leaf pages to prefill into the ring per refill.prefetch_enable/prefetch_window: guard-mode prefetch hints to buffer pool. Defaults are conservative; increaseseq_windowfor long scans to reduce I/O hop.
5. Future Work
- Stronger OLC with bounded retries and telemetry.
- CSB+-like internal layout and columnar key prefixing.
- NUMA-aware partitioning and router.
- WAL/MVCC for crash recovery and snapshot isolation.
Recovery / WAL Module
src/recovery/ guarantees the Durability part of ACID. Through WAL, checkpoints, and the
ARIES algorithm, QuillSQL can recover to a consistent state after crashes.
Responsibilities
- Generate WAL records for every change and assign monotonically increasing LSNs.
- Manage WAL buffers, synchronous flushes, and the optional background WAL writer.
- Run the ARIES Analysis → Redo → Undo sequence during startup.
- Maintain the control file so the latest checkpoint and log truncation point are known.
Directory Layout
| Path | Description | Key Types |
|---|---|---|
recovery/wal/ | Log writer, buffer, storage, background writer runtime. | WalManager, WalBuffer, WalStorage |
recovery/wal_record.rs | Serialized record variants exposed to subsystems. | WalRecordPayload, ResourceManagerId |
recovery/resource_manager.rs | Registry that maps payloads to redo/undo handlers. | ResourceManager, RedoContext, UndoContext |
recovery/recovery_manager.rs | ARIES driver. | RecoveryManager, RecoverySummary |
recovery/control_file.rs | Persistent metadata (checkpoint info). | ControlFileManager |
WalManager & I/O pipeline
- Buffering –
WalManageruses a lock-freeWalBufferring.append_record_withattaches an LSN, encodes the frame, pushes it into the ring, and auto-flushes when either record count or byte thresholds (max_buffer_records,flush_coalesce_bytes, or a full WAL page) are met. Writers never block on Vec reallocation and thus scale with concurrent transactions. - Physical logging –
log_page_updateemits 4 KiBPageWriteFPWs on first touch (tracked viaDashSet<PageId>) or when logical redo is unavailable. Payloads carry the previouspage_lsnso redo can enforce the WAL rule. - Segmentation & flush –
WalStorageappends frames to rolling segments under the configured WAL directory, queues async write/fsync tickets via theDiskSchedulersink, and recycles sealed segments after checkpoints.flush()drains buffered records up to an optional target LSN, waits for outstanding tickets, updatesdurable_lsn, and optionally persists the control file snapshot. - Background writer –
start_background_flushspawnsWalWriterRuntime, which periodically callsflush(None);WalWriterHandlelives in the background worker registry so shutdown is coordinated with the db lifecycle. - Checkpoints –
log_checkpointwrites aCheckpointframe containing ATT/DPT, forces a flush, clears the “first-touch” set (so new FPWs are generated as needed), and updatescheckpoint_redo_startin the control file. Recovery uses this redo start to avoid rescanning the whole log. - Readers & durability waits –
WalManager::readeriterates frames straight from the WAL directory.wait_for_durableis the gate that synchronous commits call after emitting a commit record; it reusesflush_with_modeand condition variables to block untildurable_lsn >= target.
Record types & resource managers
All log frames share the envelope defined in wal_record.rs and are routed by
ResourceManagerId:
- Page (
ResourceManagerId::Page) – purely physicalPageWritepayloads. The page resource manager checkspage_lsn(buffer pool if available) before rewriting the page on redo. No undo is required because these are low-level physical changes. - Heap (
ResourceManagerId::Heap) – logical payloads include relation id, page/slot, tuple metadata, and tuple bytes.HeapResourceManagerapplies inserts/overwrites at the slot level with LSN checks and repacks slotted pages. Undo uses the stored before-image (metadata + bytes) for loser transactions. - Index (
ResourceManagerId::Index) – logs leaf inserts/deletes plus structural changes (split/merge/redistribute, parent updates, root install/adopt/reset). Redo mutates leaves or rebuilds structural pages with LSN/version guards; undo mirrors leaf inserts/deletes only. Heap and index WAL remain independent. - Transaction / CLR / Checkpoint –
TransactionPayload(BEGIN/COMMIT/ABORT) enables the undo index to link per-transaction log records.ClrPayloaddocuments each undo step so the recovery process can survive crashes mid-undo. Checkpoints snapshot ATT + DPT for the analysis phase.
ensure_default_resource_managers_registered wires Page, Heap, and Index resource
managers together so both redo (RedoExecutor) and undo (UndoExecutor) can dispatch
records uniformly.
Heap / Index WAL design
- Heap – Execution operators call
ExecutionContext::insert_tuple_with_indexes(and friends), which ultimately invokeMvccHeapto write WAL before dirtying the page. Each tuple carriesTupleMetaRepr(insert/delete txn + CID, MVCC chain pointers) plus encoded column bytes. During redo,HeapResourceManagerapplies inserts or overwrites at the slot level with LSN checks and repacks the slotted page before writing it back viaTableHeap::recovery_view. Undo leverages the stored before-image to restore bytes or toggle delete flags; vacuum code can later reclaim dead slots oncesafe_xminpasses them. - Index – B+Tree leaf operations log the logical key/value change, plus structural records for split/merge/redistribute/parent updates and root install/adopt/reset. Redo loads or rebuilds pages (buffer pool or disk), checks page LSN/version to avoid reapplying newer changes, and performs the requested mutation. Undo mirrors leaf-level inserts/deletes only. Heap and index WAL stay independent.
- Page images – For heap/index structural changes that rewrite large sections (e.g.,
new heap page, B+Tree splits), page guards still emit FPWs through
WalManager::log_page_update. Logical heap/index WAL ensures redo still works even when FPWs are skipped after the first touch.
ARIES Recovery
- Analysis – read the latest checkpoint, rebuild the Dirty Page Table (DPT) and Active Transaction Table (ATT).
- Redo – scan forward from the checkpoint LSN, reapplying operations when the DPT indicates a page may be out of date.
- Undo – roll back transactions still in ATT, writing Compensation Log Records (CLR) so the process is idempotent even if another crash happens mid-undo.
RecoverySummary reports how many records were redone and which transactions require
manual attention—great for classroom demonstrations.
Interactions
- TransactionManager – emits
Begin,Commit,Abortrecords and supplies undo information. - BufferManager – links to WAL via
set_wal_manager; checkpoints rely on the buffer’s dirty page metadata. - Background workers – WAL writer and checkpoint worker live in
backgroundand use handles exposed byWalManager.
Teaching Ideas
- Simulate crashes (e.g., panic right after
wal_manager.flush(None)) and inspect log output on restart. - Add a new WAL record type (like
CreateIndex) to see howRecoveryManagermust be extended. - Compare physical vs logical redo costs to discuss ARIES trade-offs.
Recovery Lab Playbook (CMU 15-445 style)
- Mini ARIES – Disable the background WAL writer, perform a few INSERT/UPDATE
operations, crash the process mid-transaction, and single-step through
analysis_pass.rsto observe ATT/DPT reconstruction. - Logical vs physical redo – Comment out
WalManager::log_page_updatefor heap pages and re-run the experiment. Recovery still succeeds thanks to logical heap payloads; re-enable FPWs to contrast log volume. - Index crash test – Inject a panic after
BPlusTreeIndex::insertlogs a leaf insert. On restart, watch the undo phase remove the stray key before replay finishes. - Group commit tuning – Play with
WalOptions::flush_coalesce_bytes,writer_interval_ms, andsynchronous_committo demonstrate how throughput vs latency tradeoffs mirror the 15-445 checkpoints/commit labs.
Further reading: ARIES details
The ARIES Recovery Protocol
Of the four ACID properties, Durability is the one that guarantees that once a transaction is committed, its changes will survive any subsequent system failure. In a disk-based database, this is achieved through a careful and robust recovery protocol. QuillSQL implements a recovery strategy inspired by the well-known ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) protocol, centered around a Write-Ahead Log (WAL).
1. The Write-Ahead Logging (WAL) Principle
The fundamental rule of WAL is simple but powerful:
Before a modified data page is ever written from memory back to disk, the log record describing that modification must first be written to durable storage (the log file).
This allows the database to perform most of its data modifications in memory on the Buffer Pool without needing to synchronously write every changed page to disk, which would be extremely slow. In case of a crash, the database can use the log to reconstruct the state of the system and ensure all committed changes are present and all uncommitted changes are undone.
Log Records (WalRecord)
The WAL is a sequential, append-only file composed of log records. Each record is assigned a unique, monotonically increasing Log Sequence Number (LSN). A log record in QuillSQL (src/recovery/wal_record.rs) generally contains:
- LSN: The address of the log record.
- Transaction ID: The ID of the transaction that generated this record.
- Payload: The actual content of the log record, which varies by type.
QuillSQL uses several types of log records:
Transaction: Marks theBEGIN,COMMIT, orABORTof a transaction.PageWrite: Physical records describing a change to a page.PageWritecontains a full image of the page (used for first-touch or when logical redo is unavailable), a First-Page-Write (FPW) style safeguard.HeapInsert/HeapDelete: Logical tuple-level records. Each delete carries the full before-image so undo can always restore bytes/metadata.Index*: Logical B+Tree records covering leaf inserts/deletes and structural changes (split/merge/redistribute, parent updates, root install/adopt/reset).Checkpoint: A special record that marks a point of partial durability, allowing the log to be truncated.CLR(Compensation Log Record): A special record written during recovery to describe an undo action. CLRs are redo-only and are never undone themselves.
2. The ARIES Recovery Protocol
On database startup, the RecoveryManager (recovery/recovery_manager.rs) is invoked to replay() the WAL. This process follows the three phases of ARIES.
Phase 1: Analysis
The goal of the Analysis phase (recovery/analysis.rs) is to figure out the state of the database at the exact moment of the crash.
- It starts by finding the last successful
Checkpointrecord in the WAL. - It then scans the log forward from that checkpoint to the end of the log.
- During this scan, it builds two critical data structures:
- Active Transaction Table (ATT): A list of all transactions that have a
BEGINrecord but no correspondingCOMMITorABORTrecord. These are the potential “loser” transactions that will need to be undone. - Dirty Page Table (DPT): A list of all pages that were modified in the buffer pool but might not have been written to disk before the crash. For each dirty page, it records the LSN of the first log record that made it dirty (this is called the
recLSN).
- Active Transaction Table (ATT): A list of all transactions that have a
At the end of this phase, the RecoveryManager knows exactly which transactions to roll back and the earliest point in the log from which it needs to start re-applying changes.
Phase 2: Redo (Repeating History)
The goal of the Redo phase (recovery/redo.rs) is to restore the database to its exact state at the moment of the crash, including all changes from both committed and uncommitted (loser) transactions.
- The Redo phase finds the smallest
recLSNfrom the Dirty Page Table built during Analysis. This LSN is the starting point for the redo scan. - It scans the log forward from this starting point.
- For every log record it encounters that describes a change to a page (e.g.,
PageWrite,Heap*,Index*), it re-applies the change. This is idempotent: if the change is already present on the page (because it was successfully flushed to disk before the crash), re-applying it does no harm.
At the end of this phase, the database state on disk is identical to how it was in memory right before the crash.
Phase 3: Undo (Rolling Back Losers)
The final phase (recovery/undo.rs) is responsible for rolling back all the “loser” transactions identified during Analysis.
- The
UndoExecutortakes the list of loser transactions. - For each loser transaction, it scans the WAL backwards, following the chain of log records for that transaction.
- For each operation record (like
HeapInsert,HeapDelete), it performs the logical inverse operation:- To undo an
Insert, it performs aDelete. - To undo a
Delete, it restores the deleted data.
- To undo an
- Crucially, for every undo action it performs, it writes a Compensation Log Record (CLR) to the WAL. The CLR contains information about the undo action and, importantly, a pointer to the next log record that needs to be undone for that transaction.
This use of CLRs makes the recovery process itself crash-proof. If the system crashes during the undo phase, the next time recovery runs, it will see the CLRs. It will simply continue the undo process from where it left off by following the pointers in the CLRs, without ever having to undo an undo action.
3. Checkpoints
If the log were allowed to grow forever, recovery would become slower and slower. Checkpoints are a background process that periodically creates a point of partial durability.
A checkpoint does the following:
- Temporarily stops new transactions from starting.
- Writes a
BEGIN_CHECKPOINTrecord to the WAL, containing the current ATT and DPT. - Flushes all dirty pages from the buffer pool to disk.
- Writes an
END_CHECKPOINTrecord to the WAL.
Once a checkpoint is complete, the RecoveryManager knows that all changes described in log records before the BEGIN_CHECKPOINT record are safely on disk. Therefore, those older parts of the WAL file are no longer needed for recovery and can be truncated or recycled, keeping the recovery process efficient.
For Study & Discussion
-
Repeating History: The Redo phase re-applies changes from all transactions, including those that will be undone later (the “losers”). This seems wasteful. Why is this “repeating history” approach a core principle of ARIES? What would go wrong if we tried to only redo changes from committed transactions?
-
Compensation Log Records (CLRs): What specific problem would occur if the system crashed during the Undo phase and we didn’t write CLRs? How does a CLR’s “redo-only” nature make the recovery process idempotent and robust against repeated crashes?
-
Checkpointing Frequency: What are the performance trade-offs between checkpointing very frequently versus very infrequently? Consider both normal runtime performance and the time it takes to recover after a crash.
-
Programming Exercise: Add a new WAL record type. For example, a
HeapReclaimrecord that logs the physical removal of a dead tuple by a vacuum process. To do this, you would need to: a. Add a variant to theHeapRecordPayloadenum inwal_record.rs. b. Update thecodecfunctions to handle its serialization. c. Decide what information theHeapReclaimrecord needs to contain. d. Implement theredologic for this new record in the appropriateResourceManager. What should happen when you redo aHeapReclaimrecord?
Write-Ahead Logging
This note dives into the WAL subsystem that powers QuillSQL’s recovery story. It explains how frames are generated, buffered, flushed, and replayed, and why logical heap/index records complement traditional physical FPWs.
1. Component map
| Layer | Location | Responsibility |
|---|---|---|
| WalManager | src/recovery/wal/mod.rs | Assigns LSNs, encodes WalRecordPayload, enqueues frames, and drives WalStorage::flush. |
| WalBuffer | src/recovery/wal/buffer.rs | Lock-free ring buffer (ConcurrentRingBuffer) that tracks pending frame count, bytes, and highest end LSN. |
| WalStorage | src/recovery/wal/storage.rs | Maintains WAL directory/segments, builds WalPages, and dispatches write/fsync tickets to a WalSink (default: DiskSchedulerWalSink). |
| WalWriterRuntime | src/recovery/wal/writer.rs | Background thread that periodically calls WalManager::flush(None). |
| ControlFileManager | src/recovery/control_file.rs | Persists durable_lsn, max_assigned_lsn, checkpoint metadata, and redo start for fast crash recovery. |
| Resource managers | src/recovery/resource_manager.rs, storage/*/heap_recovery.rs, storage/*/index_recovery.rs | Decode payloads per ResourceManagerId and execute redo/undo logic for heap/index/page operations. |
End-to-end DML flow:
ExecutionContextmutates tuples/index entries viaTableHeap/BPlusTreeIndex.- Operators invoke
WalManager::append_record_withorlog_page_updateafter data-page changes succeed. - Frames enter
WalBuffer. Once thresholds are met,flush()drains frames intoWalStorage, which schedules asynchronous writes/fsyncs. - During commit,
TransactionManagerwaits onWalManager::wait_for_durable(lsn)to guarantee WAL persistence before releasing locks.
2. Frame & record taxonomy
wal_record.rs defines the canonical envelope. encode_frame(lsn, prev_lsn, payload)
serializes a record, while WalAppendContext reports the LSN range back to the caller.
| ResourceManagerId | Payload | Purpose |
|---|---|---|
Page | PageWritePayload | Full-page image (FPW) on first-touch; also used when logical redo is not available. |
Heap | HeapRecordPayload::{Insert,Delete} | Logical tuple records carrying relation id, page/slot, TupleMetaRepr, and tuple bytes (delete payloads always carry the before-image). |
Index | IndexRecordPayload::{LeafInsert,LeafDelete,Leaf/Internal Split/Merge/Redistribute,Parent*,Root*} | Logical B+Tree leaf mutations plus structural changes (split/merge/redistribute, parent updates, root install/adopt/reset). |
Transaction | TransactionPayload | BEGIN / COMMIT / ABORT markers that seed Undo’s per-txn chains. |
Checkpoint | CheckpointPayload | Captures ATT/DPT snapshots plus redo start. |
Clr | ClrPayload | Compensation Log Records documenting each undo step and its undo_next_lsn. |
ResourceManagerId determines how RedoExecutor / UndoExecutor route frames:
Page → physical redo only; Heap/Index → logical redo/undo using storage helpers;
Transaction/CLR/Checkpoint → interpreted by the analysis/undo phases.
3. Heap WAL: MVCC-aware logical logging
- Emission points –
TableHeap::{insert,update,delete}_tuplecallappend_heap_record(seesrc/storage/heap/mvcc_heap.rs) before the data page is overwritten. Each record stores the relation identifier, page/slot, tuple metadata, and tuple bytes (delete payloads include the exact before-image). - Encoding –
src/storage/heap/wal_codec.rsserializesTupleMeta(insert/delete txn id, command id, MVCC chain pointers) plus tuple bytes in a compact layout. - Redo –
HeapResourceManager(src/storage/heap/heap_recovery.rs) decodes the payload and applies it at the slot level (insert/overwrite) with an LSN check; it repacks the slotted page layout to keep offsets consistent and sets page LSN to the frame LSN. - Undo – Always restores the stored before-image (metadata + tuple bytes) for loser transactions, removing the optional/branchy path.
- Interaction with FPW – Heap logical redo handles tuple-level replay; FPWs are only used on first-touch or when no logical log exists for a page.
4. Index WAL: logical B+Tree leaf operations
- Emission points –
BPlusTreeIndexlogs every leaf insert/delete and all structural changes (split/merge/redistribute, parent updates, root install/adopt/reset) viaappend_index_record(src/storage/index/btree_index.rs), usingsrc/storage/index/wal_codec.rsto encode key schema, key bytes, page ids, and txns. - Redo (
src/storage/index/index_recovery.rs) steps:- Decode the key with
TupleCodecusing the stored schema. - Fetch the target page through the buffer pool (preferred) or
DiskScheduler. - Skip if
page_lsn >= frame_lsn; otherwise apply the logical mutation/structural rebuild and bump the page version. - Write the updated page back with the frame LSN (including header root pointer for root install/adopt/reset).
- Decode the key with
- Undo – Inverts leaf inserts/deletes for loser transactions. Structural records are redo-only (idempotent via LSN/version checks).
- Benefits – Heap and index WAL are decoupled; logical leaf updates and structured payloads avoid full-page writes while remaining crash-safe for splits/merges/root changes.
5. Buffering & flush strategy
- Thresholds –
max_buffer_records(fromWalOptions::buffer_capacity),flush_coalesce_bytes, and one WAL page (4 KiB) trigger batched flushes. - Flush mechanics –
WalManager::flush_with_modedrains frames up to a target LSN, asksWalStorage::append_recordsto write them, then waits on allWalFlushTickets.flush_untilforces durability before commit or after undo. - Checkpoints –
log_checkpointforces a flush, recordscheckpoint_redo_startin the control file, and clears the “touched pages” set so new FPWs fire only once per checkpoint interval. - Background writer –
WalWriterRuntimeruns whenWalOptions::writer_interval_msis non-zero, smoothing out flush pressure even when foreground transactions are light.
6. Relation to ARIES
- Analysis –
AnalysisPassparses the latest checkpoint, reconstructs ATT/DPT by scanning the tail of the log, and chooses a redo start (min(dpt.rec_lsn)). - Redo (repeat history) –
RedoExecutoriterates fromstart_lsn, invoking the appropriate resource manager for each frame. Page RM checks pageLSN before applying FPW/delta; Heap/Index RMs use logical payloads to rebuild tuples or leaf entries. - Undo –
UndoExecutorchains loser transactions backwards, calling each resource manager’s undo method and emitting CLRs withundo_next_lsn. If recovery crashes mid-undo, the next run resumes at the recordedundo_next_lsn.
7. Tuning & troubleshooting
- Configuration –
WalOptionsinsideDatabaseOptionsexposesegment_size,sync_on_flush,writer_interval_ms,synchronous_commit,retain_segments, etc. - Introspection –
WalManager::pending_records()dumps in-memory frames for debugging;background::BackgroundWorkers::snapshot()reports WAL writer/checkpoint worker metadata. EnablingRUST_LOG=tracereveals FPW vs delta decisions and flush cadence. - Crash testing – Insert a forced
std::process::exit(1)after specific DMLs, then restart and inspectRecoverySummary(redo count + loser transactions) to ensure heap/index WAL cover the intended cases.
With logical heap/index records plus FPWs as a safety net, QuillSQL’s WAL stays teaching-friendly while mirroring production-grade recoverability. When introducing new components (e.g., custom indexes or vacuum steps), define a payload + resource manager and they will automatically participate in ARIES.
Background Services
src/background/ hosts the asynchronous workers that keep a database healthy: WAL
writers, checkpoints, buffer flushers, and MVCC vacuum. A central registry makes it easy
to start/stop workers together—ideal for teaching how background maintenance supports
foreground queries.
Responsibilities
- Start workers according to configuration (
WalOptions,MvccVacuumConfig, etc.). - Define lightweight traits (
CheckpointWal,BufferMaintenance,TxnSnapshotOps) so workers can run without pulling in an async runtime. - Provide
BackgroundWorkers, a registry that tracksWorkerHandles and shuts them down whenDatabasedrops.
Built-in Workers
| Worker | Trigger | Behavior |
|---|---|---|
| WAL writer | wal_writer_interval_ms | Calls WalManager::background_flush to durably write log buffers. |
| Checkpoint | checkpoint_interval_ms | Captures dirty page / active txn tables and emits Checkpoint records to bound recovery. |
| Buffer writer | bg_writer_interval | Flushes dirty frames to reduce checkpoint pressure. |
| MVCC vacuum | MvccVacuumConfig | Removes obsolete tuple versions once safe_xmin advances. |
Every worker registers itself with BackgroundWorkers; shutdown_all() ensures threads
exit cleanly during tests or process teardown.
Interactions
- WalManager – WAL writer and checkpoint workers operate on
Arc<dyn CheckpointWal>. - BufferManager – background flushers inspect dirty frames and help checkpoints capture consistent snapshots.
- TransactionManager – MVCC vacuum queries
TxnSnapshotOpsforsafe_xmin.
Teaching Ideas
- Tune
MvccVacuumConfig::batch_limitand chart how quickly old tuple versions disappear. - Disable a worker in tests to show why unflushed WAL or missing checkpoints lengthen recovery.
- Enable
RUST_LOG=background=infoto trace how these tasks complement foreground load.
Configuration & Runtime Options
src/config/ centralizes tunables used by DatabaseOptions, the CLI/HTTP front-ends, and
background workers. Keeping knobs in one place makes it easy to demonstrate how WAL,
buffering, or vacuum behavior changes under different settings.
Key Types
| Type | Description |
|---|---|
DatabaseOptions | Top-level options when constructing a database (WAL config, default isolation, etc.). |
WalOptions | WAL directory, segment size, flush strategy, writer interval, sync mode. |
IndexVacuumConfig / MvccVacuumConfig | Background worker intervals (buffer writer, MVCC vacuum). |
BufferPoolConfig | Optional overrides for pool size, TinyLFU, and replacement policy details. |
Usage
- CLI/HTTP front-ends parse env vars or config files into
DatabaseOptionsand pass them toDatabase::new_*. - During
bootstrap_storage, the database wires these options intoWalManager,DiskScheduler, andBackgroundWorkers. - Workers and execution components receive
Arcreferences to the relevant configs so they can adapt at runtime without global state.
Teaching Ideas
- Toggle
WalOptions::synchronous_committo discuss commit latency vs durability. - Shrink the buffer pool to highlight eviction behavior under different replacement policies.
- Adjust
MvccVacuumConfigintervals and measure how vacuum frequency affects foreground write throughput.
Front-Ends (CLI / HTTP)
The bin/ directory contains the user-facing entry points. Both binaries embed the same
Database type, so they demonstrate how the core engine can power different UIs.
| Binary | Purpose |
|---|---|
client.rs | Interactive CLI (REPL) that reads SQL, executes it, and prints tabular output. |
server.rs | HTTP + JSON API for integration tests or web UIs. |
CLI (bin/client.rs)
- Uses
rustylineto provide history, multi-line editing, and familiar shell shortcuts. - Each command calls
database.run(sql)and formats the resultingVec<Tuple>. - Supports meta commands (e.g.,
.tables) that expose catalog metadata—great for teaching how logical objects map to physical handles.
HTTP (bin/server.rs)
- Built with
axum/hyper(depending on the currentCargo.toml), exposing endpoints such as:POST /query– run arbitrary SQL and return rows or an error payload.- Health/metrics endpoints—which you can extend in labs to surface background worker status or buffer metrics.
- Configuration comes from
QUILL_DB_FILE,QUILL_HTTP_ADDR,PORT, etc., mirroring how production services inject settings.
Teaching Ideas
- Extend the CLI with
\describe tableto practice catalog lookups. - Add transaction endpoints (BEGIN/COMMIT) to the HTTP server to demonstrate how
SessionContextscopes transactions per connection. - Combine CLI interaction with
RUST_LOGtracing to walk through the entire query lifecycle.
Testing & Documentation
QuillSQL is intended for teaching, so the repo invests heavily in examples and automated
verification. The tests/ tree and this mdBook work together to illustrate every module.
Test Suite
| Location | Purpose |
|---|---|
tests/sql_example/*.slt | sqllogictest suites covering DDL, DML, transactions, and indexes. |
tests/transaction_tests.rs | Rust unit tests that stress MVCC visibility, lock conflicts, and isolation semantics. |
tests/storage_* | Component tests for heap/index/buffer internals—perfect references for lab exercises. |
Common commands:
cargo test -q
# focused run
cargo test tests::transaction_tests::repeatable_read_sees_consistent_snapshot_after_update -- --nocapture
For long-running suites, wrap with timeout to guard against hangs.
Documentation (mdBook)
- The
docs/directory is an mdBook; runmdbook serve docsto browse locally. - Each module, including this page, has a dedicated chapter so instructors can teach subsystem by subsystem.
- Anchor chapters such as
architecture.md,transactions.md, andwal.mdwalk through end-to-end flows and subsystem deep dives.
Teaching Ideas
- Require sqllogictest additions alongside code changes to reinforce “tests as docs”.
- Use the mdBook site during lectures to connect diagrams with source files.
- Assign “doc walk-through” tasks where students extend diagrams or add experiment instructions to existing chapters.