Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

QuillSQL Logo

QuillSQL Internals

Welcome to the technical documentation for QuillSQL.

This book provides a deep dive into the internal architecture and implementation details of the database. It is intended for developers, contributors, and anyone interested in understanding how a relational database is built from the ground up, referencing concepts from classic database courses like CMU 15-445.


Table of Contents

QuillSQL Architecture

This document explains how a SQL request flows through QuillSQL, how transactional MVCC and background services plug in, and how the main modules collaborate. All diagrams use Mermaid so you can paste them into any compatible renderer for a richer view.


1. End-to-End Request Pipeline

flowchart TD
    Client["CLI / HTTP client"] --> API["bin/client / bin/server"]
    API --> Parser["sql::parser"]
    Parser --> LPlanner["plan::LogicalPlanner"]
    LPlanner --> Optimizer["optimizer::LogicalOptimizer"]
    Optimizer --> PhysPlanner["plan::PhysicalPlanner"]
    PhysPlanner --> Exec["execution::ExecutionEngine (Volcano)"]

    subgraph Txn["Transaction Layer"]
        Session["session::SessionContext"]
        TM["transaction::TransactionManager"]
        LM["transaction::LockManager"]
        Session --> TM --> LM
    end

    Exec -->|Tuple meta & locks| Txn
    Exec --> Storage

    subgraph Storage["Storage & I/O"]
        TableHeap["storage::table_heap::TableHeap"]
        Index["storage::index::B+Tree"]
        Buffer["buffer::BufferManager"]
        Scheduler["storage::disk_scheduler (io_uring)"]
        WAL["recovery::WalManager"]
        TableHeap <--> Buffer
        Index <--> Buffer
        Buffer <--> Scheduler
        WAL --> Scheduler
    end

    Txn -->|WAL payloads| WAL

    Background["background::workers\n(checkpoint, bg writer, MVCC vacuum)"] --> Buffer
    Background --> WAL
    Background --> TM

High-level flow

  1. SQL text is parsed into an AST, planned into a LogicalPlan, optimized by a handful of safe rewrite rules, then lowered into a physical operator tree.
  2. SessionContext either reuses or creates a transaction and injects isolation/access modes.
  3. Each physical operator is driven by the Volcano pull model (init/next). On entry it obtains a TxnRuntime which supplies a command id plus an MVCC snapshot consistent with the transaction’s isolation level.
  4. Operators consult the snapshot for tuple visibility, acquire table/row locks through TxnRuntime, and issue heap/index operations.
  5. DML operators register undo records and append WAL entries via the transaction manager. When the user commits, the manager emits Commit records, waits for durability as configured, and releases locks.

2. Transaction & MVCC Mechanics

sequenceDiagram
    participant Session
    participant TM as TransactionManager
    participant Runtime as TxnRuntime
    participant Exec as Operator
    participant Heap as TableHeap

    Session->>TM: begin(isolation, access_mode)
    TM-->>Session: Transaction handle
    Exec->>Runtime: TxnRuntime::new(&TM, &mut txn)
    Runtime-->>Exec: {snapshot, command_id}
    loop per tuple
        Exec->>Heap: next()
        Heap-->>Exec: (rid, meta, tuple)
        Exec->>Runtime: is_visible(meta)?
        Runtime-->>Exec: yes/no (uses cached snapshot)
        alt Visible
            Exec->>Runtime: lock_row(...)
            Exec->>TM: record undo + WAL
        end
    end
    Session->>TM: commit(txn)
    TM->>WAL: Transaction(Commit)
    TM->>TM: wait_for_durable / flush_until
    TM-->>Session: release locks, clear snapshot
  • Snapshots

    • Read Committed / Read Uncommitted: Every command refreshes its snapshot and clears any cached value on the transaction handle.
    • Repeatable Read / Serializable: The first command captures a snapshot (xmin, xmax, active_txns) and stores it inside Transaction. Subsequent commands reuse it, ensuring consistent reads.
    • Background MVCC vacuum consults TransactionManager::oldest_active_txn() to compute safe_xmin and prunes tuple versions whose inserting/deleting transactions precede that boundary.
  • Locking
    Multi-granularity 2PL (IS/IX/S/SIX/X) enforced by LockManager. Repeatable Read releases shared locks at the end of each command (after verifying visibility). Serializable keeps shared locks through commit. Deadlocks are detected via a wait-for graph; a victim simply fails its lock request.

  • Undo & WAL
    Transaction maintains logical undo entries. On abort, the manager emits CLR records and performs the inverse heap operations. Commit waits depend on synchronous_commit. Buffer frames retain their page_lsn so WAL-before-data holds.

  • Executor safeguards
    PhysicalUpdate now skips tuple versions created by the same (txn_id, command_id) during the current command. This prevents re-processing the freshly inserted MVCC version and thereby avoids infinite loops.


3. Storage & Buffering

ComponentHighlights
TableHeapTuple metadata (TupleMeta) stores inserting/deleting txn ids, command ids, and forward/back pointers for version chains. Helpers like mvcc_update stitch new versions while marking old ones deleted.
B+TreeB-link tree implementation with separate codecs for header/internal/leaf pages. Index maintenance occurs in DML operators after the heap change succeeds.
BufferManagerCombines a page table, LRU-K replacer (with TinyLFU admission option), and per-frame guards. Dirty pages record their first-dirty LSN, feeding checkpoints. The background writer periodically flushes dirty frames and drives lazy index cleanup.
DiskSchedulerUses io_uring worker threads. Foreground tasks push ReadPage, WritePage, WriteWal, and FsyncWal commands through lock-free queues and receive completion on dedicated channels.

4. Write-Ahead Logging & Recovery

  • WalManager manages log sequence numbers, buffers frames, writes physical (PageWrite, PageDelta) and logical (HeapInsert/Update/Delete) records, and coordinates flushes. First-page-writes guard against torn pages.
  • background::spawn_checkpoint_worker emits Checkpoint records capturing the dirty page table and active transactions so recovery can cut replay short.
  • RecoveryManager executes ARIES-style analysis → redo → undo on restart, leveraging CLRs to keep undo idempotent.
  • WAL and data I/O both use the DiskScheduler, keeping durability guarantees in one place.

5. Background Services

WorkerDescriptionTrigger
WAL writerCoalesces buffered WAL into durable segmentsWalManager::start_background_flush
CheckpointFlushes LSNs, records ATT + DPT snapshots, resets FPW epochConfigurable interval (WalOptions::checkpoint_interval_ms)
Buffer writerFlushes dirty frames, runs index lazy cleanup based on pending garbage countersIndexVacuumConfig
MVCC vacuumIterates table heaps, removing committed-deleted or aborted-inserted tuples older than safe_xminMvccVacuumConfig (interval + batch limit)

All workers are registered with background::BackgroundWorkers, which stops and joins them on database shutdown.


6. Example Timeline: Repeatable Read UPDATE

T1 (RR)                               T2 (RC)
-----------                           -----------
BEGIN;                                BEGIN;
SELECT ... (snapshot S0)              UPDATE row -> new version V1
                                      COMMIT (WAL + flush)
SELECT again -> sees original value
COMMIT
-- background vacuum later reclaims deleted version when safe_xmin > delete_txn
  • T1’s TxnRuntime caches snapshot S0 on its first command and reuses it, so the second SELECT filters out V1 even though T2 committed.
  • Row-level shared locks acquired during the read are released at the end of the command, while the MVCC snapshot keeps the view consistent.
  • When T1 commits, locks are dropped, snapshot cache is cleared, and WAL commit record becomes durable. Vacuum eventually removes T1’s deleted predecessor when all transactions with txn_id < safe_xmin have finished.

7. Observability & Configuration

  • Enable tracing via RUST_LOG=trace to inspect lock grant/queue events and MVCC vacuum activity.
  • Key knobs exposed through CLI/environment: WAL segment size, background intervals, default isolation level, MVCC vacuum batch size.
  • background::BackgroundWorkers::snapshot() reports active worker metadata; you can surface it for debugging endpoints.

8. Roadmap

  • Predicate locking / SSI to upgrade Serializable beyond strict 2PL.
  • Cost-based optimization with catalog statistics.
  • Smarter vacuum pacing tied to storage pressure and tuple churn.
  • Parallel execution and adaptive readahead hints based on operator feedback.

Even with these future items, the current layering mirrors production databases (e.g., PostgreSQL): MVCC + 2PL, ARIES-style WAL, asynchronous maintenance, and a modular Volcano executor—all while keeping the codebase approachable for teaching and experimentation.

Contributor's Guide

Welcome, and thank you for your interest in contributing to QuillSQL! Whether you're fixing a bug, adding a new feature, or improving the documentation, this guide will help you get started.

1. Getting Started: Your Development Environment

Prerequisites

  • Rust: QuillSQL is written in Rust. If you don't have it yet, install it via rustup. This will provide you with rustc (the compiler) and cargo (the package manager and build tool).
    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    
  • Build Essentials: Ensure you have a C++ compiler like gcc or clang installed, which is a common dependency for some Rust crates.

Setup

  1. Fork the Repository: Start by forking the main QuillSQL repository to your own GitHub account.

  2. Clone Your Fork: Clone your forked repository to your local machine.

    git clone https://github.com/YOUR_USERNAME/QuillSQL.git
    cd QuillSQL
    
  3. Build the Project: Compile the entire project to ensure everything is set up correctly.

    cargo build
    

2. Development Workflow

Running Tests

Before and after making any changes, it's crucial to run the test suite to ensure you haven't broken anything.

  • Run all unit and integration tests:

    cargo test
    
  • Run the benchmark suite:

    cargo bench
    

Code Style and Quality

We adhere to the standard Rust coding style and use tools to enforce it.

  • Formatting: Before committing, please format your code with rustfmt.

    cargo fmt --all
    
  • Linting: We use clippy to catch common mistakes and improve code quality. Please ensure clippy passes without warnings.

    cargo clippy --all-targets -- -D warnings
    

Submitting Your Contribution

  1. Create a New Branch: Create a descriptive branch name for your feature or bugfix.

    git checkout -b my-awesome-feature
    
  2. Make Your Changes: Write your code. Add new tests to cover your changes. Ensure all existing tests still pass.

  3. Format and Lint: Run cargo fmt and cargo clippy as described above.

  4. Commit Your Work: Write a clear and concise commit message.

    git add .
    git commit -m "feat: Add support for window functions"
    
  5. Push to Your Fork: Push your branch to your fork on GitHub.

    git push -u origin my-awesome-feature
    
  6. Open a Pull Request: Go to the original QuillSQL repository on GitHub. You should see a prompt to open a Pull Request from your new branch. Fill out the PR template with a description of your changes.

3. Working on the Documentation

The documentation is built using mdbook. To preview your changes locally, you'll need to install it and the mermaid plugin.

  1. Install mdbook and mdbook-mermaid:

    cargo install mdbook
    cargo install mdbook-mermaid
    
  2. Serve the Book Locally: Run the following command from the root of the project.

    mdbook serve docs
    

    This will build the book and start a local web server. You can open your browser to http://localhost:3000 to see the live-previewed documentation.

Buffer Manager

In any disk-based database, the speed difference between main memory (RAM) and persistent storage (SSD/HDD) is enormous. The Buffer Manager is the component designed to solve this problem. It acts as a cache, managing a region of main memory called the Buffer Pool and moving data pages between disk and memory as needed.

Its primary goal is to minimize disk I/O by keeping frequently accessed pages in memory.

This section is divided into the following parts:

  • Page & Page Guards: Explains the core concepts of pinning and the RAII guards used to safely access pages.
  • The Buffer Pool: A deep dive into the architecture and lifecycle of a page request, including the page table, replacer, and concurrency.

Page & Page Guards

Before the Buffer Manager can hand out a reference to a page in memory, it must ensure that the page won't be evicted while it's being used by another thread. This is accomplished by pinning.

Pinning

Pinning simply means incrementing a "pin count" associated with the page's frame in the buffer pool. A frame with a pin count greater than zero is forbidden from being chosen as a victim by the page replacer.

  • When a thread wants to use a page, it must first pin it.
  • When the thread is finished with the page, it must unpin it (decrementing the count).

Manually managing pin counts is tedious and error-prone. Forgetting to unpin a page leads to a memory leak, as the frame can never be evicted. To solve this, QuillSQL uses a common and powerful C++ and Rust pattern: Resource Acquisition Is Initialization (RAII).

ReadPageGuard and WritePageGuard

Instead of returning a raw pointer to the page memory, the BufferManager's fetch_page_* methods return a guard object: ReadPageGuard or WritePageGuard.

These guards are responsible for the lifetime of the pin and the lock on the page:

  1. Acquisition: When a PageGuard is created, its constructor acquires the appropriate lock (RwLock) on the page's frame and increments the frame's pin count.

    • ReadPageGuard takes a read lock, allowing multiple concurrent readers.
    • WritePageGuard takes an exclusive write lock.
  2. Usage: The calling code uses the guard object to access the page's data. The guard provides safe, locked access to the underlying byte array.

  3. Release: When the guard variable goes out of scope (e.g., at the end of a function), its drop() method is automatically called by the Rust compiler. This drop() implementation handles all the cleanup:

    • It decrements the pin count.
    • It releases the lock on the frame.
    • If it's a WritePageGuard and the data was modified, it informs the BufferManager that the page is now dirty.

This RAII pattern makes using the buffer pool much safer and more ergonomic, as it makes it impossible to forget to unpin a page or release a lock.

The Buffer Pool: Architecture and Lifecycle

1. Core Components & Architecture

QuillSQL's buffer management is split into two main structs, reflecting a separation of concerns:

  • BufferPool (buffer/buffer_pool.rs): A low-level, "dumb" container. It owns the actual memory arena for pages and provides basic mapping from a PageId to a memory location (FrameId).
  • BufferManager (buffer/buffer_manager.rs): The high-level, "smart" coordinator. It contains the BufferPool and implements all the logic for fetching pages, choosing which pages to evict, and interacting with other database components like the transaction and recovery managers.

This architecture is centered around three key data structures:

  1. Frames (The Arena): The BufferPool pre-allocates a large, contiguous block of memory on the heap. This block is divided into a fixed number of smaller chunks called frames. Each frame is exactly PAGE_SIZE (4KB) and can hold the contents of one disk page.

  2. Page Table (page_table): A hash map (specifically, a concurrent DashMap) that maps a logical PageId to the FrameId where it currently resides in memory. This provides fast O(1) lookups to check if a page is already in the buffer pool.

  3. Replacer (LRUKReplacer): When a requested page is not in memory and the buffer pool is full, one of the existing pages must be evicted to make room. The Replacer is the component that implements the page replacement policy and decides which page is the best candidate for eviction. QuillSQL uses an LRU-K replacement policy, a sophisticated variant of the classic Least Recently Used (LRU) algorithm.

2. The Lifecycle of a Page Request

When another part of the database (e.g., the execution engine) needs to access a page, it calls buffer_manager.fetch_page_read(page_id) or fetch_page_write(page_id). This initiates a critical sequence of events.

The flow is as follows:

  1. Request: An executor requests Page P.

  2. Lookup: The BufferManager consults the PageTable.

  3. Case 1: Cache Hit

    • The PageTable contains an entry for P, mapping it to FrameId F.
    • The BufferManager increments the pin count for frame F.
    • It informs the LRUKReplacer that the page has been accessed (updating its priority).
    • It returns a PageGuard wrapping a reference to the memory in frame F.
  4. Case 2: Cache Miss

    • The PageTable has no entry for P.
    • The BufferManager must bring the page from disk. It asks the Replacer to choose a victim frame F_v to evict.
    • If the victim frame F_v is dirty: The BufferManager first writes the contents of F_v back to disk via the DiskScheduler. This is essential for data durability.
    • The PageTable entry for the old page residing in F_v is removed.
    • The BufferManager issues a read request to the DiskScheduler to load page P's data from disk into frame F_v.
    • The PageTable is updated with the new mapping: P -> F_v.
    • The process then continues like a cache hit: frame F_v is pinned and a PageGuard is returned.

3. Concurrency

The buffer pool is a shared resource accessed by many concurrent threads. QuillSQL uses a combination of locking strategies:

  • Page Table: A DashMap is used for the page table, which is highly optimized for concurrent reads and writes.
  • Frame-Level Locks: Each frame in the pool has its own RwLock. A ReadPageGuard acquires a read lock on the frame, allowing multiple threads to read the same page concurrently. A WritePageGuard acquires an exclusive write lock, ensuring that only one thread can modify a page at a time.
  • Replacer/Free List: These shared structures are protected by a Mutex.

4. Integration with WAL and Recovery

The Buffer Manager is a key player in the ARIES recovery protocol.

  • Dirty Pages: When a WritePageGuard is dropped, if it has modified the page, it marks the page as dirty. The BufferManager maintains a dirty_page_table that tracks all dirty pages and the Log Sequence Number (LSN) of the first WAL record that caused the page to become dirty.
  • Forced WAL (Write-Ahead Logging): Before a dirty page can be written back to disk (either during eviction or a checkpoint), the BufferManager must ensure that all WAL records up to that page's current LSN have been flushed to durable storage. This is the fundamental WAL rule and is enforced by the flush_page method, which calls wal_manager.flush(lsn) before writing the page to disk.

For Study & Discussion

  1. Replacement Policies: QuillSQL uses LRU-K. What are the potential advantages of LRU-K over a simple LRU policy? What kind of workload would benefit most? Conversely, what are the trade-offs of using Clock/Second-Chance instead?

  2. The "Double Caching" Problem: We mentioned that using Direct I/O helps avoid double caching. If Direct I/O is disabled, how does the database's buffer pool interact with the operating system's file system cache? Why can this lead to suboptimal performance?

  3. Programming Exercise: Implement a ClockReplacer that adheres to the Replacer trait. Modify the BufferManager to use your new replacer instead of LRUKReplacer. Run the benchmark suite and compare the performance. Does it change?

  4. Programming Exercise: Add metrics to the BufferManager to track the buffer pool hit rate (i.e., num_hits / (num_hits + num_misses)). You could expose this via a new SHOW BUFFER_STATS; SQL command.

Storage Engine

The storage engine is where the logical world of SQL tables and rows meets the physical world of disk blocks and bytes. Understanding how data is laid out on disk is fundamental to understanding database performance, concurrency control, and recovery.

This section is divided into the following parts:

  • Disk I/O: A look at the asynchronous I/O layer using io_uring.
  • Page & Tuple Layout: A deep dive into how pages and tuples are physically structured on disk, including the slotted page layout.
  • Table Heap: Explains how tuple versions are managed for MVCC.

Disk I/O — Scheduler, io_uring Data Pages & WAL Runtime

1. Architecture

  • Request Path: foreground components enqueue DiskRequest objects via DiskScheduler::{schedule_read, schedule_write, …}. A dispatcher thread drains the global channel and distributes work round-robin to N io_uring workers. Each worker owns its own ring and file-descriptor cache, so once a request is forwarded, execution proceeds entirely off the foreground thread.
  • Stable APIs: schedule_read(page_id), schedule_write(page_id, Bytes), schedule_read_pages(Vec<PageId>), schedule_allocate(), schedule_deallocate(page_id) — every call returns a channel the caller can block on or poll.
  • Batch Reads: ReadPages fans out per-page SQEs while a shared BatchState tracks completions. Even if the kernel completes I/O out of order, the caller receives a Vec<BytesMut> that preserves the original page order.

2. WAL Runtime (buffered I/O)

  • Dedicated WAL runtime threads handle sequential WAL appends/reads using buffered I/O. They now keep a per-thread cache of open segment files, eliminating repeated open()/close() on every log record.
  • Worker count defaults to max(1, available_parallelism / 2) but is tunable through IOSchedulerConfig.
  • Optional sync on a request triggers sync_data / fdatasync so WalManager can honour synchronous commit or checkpoint barriers. Data pages stay on the io_uring dataplane; WAL always uses buffered writes.

3. io_uring Backend (Linux)

  • Each worker owns an IoUring with configurable queue_depth, optional SQPOLL idle timeout, and a pool of registered fixed buffers sized to PAGE_SIZE. Workers submit SQEs asynchronously and drain CQEs in small batches to keep the ring warm.
  • Read batching relies on shared BatchState instances (Rc<RefCell<_>>) so multi-page callers see ordered results without blocking the kernel on serialization.
  • Writes keep their payload alive until completion; if a fixed buffer slot is available we reuse it, otherwise we fall back to heap buffers. A companion WriteState tracks an optional fdatasync so the caller still observes exactly one Result<()> once all CQEs land.
  • Errors (short read/write, errno) are normalised into QuillSQLError values that flow back on the original channel.

4. Configuration

  • config::IOSchedulerConfig controls:
    • workers: number of io_uring workers (default = available parallelism).
    • wal_workers: WAL runtime threads (default workers / 2).
    • iouring_queue_depth, iouring_fixed_buffers, iouring_sqpoll_idle_ms.
    • fsync_on_write: whether data-page writes also issue fdatasync (WAL sync is managed separately by WalManager).

5. Concurrency & Safety

  • Worker-local file descriptors plus positional I/O remove shared mutable state on the hot path. The new per-worker handle cache further reduces syscall overhead.
  • Shutdown sequence: enqueue Shutdown, dispatcher forwards it to every worker, each worker drains outstanding SQEs/CQEs, and finally dispatcher + workers are joined.
  • BufferPool, TableHeap, and the streaming scan ring buffer still integrate via channels; inflight guards prevent duplicate page fetches.

6. Performance Notes

  • Random page access benefits from fewer syscalls and deeper outstanding queue depth than the blocking fallback.
  • Only the io_uring backend currently ships (Linux x86_64). A portable fallback remains future work.
  • For large sequential scans, combine ReadPages with the ring-buffer iterator to minimise buffer-pool churn.

7. Future Work

  • Queue-depth aware scheduling and CQE bulk harvesting.
  • Optional group commit (aggregate writes, single fsync) behind configuration.
  • Metrics hooks (queue depth, submit/complete throughput, latency percentiles, error codes).
  • Cross-platform fallback backend and richer prioritisation/throttling policies.
  • Control-plane knobs for throttling individual background workers.

Page and Tuple Layout

1. The Page: The Atomic Unit of I/O

A database file is not treated as one continuous stream of data. Instead, it is broken down into fixed-size blocks called pages. A page is the atomic unit of transfer between the disk and the in-memory Buffer Pool. Whenever the database needs to read a piece of data (like a single row), it must load the entire page containing that data into memory.

In QuillSQL, the page size is a constant defined at quill-sql/src/buffer/mod.rs:

  • PAGE_SIZE: 4096 bytes (4 KB)

This fixed-size approach simplifies buffer management and allows for efficient, aligned I/O operations, especially when using Direct I/O to bypass the OS cache.

2. TablePage: The Slotted Page Layout

While a page is a generic 4KB block of bytes, pages that store actual table data are structured in a specific way. QuillSQL uses a classic Slotted Page layout, which is a core concept in database implementation (as taught in CMU 15-445).

A TablePage is organized into three main parts:

<------------------------------ 4KB ------------------------------>
+----------------+-----------------+-----------------+--------------+
|  Page Header   |   Slot Array    |      Free       |   Tuple      |
| (grows ->)     |   (grows ->)    |      Space      |     Data     |
|                |                 |                 | (<- grows)   |
+----------------+-----------------+-----------------+--------------+
  1. Page Header (TablePageHeader): Located at the beginning of the page. It contains metadata about the page itself.
  2. Slot Array (tuple_infos): An array of TupleInfo structs that grows from after the header. Each entry in this array acts as a "pointer" or "directory entry" for a tuple on the page.
  3. Tuple Data: The actual raw data of the tuples is stored starting from the end of the page, growing backwards towards the middle.

This design has a key advantage: it decouples a tuple's logical identifier from its physical location on the page.

The RecordId (RID)

A specific tuple is uniquely identified by a RecordId (RID). The RID is a stable pointer composed of two parts:

  • page_id: The ID of the page where the tuple resides.
  • slot_num: The index into the Slot Array on that page.

So, RID = (page_id, slot_num).

When the database needs to delete a tuple or if a variable-length tuple is updated and grows in size, the tuple's data might need to be moved within the page (for compaction). In a slotted page design, we only need to update the offset in the corresponding slot array entry. The tuple's RID (page_id, slot_num) remains unchanged. This prevents a cascade of updates to all secondary indexes that might be pointing to that tuple.

TablePageHeader and Slot (TupleInfo)

Let's look at the physical layout, as defined in storage/codec/table_page.rs:

  • TablePageHeader:

    • lsn: The Log Sequence Number of the last change made to this page, crucial for WAL recovery.
    • next_page_id: The ID of the next page in this table, forming a linked list of pages.
    • num_tuples: The number of active tuples on the page.
    • num_deleted_tuples: The number of "dead" or deleted tuples.
    • tuple_infos: The slot array itself.
  • TupleInfo (A single slot in the array):

    • offset: The byte offset from the beginning of the page where the tuple's data begins.
    • size: The size of the tuple's data in bytes.
    • meta: A nested TupleMeta struct containing critical information for concurrency control.

For Study & Discussion

  1. Layout Trade-offs: What is the main benefit of having the tuple data grow from the end of the page backwards, while the header and slot array grow from the beginning forwards? What happens when they meet?

  2. Record ID Stability: Why is it so important that a tuple's RecordId does not change even if the tuple's physical data is moved within the page? What would break if the RID was just a direct byte offset?

  3. Large Objects: The current design assumes a tuple fits entirely on one page. How would you modify this page layout to support tuples that are larger than 4KB (e.g., a long blog post stored in a VARCHAR column)? Research how systems like PostgreSQL handle this with their "TOAST" (The Oversized-Attribute Storage Technique) mechanism.

  4. Programming Exercise: Implement a defragment() method for the TablePage. After several insertions and deletions, the free space on a page can become fragmented into small, unusable chunks. This method should reorganize the page by moving the existing tuples to be contiguous, creating a single, large block of free space. Remember to update the offset in each TupleInfo slot after moving its corresponding tuple data!

Table Heap and MVCC

The TableHeap (storage/table_heap.rs) is the component responsible for managing the collection of pages that belong to a single table. While the TablePage defines the layout within a single page, the TableHeap provides the high-level API for inserting, updating, and deleting tuples, making it central to the implementation of MVCC.

Tuple Serialization

A logical Tuple struct in memory must be converted into a byte array to be stored on a page. This process is handled by storage/codec/tuple.rs.

The serialized format of a tuple consists of two parts:

  1. Null Bitmap: A compact bitmap at the beginning of the tuple data. Each bit corresponds to a column in the schema; if the bit is 1, the column's value is NULL. This avoids storing any data for null fields.
  2. Attribute Data: The actual data for all non-null columns, serialized one after another.

TupleMeta: The Heart of MVCC

The most important part of a tuple's on-disk metadata is the TupleMeta struct, which is stored directly within the TupleInfo slot in the page header. This is the heart of QuillSQL's Multi-Version Concurrency Control (MVCC) implementation.

  • insert_txn_id: The ID of the transaction that created this version of the tuple.
  • delete_txn_id: The ID of the transaction that "deleted" this version. (A value of 0 or INVALID means it's not deleted).
  • is_deleted: A boolean flag indicating the tuple version is considered deleted.
  • prev_version: Option<RecordId>: A link (RID) to the previous version of this logical row.
  • next_version: Option<RecordId>: A link (RID) to the next version of this logical row.

These fields allow QuillSQL to maintain a version chain for each logical row. When a row is updated, the TableHeap's mvcc_update method is called, which, instead of overwriting the data, performs the following steps:

  1. Creates a new tuple version with the new data by calling insert_tuple.
  2. Sets the prev_version pointer of this new version to the RID of the old version.
  3. Updates the next_version pointer of the old version to point to the new version.
  4. Sets the delete_txn_id on the old version to mark it as "dead".

A transaction can then traverse this chain and use the transaction IDs in the TupleMeta to determine which version of a row is visible to it based on its own transaction ID and isolation level, thus achieving transaction isolation without long-held read locks.


For Study & Discussion

  1. Version Chain Traversal: Imagine a transaction with ID T10 needs to read a row. It finds a version of the row that was inserted by T5 (committed) but deleted by T12 (in-progress). Should T10 see this version? What if the deleter was T8 (committed)? What if the deleter was T10 itself? Walk through the visibility check logic.

  2. Garbage Collection: The MVCC model creates many "dead" tuple versions that are no longer visible to any active or future transaction. What is the long-term problem with leaving these dead versions in the database? This problem is typically solved by a process called vacuuming or garbage collection. When is it safe to physically remove a dead tuple version?

  3. Programming Exercise (Advanced): Implement a basic VACUUM function for a TableHeap. This function should scan the table and identify dead tuple versions that are no longer visible to any currently active transaction. Once identified, it should physically remove them from their pages. This is a challenging exercise that touches transaction management, storage, and concurrency.

Indexes

Indexes are crucial for database performance. They are redundant data structures that allow the system to find rows matching specific criteria quickly, without having to scan the entire table. While indexes speed up queries (reads), they incur a cost during data modification (writes), as both the table and its indexes must be updated.

QuillSQL currently provides a BPlusTreeIndex.

Key Concepts

  • Data Structure: The index is built using a B+Tree, a self-balancing tree data structure that maintains sorted data and allows for efficient insertion, deletion, and search operations (typically in O(log n) time).

  • Index Keys: The B+Tree stores key-value pairs. The "key" is the value from the indexed column(s) (e.g., a user's ID), and the "value" is the RecordId (RID) of the row containing that key. This allows the database to first find the key in the B+Tree and then use the RID to immediately locate the full row data on its data page.

  • Concurrency Control: The B+Tree implementation must be highly concurrent. QuillSQL's implementation uses advanced techniques like B-link (or B-link tree) page connections and latch-crabbing to allow multiple readers and writers to access the tree simultaneously with minimal contention.

This section contains:

  • B+Tree: A look at the classic B+Tree data structure.

B+ Tree Index — Architecture and Concurrency

1. Architecture Overview

1.1 Node and Page Structure

  • Node Types: Internal and Leaf.
    • Internal Nodes: Store separator keys and pointers to child pages. The first pointer is a "sentinel" that points to the subtree for all keys less than the first key in the node.
    • Leaf Nodes: Store (key, RecordId) pairs in sorted order. Leaves are linked together in a singly linked list (next_page_id) to allow for efficient range scans.
  • Header Page: A fixed page (header_page_id) that acts as the entry point to the tree. It stores the root_page_id, allowing for atomic updates to the tree root when it splits or shrinks.
  • Page Layout:
    • Internal Page: {Header, High Key, Array<(Key, ChildPageId)>}. The High Key is part of the B-link optimization.
    • Leaf Page: {Header, Array<(Key, RID)>}.

The tree uses B-link pointers (next_page_id) on all levels (both internal and leaf nodes). This creates a "side-link" to the right sibling. This is crucial for concurrency, as it allows readers to recover from transient inconsistent states caused by concurrent page splits by "chasing" the link to the right sibling.

              +------------------+
              |   Root (Int)     |
              +------------------+
             /         |         \
   +--------+      +--------+      +--------+
   | Int P1 |----->| Int P2 |----->| Int P3 |  (Internal B-link pointers)
   +--------+      +--------+      +--------+
   /   |   \      /   |   \      /   |   \
+----+ +----+  +----+ +----+  +----+ +----+
| L1 |-| L2 |->| L3 |-| L4 |->| L5 |-| L6 | (Leaf chain / B-links)
+----+ +----+  +----+ +----+  +----+ +----+

2. Concurrency Control

The B+Tree employs a sophisticated, lock-free read path and a high-concurrency write path using latch crabbing.

  • Readers traverse the tree from the root without taking any locks.
  • On each page, a version number is read before and after processing the page's contents. If the version changes, it indicates a concurrent modification, and the read operation restarts from the root.
  • If a reader is traversing an internal node and the search key is greater than or equal to the node's high_key, it knows a split has occurred. Instead of restarting, it can use the next_page_id B-link to "chase" to the right sibling and continue the search, minimizing restarts.

2.2 Write Path: Latch Crabbing

Writers (insert/delete) use a technique called latch crabbing (or lock coupling) to ensure safe concurrent modifications.

  • Process: A writer acquires a write latch on a parent node before fetching and latching a child node. Once the child is latched, the writer checks if the child is "safe" for the operation (i.e., not full for an insert, not at minimum size for a delete).
    • If the child is safe, the latch on the parent (and all other ancestors) is released.
    • If the child is unsafe, the parent latch is held, as the child might need to split or merge, which would require modifying the parent.
  • Context Struct: This process is managed by a Context struct that holds the stack of write guards (write_set) for the current traversal path. Releasing ancestor latches is as simple as clearing this stack.
Latch Crabbing on Insert:
1. Latch(Root)
2. Descend to Child C1. Latch(C1).
3. Check if C1 is safe (not full).
   IF SAFE:
     ReleaseLatch(Root). Path is now just {C1}.
   IF UNSAFE (full):
     Keep Latch(Root). Path is {Root, C1}.
4. Descend from C1 to C2. Latch(C2).
5. Check if C2 is safe... and so on.

2.3 Deadlock Avoidance

When modifying siblings (during merge or redistribution), deadlocks are possible if two threads try to acquire latches on the same two pages in opposite orders. This is prevented by enforcing a strict PageId-ordered locking protocol. When two sibling pages must be latched, the page with the lower PageId is always latched first.

3. Key Algorithms & Features

  • Parent-Guided Redirection: During an insert or delete, after a writer has descended to a leaf, it re-validates its position using the parent (if a latch is still held). If a concurrent split has moved the target key range to a different sibling, the writer can jump directly to the correct page instead of traversing the leaf chain, preventing race conditions.
  • Iterator: The iterator performs a forward scan by following the leaf chain (next_page_id). It uses a lightweight form of OLC, checking the leaf page version to detect concurrent modifications and restart if necessary to ensure it doesn't miss keys.
    • Sequential Scan Optimization (RingBuffer): For large range scans, the iterator switches to a "synchronous batch fetch + local ring buffer" mode. It fills a small RingBuffer<BytesMut> with consecutive leaf pages (by following next_page_id) and then decodes KVs locally without holding page guards for long. This reduces buffer pool pollution and syscall/lock overhead.
    • Two Iteration Modes:
      • Guard Mode: Keep a ReadPageGuard and decode per step; prefetch next leaf best-effort.
      • Bytes Mode: After switching, decode from BytesMut buffers in the local ring; when a leaf is exhausted, pop next bytes from the ring or refill by following the chain.
  • Prefix Compression: Keys in internal nodes are prefix-compressed to save space. Each key is stored as (lcp, suffix_len, suffix). This reduces the size of internal pages, increasing the tree's fanout and reducing its height, which improves cache performance and reduces I/O.
  • Split/Merge Safety:
    • Split: When a node splits, the new right sibling is written first. Then, the B-link pointer and separator key are published atomically by updating the left sibling and parent. This ensures readers can always navigate the structure correctly.
    • Merge/Redistribute: When a node is underfull, the implementation first tries to borrow an entry from a sibling (redistribute). If both siblings are at minimum size, it merges with a sibling. All these operations carefully maintain the B-link chain and parent pointers.

4. Benchmarks & Performance

4.1 Example: Range Scan Benchmark

This benchmark measures the efficiency of the leaf-chain traversal, which is critical for SELECT queries with WHERE clauses on indexed columns. It benefits from iterator prefetching and prefix compression.

#![allow(unused)]
fn main() {
// Pseudo-code for a range scan benchmark
use std::time::Instant;

fn benchmark_range_scan(index: Arc<BPlusTreeIndex>, num_keys: i64, num_passes: usize) {
    // 1. Populate the index with sequential keys
    for key in 1..=num_keys {
        let tuple = create_tuple_from_key(key, index.key_schema.clone());
        index.insert(&tuple, create_rid_from_key(key)).unwrap();
    }

    // 2. Run the benchmark
    let start = Instant::now();
    let mut count = 0;
    for _ in 0..num_passes {
        // Create an iterator over the full key range
        let mut iter = TreeIndexIterator::new(index.clone(), ..);
        while let Some(_) = iter.next().unwrap() {
            count += 1;
        }
    }
    let elapsed = start.elapsed();
    let total_items_scanned = num_keys as usize * num_passes;
    let items_per_sec = total_items_scanned as f64 / elapsed.as_secs_f64();

    println!(
        "Range Scan: Scanned {} items in {:?}. Throughput: {:.2} items/sec",
        total_items_scanned, elapsed, items_per_sec
    );
}
}

4.2 Performance Notes

  • Hot Reads: Performance on hot-spot reads depends on keeping upper levels and hot leaves resident in the buffer pool. Warm up the cache for read-heavy benchmarks. Protect hot pages from pollution by enabling TinyLFU admission.
  • Large Range Scans: Prefer table SeqScan with ring buffer (bypass) when scanning most of the table. For index scans over very large ranges, consider future Bitmap Heap Scan rather than bypassing the pool for leaves.

4.3 Configuration

  • config::BTreeConfig controls iterator behavior:
    • seq_batch_enable (bool): enable batch mode with local ring buffer.
    • seq_window (usize): number of leaf pages to prefill into the ring per refill.
    • prefetch_enable/prefetch_window: guard-mode prefetch hints to buffer pool. Defaults are conservative; increase seq_window for long scans to reduce I/O hop.

5. Future Work

  • Stronger OLC with bounded retries and telemetry.
  • CSB+-like internal layout and columnar key prefixing.
  • NUMA-aware partitioning and router.
  • WAL/MVCC for crash recovery and snapshot isolation.

Recovery Manager (WAL)

The Recovery Manager is the component that ensures the Durability aspect of ACID. It guarantees that once a transaction is committed, its effects are permanent, even in the face of system crashes or power failures.

This is achieved through a Write-Ahead Log (WAL) and a recovery protocol inspired by ARIES.

Core Concepts

  • Write-Ahead Logging: The fundamental principle is that any change to a data page must be recorded in a log file on durable storage before the data page itself is written back to disk.

  • Log Records: The WAL is a sequence of records, each with a unique Log Sequence Number (LSN). Records describe changes (PageDelta, HeapInsert), transaction outcomes (Commit, Abort), or internal state (Checkpoint).

  • ARIES Protocol: On startup, the database replays the log to restore a consistent state. This involves three phases: Analysis (to figure out what was happening during the crash), Redo (to re-apply all changes since the last checkpoint), and Undo (to roll back any transactions that did not commit).

This section contains:

The ARIES Recovery Protocol

Of the four ACID properties, Durability is the one that guarantees that once a transaction is committed, its changes will survive any subsequent system failure. In a disk-based database, this is achieved through a careful and robust recovery protocol. QuillSQL implements a recovery strategy inspired by the well-known ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) protocol, centered around a Write-Ahead Log (WAL).

1. The Write-Ahead Logging (WAL) Principle

The fundamental rule of WAL is simple but powerful:

Before a modified data page is ever written from memory back to disk, the log record describing that modification must first be written to durable storage (the log file).

This allows the database to perform most of its data modifications in memory on the Buffer Pool without needing to synchronously write every changed page to disk, which would be extremely slow. In case of a crash, the database can use the log to reconstruct the state of the system and ensure all committed changes are present and all uncommitted changes are undone.

Log Records (WalRecord)

The WAL is a sequential, append-only file composed of log records. Each record is assigned a unique, monotonically increasing Log Sequence Number (LSN). A log record in QuillSQL (src/recovery/wal_record.rs) generally contains:

  • LSN: The address of the log record.
  • Transaction ID: The ID of the transaction that generated this record.
  • Payload: The actual content of the log record, which varies by type.

QuillSQL uses several types of log records:

  • Transaction: Marks the BEGIN, COMMIT, or ABORT of a transaction.
  • PageWrite / PageDelta: Physical/Physiological records describing a change to a page. PageWrite contains a full image of the page (used for the first write after a checkpoint, a technique called First-Page-Write or FPW), while PageDelta contains only the changed bytes.
  • HeapInsert / HeapUpdate / HeapDelete: Logical records describing a high-level heap operation. These are primarily used for generating precise undo operations.
  • Checkpoint: A special record that marks a point of partial durability, allowing the log to be truncated.
  • CLR (Compensation Log Record): A special record written during recovery to describe an undo action. CLRs are redo-only and are never undone themselves.

2. The ARIES Recovery Protocol

On database startup, the RecoveryManager (recovery/recovery_manager.rs) is invoked to replay() the WAL. This process follows the three phases of ARIES.

Phase 1: Analysis

The goal of the Analysis phase (recovery/analysis.rs) is to figure out the state of the database at the exact moment of the crash.

  1. It starts by finding the last successful Checkpoint record in the WAL.
  2. It then scans the log forward from that checkpoint to the end of the log.
  3. During this scan, it builds two critical data structures:
    • Active Transaction Table (ATT): A list of all transactions that have a BEGIN record but no corresponding COMMIT or ABORT record. These are the potential "loser" transactions that will need to be undone.
    • Dirty Page Table (DPT): A list of all pages that were modified in the buffer pool but might not have been written to disk before the crash. For each dirty page, it records the LSN of the first log record that made it dirty (this is called the recLSN).

At the end of this phase, the RecoveryManager knows exactly which transactions to roll back and the earliest point in the log from which it needs to start re-applying changes.

Phase 2: Redo (Repeating History)

The goal of the Redo phase (recovery/redo.rs) is to restore the database to its exact state at the moment of the crash, including all changes from both committed and uncommitted (loser) transactions.

  1. The Redo phase finds the smallest recLSN from the Dirty Page Table built during Analysis. This LSN is the starting point for the redo scan.
  2. It scans the log forward from this starting point.
  3. For every log record it encounters that describes a physical change to a page (e.g., PageWrite, PageDelta, Heap*), it re-applies the change. This is idempotent: if the change is already present on the page (because it was successfully flushed to disk before the crash), re-applying it does no harm.

At the end of this phase, the database state on disk is identical to how it was in memory right before the crash.

Phase 3: Undo (Rolling Back Losers)

The final phase (recovery/undo.rs) is responsible for rolling back all the "loser" transactions identified during Analysis.

  1. The UndoExecutor takes the list of loser transactions.
  2. For each loser transaction, it scans the WAL backwards, following the chain of log records for that transaction.
  3. For each operation record (like HeapInsert, HeapUpdate), it performs the logical inverse operation:
    • To undo an Insert, it performs a Delete.
    • To undo a Delete, it restores the deleted data.
    • To undo an Update, it restores the data from before the update.
  4. Crucially, for every undo action it performs, it writes a Compensation Log Record (CLR) to the WAL. The CLR contains information about the undo action and, importantly, a pointer to the next log record that needs to be undone for that transaction.

This use of CLRs makes the recovery process itself crash-proof. If the system crashes during the undo phase, the next time recovery runs, it will see the CLRs. It will simply continue the undo process from where it left off by following the pointers in the CLRs, without ever having to undo an undo action.

3. Checkpoints

If the log were allowed to grow forever, recovery would become slower and slower. Checkpoints are a background process that periodically creates a point of partial durability.

A checkpoint does the following:

  1. Temporarily stops new transactions from starting.
  2. Writes a BEGIN_CHECKPOINT record to the WAL, containing the current ATT and DPT.
  3. Flushes all dirty pages from the buffer pool to disk.
  4. Writes an END_CHECKPOINT record to the WAL.

Once a checkpoint is complete, the RecoveryManager knows that all changes described in log records before the BEGIN_CHECKPOINT record are safely on disk. Therefore, those older parts of the WAL file are no longer needed for recovery and can be truncated or recycled, keeping the recovery process efficient.


For Study & Discussion

  1. Repeating History: The Redo phase re-applies changes from all transactions, including those that will be undone later (the "losers"). This seems wasteful. Why is this "repeating history" approach a core principle of ARIES? What would go wrong if we tried to only redo changes from committed transactions?

  2. Compensation Log Records (CLRs): What specific problem would occur if the system crashed during the Undo phase and we didn't write CLRs? How does a CLR's "redo-only" nature make the recovery process idempotent and robust against repeated crashes?

  3. Checkpointing Frequency: What are the performance trade-offs between checkpointing very frequently versus very infrequently? Consider both normal runtime performance and the time it takes to recover after a crash.

  4. Programming Exercise: Add a new WAL record type. For example, a HeapReclaim record that logs the physical removal of a dead tuple by a vacuum process. To do this, you would need to: a. Add a variant to the HeapRecordPayload enum in wal_record.rs. b. Update the codec functions to handle its serialization. c. Decide what information the HeapReclaim record needs to contain. d. Implement the redo logic for this new record in the appropriate ResourceManager. What should happen when you redo a HeapReclaim record?

Transaction Manager

The Transaction Manager is responsible for one of the most critical aspects of a database: guaranteeing the ACID properties, specifically Isolation and Atomicity.

It allows multiple clients to access the database concurrently without interfering with each other, and it ensures that transactions are treated as single, indivisible operations (all or nothing).

Core Concepts

  • Concurrency Control: To handle simultaneous operations, QuillSQL uses a hybrid approach that combines two powerful techniques:

    1. Multi-Version Concurrency Control (MVCC): Instead of overwriting data, writers create new versions of rows. Readers are given a consistent snapshot of the database, which allows them to proceed without being blocked by writers.
    2. Two-Phase Locking (2PL): To prevent two writers from modifying the same row at the same time, a strict two-phase locking protocol is used. Transactions must acquire locks on data before modifying it and hold those locks until the transaction ends.
  • Atomicity: The transaction manager records all actions performed by a transaction. If the transaction needs to be aborted, it can use this record to perform the necessary undo operations, rolling the database back to its state before the transaction began.

This section contains:

  • MVCC and 2PL: A deep dive into QuillSQL's hybrid concurrency control model.

MVCC and 2PL

Of the four ACID properties, Isolation is often the most complex to implement. It ensures that concurrent transactions do not interfere with each other, making it appear as if each transaction is executing sequentially, one after another. Without proper isolation, a database would suffer from concurrency-related anomalies like dirty reads, non-repeatable reads, and phantom reads.

Databases typically use two main strategies for concurrency control:

  1. Pessimistic Concurrency Control: Assumes conflicts are likely and prevents them from happening by using locks. The most common protocol is Two-Phase Locking (2PL).
  2. Optimistic Concurrency Control: Assumes conflicts are rare. Transactions proceed without locking, and the database validates at commit time that no conflicts occurred. A popular variant is Multi-Version Concurrency Control (MVCC).

QuillSQL, like many modern relational databases (e.g., PostgreSQL, Oracle), implements a powerful hybrid model that combines MVCC with 2PL.

1. MVCC: Reading without Blocking

The core idea of MVCC is "writers don't block readers, and readers don't block writers." This is achieved by never overwriting data in-place. When a row is updated, the database creates a new version of that row, preserving the old one.

Version Chains and TupleMeta

As discussed in the Storage Engine chapter, every tuple on disk has associated metadata (TupleMeta) that includes:

  • insert_txn_id: The ID of the transaction that created this version.
  • delete_txn_id: The ID of the transaction that marked this version as deleted.
  • prev_version / next_version: Pointers (RIDs) that form a linked list of versions for a single logical row, called the version chain.

Transaction Snapshots

When a transaction begins, it asks the TransactionManager for a snapshot of the database state. This snapshot, defined in transaction/mvcc.rs, contains three key pieces of information:

  • xmin: The oldest active transaction ID at the time of the snapshot. Any transaction with an ID less than xmin is guaranteed to be either committed or aborted.
  • xmax: The next transaction ID to be assigned. Any transaction with an ID greater than or equal to xmax was not yet started when the snapshot was taken.
  • active_txns: A list of all other transaction IDs that were active when the snapshot was taken.

The Visibility Check

When a transaction scans the database, for every tuple version it encounters, it performs a visibility check using its snapshot. The logic in MvccSnapshot::is_visible determines if the version should be "seen" by the current transaction. In simplified terms, a tuple version is visible if:

  1. Its insert_txn_id belongs to a transaction that was already committed before our snapshot was taken.
  2. AND its delete_txn_id is either not set, OR it belongs to a transaction that was not yet committed when our snapshot was taken.

This mechanism elegantly solves several concurrency problems. Since a reader transaction only ever sees a consistent snapshot of the database, it is completely immune to changes being made by other concurrent writer transactions. This prevents dirty reads and non-repeatable reads.

2. Two-Phase Locking (2PL): Preventing Write-Write Conflicts

While MVCC is excellent for read-write conflicts, it does not, by itself, prevent two transactions from trying to modify the same logical row at the same time (a write-write conflict). This is where locking comes in.

QuillSQL implements a strict Two-Phase Locking protocol via the LockManager (transaction/lock_manager.rs).

The Two Phases

  1. Growing Phase: The transaction can acquire new locks as it executes.
  2. Shrinking Phase: Once the transaction releases its first lock, it cannot acquire any new locks.

In practice, QuillSQL uses Strict 2PL, where the shrinking phase is delayed until the transaction commits or aborts. All locks are held until the very end.

Hierarchical Locking (Intention Locks)

To be efficient, the LockManager uses a multi-granularity, hierarchical locking scheme. Before a transaction can take a fine-grained lock on a row (a Shared or Exclusive lock), it must first acquire a coarser-grained intention lock on the table.

  • IntentionShared (IS): Signals the intent to read rows from the table.
  • IntentionExclusive (IX): Signals the intent to modify rows in the table.

This prevents a transaction wanting to lock the entire table from conflicting with another transaction that is already modifying a single row. For example, a SELECT * FROM users FOR UPDATE (which needs an X lock on the table) will be blocked if another transaction already holds an IX lock on users.

Deadlock Detection

When two or more transactions are waiting for each other to release locks in a circular chain, a deadlock occurs. The LockManager detects this by building a waits-for graph. When a transaction T1 has to wait for a lock held by T2, an edge T1 -> T2 is added to the graph. If adding an edge creates a cycle, a deadlock is detected, and one of the transactions (the victim) is immediately aborted to break the cycle.

3. The Hybrid Model: MVCC + 2PL in Action

In QuillSQL, these two mechanisms work in concert:

  • A reader (SELECT) transaction acquires an MVCC snapshot. It uses this snapshot to determine visibility. It only needs to acquire Shared (S) locks on the rows it reads to prevent other transactions from modifying them, thus ensuring repeatable reads in higher isolation levels.

  • A writer (UPDATE, DELETE) transaction must acquire an Exclusive (X) lock on the specific row it intends to modify. Once the lock is granted, it knows no other writer can interfere. It can then safely create a new tuple version as part of the MVCC protocol.

This hybrid approach provides the best of both worlds: reads are fast and non-blocking, while write-write conflicts are safely prevented by the locking protocol.


For Study & Discussion

  1. Isolation Levels: QuillSQL supports multiple SQL isolation levels. How does the behavior of MVCC snapshots and 2PL change between ReadCommitted and RepeatableRead? In ReadCommitted, a transaction gets a new snapshot for every statement, whereas in RepeatableRead, it uses the same snapshot for the entire transaction. What concurrency anomalies does this difference prevent?

  2. Phantom Reads: Even with MVCC and row-level 2PL, a RepeatableRead transaction can suffer from phantom reads. Imagine T1 runs SELECT COUNT(*) FROM users WHERE age > 30. Then, T2 inserts a new user with age = 40 and commits. If T1 runs its SELECT query again, it will see a new "phantom" row that wasn't there before. How can a database prevent this to achieve the Serializable isolation level? (Hint: research predicate locking and index-range locking).

  3. Deadlock Handling: QuillSQL detects deadlocks by building a waits-for graph and aborting a transaction. What is an alternative strategy for handling deadlocks? For example, what are the pros and cons of using lock timeouts instead of cycle detection?

  4. Programming Exercise (Advanced): A full implementation of Serializable isolation often requires index-range locking to prevent phantoms. Extend the LockManager to support locking a range of keys within a B+Tree index. This would require a new lock type and a way to check for overlapping ranges. When a SELECT ... WHERE age > 30 query runs, it would place a shared lock on the (30, +∞) range in the index on the age column, preventing any other transaction from inserting a new user with an age in that range.

Query Plan

The Query Planner is the "brain" of the database. It is responsible for taking a declarative SQL query string (which describes what data to retrieve) and converting it into a highly optimized, imperative plan (which describes how to retrieve that data).

This process is one of the most complex and important aspects of a relational database system.

Core Concepts

  • AST (Abstract Syntax Tree): The raw SQL text is first parsed into a tree structure that represents the syntax of the query.

  • Logical Plan: The AST is then converted into a logical plan. This is a tree of relational algebra operators (e.g., Filter, Projection, Join) that describes the logical steps required to fulfill the query, independent of any specific algorithm or data layout.

  • Physical Plan: The logical plan is then converted into a physical plan. This plan consists of concrete operators (or "iterators") that implement specific algorithms (e.g., NestedLoopJoin, SeqScan). This is the plan that is actually executed.

This section contains:

The Lifecycle of a Query

When you submit a SQL query to a database, it doesn't just magically produce a result. The database undertakes a sophisticated, multi-stage process to translate the declarative SQL statement (which describes what data you want) into an imperative, efficient execution plan (which describes how to get that data). This entire process is the responsibility of the Query Planner.

In QuillSQL, this process follows a classic, compiler-like pipeline, which is a cornerstone of modern database architecture as taught in courses like CMU 15-445.

The journey from a SQL string to an executable plan involves several transformations:

SQL String -> AST (Abstract Syntax Tree) -> Logical Plan -> Optimized Logical Plan -> Physical Plan

Let's break down each stage.

Stage 1: Parsing (SQL -> AST)

The first step is purely syntactic. The raw SQL query string is fed into a parser. QuillSQL uses the excellent sqlparser crate for this. The parser checks if the SQL conforms to valid grammar and, if so, converts it into an Abstract Syntax Tree (AST).

An AST is a direct tree representation of the SQL query's structure. For example, SELECT id FROM users WHERE age > 30 would be parsed into a tree structure with nodes representing the SELECT clause, the table users, the WHERE clause, and the predicate age > 30.

Stage 2: Logical Planning (AST -> Logical Plan)

Next, the LogicalPlanner (plan/logical_planner.rs) walks the AST and converts it into a Logical Plan.

A Logical Plan is a tree of relational algebra operators. It describes the query in terms of high-level data operations, completely independent of how the data is stored or which algorithms will be used. It defines what to do, not how to do it.

Key logical operators in QuillSQL (plan/logical_plan/mod.rs) include:

  • TableScan(users): Represents reading the entire users table.
  • Filter(predicate: age > 30): Represents filtering rows based on a condition.
  • Projection(columns: [id]): Represents selecting specific columns.
  • Join: Represents joining two data sources.
  • Aggregate: Represents a GROUP BY operation.
  • Sort: Represents an ORDER BY operation.

For our example query, the initial logical plan might look like this:

Projection(columns=[id])
  └── Filter(predicate=age > 30)
        └── TableScan(users)

Stage 3: Logical Optimization

Before executing the plan, we have a crucial opportunity to make it more efficient. The LogicalOptimizer (optimizer/logical_optimizer.rs) takes the logical plan and applies a series of transformation rules to produce a new, equivalent logical plan that is expected to be faster.

QuillSQL uses a simple but effective rule-based optimizer. A classic example of such a rule is Predicate Pushdown. Consider this query:

SELECT name FROM (SELECT * FROM users JOIN cities ON users.city_id = cities.id) WHERE users.age > 30;

A naive logical plan would first perform a full JOIN between users and cities and then filter the massive result. Predicate pushdown is a rule that would rewrite the plan to apply the age > 30 filter before the join:

Before Optimization:

Filter(users.age > 30)
  └── Join(users.city_id = cities.id)
        ├── TableScan(users)
        └── TableScan(cities)

After Optimization (Predicate Pushdown):

Join(users.city_id = cities.id)
  ├── Filter(users.age > 30)
  │     └── TableScan(users)
  └── TableScan(cities)

By filtering early, we dramatically reduce the number of rows that need to be processed by the expensive Join operator. QuillSQL implements similar rules, such as PushDownLimit, which pushes LIMIT clauses down the tree to reduce the amount of data processed.

Stage 4: Physical Planning (Logical Plan -> Physical Plan)

Finally, the PhysicalPlanner (plan/physical_planner.rs) converts the optimized logical plan into a Physical Plan.

A Physical Plan describes exactly how the query will be executed. It maps each logical operator to a concrete algorithm or implementation.

  • A LogicalPlan::TableScan becomes a PhysicalSeqScan (a sequential scan of the table heap).
  • A LogicalPlan::Filter becomes a PhysicalFilter, which implements the filtering logic.
  • A LogicalPlan::Join becomes a PhysicalNestedLoopJoin. This is where the database commits to a specific join algorithm. A more advanced database might have multiple options (e.g., PhysicalHashJoin, PhysicalSortMergeJoin) and would use a cost model to choose the best one. QuillSQL currently implements Nested Loop Join.

Each node in the physical plan tree is an executor that the Execution Engine can run. This final plan is what gets executed to produce the query result.

Conclusion

This layered approach—from syntax to a logical representation, then to an optimized logical representation, and finally to a concrete physical execution plan—is fundamental to database design. It provides a clear separation of concerns and, most importantly, creates a dedicated optimization stage, which is the key to achieving high performance on a wide variety of SQL queries.


For Study & Discussion

  1. Logical vs. Physical: Why is the separation between logical and physical plans so important? What would be the disadvantages of a simpler system that converted the AST directly into a physical plan?

  2. Join Algorithms: QuillSQL currently only implements NestedLoopJoin. What are two other common join algorithms? Describe how they work and in what scenarios they would be more performant than a nested loop join.

  3. Programming Exercise (Advanced): Implement a PhysicalHashJoin operator. This is a significant undertaking that involves: a. Creating a PhysicalHashJoin struct that implements the VolcanoExecutor trait. b. In the init() phase, it should consume the entire "build" side (typically the smaller, right-hand table) and build an in-memory hash table from its rows. c. In the next() phase, it should read from the "probe" side (the left-hand table) one row at a time, probe the hash table for matches, and emit the joined tuples. d. Modify the PhysicalPlanner to choose PhysicalHashJoin instead of PhysicalNestedLoopJoin for equi-joins.

  4. Programming Exercise: Add support for the UNION ALL operator. This would involve: a. Adding a Union variant to the LogicalPlan enum. b. Updating the LogicalPlanner to recognize the UNION syntax in the AST and create a LogicalPlan::Union node. c. Creating a PhysicalUnion executor that pulls tuples from its first child until it's exhausted, and then pulls tuples from its second child.

Query Optimizer

A naive, direct translation of a SQL query into an execution plan is often dramatically inefficient. The Query Optimizer is a critical stage in the query processing pipeline that rearranges and transforms the initial logical plan into an equivalent, but far cheaper, plan to execute.

Core Concepts

  • Rule-Based Optimization: QuillSQL uses a rule-based optimizer. It applies a series of pre-defined, heuristic rules to the logical plan to improve it. This is in contrast to a cost-based optimizer, which would estimate the "cost" of many possible plans and choose the cheapest one.

  • Logical Transformations: The key insight is that many different logical plans can produce the exact same result. For example, filtering data before a join is usually much faster than joining two large tables and filtering the result, but the final output is identical. The optimizer's job is to find and apply these beneficial transformations.

This section contains:

  • Rule-Based Optimization: A deep dive into how the rule-based optimizer works, using the PushDownLimit rule as a concrete example.

Rule-Based Optimization

After the LogicalPlanner creates an initial LogicalPlan, it's passed to the LogicalOptimizer. The initial plan is a direct, syntactically correct translation of the SQL query, but it's often not the most efficient way to execute it. The optimizer's job is to transform this plan into an equivalent, but more performant, logical plan.

The Optimizer in QuillSQL

QuillSQL implements a Rule-Based Optimizer. This is a common and powerful approach where the optimizer is equipped with a set of predefined transformation rules. It repeatedly applies these rules to the logical plan tree until no more rules can be applied, or a maximum number of passes is reached.

The main components are:

  • LogicalOptimizer (optimizer/logical_optimizer.rs): The main driver. It holds a list of rules and contains the logic to recursively walk the plan tree and apply them.
  • LogicalOptimizerRule Trait: An interface that every optimization rule must implement. Its core method is try_optimize, which takes a plan node and attempts to return a rewritten, optimized version of that node.

Deep Dive: The PushDownLimit Rule

One of the most classic and effective optimizations is "pushing down" operations as far as possible towards the data source. Let's examine the PushDownLimit rule (optimizer/rule/push_down_limit.rs) to see this in action.

Consider the following query:

SELECT * FROM users ORDER BY signup_date LIMIT 10;

The Naive Plan

A naive logical plan for this query would be:

Limit(10)
  └── Sort(by: signup_date)
        └── TableScan(users)

If executed directly, this plan would:

  1. Scan the entire users table.
  2. Sort the entire table by signup_date.
  3. Finally, discard all but the first 10 rows.

This is incredibly inefficient, especially for a large table, as it involves a massive, memory-intensive sort operation.

The Optimization Rule

The PushDownLimit rule is designed to recognize this specific pattern: a Limit operator directly on top of a Sort operator.

When the optimizer applies this rule, the try_optimize method matches on the Limit node. It inspects its child and sees that it's a Sort node. The rule then knows it can apply its logic.

The Rewritten Plan

The rule rewrites the plan tree by "pushing" the limit information into the Sort node itself:

Limit(10)
  └── Sort(by: signup_date, limit: 10)
        └── TableScan(users)

Notice the new limit: 10 property on the Sort node. This seemingly small change has a huge performance impact. When the PhysicalSort operator is created from this logical node, it now knows that it only needs to find the top 10 rows. Instead of performing a full sort, it can use a much more efficient algorithm, like a heap sort (using a min-heap of size 10), to find the top 10 rows in a single pass over the data.

This optimization avoids sorting the entire table, dramatically reducing both CPU and memory consumption.

Other Rules

QuillSQL implements other simple but effective rules:

  • EliminateLimit: Removes a LIMIT clause if it provides no value (e.g., LIMIT NULL).
  • MergeLimit: If two LIMIT clauses are stacked on top of each other (which can happen after other rule transformations), this rule merges them into a single, more restrictive LIMIT.

Conclusion

While QuillSQL's optimizer is currently rule-based and relatively simple, it demonstrates the fundamental principles of query optimization. By separating the logical representation of a query from its physical execution and applying equivalence-preserving transformations, a database can achieve massive performance gains. More advanced systems build on this with a Cost-Based Optimizer, which uses table statistics to estimate the "cost" of different physical plans (e.g., choosing between a NestedLoopJoin and a HashJoin) and pick the cheapest one.


For Study & Discussion

  1. Rule Ordering: The LogicalOptimizer applies its list of rules in a fixed order for a set number of passes. Can the order in which rules are applied affect the final, optimized plan? Can one rule's transformation enable another rule to be applied in a subsequent pass?

  2. Cost-Based vs. Rule-Based: What is the primary limitation of a purely rule-based optimizer? When would a rule-based optimizer make a poor decision that a cost-based optimizer (with accurate statistics) would get right? (Hint: consider join algorithm selection).

  3. Programming Exercise: Implement the classic Predicate Pushdown rule. Your rule should look for a Filter operator whose child is a Join. If the filter's predicate only uses columns from one side of the join, the rule should push the Filter node down to that side of the join, below the Join node. This is one of the most effective optimizations in any database.

  4. Programming Exercise: Implement a Constant Folding rule. This rule would traverse expression trees and pre-compute constant expressions. For example:

    • An expression WHERE age = 10 + 5 would be rewritten to WHERE age = 15.
    • An expression WHERE 1 = 1 would be evaluated to true, and a smart optimizer could then potentially eliminate the WHERE clause entirely.

Execution Engine

Once a query has been parsed, planned, and optimized, the resulting PhysicalPlan is handed to the Execution Engine. This component is the workhorse of the database, responsible for actually running the plan and producing the final results.

Core Concepts

  • Volcano (Iterator) Model: QuillSQL uses a pull-based execution model. Each operator in the physical plan tree is an "iterator" that provides a next() method. The parent operator calls next() on its children to pull rows upwards through the tree, from the storage layer to the client.

  • Physical Operators: Each logical operation (like a filter or a join) is mapped to a concrete physical operator that implements a specific algorithm (e.g., PhysicalFilter, PhysicalNestedLoopJoin).

  • Execution Context: As the plan is executed, a shared ExecutionContext is passed between operators. It contains vital information, such as the current transaction and its MVCC snapshot, allowing operators to perform visibility checks and locking.

This section contains:

The Volcano Execution Model

Once the Query Planner has produced an optimized PhysicalPlan, it's the job of the Execution Engine to run it and produce results. The execution engine is the component that brings the plan to life, interacting with the transaction manager and storage layer to process data.

QuillSQL uses the classic Volcano Model, also known as the Iterator Model. This is a pull-based execution model where each physical operator in the plan tree acts as an iterator that the parent operator can "pull" rows from.

1. The VolcanoExecutor Trait

At the heart of the execution model is the VolcanoExecutor trait (execution/mod.rs). Every physical operator implements this simple trait:

#![allow(unused)]
fn main() {
pub trait VolcanoExecutor {
    fn init(&self, context: &mut ExecutionContext) -> QuillSQLResult<()>;
    fn next(&self, context: &mut ExecutionContext) -> QuillSQLResult<Option<Tuple>>;
    fn output_schema(&self) -> SchemaRef;
}
}
  • init(): This method is called once at the beginning of execution. It allows an operator to set up its initial state (e.g., a SeqScan operator would initialize its table iterator here).
  • next(): This is the core method. When called, the operator produces its next output tuple. It returns Some(tuple) if it has a row, or None if it has exhausted its data source. The top-level ExecutionEngine simply calls next() on the root of the plan tree in a loop until it receives None.

2. The ExecutionContext

Notice that both init() and next() take a mutable ExecutionContext. This object is the "context" or "world" in which the query runs. It is passed down the entire operator tree and gives each operator access to crucial services:

  • Catalog: To look up tables and indexes.
  • TransactionManager and Transaction: To interact with the current transaction. This is how operators perform locking and visibility checks.
  • MvccSnapshot: The specific MVCC snapshot for the current transaction, used to determine which tuple versions are visible.

This design cleanly separates the operator's logic from the transactional context it runs in.

3. Anatomy of Physical Operators

Data flows up the tree from the leaves (scans) to the root. Let's see how it works by examining a few key operators.

Leaf Operator: PhysicalSeqScan

A sequential scan is at the leaf of a plan tree. It's responsible for reading tuples from a table on disk.

  • init(): It acquires an IntentionShared lock on the table and creates a TableIterator for the table heap.
  • next(): In a loop, it does the following:
    1. Pulls the next raw tuple (rid, meta, tuple) from the TableIterator.
    2. Calls context.is_visible(&meta) to perform an MVCC visibility check using the transaction's snapshot.
    3. If the tuple version is visible, it acquires the necessary row lock (e.g., Shared lock) and returns the tuple.
    4. If the tuple is not visible, it ignores it and loops to get the next one.

Unary Operator: PhysicalFilter

A filter has one child operator (its input). It implements a WHERE clause.

  • next(): Its logic is a simple, tight loop:
    1. It calls self.input.next() to get a tuple from its child.
    2. If the child returns None, the filter is also exhausted and returns None.
    3. If it receives a tuple, it evaluates its predicate expression (e.g., age > 30) against the tuple.
    4. If the predicate evaluates to true, it returns the tuple. Otherwise, it loops back to step 1.

Binary Operator: PhysicalNestedLoopJoin

A join has two children: a left (outer) and a right (inner).

  • next(): It implements the classic nested loop join algorithm:
    1. Fetch one tuple from the outer (left) child and hold onto it.
    2. Enter a loop: fetch tuples one by one from the inner (right) child.
    3. For each inner tuple, combine it with the held outer tuple and evaluate the join condition. If it matches, return the combined tuple.
    4. When the inner child is exhausted, rewind it by calling self.right_input.init() again.
    5. Go back to step 1 to fetch the next tuple from the outer child.
    6. Repeat until the outer child is also exhausted.

4. Putting It All Together

Consider the query SELECT name FROM users WHERE age > 30. The physical plan is Projection -> Filter -> SeqScan.

  1. The ExecutionEngine calls next() on the Projection operator.
  2. The Projection operator needs a tuple, so it calls next() on its child, Filter.
  3. The Filter operator needs a tuple, so it calls next() on its child, SeqScan.
  4. The SeqScan operator fetches a raw tuple from the TableHeap, checks its MVCC visibility, and finds a visible tuple for a user with age = 25.
  5. SeqScan returns this tuple up to Filter.
  6. Filter evaluates age > 30 on the tuple. It's false, so it loops, calling SeqScan.next() again.
  7. SeqScan finds another visible tuple, this time for a user with age = 40 and name = 'Alice'.
  8. SeqScan returns this tuple up to Filter.
  9. Filter evaluates age > 30. It's true! It returns the tuple for Alice up to Projection.
  10. Projection takes the full tuple for Alice, creates a new tuple containing only the name column ('Alice'), and returns this new tuple as the result.

This process repeats, with tuples flowing up the tree one at a time, until the SeqScan operator runs out of pages and returns None, which then propagates up the tree, signaling the end of execution.


For Study & Discussion

  1. Push vs. Pull Models: The Volcano model is a "pull-based" model. An alternative is a "push-based" model, where operators push their results to their parents as soon as they are ready. What are the potential advantages and disadvantages of each model, particularly concerning cache efficiency and control flow?

  2. Blocking vs. Non-Blocking Operators: Some operators, like PhysicalFilter, can produce their first output row as soon as they receive their first input row. These are non-blocking. Other operators, like PhysicalSort, must consume their entire input before they can produce even a single row of output. These are blocking. What is the impact of blocking operators on query latency and memory usage?

  3. Programming Exercise: The current PhysicalNestedLoopJoin is simple but can be inefficient as it re-scans the entire inner table for every outer row. Implement a PhysicalBlockNestedLoopJoin operator. This version would read a block (a small batch) of tuples from the outer table into an in-memory buffer, and then iterate through the inner table once for that entire block. This can significantly reduce the number of times the inner table needs to be scanned.

  4. Programming Exercise: Implement the PhysicalLimit operator. Its next() method should: a. Keep an internal counter. b. If the counter is less than the offset, pull and discard tuples from its child. c. If the counter is between offset and offset + limit, pull a tuple from its child and return it. d. Once the limit is reached, it should stop pulling from its child and return None for all subsequent calls. This is important for efficiency, as it stops the execution of the entire sub-tree below it.