
QuillSQL Internals
Welcome to the technical documentation for QuillSQL.
This book provides a deep dive into the internal architecture and implementation details of the database. It is intended for developers, contributors, and anyone interested in understanding how a relational database is built from the ground up, referencing concepts from classic database courses like CMU 15-445.
Table of Contents
-
Overall Architecture: A high-level overview of the entire system.
-
Core Modules
- Buffer Manager: The in-memory page cache.
- Storage Engine: How data is physically stored.
- Indexes: The B+Tree implementation.
- Recovery Manager (WAL): Crash recovery and the ARIES protocol.
- Transaction Manager: Concurrency control with MVCC and 2PL.
- Query Plan: The journey from SQL to an executable plan.
- Query Optimizer: Rule-based plan transformations.
- Execution Engine: The Volcano (iterator) execution model.
QuillSQL Architecture
This document explains how a SQL request flows through QuillSQL, how transactional MVCC and background services plug in, and how the main modules collaborate. All diagrams use Mermaid so you can paste them into any compatible renderer for a richer view.
1. End-to-End Request Pipeline
flowchart TD Client["CLI / HTTP client"] --> API["bin/client / bin/server"] API --> Parser["sql::parser"] Parser --> LPlanner["plan::LogicalPlanner"] LPlanner --> Optimizer["optimizer::LogicalOptimizer"] Optimizer --> PhysPlanner["plan::PhysicalPlanner"] PhysPlanner --> Exec["execution::ExecutionEngine (Volcano)"] subgraph Txn["Transaction Layer"] Session["session::SessionContext"] TM["transaction::TransactionManager"] LM["transaction::LockManager"] Session --> TM --> LM end Exec -->|Tuple meta & locks| Txn Exec --> Storage subgraph Storage["Storage & I/O"] TableHeap["storage::table_heap::TableHeap"] Index["storage::index::B+Tree"] Buffer["buffer::BufferManager"] Scheduler["storage::disk_scheduler (io_uring)"] WAL["recovery::WalManager"] TableHeap <--> Buffer Index <--> Buffer Buffer <--> Scheduler WAL --> Scheduler end Txn -->|WAL payloads| WAL Background["background::workers\n(checkpoint, bg writer, MVCC vacuum)"] --> Buffer Background --> WAL Background --> TM
High-level flow
- SQL text is parsed into an AST, planned into a
LogicalPlan
, optimized by a handful of safe rewrite rules, then lowered into a physical operator tree. SessionContext
either reuses or creates a transaction and injects isolation/access modes.- Each physical operator is driven by the Volcano pull model (
init
/next
). On entry it obtains aTxnRuntime
which supplies a command id plus an MVCC snapshot consistent with the transaction’s isolation level. - Operators consult the snapshot for tuple visibility, acquire table/row locks through
TxnRuntime
, and issue heap/index operations. - DML operators register undo records and append WAL entries via the transaction manager. When the user commits, the manager emits
Commit
records, waits for durability as configured, and releases locks.
2. Transaction & MVCC Mechanics
sequenceDiagram participant Session participant TM as TransactionManager participant Runtime as TxnRuntime participant Exec as Operator participant Heap as TableHeap Session->>TM: begin(isolation, access_mode) TM-->>Session: Transaction handle Exec->>Runtime: TxnRuntime::new(&TM, &mut txn) Runtime-->>Exec: {snapshot, command_id} loop per tuple Exec->>Heap: next() Heap-->>Exec: (rid, meta, tuple) Exec->>Runtime: is_visible(meta)? Runtime-->>Exec: yes/no (uses cached snapshot) alt Visible Exec->>Runtime: lock_row(...) Exec->>TM: record undo + WAL end end Session->>TM: commit(txn) TM->>WAL: Transaction(Commit) TM->>TM: wait_for_durable / flush_until TM-->>Session: release locks, clear snapshot
-
Snapshots
- Read Committed / Read Uncommitted: Every command refreshes its snapshot and clears any cached value on the transaction handle.
- Repeatable Read / Serializable: The first command captures a snapshot (
xmin
,xmax
,active_txns
) and stores it insideTransaction
. Subsequent commands reuse it, ensuring consistent reads. - Background MVCC vacuum consults
TransactionManager::oldest_active_txn()
to computesafe_xmin
and prunes tuple versions whose inserting/deleting transactions precede that boundary.
-
Locking
Multi-granularity 2PL (IS/IX/S/SIX/X) enforced byLockManager
. Repeatable Read releases shared locks at the end of each command (after verifying visibility). Serializable keeps shared locks through commit. Deadlocks are detected via a wait-for graph; a victim simply fails its lock request. -
Undo & WAL
Transaction
maintains logical undo entries. On abort, the manager emits CLR records and performs the inverse heap operations. Commit waits depend onsynchronous_commit
. Buffer frames retain theirpage_lsn
so WAL-before-data holds. -
Executor safeguards
PhysicalUpdate
now skips tuple versions created by the same(txn_id, command_id)
during the current command. This prevents re-processing the freshly inserted MVCC version and thereby avoids infinite loops.
3. Storage & Buffering
Component | Highlights |
---|---|
TableHeap | Tuple metadata (TupleMeta ) stores inserting/deleting txn ids, command ids, and forward/back pointers for version chains. Helpers like mvcc_update stitch new versions while marking old ones deleted. |
B+Tree | B-link tree implementation with separate codecs for header/internal/leaf pages. Index maintenance occurs in DML operators after the heap change succeeds. |
BufferManager | Combines a page table, LRU-K replacer (with TinyLFU admission option), and per-frame guards. Dirty pages record their first-dirty LSN, feeding checkpoints. The background writer periodically flushes dirty frames and drives lazy index cleanup. |
DiskScheduler | Uses io_uring worker threads. Foreground tasks push ReadPage , WritePage , WriteWal , and FsyncWal commands through lock-free queues and receive completion on dedicated channels. |
4. Write-Ahead Logging & Recovery
WalManager
manages log sequence numbers, buffers frames, writes physical (PageWrite
,PageDelta
) and logical (HeapInsert/Update/Delete
) records, and coordinates flushes. First-page-writes guard against torn pages.background::spawn_checkpoint_worker
emitsCheckpoint
records capturing the dirty page table and active transactions so recovery can cut replay short.RecoveryManager
executes ARIES-style analysis → redo → undo on restart, leveraging CLRs to keep undo idempotent.- WAL and data I/O both use the
DiskScheduler
, keeping durability guarantees in one place.
5. Background Services
Worker | Description | Trigger |
---|---|---|
WAL writer | Coalesces buffered WAL into durable segments | WalManager::start_background_flush |
Checkpoint | Flushes LSNs, records ATT + DPT snapshots, resets FPW epoch | Configurable interval (WalOptions::checkpoint_interval_ms ) |
Buffer writer | Flushes dirty frames, runs index lazy cleanup based on pending garbage counters | IndexVacuumConfig |
MVCC vacuum | Iterates table heaps, removing committed-deleted or aborted-inserted tuples older than safe_xmin | MvccVacuumConfig (interval + batch limit) |
All workers are registered with background::BackgroundWorkers
, which stops and joins them on database shutdown.
6. Example Timeline: Repeatable Read UPDATE
T1 (RR) T2 (RC)
----------- -----------
BEGIN; BEGIN;
SELECT ... (snapshot S0) UPDATE row -> new version V1
COMMIT (WAL + flush)
SELECT again -> sees original value
COMMIT
-- background vacuum later reclaims deleted version when safe_xmin > delete_txn
- T1’s
TxnRuntime
caches snapshot S0 on its first command and reuses it, so the secondSELECT
filters out V1 even though T2 committed. - Row-level shared locks acquired during the read are released at the end of the command, while the MVCC snapshot keeps the view consistent.
- When T1 commits, locks are dropped, snapshot cache is cleared, and WAL commit record becomes durable. Vacuum eventually removes T1’s deleted predecessor when all transactions with
txn_id < safe_xmin
have finished.
7. Observability & Configuration
- Enable tracing via
RUST_LOG=trace
to inspect lock grant/queue events and MVCC vacuum activity. - Key knobs exposed through CLI/environment: WAL segment size, background intervals, default isolation level, MVCC vacuum batch size.
background::BackgroundWorkers::snapshot()
reports active worker metadata; you can surface it for debugging endpoints.
8. Roadmap
- Predicate locking / SSI to upgrade Serializable beyond strict 2PL.
- Cost-based optimization with catalog statistics.
- Smarter vacuum pacing tied to storage pressure and tuple churn.
- Parallel execution and adaptive readahead hints based on operator feedback.
Even with these future items, the current layering mirrors production databases (e.g., PostgreSQL): MVCC + 2PL, ARIES-style WAL, asynchronous maintenance, and a modular Volcano executor—all while keeping the codebase approachable for teaching and experimentation.
Contributor's Guide
Welcome, and thank you for your interest in contributing to QuillSQL! Whether you're fixing a bug, adding a new feature, or improving the documentation, this guide will help you get started.
1. Getting Started: Your Development Environment
Prerequisites
- Rust: QuillSQL is written in Rust. If you don't have it yet, install it via rustup. This will provide you with
rustc
(the compiler) andcargo
(the package manager and build tool).curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
- Build Essentials: Ensure you have a C++ compiler like
gcc
orclang
installed, which is a common dependency for some Rust crates.
Setup
-
Fork the Repository: Start by forking the main QuillSQL repository to your own GitHub account.
-
Clone Your Fork: Clone your forked repository to your local machine.
git clone https://github.com/YOUR_USERNAME/QuillSQL.git cd QuillSQL
-
Build the Project: Compile the entire project to ensure everything is set up correctly.
cargo build
2. Development Workflow
Running Tests
Before and after making any changes, it's crucial to run the test suite to ensure you haven't broken anything.
-
Run all unit and integration tests:
cargo test
-
Run the benchmark suite:
cargo bench
Code Style and Quality
We adhere to the standard Rust coding style and use tools to enforce it.
-
Formatting: Before committing, please format your code with
rustfmt
.cargo fmt --all
-
Linting: We use
clippy
to catch common mistakes and improve code quality. Please ensureclippy
passes without warnings.cargo clippy --all-targets -- -D warnings
Submitting Your Contribution
-
Create a New Branch: Create a descriptive branch name for your feature or bugfix.
git checkout -b my-awesome-feature
-
Make Your Changes: Write your code. Add new tests to cover your changes. Ensure all existing tests still pass.
-
Format and Lint: Run
cargo fmt
andcargo clippy
as described above. -
Commit Your Work: Write a clear and concise commit message.
git add . git commit -m "feat: Add support for window functions"
-
Push to Your Fork: Push your branch to your fork on GitHub.
git push -u origin my-awesome-feature
-
Open a Pull Request: Go to the original QuillSQL repository on GitHub. You should see a prompt to open a Pull Request from your new branch. Fill out the PR template with a description of your changes.
3. Working on the Documentation
The documentation is built using mdbook
. To preview your changes locally, you'll need to install it and the mermaid
plugin.
-
Install
mdbook
andmdbook-mermaid
:cargo install mdbook cargo install mdbook-mermaid
-
Serve the Book Locally: Run the following command from the root of the project.
mdbook serve docs
This will build the book and start a local web server. You can open your browser to
http://localhost:3000
to see the live-previewed documentation.
Buffer Manager
In any disk-based database, the speed difference between main memory (RAM) and persistent storage (SSD/HDD) is enormous. The Buffer Manager is the component designed to solve this problem. It acts as a cache, managing a region of main memory called the Buffer Pool and moving data pages between disk and memory as needed.
Its primary goal is to minimize disk I/O by keeping frequently accessed pages in memory.
This section is divided into the following parts:
- Page & Page Guards: Explains the core concepts of pinning and the RAII guards used to safely access pages.
- The Buffer Pool: A deep dive into the architecture and lifecycle of a page request, including the page table, replacer, and concurrency.
Page & Page Guards
Before the Buffer Manager can hand out a reference to a page in memory, it must ensure that the page won't be evicted while it's being used by another thread. This is accomplished by pinning.
Pinning
Pinning simply means incrementing a "pin count" associated with the page's frame in the buffer pool. A frame with a pin count greater than zero is forbidden from being chosen as a victim by the page replacer.
- When a thread wants to use a page, it must first pin it.
- When the thread is finished with the page, it must unpin it (decrementing the count).
Manually managing pin counts is tedious and error-prone. Forgetting to unpin a page leads to a memory leak, as the frame can never be evicted. To solve this, QuillSQL uses a common and powerful C++ and Rust pattern: Resource Acquisition Is Initialization (RAII).
ReadPageGuard
and WritePageGuard
Instead of returning a raw pointer to the page memory, the BufferManager
's fetch_page_*
methods return a guard object: ReadPageGuard
or WritePageGuard
.
These guards are responsible for the lifetime of the pin and the lock on the page:
-
Acquisition: When a
PageGuard
is created, its constructor acquires the appropriate lock (RwLock
) on the page's frame and increments the frame's pin count.ReadPageGuard
takes a read lock, allowing multiple concurrent readers.WritePageGuard
takes an exclusive write lock.
-
Usage: The calling code uses the guard object to access the page's data. The guard provides safe, locked access to the underlying byte array.
-
Release: When the guard variable goes out of scope (e.g., at the end of a function), its
drop()
method is automatically called by the Rust compiler. Thisdrop()
implementation handles all the cleanup:- It decrements the pin count.
- It releases the lock on the frame.
- If it's a
WritePageGuard
and the data was modified, it informs theBufferManager
that the page is now dirty.
This RAII pattern makes using the buffer pool much safer and more ergonomic, as it makes it impossible to forget to unpin a page or release a lock.
The Buffer Pool: Architecture and Lifecycle
1. Core Components & Architecture
QuillSQL's buffer management is split into two main structs, reflecting a separation of concerns:
BufferPool
(buffer/buffer_pool.rs
): A low-level, "dumb" container. It owns the actual memory arena for pages and provides basic mapping from aPageId
to a memory location (FrameId
).BufferManager
(buffer/buffer_manager.rs
): The high-level, "smart" coordinator. It contains theBufferPool
and implements all the logic for fetching pages, choosing which pages to evict, and interacting with other database components like the transaction and recovery managers.
This architecture is centered around three key data structures:
-
Frames (The Arena): The
BufferPool
pre-allocates a large, contiguous block of memory on the heap. This block is divided into a fixed number of smaller chunks called frames. Each frame is exactlyPAGE_SIZE
(4KB) and can hold the contents of one disk page. -
Page Table (
page_table
): A hash map (specifically, a concurrentDashMap
) that maps a logicalPageId
to theFrameId
where it currently resides in memory. This provides fast O(1) lookups to check if a page is already in the buffer pool. -
Replacer (
LRUKReplacer
): When a requested page is not in memory and the buffer pool is full, one of the existing pages must be evicted to make room. TheReplacer
is the component that implements the page replacement policy and decides which page is the best candidate for eviction. QuillSQL uses an LRU-K replacement policy, a sophisticated variant of the classic Least Recently Used (LRU) algorithm.
2. The Lifecycle of a Page Request
When another part of the database (e.g., the execution engine) needs to access a page, it calls buffer_manager.fetch_page_read(page_id)
or fetch_page_write(page_id)
. This initiates a critical sequence of events.
The flow is as follows:
-
Request: An executor requests Page
P
. -
Lookup: The
BufferManager
consults thePageTable
. -
Case 1: Cache Hit
- The
PageTable
contains an entry forP
, mapping it toFrameId
F
. - The
BufferManager
increments the pin count for frameF
. - It informs the
LRUKReplacer
that the page has been accessed (updating its priority). - It returns a
PageGuard
wrapping a reference to the memory in frameF
.
- The
-
Case 2: Cache Miss
- The
PageTable
has no entry forP
. - The
BufferManager
must bring the page from disk. It asks theReplacer
to choose a victim frameF_v
to evict. - If the victim frame
F_v
is dirty: TheBufferManager
first writes the contents ofF_v
back to disk via theDiskScheduler
. This is essential for data durability. - The
PageTable
entry for the old page residing inF_v
is removed. - The
BufferManager
issues a read request to theDiskScheduler
to load pageP
's data from disk into frameF_v
. - The
PageTable
is updated with the new mapping:P -> F_v
. - The process then continues like a cache hit: frame
F_v
is pinned and aPageGuard
is returned.
- The
3. Concurrency
The buffer pool is a shared resource accessed by many concurrent threads. QuillSQL uses a combination of locking strategies:
- Page Table: A
DashMap
is used for the page table, which is highly optimized for concurrent reads and writes. - Frame-Level Locks: Each frame in the pool has its own
RwLock
. AReadPageGuard
acquires a read lock on the frame, allowing multiple threads to read the same page concurrently. AWritePageGuard
acquires an exclusive write lock, ensuring that only one thread can modify a page at a time. - Replacer/Free List: These shared structures are protected by a
Mutex
.
4. Integration with WAL and Recovery
The Buffer Manager is a key player in the ARIES recovery protocol.
- Dirty Pages: When a
WritePageGuard
is dropped, if it has modified the page, it marks the page as dirty. TheBufferManager
maintains adirty_page_table
that tracks all dirty pages and the Log Sequence Number (LSN) of the first WAL record that caused the page to become dirty. - Forced WAL (Write-Ahead Logging): Before a dirty page can be written back to disk (either during eviction or a checkpoint), the
BufferManager
must ensure that all WAL records up to that page's current LSN have been flushed to durable storage. This is the fundamental WAL rule and is enforced by theflush_page
method, which callswal_manager.flush(lsn)
before writing the page to disk.
For Study & Discussion
-
Replacement Policies: QuillSQL uses LRU-K. What are the potential advantages of LRU-K over a simple LRU policy? What kind of workload would benefit most? Conversely, what are the trade-offs of using Clock/Second-Chance instead?
-
The "Double Caching" Problem: We mentioned that using Direct I/O helps avoid double caching. If Direct I/O is disabled, how does the database's buffer pool interact with the operating system's file system cache? Why can this lead to suboptimal performance?
-
Programming Exercise: Implement a
ClockReplacer
that adheres to theReplacer
trait. Modify theBufferManager
to use your new replacer instead ofLRUKReplacer
. Run the benchmark suite and compare the performance. Does it change? -
Programming Exercise: Add metrics to the
BufferManager
to track the buffer pool hit rate (i.e.,num_hits / (num_hits + num_misses)
). You could expose this via a newSHOW BUFFER_STATS;
SQL command.
Storage Engine
The storage engine is where the logical world of SQL tables and rows meets the physical world of disk blocks and bytes. Understanding how data is laid out on disk is fundamental to understanding database performance, concurrency control, and recovery.
This section is divided into the following parts:
- Disk I/O: A look at the asynchronous I/O layer using
io_uring
. - Page & Tuple Layout: A deep dive into how pages and tuples are physically structured on disk, including the slotted page layout.
- Table Heap: Explains how tuple versions are managed for MVCC.
Disk I/O — Scheduler, io_uring Data Pages & WAL Runtime
1. Architecture
- Request Path: foreground components enqueue
DiskRequest
objects viaDiskScheduler::{schedule_read, schedule_write, …}
. A dispatcher thread drains the global channel and distributes work round-robin to N io_uring workers. Each worker owns its own ring and file-descriptor cache, so once a request is forwarded, execution proceeds entirely off the foreground thread. - Stable APIs:
schedule_read(page_id)
,schedule_write(page_id, Bytes)
,schedule_read_pages(Vec<PageId>)
,schedule_allocate()
,schedule_deallocate(page_id)
— every call returns a channel the caller can block on or poll. - Batch Reads:
ReadPages
fans out per-page SQEs while a sharedBatchState
tracks completions. Even if the kernel completes I/O out of order, the caller receives aVec<BytesMut>
that preserves the original page order.
2. WAL Runtime (buffered I/O)
- Dedicated WAL runtime threads handle sequential WAL appends/reads using buffered I/O. They now keep a per-thread cache of open segment files, eliminating repeated
open()
/close()
on every log record. - Worker count defaults to
max(1, available_parallelism / 2)
but is tunable throughIOSchedulerConfig
. - Optional
sync
on a request triggerssync_data
/fdatasync
soWalManager
can honour synchronous commit or checkpoint barriers. Data pages stay on the io_uring dataplane; WAL always uses buffered writes.
3. io_uring Backend (Linux)
- Each worker owns an
IoUring
with configurablequeue_depth
, optional SQPOLL idle timeout, and a pool of registered fixed buffers sized toPAGE_SIZE
. Workers submit SQEs asynchronously and drain CQEs in small batches to keep the ring warm. - Read batching relies on shared
BatchState
instances (Rc<RefCell<_>>
) so multi-page callers see ordered results without blocking the kernel on serialization. - Writes keep their payload alive until completion; if a fixed buffer slot is available we reuse it, otherwise we fall back to heap buffers. A companion
WriteState
tracks an optionalfdatasync
so the caller still observes exactly oneResult<()>
once all CQEs land. - Errors (short read/write, errno) are normalised into
QuillSQLError
values that flow back on the original channel.
4. Configuration
config::IOSchedulerConfig
controls:workers
: number of io_uring workers (default = available parallelism).wal_workers
: WAL runtime threads (default workers / 2).iouring_queue_depth
,iouring_fixed_buffers
,iouring_sqpoll_idle_ms
.fsync_on_write
: whether data-page writes also issuefdatasync
(WAL sync is managed separately byWalManager
).
5. Concurrency & Safety
- Worker-local file descriptors plus positional I/O remove shared mutable state on the hot path. The new per-worker handle cache further reduces syscall overhead.
- Shutdown sequence: enqueue
Shutdown
, dispatcher forwards it to every worker, each worker drains outstanding SQEs/CQEs, and finally dispatcher + workers are joined. - BufferPool, TableHeap, and the streaming scan ring buffer still integrate via channels; inflight guards prevent duplicate page fetches.
6. Performance Notes
- Random page access benefits from fewer syscalls and deeper outstanding queue depth than the blocking fallback.
- Only the io_uring backend currently ships (Linux x86_64). A portable fallback remains future work.
- For large sequential scans, combine
ReadPages
with the ring-buffer iterator to minimise buffer-pool churn.
7. Future Work
- Queue-depth aware scheduling and CQE bulk harvesting.
- Optional group commit (aggregate writes, single fsync) behind configuration.
- Metrics hooks (queue depth, submit/complete throughput, latency percentiles, error codes).
- Cross-platform fallback backend and richer prioritisation/throttling policies.
- Control-plane knobs for throttling individual background workers.
Page and Tuple Layout
1. The Page: The Atomic Unit of I/O
A database file is not treated as one continuous stream of data. Instead, it is broken down into fixed-size blocks called pages. A page is the atomic unit of transfer between the disk and the in-memory Buffer Pool. Whenever the database needs to read a piece of data (like a single row), it must load the entire page containing that data into memory.
In QuillSQL, the page size is a constant defined at quill-sql/src/buffer/mod.rs
:
PAGE_SIZE
: 4096 bytes (4 KB)
This fixed-size approach simplifies buffer management and allows for efficient, aligned I/O operations, especially when using Direct I/O to bypass the OS cache.
2. TablePage
: The Slotted Page Layout
While a page is a generic 4KB block of bytes, pages that store actual table data are structured in a specific way. QuillSQL uses a classic Slotted Page layout, which is a core concept in database implementation (as taught in CMU 15-445).
A TablePage
is organized into three main parts:
<------------------------------ 4KB ------------------------------>
+----------------+-----------------+-----------------+--------------+
| Page Header | Slot Array | Free | Tuple |
| (grows ->) | (grows ->) | Space | Data |
| | | | (<- grows) |
+----------------+-----------------+-----------------+--------------+
- Page Header (
TablePageHeader
): Located at the beginning of the page. It contains metadata about the page itself. - Slot Array (
tuple_infos
): An array ofTupleInfo
structs that grows from after the header. Each entry in this array acts as a "pointer" or "directory entry" for a tuple on the page. - Tuple Data: The actual raw data of the tuples is stored starting from the end of the page, growing backwards towards the middle.
This design has a key advantage: it decouples a tuple's logical identifier from its physical location on the page.
The RecordId
(RID)
A specific tuple is uniquely identified by a RecordId
(RID). The RID is a stable pointer composed of two parts:
page_id
: The ID of the page where the tuple resides.slot_num
: The index into the Slot Array on that page.
So, RID = (page_id, slot_num)
.
When the database needs to delete a tuple or if a variable-length tuple is updated and grows in size, the tuple's data might need to be moved within the page (for compaction). In a slotted page design, we only need to update the offset
in the corresponding slot array entry. The tuple's RID (page_id
, slot_num
) remains unchanged. This prevents a cascade of updates to all secondary indexes that might be pointing to that tuple.
TablePageHeader
and Slot (TupleInfo
)
Let's look at the physical layout, as defined in storage/codec/table_page.rs
:
-
TablePageHeader
:lsn
: The Log Sequence Number of the last change made to this page, crucial for WAL recovery.next_page_id
: The ID of the next page in this table, forming a linked list of pages.num_tuples
: The number of active tuples on the page.num_deleted_tuples
: The number of "dead" or deleted tuples.tuple_infos
: The slot array itself.
-
TupleInfo
(A single slot in the array):offset
: The byte offset from the beginning of the page where the tuple's data begins.size
: The size of the tuple's data in bytes.meta
: A nestedTupleMeta
struct containing critical information for concurrency control.
For Study & Discussion
-
Layout Trade-offs: What is the main benefit of having the tuple data grow from the end of the page backwards, while the header and slot array grow from the beginning forwards? What happens when they meet?
-
Record ID Stability: Why is it so important that a tuple's
RecordId
does not change even if the tuple's physical data is moved within the page? What would break if the RID was just a direct byte offset? -
Large Objects: The current design assumes a tuple fits entirely on one page. How would you modify this page layout to support tuples that are larger than 4KB (e.g., a long blog post stored in a
VARCHAR
column)? Research how systems like PostgreSQL handle this with their "TOAST" (The Oversized-Attribute Storage Technique) mechanism. -
Programming Exercise: Implement a
defragment()
method for theTablePage
. After several insertions and deletions, the free space on a page can become fragmented into small, unusable chunks. This method should reorganize the page by moving the existing tuples to be contiguous, creating a single, large block of free space. Remember to update theoffset
in eachTupleInfo
slot after moving its corresponding tuple data!
Table Heap and MVCC
The TableHeap
(storage/table_heap.rs
) is the component responsible for managing the collection of pages that belong to a single table. While the TablePage
defines the layout within a single page, the TableHeap
provides the high-level API for inserting, updating, and deleting tuples, making it central to the implementation of MVCC.
Tuple
Serialization
A logical Tuple
struct in memory must be converted into a byte array to be stored on a page. This process is handled by storage/codec/tuple.rs
.
The serialized format of a tuple consists of two parts:
- Null Bitmap: A compact bitmap at the beginning of the tuple data. Each bit corresponds to a column in the schema; if the bit is
1
, the column's value isNULL
. This avoids storing any data for null fields. - Attribute Data: The actual data for all non-null columns, serialized one after another.
TupleMeta
: The Heart of MVCC
The most important part of a tuple's on-disk metadata is the TupleMeta
struct, which is stored directly within the TupleInfo
slot in the page header. This is the heart of QuillSQL's Multi-Version Concurrency Control (MVCC) implementation.
insert_txn_id
: The ID of the transaction that created this version of the tuple.delete_txn_id
: The ID of the transaction that "deleted" this version. (A value of0
orINVALID
means it's not deleted).is_deleted
: A boolean flag indicating the tuple version is considered deleted.prev_version: Option<RecordId>
: A link (RID) to the previous version of this logical row.next_version: Option<RecordId>
: A link (RID) to the next version of this logical row.
These fields allow QuillSQL to maintain a version chain for each logical row. When a row is updated, the TableHeap
's mvcc_update
method is called, which, instead of overwriting the data, performs the following steps:
- Creates a new tuple version with the new data by calling
insert_tuple
. - Sets the
prev_version
pointer of this new version to the RID of the old version. - Updates the
next_version
pointer of the old version to point to the new version. - Sets the
delete_txn_id
on the old version to mark it as "dead".
A transaction can then traverse this chain and use the transaction IDs in the TupleMeta
to determine which version of a row is visible to it based on its own transaction ID and isolation level, thus achieving transaction isolation without long-held read locks.
For Study & Discussion
-
Version Chain Traversal: Imagine a transaction with ID
T10
needs to read a row. It finds a version of the row that was inserted byT5
(committed) but deleted byT12
(in-progress). ShouldT10
see this version? What if the deleter wasT8
(committed)? What if the deleter wasT10
itself? Walk through the visibility check logic. -
Garbage Collection: The MVCC model creates many "dead" tuple versions that are no longer visible to any active or future transaction. What is the long-term problem with leaving these dead versions in the database? This problem is typically solved by a process called vacuuming or garbage collection. When is it safe to physically remove a dead tuple version?
-
Programming Exercise (Advanced): Implement a basic
VACUUM
function for aTableHeap
. This function should scan the table and identify dead tuple versions that are no longer visible to any currently active transaction. Once identified, it should physically remove them from their pages. This is a challenging exercise that touches transaction management, storage, and concurrency.
Indexes
Indexes are crucial for database performance. They are redundant data structures that allow the system to find rows matching specific criteria quickly, without having to scan the entire table. While indexes speed up queries (reads), they incur a cost during data modification (writes), as both the table and its indexes must be updated.
QuillSQL currently provides a BPlusTreeIndex
.
Key Concepts
-
Data Structure: The index is built using a B+Tree, a self-balancing tree data structure that maintains sorted data and allows for efficient insertion, deletion, and search operations (typically in O(log n) time).
-
Index Keys: The B+Tree stores key-value pairs. The "key" is the value from the indexed column(s) (e.g., a user's ID), and the "value" is the
RecordId
(RID) of the row containing that key. This allows the database to first find the key in the B+Tree and then use the RID to immediately locate the full row data on its data page. -
Concurrency Control: The B+Tree implementation must be highly concurrent. QuillSQL's implementation uses advanced techniques like B-link (or B-link tree) page connections and latch-crabbing to allow multiple readers and writers to access the tree simultaneously with minimal contention.
This section contains:
- B+Tree: A look at the classic B+Tree data structure.
B+ Tree Index — Architecture and Concurrency
1. Architecture Overview
1.1 Node and Page Structure
- Node Types:
Internal
andLeaf
.- Internal Nodes: Store separator keys and pointers to child pages. The first pointer is a "sentinel" that points to the subtree for all keys less than the first key in the node.
- Leaf Nodes: Store
(key, RecordId)
pairs in sorted order. Leaves are linked together in a singly linked list (next_page_id
) to allow for efficient range scans.
- Header Page: A fixed page (
header_page_id
) that acts as the entry point to the tree. It stores theroot_page_id
, allowing for atomic updates to the tree root when it splits or shrinks. - Page Layout:
- Internal Page:
{Header, High Key, Array<(Key, ChildPageId)>}
. TheHigh Key
is part of the B-link optimization. - Leaf Page:
{Header, Array<(Key, RID)>}
.
- Internal Page:
1.2 B-link Structure
The tree uses B-link pointers (next_page_id
) on all levels (both internal and leaf nodes). This creates a "side-link" to the right sibling. This is crucial for concurrency, as it allows readers to recover from transient inconsistent states caused by concurrent page splits by "chasing" the link to the right sibling.
+------------------+
| Root (Int) |
+------------------+
/ | \
+--------+ +--------+ +--------+
| Int P1 |----->| Int P2 |----->| Int P3 | (Internal B-link pointers)
+--------+ +--------+ +--------+
/ | \ / | \ / | \
+----+ +----+ +----+ +----+ +----+ +----+
| L1 |-| L2 |->| L3 |-| L4 |->| L5 |-| L6 | (Leaf chain / B-links)
+----+ +----+ +----+ +----+ +----+ +----+
2. Concurrency Control
The B+Tree employs a sophisticated, lock-free read path and a high-concurrency write path using latch crabbing.
2.1 Read Path: Optimistic Lock Coupling (OLC) with B-links
- Readers traverse the tree from the root without taking any locks.
- On each page, a
version
number is read before and after processing the page's contents. If the version changes, it indicates a concurrent modification, and the read operation restarts from the root. - If a reader is traversing an internal node and the search key is greater than or equal to the node's
high_key
, it knows a split has occurred. Instead of restarting, it can use thenext_page_id
B-link to "chase" to the right sibling and continue the search, minimizing restarts.
2.2 Write Path: Latch Crabbing
Writers (insert/delete) use a technique called latch crabbing (or lock coupling) to ensure safe concurrent modifications.
- Process: A writer acquires a write latch on a parent node before fetching and latching a child node. Once the child is latched, the writer checks if the child is "safe" for the operation (i.e., not full for an insert, not at minimum size for a delete).
- If the child is safe, the latch on the parent (and all other ancestors) is released.
- If the child is unsafe, the parent latch is held, as the child might need to split or merge, which would require modifying the parent.
Context
Struct: This process is managed by aContext
struct that holds the stack of write guards (write_set
) for the current traversal path. Releasing ancestor latches is as simple as clearing this stack.
Latch Crabbing on Insert:
1. Latch(Root)
2. Descend to Child C1. Latch(C1).
3. Check if C1 is safe (not full).
IF SAFE:
ReleaseLatch(Root). Path is now just {C1}.
IF UNSAFE (full):
Keep Latch(Root). Path is {Root, C1}.
4. Descend from C1 to C2. Latch(C2).
5. Check if C2 is safe... and so on.
2.3 Deadlock Avoidance
When modifying siblings (during merge or redistribution), deadlocks are possible if two threads try to acquire latches on the same two pages in opposite orders. This is prevented by enforcing a strict PageId-ordered locking protocol. When two sibling pages must be latched, the page with the lower PageId
is always latched first.
3. Key Algorithms & Features
- Parent-Guided Redirection: During an insert or delete, after a writer has descended to a leaf, it re-validates its position using the parent (if a latch is still held). If a concurrent split has moved the target key range to a different sibling, the writer can jump directly to the correct page instead of traversing the leaf chain, preventing race conditions.
- Iterator: The iterator performs a forward scan by following the leaf chain (
next_page_id
). It uses a lightweight form of OLC, checking the leaf page version to detect concurrent modifications and restart if necessary to ensure it doesn't miss keys.- Sequential Scan Optimization (RingBuffer): For large range scans, the iterator switches to a "synchronous batch fetch + local ring buffer" mode. It fills a small
RingBuffer<BytesMut>
with consecutive leaf pages (by followingnext_page_id
) and then decodes KVs locally without holding page guards for long. This reduces buffer pool pollution and syscall/lock overhead. - Two Iteration Modes:
- Guard Mode: Keep a
ReadPageGuard
and decode per step; prefetch next leaf best-effort. - Bytes Mode: After switching, decode from
BytesMut
buffers in the local ring; when a leaf is exhausted, pop next bytes from the ring or refill by following the chain.
- Guard Mode: Keep a
- Sequential Scan Optimization (RingBuffer): For large range scans, the iterator switches to a "synchronous batch fetch + local ring buffer" mode. It fills a small
- Prefix Compression: Keys in internal nodes are prefix-compressed to save space. Each key is stored as
(lcp, suffix_len, suffix)
. This reduces the size of internal pages, increasing the tree's fanout and reducing its height, which improves cache performance and reduces I/O. - Split/Merge Safety:
- Split: When a node splits, the new right sibling is written first. Then, the B-link pointer and separator key are published atomically by updating the left sibling and parent. This ensures readers can always navigate the structure correctly.
- Merge/Redistribute: When a node is underfull, the implementation first tries to borrow an entry from a sibling (redistribute). If both siblings are at minimum size, it merges with a sibling. All these operations carefully maintain the B-link chain and parent pointers.
4. Benchmarks & Performance
4.1 Example: Range Scan Benchmark
This benchmark measures the efficiency of the leaf-chain traversal, which is critical for SELECT
queries with WHERE
clauses on indexed columns. It benefits from iterator prefetching and prefix compression.
#![allow(unused)] fn main() { // Pseudo-code for a range scan benchmark use std::time::Instant; fn benchmark_range_scan(index: Arc<BPlusTreeIndex>, num_keys: i64, num_passes: usize) { // 1. Populate the index with sequential keys for key in 1..=num_keys { let tuple = create_tuple_from_key(key, index.key_schema.clone()); index.insert(&tuple, create_rid_from_key(key)).unwrap(); } // 2. Run the benchmark let start = Instant::now(); let mut count = 0; for _ in 0..num_passes { // Create an iterator over the full key range let mut iter = TreeIndexIterator::new(index.clone(), ..); while let Some(_) = iter.next().unwrap() { count += 1; } } let elapsed = start.elapsed(); let total_items_scanned = num_keys as usize * num_passes; let items_per_sec = total_items_scanned as f64 / elapsed.as_secs_f64(); println!( "Range Scan: Scanned {} items in {:?}. Throughput: {:.2} items/sec", total_items_scanned, elapsed, items_per_sec ); } }
4.2 Performance Notes
- Hot Reads: Performance on hot-spot reads depends on keeping upper levels and hot leaves resident in the buffer pool. Warm up the cache for read-heavy benchmarks. Protect hot pages from pollution by enabling TinyLFU admission.
- Large Range Scans: Prefer table SeqScan with ring buffer (bypass) when scanning most of the table. For index scans over very large ranges, consider future Bitmap Heap Scan rather than bypassing the pool for leaves.
4.3 Configuration
config::BTreeConfig
controls iterator behavior:seq_batch_enable
(bool): enable batch mode with local ring buffer.seq_window
(usize): number of leaf pages to prefill into the ring per refill.prefetch_enable
/prefetch_window
: guard-mode prefetch hints to buffer pool. Defaults are conservative; increaseseq_window
for long scans to reduce I/O hop.
5. Future Work
- Stronger OLC with bounded retries and telemetry.
- CSB+-like internal layout and columnar key prefixing.
- NUMA-aware partitioning and router.
- WAL/MVCC for crash recovery and snapshot isolation.
Recovery Manager (WAL)
The Recovery Manager is the component that ensures the Durability aspect of ACID. It guarantees that once a transaction is committed, its effects are permanent, even in the face of system crashes or power failures.
This is achieved through a Write-Ahead Log (WAL) and a recovery protocol inspired by ARIES.
Core Concepts
-
Write-Ahead Logging: The fundamental principle is that any change to a data page must be recorded in a log file on durable storage before the data page itself is written back to disk.
-
Log Records: The WAL is a sequence of records, each with a unique Log Sequence Number (LSN). Records describe changes (
PageDelta
,HeapInsert
), transaction outcomes (Commit
,Abort
), or internal state (Checkpoint
). -
ARIES Protocol: On startup, the database replays the log to restore a consistent state. This involves three phases: Analysis (to figure out what was happening during the crash), Redo (to re-apply all changes since the last checkpoint), and Undo (to roll back any transactions that did not commit).
This section contains:
- The ARIES Protocol in QuillSQL: A deep dive into the three phases of recovery and how they are implemented.
The ARIES Recovery Protocol
Of the four ACID properties, Durability is the one that guarantees that once a transaction is committed, its changes will survive any subsequent system failure. In a disk-based database, this is achieved through a careful and robust recovery protocol. QuillSQL implements a recovery strategy inspired by the well-known ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) protocol, centered around a Write-Ahead Log (WAL).
1. The Write-Ahead Logging (WAL) Principle
The fundamental rule of WAL is simple but powerful:
Before a modified data page is ever written from memory back to disk, the log record describing that modification must first be written to durable storage (the log file).
This allows the database to perform most of its data modifications in memory on the Buffer Pool without needing to synchronously write every changed page to disk, which would be extremely slow. In case of a crash, the database can use the log to reconstruct the state of the system and ensure all committed changes are present and all uncommitted changes are undone.
Log Records (WalRecord
)
The WAL is a sequential, append-only file composed of log records. Each record is assigned a unique, monotonically increasing Log Sequence Number (LSN). A log record in QuillSQL (src/recovery/wal_record.rs
) generally contains:
- LSN: The address of the log record.
- Transaction ID: The ID of the transaction that generated this record.
- Payload: The actual content of the log record, which varies by type.
QuillSQL uses several types of log records:
Transaction
: Marks theBEGIN
,COMMIT
, orABORT
of a transaction.PageWrite
/PageDelta
: Physical/Physiological records describing a change to a page.PageWrite
contains a full image of the page (used for the first write after a checkpoint, a technique called First-Page-Write or FPW), whilePageDelta
contains only the changed bytes.HeapInsert
/HeapUpdate
/HeapDelete
: Logical records describing a high-level heap operation. These are primarily used for generating precise undo operations.Checkpoint
: A special record that marks a point of partial durability, allowing the log to be truncated.CLR
(Compensation Log Record): A special record written during recovery to describe an undo action. CLRs are redo-only and are never undone themselves.
2. The ARIES Recovery Protocol
On database startup, the RecoveryManager
(recovery/recovery_manager.rs
) is invoked to replay()
the WAL. This process follows the three phases of ARIES.
Phase 1: Analysis
The goal of the Analysis phase (recovery/analysis.rs
) is to figure out the state of the database at the exact moment of the crash.
- It starts by finding the last successful
Checkpoint
record in the WAL. - It then scans the log forward from that checkpoint to the end of the log.
- During this scan, it builds two critical data structures:
- Active Transaction Table (ATT): A list of all transactions that have a
BEGIN
record but no correspondingCOMMIT
orABORT
record. These are the potential "loser" transactions that will need to be undone. - Dirty Page Table (DPT): A list of all pages that were modified in the buffer pool but might not have been written to disk before the crash. For each dirty page, it records the LSN of the first log record that made it dirty (this is called the
recLSN
).
- Active Transaction Table (ATT): A list of all transactions that have a
At the end of this phase, the RecoveryManager
knows exactly which transactions to roll back and the earliest point in the log from which it needs to start re-applying changes.
Phase 2: Redo (Repeating History)
The goal of the Redo phase (recovery/redo.rs
) is to restore the database to its exact state at the moment of the crash, including all changes from both committed and uncommitted (loser) transactions.
- The Redo phase finds the smallest
recLSN
from the Dirty Page Table built during Analysis. This LSN is the starting point for the redo scan. - It scans the log forward from this starting point.
- For every log record it encounters that describes a physical change to a page (e.g.,
PageWrite
,PageDelta
,Heap*
), it re-applies the change. This is idempotent: if the change is already present on the page (because it was successfully flushed to disk before the crash), re-applying it does no harm.
At the end of this phase, the database state on disk is identical to how it was in memory right before the crash.
Phase 3: Undo (Rolling Back Losers)
The final phase (recovery/undo.rs
) is responsible for rolling back all the "loser" transactions identified during Analysis.
- The
UndoExecutor
takes the list of loser transactions. - For each loser transaction, it scans the WAL backwards, following the chain of log records for that transaction.
- For each operation record (like
HeapInsert
,HeapUpdate
), it performs the logical inverse operation:- To undo an
Insert
, it performs aDelete
. - To undo a
Delete
, it restores the deleted data. - To undo an
Update
, it restores the data from before the update.
- To undo an
- Crucially, for every undo action it performs, it writes a Compensation Log Record (CLR) to the WAL. The CLR contains information about the undo action and, importantly, a pointer to the next log record that needs to be undone for that transaction.
This use of CLRs makes the recovery process itself crash-proof. If the system crashes during the undo phase, the next time recovery runs, it will see the CLRs. It will simply continue the undo process from where it left off by following the pointers in the CLRs, without ever having to undo an undo action.
3. Checkpoints
If the log were allowed to grow forever, recovery would become slower and slower. Checkpoints are a background process that periodically creates a point of partial durability.
A checkpoint does the following:
- Temporarily stops new transactions from starting.
- Writes a
BEGIN_CHECKPOINT
record to the WAL, containing the current ATT and DPT. - Flushes all dirty pages from the buffer pool to disk.
- Writes an
END_CHECKPOINT
record to the WAL.
Once a checkpoint is complete, the RecoveryManager
knows that all changes described in log records before the BEGIN_CHECKPOINT
record are safely on disk. Therefore, those older parts of the WAL file are no longer needed for recovery and can be truncated or recycled, keeping the recovery process efficient.
For Study & Discussion
-
Repeating History: The Redo phase re-applies changes from all transactions, including those that will be undone later (the "losers"). This seems wasteful. Why is this "repeating history" approach a core principle of ARIES? What would go wrong if we tried to only redo changes from committed transactions?
-
Compensation Log Records (CLRs): What specific problem would occur if the system crashed during the Undo phase and we didn't write CLRs? How does a CLR's "redo-only" nature make the recovery process idempotent and robust against repeated crashes?
-
Checkpointing Frequency: What are the performance trade-offs between checkpointing very frequently versus very infrequently? Consider both normal runtime performance and the time it takes to recover after a crash.
-
Programming Exercise: Add a new WAL record type. For example, a
HeapReclaim
record that logs the physical removal of a dead tuple by a vacuum process. To do this, you would need to: a. Add a variant to theHeapRecordPayload
enum inwal_record.rs
. b. Update thecodec
functions to handle its serialization. c. Decide what information theHeapReclaim
record needs to contain. d. Implement theredo
logic for this new record in the appropriateResourceManager
. What should happen when you redo aHeapReclaim
record?
Transaction Manager
The Transaction Manager is responsible for one of the most critical aspects of a database: guaranteeing the ACID properties, specifically Isolation and Atomicity.
It allows multiple clients to access the database concurrently without interfering with each other, and it ensures that transactions are treated as single, indivisible operations (all or nothing).
Core Concepts
-
Concurrency Control: To handle simultaneous operations, QuillSQL uses a hybrid approach that combines two powerful techniques:
- Multi-Version Concurrency Control (MVCC): Instead of overwriting data, writers create new versions of rows. Readers are given a consistent snapshot of the database, which allows them to proceed without being blocked by writers.
- Two-Phase Locking (2PL): To prevent two writers from modifying the same row at the same time, a strict two-phase locking protocol is used. Transactions must acquire locks on data before modifying it and hold those locks until the transaction ends.
-
Atomicity: The transaction manager records all actions performed by a transaction. If the transaction needs to be aborted, it can use this record to perform the necessary undo operations, rolling the database back to its state before the transaction began.
This section contains:
- MVCC and 2PL: A deep dive into QuillSQL's hybrid concurrency control model.
MVCC and 2PL
Of the four ACID properties, Isolation is often the most complex to implement. It ensures that concurrent transactions do not interfere with each other, making it appear as if each transaction is executing sequentially, one after another. Without proper isolation, a database would suffer from concurrency-related anomalies like dirty reads, non-repeatable reads, and phantom reads.
Databases typically use two main strategies for concurrency control:
- Pessimistic Concurrency Control: Assumes conflicts are likely and prevents them from happening by using locks. The most common protocol is Two-Phase Locking (2PL).
- Optimistic Concurrency Control: Assumes conflicts are rare. Transactions proceed without locking, and the database validates at commit time that no conflicts occurred. A popular variant is Multi-Version Concurrency Control (MVCC).
QuillSQL, like many modern relational databases (e.g., PostgreSQL, Oracle), implements a powerful hybrid model that combines MVCC with 2PL.
1. MVCC: Reading without Blocking
The core idea of MVCC is "writers don't block readers, and readers don't block writers." This is achieved by never overwriting data in-place. When a row is updated, the database creates a new version of that row, preserving the old one.
Version Chains and TupleMeta
As discussed in the Storage Engine chapter, every tuple on disk has associated metadata (TupleMeta
) that includes:
insert_txn_id
: The ID of the transaction that created this version.delete_txn_id
: The ID of the transaction that marked this version as deleted.prev_version
/next_version
: Pointers (RIDs) that form a linked list of versions for a single logical row, called the version chain.
Transaction Snapshots
When a transaction begins, it asks the TransactionManager
for a snapshot of the database state. This snapshot, defined in transaction/mvcc.rs
, contains three key pieces of information:
xmin
: The oldest active transaction ID at the time of the snapshot. Any transaction with an ID less thanxmin
is guaranteed to be either committed or aborted.xmax
: The next transaction ID to be assigned. Any transaction with an ID greater than or equal toxmax
was not yet started when the snapshot was taken.active_txns
: A list of all other transaction IDs that were active when the snapshot was taken.
The Visibility Check
When a transaction scans the database, for every tuple version it encounters, it performs a visibility check using its snapshot. The logic in MvccSnapshot::is_visible
determines if the version should be "seen" by the current transaction. In simplified terms, a tuple version is visible if:
- Its
insert_txn_id
belongs to a transaction that was already committed before our snapshot was taken. - AND its
delete_txn_id
is either not set, OR it belongs to a transaction that was not yet committed when our snapshot was taken.
This mechanism elegantly solves several concurrency problems. Since a reader transaction only ever sees a consistent snapshot of the database, it is completely immune to changes being made by other concurrent writer transactions. This prevents dirty reads and non-repeatable reads.
2. Two-Phase Locking (2PL): Preventing Write-Write Conflicts
While MVCC is excellent for read-write conflicts, it does not, by itself, prevent two transactions from trying to modify the same logical row at the same time (a write-write conflict). This is where locking comes in.
QuillSQL implements a strict Two-Phase Locking protocol via the LockManager
(transaction/lock_manager.rs
).
The Two Phases
- Growing Phase: The transaction can acquire new locks as it executes.
- Shrinking Phase: Once the transaction releases its first lock, it cannot acquire any new locks.
In practice, QuillSQL uses Strict 2PL, where the shrinking phase is delayed until the transaction commits or aborts. All locks are held until the very end.
Hierarchical Locking (Intention Locks)
To be efficient, the LockManager
uses a multi-granularity, hierarchical locking scheme. Before a transaction can take a fine-grained lock on a row (a Shared
or Exclusive
lock), it must first acquire a coarser-grained intention lock on the table.
IntentionShared (IS)
: Signals the intent to read rows from the table.IntentionExclusive (IX)
: Signals the intent to modify rows in the table.
This prevents a transaction wanting to lock the entire table from conflicting with another transaction that is already modifying a single row. For example, a SELECT * FROM users FOR UPDATE
(which needs an X
lock on the table) will be blocked if another transaction already holds an IX
lock on users
.
Deadlock Detection
When two or more transactions are waiting for each other to release locks in a circular chain, a deadlock occurs. The LockManager
detects this by building a waits-for graph. When a transaction T1
has to wait for a lock held by T2
, an edge T1 -> T2
is added to the graph. If adding an edge creates a cycle, a deadlock is detected, and one of the transactions (the victim) is immediately aborted to break the cycle.
3. The Hybrid Model: MVCC + 2PL in Action
In QuillSQL, these two mechanisms work in concert:
-
A reader (
SELECT
) transaction acquires an MVCC snapshot. It uses this snapshot to determine visibility. It only needs to acquireShared
(S) locks on the rows it reads to prevent other transactions from modifying them, thus ensuring repeatable reads in higher isolation levels. -
A writer (
UPDATE
,DELETE
) transaction must acquire anExclusive
(X) lock on the specific row it intends to modify. Once the lock is granted, it knows no other writer can interfere. It can then safely create a new tuple version as part of the MVCC protocol.
This hybrid approach provides the best of both worlds: reads are fast and non-blocking, while write-write conflicts are safely prevented by the locking protocol.
For Study & Discussion
-
Isolation Levels: QuillSQL supports multiple SQL isolation levels. How does the behavior of MVCC snapshots and 2PL change between
ReadCommitted
andRepeatableRead
? InReadCommitted
, a transaction gets a new snapshot for every statement, whereas inRepeatableRead
, it uses the same snapshot for the entire transaction. What concurrency anomalies does this difference prevent? -
Phantom Reads: Even with MVCC and row-level 2PL, a
RepeatableRead
transaction can suffer from phantom reads. ImagineT1
runsSELECT COUNT(*) FROM users WHERE age > 30
. Then,T2
inserts a new user withage = 40
and commits. IfT1
runs itsSELECT
query again, it will see a new "phantom" row that wasn't there before. How can a database prevent this to achieve theSerializable
isolation level? (Hint: research predicate locking and index-range locking). -
Deadlock Handling: QuillSQL detects deadlocks by building a waits-for graph and aborting a transaction. What is an alternative strategy for handling deadlocks? For example, what are the pros and cons of using lock timeouts instead of cycle detection?
-
Programming Exercise (Advanced): A full implementation of
Serializable
isolation often requires index-range locking to prevent phantoms. Extend theLockManager
to support locking a range of keys within a B+Tree index. This would require a new lock type and a way to check for overlapping ranges. When aSELECT ... WHERE age > 30
query runs, it would place a shared lock on the(30, +∞)
range in the index on theage
column, preventing any other transaction from inserting a new user with an age in that range.
Query Plan
The Query Planner is the "brain" of the database. It is responsible for taking a declarative SQL query string (which describes what data to retrieve) and converting it into a highly optimized, imperative plan (which describes how to retrieve that data).
This process is one of the most complex and important aspects of a relational database system.
Core Concepts
-
AST (Abstract Syntax Tree): The raw SQL text is first parsed into a tree structure that represents the syntax of the query.
-
Logical Plan: The AST is then converted into a logical plan. This is a tree of relational algebra operators (e.g.,
Filter
,Projection
,Join
) that describes the logical steps required to fulfill the query, independent of any specific algorithm or data layout. -
Physical Plan: The logical plan is then converted into a physical plan. This plan consists of concrete operators (or "iterators") that implement specific algorithms (e.g.,
NestedLoopJoin
,SeqScan
). This is the plan that is actually executed.
This section contains:
- The Lifecycle of a Query: A deep dive into the journey from SQL string to executable physical plan.
The Lifecycle of a Query
When you submit a SQL query to a database, it doesn't just magically produce a result. The database undertakes a sophisticated, multi-stage process to translate the declarative SQL statement (which describes what data you want) into an imperative, efficient execution plan (which describes how to get that data). This entire process is the responsibility of the Query Planner.
In QuillSQL, this process follows a classic, compiler-like pipeline, which is a cornerstone of modern database architecture as taught in courses like CMU 15-445.
The journey from a SQL string to an executable plan involves several transformations:
SQL String -> AST (Abstract Syntax Tree) -> Logical Plan -> Optimized Logical Plan -> Physical Plan
Let's break down each stage.
Stage 1: Parsing (SQL -> AST)
The first step is purely syntactic. The raw SQL query string is fed into a parser. QuillSQL uses the excellent sqlparser
crate for this. The parser checks if the SQL conforms to valid grammar and, if so, converts it into an Abstract Syntax Tree (AST).
An AST is a direct tree representation of the SQL query's structure. For example, SELECT id FROM users WHERE age > 30
would be parsed into a tree structure with nodes representing the SELECT
clause, the table users
, the WHERE
clause, and the predicate age > 30
.
Stage 2: Logical Planning (AST -> Logical Plan)
Next, the LogicalPlanner
(plan/logical_planner.rs
) walks the AST and converts it into a Logical Plan.
A Logical Plan is a tree of relational algebra operators. It describes the query in terms of high-level data operations, completely independent of how the data is stored or which algorithms will be used. It defines what to do, not how to do it.
Key logical operators in QuillSQL (plan/logical_plan/mod.rs
) include:
TableScan(users)
: Represents reading the entireusers
table.Filter(predicate: age > 30)
: Represents filtering rows based on a condition.Projection(columns: [id])
: Represents selecting specific columns.Join
: Represents joining two data sources.Aggregate
: Represents aGROUP BY
operation.Sort
: Represents anORDER BY
operation.
For our example query, the initial logical plan might look like this:
Projection(columns=[id])
└── Filter(predicate=age > 30)
└── TableScan(users)
Stage 3: Logical Optimization
Before executing the plan, we have a crucial opportunity to make it more efficient. The LogicalOptimizer
(optimizer/logical_optimizer.rs
) takes the logical plan and applies a series of transformation rules to produce a new, equivalent logical plan that is expected to be faster.
QuillSQL uses a simple but effective rule-based optimizer. A classic example of such a rule is Predicate Pushdown. Consider this query:
SELECT name FROM (SELECT * FROM users JOIN cities ON users.city_id = cities.id) WHERE users.age > 30;
A naive logical plan would first perform a full JOIN
between users
and cities
and then filter the massive result. Predicate pushdown is a rule that would rewrite the plan to apply the age > 30
filter before the join:
Before Optimization:
Filter(users.age > 30)
└── Join(users.city_id = cities.id)
├── TableScan(users)
└── TableScan(cities)
After Optimization (Predicate Pushdown):
Join(users.city_id = cities.id)
├── Filter(users.age > 30)
│ └── TableScan(users)
└── TableScan(cities)
By filtering early, we dramatically reduce the number of rows that need to be processed by the expensive Join
operator. QuillSQL implements similar rules, such as PushDownLimit
, which pushes LIMIT
clauses down the tree to reduce the amount of data processed.
Stage 4: Physical Planning (Logical Plan -> Physical Plan)
Finally, the PhysicalPlanner
(plan/physical_planner.rs
) converts the optimized logical plan into a Physical Plan.
A Physical Plan describes exactly how the query will be executed. It maps each logical operator to a concrete algorithm or implementation.
- A
LogicalPlan::TableScan
becomes aPhysicalSeqScan
(a sequential scan of the table heap). - A
LogicalPlan::Filter
becomes aPhysicalFilter
, which implements the filtering logic. - A
LogicalPlan::Join
becomes aPhysicalNestedLoopJoin
. This is where the database commits to a specific join algorithm. A more advanced database might have multiple options (e.g.,PhysicalHashJoin
,PhysicalSortMergeJoin
) and would use a cost model to choose the best one. QuillSQL currently implements Nested Loop Join.
Each node in the physical plan tree is an executor that the Execution Engine can run. This final plan is what gets executed to produce the query result.
Conclusion
This layered approach—from syntax to a logical representation, then to an optimized logical representation, and finally to a concrete physical execution plan—is fundamental to database design. It provides a clear separation of concerns and, most importantly, creates a dedicated optimization stage, which is the key to achieving high performance on a wide variety of SQL queries.
For Study & Discussion
-
Logical vs. Physical: Why is the separation between logical and physical plans so important? What would be the disadvantages of a simpler system that converted the AST directly into a physical plan?
-
Join Algorithms: QuillSQL currently only implements
NestedLoopJoin
. What are two other common join algorithms? Describe how they work and in what scenarios they would be more performant than a nested loop join. -
Programming Exercise (Advanced): Implement a
PhysicalHashJoin
operator. This is a significant undertaking that involves: a. Creating aPhysicalHashJoin
struct that implements theVolcanoExecutor
trait. b. In theinit()
phase, it should consume the entire "build" side (typically the smaller, right-hand table) and build an in-memory hash table from its rows. c. In thenext()
phase, it should read from the "probe" side (the left-hand table) one row at a time, probe the hash table for matches, and emit the joined tuples. d. Modify thePhysicalPlanner
to choosePhysicalHashJoin
instead ofPhysicalNestedLoopJoin
for equi-joins. -
Programming Exercise: Add support for the
UNION ALL
operator. This would involve: a. Adding aUnion
variant to theLogicalPlan
enum. b. Updating theLogicalPlanner
to recognize theUNION
syntax in the AST and create aLogicalPlan::Union
node. c. Creating aPhysicalUnion
executor that pulls tuples from its first child until it's exhausted, and then pulls tuples from its second child.
Query Optimizer
A naive, direct translation of a SQL query into an execution plan is often dramatically inefficient. The Query Optimizer is a critical stage in the query processing pipeline that rearranges and transforms the initial logical plan into an equivalent, but far cheaper, plan to execute.
Core Concepts
-
Rule-Based Optimization: QuillSQL uses a rule-based optimizer. It applies a series of pre-defined, heuristic rules to the logical plan to improve it. This is in contrast to a cost-based optimizer, which would estimate the "cost" of many possible plans and choose the cheapest one.
-
Logical Transformations: The key insight is that many different logical plans can produce the exact same result. For example, filtering data before a join is usually much faster than joining two large tables and filtering the result, but the final output is identical. The optimizer's job is to find and apply these beneficial transformations.
This section contains:
- Rule-Based Optimization: A deep dive into how the rule-based optimizer works, using the
PushDownLimit
rule as a concrete example.
Rule-Based Optimization
After the LogicalPlanner
creates an initial LogicalPlan
, it's passed to the LogicalOptimizer
. The initial plan is a direct, syntactically correct translation of the SQL query, but it's often not the most efficient way to execute it. The optimizer's job is to transform this plan into an equivalent, but more performant, logical plan.
The Optimizer in QuillSQL
QuillSQL implements a Rule-Based Optimizer. This is a common and powerful approach where the optimizer is equipped with a set of predefined transformation rules. It repeatedly applies these rules to the logical plan tree until no more rules can be applied, or a maximum number of passes is reached.
The main components are:
LogicalOptimizer
(optimizer/logical_optimizer.rs
): The main driver. It holds a list of rules and contains the logic to recursively walk the plan tree and apply them.LogicalOptimizerRule
Trait: An interface that every optimization rule must implement. Its core method istry_optimize
, which takes a plan node and attempts to return a rewritten, optimized version of that node.
Deep Dive: The PushDownLimit
Rule
One of the most classic and effective optimizations is "pushing down" operations as far as possible towards the data source. Let's examine the PushDownLimit
rule (optimizer/rule/push_down_limit.rs
) to see this in action.
Consider the following query:
SELECT * FROM users ORDER BY signup_date LIMIT 10;
The Naive Plan
A naive logical plan for this query would be:
Limit(10)
└── Sort(by: signup_date)
└── TableScan(users)
If executed directly, this plan would:
- Scan the entire
users
table. - Sort the entire table by
signup_date
. - Finally, discard all but the first 10 rows.
This is incredibly inefficient, especially for a large table, as it involves a massive, memory-intensive sort operation.
The Optimization Rule
The PushDownLimit
rule is designed to recognize this specific pattern: a Limit
operator directly on top of a Sort
operator.
When the optimizer applies this rule, the try_optimize
method matches on the Limit
node. It inspects its child and sees that it's a Sort
node. The rule then knows it can apply its logic.
The Rewritten Plan
The rule rewrites the plan tree by "pushing" the limit information into the Sort
node itself:
Limit(10)
└── Sort(by: signup_date, limit: 10)
└── TableScan(users)
Notice the new limit: 10
property on the Sort
node. This seemingly small change has a huge performance impact. When the PhysicalSort
operator is created from this logical node, it now knows that it only needs to find the top 10 rows. Instead of performing a full sort, it can use a much more efficient algorithm, like a heap sort (using a min-heap of size 10), to find the top 10 rows in a single pass over the data.
This optimization avoids sorting the entire table, dramatically reducing both CPU and memory consumption.
Other Rules
QuillSQL implements other simple but effective rules:
EliminateLimit
: Removes aLIMIT
clause if it provides no value (e.g.,LIMIT NULL
).MergeLimit
: If twoLIMIT
clauses are stacked on top of each other (which can happen after other rule transformations), this rule merges them into a single, more restrictiveLIMIT
.
Conclusion
While QuillSQL's optimizer is currently rule-based and relatively simple, it demonstrates the fundamental principles of query optimization. By separating the logical representation of a query from its physical execution and applying equivalence-preserving transformations, a database can achieve massive performance gains. More advanced systems build on this with a Cost-Based Optimizer, which uses table statistics to estimate the "cost" of different physical plans (e.g., choosing between a NestedLoopJoin
and a HashJoin
) and pick the cheapest one.
For Study & Discussion
-
Rule Ordering: The
LogicalOptimizer
applies its list of rules in a fixed order for a set number of passes. Can the order in which rules are applied affect the final, optimized plan? Can one rule's transformation enable another rule to be applied in a subsequent pass? -
Cost-Based vs. Rule-Based: What is the primary limitation of a purely rule-based optimizer? When would a rule-based optimizer make a poor decision that a cost-based optimizer (with accurate statistics) would get right? (Hint: consider join algorithm selection).
-
Programming Exercise: Implement the classic Predicate Pushdown rule. Your rule should look for a
Filter
operator whose child is aJoin
. If the filter's predicate only uses columns from one side of the join, the rule should push theFilter
node down to that side of the join, below theJoin
node. This is one of the most effective optimizations in any database. -
Programming Exercise: Implement a Constant Folding rule. This rule would traverse expression trees and pre-compute constant expressions. For example:
- An expression
WHERE age = 10 + 5
would be rewritten toWHERE age = 15
. - An expression
WHERE 1 = 1
would be evaluated totrue
, and a smart optimizer could then potentially eliminate theWHERE
clause entirely.
- An expression
Execution Engine
Once a query has been parsed, planned, and optimized, the resulting PhysicalPlan
is handed to the Execution Engine. This component is the workhorse of the database, responsible for actually running the plan and producing the final results.
Core Concepts
-
Volcano (Iterator) Model: QuillSQL uses a pull-based execution model. Each operator in the physical plan tree is an "iterator" that provides a
next()
method. The parent operator callsnext()
on its children to pull rows upwards through the tree, from the storage layer to the client. -
Physical Operators: Each logical operation (like a filter or a join) is mapped to a concrete physical operator that implements a specific algorithm (e.g.,
PhysicalFilter
,PhysicalNestedLoopJoin
). -
Execution Context: As the plan is executed, a shared
ExecutionContext
is passed between operators. It contains vital information, such as the current transaction and its MVCC snapshot, allowing operators to perform visibility checks and locking.
This section contains:
- The Volcano Execution Model: A deep dive into the pull-based execution model and the lifecycle of a query.
The Volcano Execution Model
Once the Query Planner has produced an optimized PhysicalPlan
, it's the job of the Execution Engine to run it and produce results. The execution engine is the component that brings the plan to life, interacting with the transaction manager and storage layer to process data.
QuillSQL uses the classic Volcano Model, also known as the Iterator Model. This is a pull-based execution model where each physical operator in the plan tree acts as an iterator that the parent operator can "pull" rows from.
1. The VolcanoExecutor
Trait
At the heart of the execution model is the VolcanoExecutor
trait (execution/mod.rs
). Every physical operator implements this simple trait:
#![allow(unused)] fn main() { pub trait VolcanoExecutor { fn init(&self, context: &mut ExecutionContext) -> QuillSQLResult<()>; fn next(&self, context: &mut ExecutionContext) -> QuillSQLResult<Option<Tuple>>; fn output_schema(&self) -> SchemaRef; } }
init()
: This method is called once at the beginning of execution. It allows an operator to set up its initial state (e.g., aSeqScan
operator would initialize its table iterator here).next()
: This is the core method. When called, the operator produces its next output tuple. It returnsSome(tuple)
if it has a row, orNone
if it has exhausted its data source. The top-levelExecutionEngine
simply callsnext()
on the root of the plan tree in a loop until it receivesNone
.
2. The ExecutionContext
Notice that both init()
and next()
take a mutable ExecutionContext
. This object is the "context" or "world" in which the query runs. It is passed down the entire operator tree and gives each operator access to crucial services:
Catalog
: To look up tables and indexes.TransactionManager
andTransaction
: To interact with the current transaction. This is how operators perform locking and visibility checks.MvccSnapshot
: The specific MVCC snapshot for the current transaction, used to determine which tuple versions are visible.
This design cleanly separates the operator's logic from the transactional context it runs in.
3. Anatomy of Physical Operators
Data flows up the tree from the leaves (scans) to the root. Let's see how it works by examining a few key operators.
Leaf Operator: PhysicalSeqScan
A sequential scan is at the leaf of a plan tree. It's responsible for reading tuples from a table on disk.
init()
: It acquires anIntentionShared
lock on the table and creates aTableIterator
for the table heap.next()
: In a loop, it does the following:- Pulls the next raw tuple
(rid, meta, tuple)
from theTableIterator
. - Calls
context.is_visible(&meta)
to perform an MVCC visibility check using the transaction's snapshot. - If the tuple version is visible, it acquires the necessary row lock (e.g.,
Shared
lock) and returns the tuple. - If the tuple is not visible, it ignores it and loops to get the next one.
- Pulls the next raw tuple
Unary Operator: PhysicalFilter
A filter has one child operator (its input
). It implements a WHERE
clause.
next()
: Its logic is a simple, tight loop:- It calls
self.input.next()
to get a tuple from its child. - If the child returns
None
, the filter is also exhausted and returnsNone
. - If it receives a tuple, it evaluates its predicate expression (e.g.,
age > 30
) against the tuple. - If the predicate evaluates to
true
, it returns the tuple. Otherwise, it loops back to step 1.
- It calls
Binary Operator: PhysicalNestedLoopJoin
A join has two children: a left (outer) and a right (inner).
next()
: It implements the classic nested loop join algorithm:- Fetch one tuple from the outer (left) child and hold onto it.
- Enter a loop: fetch tuples one by one from the inner (right) child.
- For each inner tuple, combine it with the held outer tuple and evaluate the join condition. If it matches, return the combined tuple.
- When the inner child is exhausted, rewind it by calling
self.right_input.init()
again. - Go back to step 1 to fetch the next tuple from the outer child.
- Repeat until the outer child is also exhausted.
4. Putting It All Together
Consider the query SELECT name FROM users WHERE age > 30
. The physical plan is Projection -> Filter -> SeqScan
.
- The
ExecutionEngine
callsnext()
on theProjection
operator. - The
Projection
operator needs a tuple, so it callsnext()
on its child,Filter
. - The
Filter
operator needs a tuple, so it callsnext()
on its child,SeqScan
. - The
SeqScan
operator fetches a raw tuple from theTableHeap
, checks its MVCC visibility, and finds a visible tuple for a user withage = 25
. SeqScan
returns this tuple up toFilter
.Filter
evaluatesage > 30
on the tuple. It's false, so it loops, callingSeqScan.next()
again.SeqScan
finds another visible tuple, this time for a user withage = 40
andname = 'Alice'
.SeqScan
returns this tuple up toFilter
.Filter
evaluatesage > 30
. It's true! It returns the tuple for Alice up toProjection
.Projection
takes the full tuple for Alice, creates a new tuple containing only thename
column ('Alice'
), and returns this new tuple as the result.
This process repeats, with tuples flowing up the tree one at a time, until the SeqScan
operator runs out of pages and returns None
, which then propagates up the tree, signaling the end of execution.
For Study & Discussion
-
Push vs. Pull Models: The Volcano model is a "pull-based" model. An alternative is a "push-based" model, where operators push their results to their parents as soon as they are ready. What are the potential advantages and disadvantages of each model, particularly concerning cache efficiency and control flow?
-
Blocking vs. Non-Blocking Operators: Some operators, like
PhysicalFilter
, can produce their first output row as soon as they receive their first input row. These are non-blocking. Other operators, likePhysicalSort
, must consume their entire input before they can produce even a single row of output. These are blocking. What is the impact of blocking operators on query latency and memory usage? -
Programming Exercise: The current
PhysicalNestedLoopJoin
is simple but can be inefficient as it re-scans the entire inner table for every outer row. Implement aPhysicalBlockNestedLoopJoin
operator. This version would read a block (a small batch) of tuples from the outer table into an in-memory buffer, and then iterate through the inner table once for that entire block. This can significantly reduce the number of times the inner table needs to be scanned. -
Programming Exercise: Implement the
PhysicalLimit
operator. Itsnext()
method should: a. Keep an internal counter. b. If the counter is less than theoffset
, pull and discard tuples from its child. c. If the counter is betweenoffset
andoffset + limit
, pull a tuple from its child and return it. d. Once the limit is reached, it should stop pulling from its child and returnNone
for all subsequent calls. This is important for efficiency, as it stops the execution of the entire sub-tree below it.