QuillSQL Architecture
This chapter gives a tour of the major subsystems, the flow of a SQL statement, and the contract between execution, transactions, and storage. It uses Mermaid diagrams so you can render the visuals directly inside mdBook or any compatible viewer.
1. End-to-End Pipeline
flowchart LR
subgraph Frontend
CLI["bin/client"] --> Parser["sql::parser"]
HTTP["bin/server"] --> Parser
end
Parser --> LPlan["plan::LogicalPlanner"]
LPlan --> Optimizer["optimizer::LogicalOptimizer"]
Optimizer --> PPlan["plan::PhysicalPlanner"]
PPlan --> Exec["execution::ExecutionEngine (Volcano)"]
subgraph Txn["transaction::*"]
Session["session::SessionContext"]
TM["TransactionManager"]
LM["LockManager"]
Session --> TM --> LM
end
Exec <-->|snapshot, locks| Txn
Exec --> Binding
subgraph Storage["storage::* & buffer::*"]
Binding["storage::engine::TableBinding"]
Heap["storage::table_heap::TableHeap"]
MVCC["storage::heap::MvccHeap"]
Index["storage::index::BPlusTree"]
Buffer["buffer::BufferManager"]
Disk["storage::disk_scheduler (io_uring)"]
WAL["recovery::WalManager"]
Binding --> Heap
Binding --> Index
Heap <--> Buffer
Index <--> Buffer
Buffer <--> Disk
WAL --> Disk
end
Background["background::workers\n(checkpoint, WAL writer, MVCC vacuum)"] --> {Buffer, WAL, TM}
Key takeaways
- The frontend (CLI/HTTP) only knows how to parse SQL and drive the planning stages; all shared state lives below.
- The
ExecutionEnginedrives a Volcano iterator. Each physical operator calls into the transaction runtime for visibility checks and locking, but touches storage exclusively through aTableBinding. This makes the executor easy to reason about in a classroom setting. - Buffering, WAL, and disk I/O are centralized so that durability/ordering guarantees stay in one module.
2. Transactions, MVCC, and the Executor
sequenceDiagram
autonumber
participant Session
participant TM as TransactionManager
participant Runtime as TxnRuntime
participant Exec
participant Binding
participant Heap as TableHeap
Session->>TM: begin_txn(isolation, access_mode)
TM-->>Session: Transaction handle
Exec->>Runtime: TxnRuntime::new(&TM, &mut txn)
Runtime-->>Exec: {snapshot, command_id}
loop per tuple
Exec->>Binding: scan.next()
Binding->>Heap: TableIterator::next()
Heap-->>Binding: (rid, meta, tuple)
Binding-->>Exec: (rid, meta, tuple)
Exec->>Runtime: is_visible(meta)?
Runtime-->>Exec: yes/no
alt visible & needs lock
Exec->>Runtime: lock_row(table, rid, mode)
Runtime-->>Exec: guard
end
alt mutation
Exec->>Binding: prepare_row_for_write(...)
Binding->>Heap: mvcc.full_tuple(...)
Heap-->>Binding: (current_meta, tuple)
Binding->>TM: push_undo + append WAL(HeapRecord)
end
end
Session->>TM: commit(txn)
TM->>WAL: Transaction{Commit}
TM->>TM: flush_until(lsn) if synchronous
TM-->>Session: release locks, clear snapshot
Snapshot policy
Read Uncommitted / Read Committed: every command obtains a fresh snapshot, so repeated statements can see committed updates.Repeatable Read / Serializable: cache the first snapshot on theTransactionhandle; subsequent commands reuse it for consistent reads. The lock manager releases S-locks at the end of each command for RR, but Serializable keeps them until commit.
Undo & WAL
- Each mutation produces a logical
HeapRecordPayloadorIndexRecordPayload. The heap payload already carries redo (new bytes) and undo (old bytes), so recovery can replay forward or backward without re-reading heap pages. - On abort, the manager walks the undo stack, emits CLRs, and re-applies the inverse heap/index operations.
3. Storage Layering
graph TD
subgraph Planner View
Catalog
end
Catalog -->|Arc<TableHeap>| TableBinding
TableBinding --> Heap["TableHeap"]
TableBinding --> Mvcc["MvccHeap"]
TableBinding --> Index["BPlusTreeIndex"]
Heap -->|pages| BufferMgr
Index -->|pages| BufferMgr
BufferMgr --> DiskMgr
BufferMgr --> WalMgr
| Layer | Responsibility | Notes |
|---|---|---|
TableHeap | Physical slotted pages, tuple encoding, WAL page image helpers | Exposes insert/update/delete that emit heap-specific WAL payloads before dirtying frames. |
MvccHeap | Version chain management, delete-marking, undo metadata | Creates new versions, rewires forward/back pointers, and delegates actual I/O to TableHeap. |
TableBinding | Safe façade for the executor | Provides scan, index_scan, insert, delete, update, and prepare_row_for_write, always pairing heap/index changes so operators stay small. |
BufferManager + DiskScheduler | Unified cache + async I/O | Uses LRU-K (+ optional TinyLFU admission) and io_uring to keep the hot set resident. |
WalManager | ARIES-compatible log | Supports logical heap/index records, page write/delta fallback, CLRs, checkpoints, and background flush threads. |
4. TableBinding Contract
classDiagram
class TableBinding {
+scan() -> TupleStream
+index_scan(name, IndexScanRequest) -> TupleStream
+insert(&mut TxnContext, &Tuple)
+delete(&mut TxnContext, RecordId, TupleMeta, Tuple)
+update(&mut TxnContext, RecordId, Tuple, TupleMeta, Tuple) -> RecordId
+prepare_row_for_write(&mut TxnContext, RecordId, &TupleMeta)
+table_heap() -> Arc<TableHeap>
}
class PhysicalOperator {
+init(&mut ExecutionContext)
+next(&mut ExecutionContext) -> Option<Tuple>
}
PhysicalOperator --> TableBinding : uses
This binding hides every MVCC/WAL detail from the operators:
- No more ad-hoc
catalog.table_indexes()calls. - No direct references to
MvccHeaporTableHeap. - Executor code reads like pseudo SQL: “lock table”, “scan binding”, “update tuple”.
5. WAL & Recovery Details
flowchart LR
subgraph WAL Record Types
HI["HeapInsert"] --> redo
HD["HeapDelete (with before-image)"] --> redo & undo
LI["IndexLeaf{Insert,Delete}"]
IS["IndexSplit/Merge/Redistribute/Parent*"]
IR["IndexRoot Install/Adopt/Reset"]
PI["PageWrite (FPW)"]
CI["Checkpoint"]
CL["CLR"]
end
Exec -->|Heap/Index payloads| WalMgr
BufferMgr -->|first-touch FPW| WalMgr
WalMgr --> DiskScheduler --> log files
Recovery -->|analysis/redo/undo| BufferMgr
- Logical logging first: heap/index mutations emit redo + undo at the logical level. This keeps the WAL stream compact and human-readable for teaching.
- Physical fallbacks: metadata-heavy pages (meta, freelist, header) still leverage PageWrite FPWs to guarantee a consistent base image, especially on the first page flush after a crash.
- Restart:
RecoveryManagerperforms the classical ARIES sequence. It uses thedirty_page_tableandactive_txn_tablecaptured by checkpoints to limit redo and undo work.
6. Background Workers
| Worker | Purpose | Configuration |
|---|---|---|
| WAL writer | Periodically flushes WAL buffers, coalesces adjacent writes | WalOptions::writer_interval_ms, buffer_capacity |
| Checkpoint | Records ATT + DPT, gently pushes dirty buffers | WalOptions::checkpoint_interval_ms |
| Buffer writer | Flushes frames when the dirty set grows too large | background::BufferWriterConfig |
| MVCC vacuum | Reclaims obsolete tuple versions based on safe_xmin | MvccVacuumConfig |
| Index vacuum | Cleans up deleted index entries using B-link traversal | IndexVacuumConfig |
Workers register with background::BackgroundWorkers, so the database can stop and join them cleanly on shutdown. Each worker publishes metrics (intervals, batches processed) for observability.
7. Example: Repeatable Read UPDATE
sequenceDiagram
participant T1 as Txn 1 (RR)
participant T2 as Txn 2 (RC)
participant Heap as TableHeap
T1->>Heap: SELECT (snapshot S0)
Heap-->>T1: version V0
T2->>Heap: UPDATE -> insert V1, delete V0
Heap-->>T2: ack
T2->>T2: COMMIT (WAL + locks release)
T1->>Heap: SELECT again
Heap-->>T1: V0 still visible via snapshot S0
T1->>T1: COMMIT
Note over T1,Heap: Vacuum later removes V0 when safe_xmin advances
This timeline demonstrates:
- Snapshots shield Repeatable Read statements from concurrent writers even if row-level locks are released early.
- The MVCC chain (
V1.prev_version = V0,V0.next_version = V1) lets future readers reach the newest committed version while the vacuum worker reclaims obsolete ones lazily.
8. Observability & Configuration Cheat Sheet
- Logging:
RUST_LOG=tracesurfaces lock acquisitions, MVCC vacuums, and io_uring completions. - Runtime knobs:
WalOptions(segment size, sync policy),BufferPoolConfig(capacity, TinyLFU toggle),MvccVacuumConfig, andIndexVacuumConfigcan all be adjusted viaDatabaseOptions. - Metrics:
background::BackgroundWorkers::snapshot()reports worker health; WAL exposes current LSN and flush LSN; the buffer manager can dump the dirty page table for diagnostics.
With these layers in place, QuillSQL remains faithful to production-grade engines (MVCC + WAL + buffer pool) while keeping its code and documentation approachable for coursework and research prototypes.