Keyboard shortcuts

Press or to navigate between chapters

Press ? to show this help

Press Esc to hide this help

QuillSQL Architecture

This chapter gives a tour of the major subsystems, the flow of a SQL statement, and the contract between execution, transactions, and storage. It uses Mermaid diagrams so you can render the visuals directly inside mdBook or any compatible viewer.


1. End-to-End Pipeline

flowchart LR
    subgraph Frontend
        CLI["bin/client"] --> Parser["sql::parser"]
        HTTP["bin/server"] --> Parser
    end
    Parser --> LPlan["plan::LogicalPlanner"]
    LPlan --> Optimizer["optimizer::LogicalOptimizer"]
    Optimizer --> PPlan["plan::PhysicalPlanner"]
    PPlan --> Exec["execution::ExecutionEngine (Volcano)"]

    subgraph Txn["transaction::*"]
        Session["session::SessionContext"]
        TM["TransactionManager"]
        LM["LockManager"]
        Session --> TM --> LM
    end

    Exec <-->|snapshot, locks| Txn
    Exec --> Binding

    subgraph Storage["storage::* & buffer::*"]
        Binding["storage::engine::TableBinding"]
        Heap["storage::table_heap::TableHeap"]
        MVCC["storage::heap::MvccHeap"]
        Index["storage::index::BPlusTree"]
        Buffer["buffer::BufferManager"]
        Disk["storage::disk_scheduler (io_uring)"]
        WAL["recovery::WalManager"]
        Binding --> Heap
        Binding --> Index
        Heap <--> Buffer
        Index <--> Buffer
        Buffer <--> Disk
        WAL --> Disk
    end

    Background["background::workers\n(checkpoint, WAL writer, MVCC vacuum)"] --> {Buffer, WAL, TM}

Key takeaways

  • The frontend (CLI/HTTP) only knows how to parse SQL and drive the planning stages; all shared state lives below.
  • The ExecutionEngine drives a Volcano iterator. Each physical operator calls into the transaction runtime for visibility checks and locking, but touches storage exclusively through a TableBinding. This makes the executor easy to reason about in a classroom setting.
  • Buffering, WAL, and disk I/O are centralized so that durability/ordering guarantees stay in one module.

2. Transactions, MVCC, and the Executor

sequenceDiagram
    autonumber
    participant Session
    participant TM as TransactionManager
    participant Runtime as TxnRuntime
    participant Exec
    participant Binding
    participant Heap as TableHeap

    Session->>TM: begin_txn(isolation, access_mode)
    TM-->>Session: Transaction handle
    Exec->>Runtime: TxnRuntime::new(&TM, &mut txn)
    Runtime-->>Exec: {snapshot, command_id}

    loop per tuple
        Exec->>Binding: scan.next()
        Binding->>Heap: TableIterator::next()
        Heap-->>Binding: (rid, meta, tuple)
        Binding-->>Exec: (rid, meta, tuple)
        Exec->>Runtime: is_visible(meta)?
        Runtime-->>Exec: yes/no
        alt visible & needs lock
            Exec->>Runtime: lock_row(table, rid, mode)
            Runtime-->>Exec: guard
        end
        alt mutation
            Exec->>Binding: prepare_row_for_write(...)
            Binding->>Heap: mvcc.full_tuple(...)
            Heap-->>Binding: (current_meta, tuple)
            Binding->>TM: push_undo + append WAL(HeapRecord)
        end
    end

    Session->>TM: commit(txn)
    TM->>WAL: Transaction{Commit}
    TM->>TM: flush_until(lsn) if synchronous
    TM-->>Session: release locks, clear snapshot

Snapshot policy

  • Read Uncommitted / Read Committed: every command obtains a fresh snapshot, so repeated statements can see committed updates.
  • Repeatable Read / Serializable: cache the first snapshot on the Transaction handle; subsequent commands reuse it for consistent reads. The lock manager releases S-locks at the end of each command for RR, but Serializable keeps them until commit.

Undo & WAL

  • Each mutation produces a logical HeapRecordPayload or IndexRecordPayload. The heap payload already carries redo (new bytes) and undo (old bytes), so recovery can replay forward or backward without re-reading heap pages.
  • On abort, the manager walks the undo stack, emits CLRs, and re-applies the inverse heap/index operations.

3. Storage Layering

graph TD
    subgraph Planner View
        Catalog
    end
    Catalog -->|Arc<TableHeap>| TableBinding
    TableBinding --> Heap["TableHeap"]
    TableBinding --> Mvcc["MvccHeap"]
    TableBinding --> Index["BPlusTreeIndex"]
    Heap -->|pages| BufferMgr
    Index -->|pages| BufferMgr
    BufferMgr --> DiskMgr
    BufferMgr --> WalMgr
LayerResponsibilityNotes
TableHeapPhysical slotted pages, tuple encoding, WAL page image helpersExposes insert/update/delete that emit heap-specific WAL payloads before dirtying frames.
MvccHeapVersion chain management, delete-marking, undo metadataCreates new versions, rewires forward/back pointers, and delegates actual I/O to TableHeap.
TableBindingSafe façade for the executorProvides scan, index_scan, insert, delete, update, and prepare_row_for_write, always pairing heap/index changes so operators stay small.
BufferManager + DiskSchedulerUnified cache + async I/OUses LRU-K (+ optional TinyLFU admission) and io_uring to keep the hot set resident.
WalManagerARIES-compatible logSupports logical heap/index records, page write/delta fallback, CLRs, checkpoints, and background flush threads.

4. TableBinding Contract

classDiagram
    class TableBinding {
        +scan() -> TupleStream
        +index_scan(name, IndexScanRequest) -> TupleStream
        +insert(&mut TxnContext, &Tuple)
        +delete(&mut TxnContext, RecordId, TupleMeta, Tuple)
        +update(&mut TxnContext, RecordId, Tuple, TupleMeta, Tuple) -> RecordId
        +prepare_row_for_write(&mut TxnContext, RecordId, &TupleMeta)
        +table_heap() -> Arc<TableHeap>
    }

    class PhysicalOperator {
        +init(&mut ExecutionContext)
        +next(&mut ExecutionContext) -> Option<Tuple>
    }

    PhysicalOperator --> TableBinding : uses

This binding hides every MVCC/WAL detail from the operators:

  • No more ad-hoc catalog.table_indexes() calls.
  • No direct references to MvccHeap or TableHeap.
  • Executor code reads like pseudo SQL: “lock table”, “scan binding”, “update tuple”.

5. WAL & Recovery Details

flowchart LR
    subgraph WAL Record Types
        HI["HeapInsert"] --> redo
        HD["HeapDelete (with before-image)"] --> redo & undo
        LI["IndexLeaf{Insert,Delete}"]
        IS["IndexSplit/Merge/Redistribute/Parent*"]
        IR["IndexRoot Install/Adopt/Reset"]
        PI["PageWrite (FPW)"]
        CI["Checkpoint"]
        CL["CLR"]
    end
    Exec -->|Heap/Index payloads| WalMgr
    BufferMgr -->|first-touch FPW| WalMgr
    WalMgr --> DiskScheduler --> log files
    Recovery -->|analysis/redo/undo| BufferMgr
  • Logical logging first: heap/index mutations emit redo + undo at the logical level. This keeps the WAL stream compact and human-readable for teaching.
  • Physical fallbacks: metadata-heavy pages (meta, freelist, header) still leverage PageWrite FPWs to guarantee a consistent base image, especially on the first page flush after a crash.
  • Restart: RecoveryManager performs the classical ARIES sequence. It uses the dirty_page_table and active_txn_table captured by checkpoints to limit redo and undo work.

6. Background Workers

WorkerPurposeConfiguration
WAL writerPeriodically flushes WAL buffers, coalesces adjacent writesWalOptions::writer_interval_ms, buffer_capacity
CheckpointRecords ATT + DPT, gently pushes dirty buffersWalOptions::checkpoint_interval_ms
Buffer writerFlushes frames when the dirty set grows too largebackground::BufferWriterConfig
MVCC vacuumReclaims obsolete tuple versions based on safe_xminMvccVacuumConfig
Index vacuumCleans up deleted index entries using B-link traversalIndexVacuumConfig

Workers register with background::BackgroundWorkers, so the database can stop and join them cleanly on shutdown. Each worker publishes metrics (intervals, batches processed) for observability.


7. Example: Repeatable Read UPDATE

sequenceDiagram
    participant T1 as Txn 1 (RR)
    participant T2 as Txn 2 (RC)
    participant Heap as TableHeap

    T1->>Heap: SELECT (snapshot S0)
    Heap-->>T1: version V0
    T2->>Heap: UPDATE -> insert V1, delete V0
    Heap-->>T2: ack
    T2->>T2: COMMIT (WAL + locks release)
    T1->>Heap: SELECT again
    Heap-->>T1: V0 still visible via snapshot S0
    T1->>T1: COMMIT
    Note over T1,Heap: Vacuum later removes V0 when safe_xmin advances

This timeline demonstrates:

  • Snapshots shield Repeatable Read statements from concurrent writers even if row-level locks are released early.
  • The MVCC chain (V1.prev_version = V0, V0.next_version = V1) lets future readers reach the newest committed version while the vacuum worker reclaims obsolete ones lazily.

8. Observability & Configuration Cheat Sheet

  • Logging: RUST_LOG=trace surfaces lock acquisitions, MVCC vacuums, and io_uring completions.
  • Runtime knobs: WalOptions (segment size, sync policy), BufferPoolConfig (capacity, TinyLFU toggle), MvccVacuumConfig, and IndexVacuumConfig can all be adjusted via DatabaseOptions.
  • Metrics: background::BackgroundWorkers::snapshot() reports worker health; WAL exposes current LSN and flush LSN; the buffer manager can dump the dirty page table for diagnostics.

With these layers in place, QuillSQL remains faithful to production-grade engines (MVCC + WAL + buffer pool) while keeping its code and documentation approachable for coursework and research prototypes.