Request Path: foreground components enqueue DiskRequest objects via DiskScheduler::{schedule_read, schedule_write, …}. A dispatcher thread drains the global channel and distributes work round-robin to N io_uring workers. Each worker owns its own ring and file-descriptor cache, so once a request is forwarded, execution proceeds entirely off the foreground thread.
Stable APIs: schedule_read(page_id), schedule_write(page_id, Bytes), schedule_read_pages(Vec<PageId>), schedule_allocate(), schedule_deallocate(page_id) — every call returns a channel the caller can block on or poll.
Batch Reads: ReadPages fans out per-page SQEs while a shared BatchState tracks completions. Even if the kernel completes I/O out of order, the caller receives a Vec<BytesMut> that preserves the original page order.
Dedicated WAL runtime threads handle sequential WAL appends/reads using buffered I/O. They now keep a per-thread cache of open segment files, eliminating repeated open()/close() on every log record.
Worker count defaults to max(1, available_parallelism / 2) but is tunable through IOSchedulerConfig.
Optional sync on a request triggers sync_data / fdatasync so WalManager can honour synchronous commit or checkpoint barriers. Data pages stay on the io_uring dataplane; WAL always uses buffered writes.
Each worker owns an IoUring with configurable queue_depth, optional SQPOLL idle timeout, and a pool of registered fixed buffers sized to PAGE_SIZE. Workers submit SQEs asynchronously and drain CQEs in small batches to keep the ring warm.
Read batching relies on shared BatchState instances (Rc<RefCell<_>>) so multi-page callers see ordered results without blocking the kernel on serialization.
Writes keep their payload alive until completion; if a fixed buffer slot is available we reuse it, otherwise we fall back to heap buffers. A companion WriteState tracks an optional fdatasync so the caller still observes exactly one Result<()> once all CQEs land.
Errors (short read/write, errno) are normalised into QuillSQLError values that flow back on the original channel.
Worker-local file descriptors plus positional I/O remove shared mutable state on the hot path. The new per-worker handle cache further reduces syscall overhead.
Shutdown sequence: enqueue Shutdown, dispatcher forwards it to every worker, each worker drains outstanding SQEs/CQEs, and finally dispatcher + workers are joined.
BufferPool, TableHeap, and the streaming scan ring buffer still integrate via channels; inflight guards prevent duplicate page fetches.