Disk I/O — Scheduler, io_uring Data Pages & WAL Runtime

1. Architecture

Request Path: foreground components enqueue DiskRequest objects via DiskScheduler::{schedule_read, schedule_write, …}. A dispatcher thread drains the global channel and distributes work round-robin to N io_uring workers. Each worker owns its own ring and file-descriptor cache, so once a request is forwarded, execution proceeds entirely off the foreground thread.
Stable APIs: schedule_read(page_id), schedule_write(page_id, Bytes), schedule_read_pages(Vec<PageId>), schedule_allocate(), schedule_deallocate(page_id) — every call returns a channel the caller can block on or poll.
Batch Reads: ReadPages fans out per-page SQEs while a shared BatchState tracks completions. Even if the kernel completes I/O out of order, the caller receives a Vec<BytesMut> that preserves the original page order.

Dedicated WAL runtime threads handle sequential WAL appends/reads using buffered I/O. They now keep a per-thread cache of open segment files, eliminating repeated open()/close() on every log record.
Worker count defaults to max(1, available_parallelism / 2) but is tunable through IOSchedulerConfig.
Optional sync on a request triggers sync_data / fdatasync so WalManager can honour synchronous commit or checkpoint barriers. Data pages stay on the io_uring dataplane; WAL always uses buffered writes.

Each worker owns an IoUring with configurable queue_depth, optional SQPOLL idle timeout, and a pool of registered fixed buffers sized to PAGE_SIZE. Workers submit SQEs asynchronously and drain CQEs in small batches to keep the ring warm.
Read batching relies on shared BatchState instances (Rc<RefCell<_>>) so multi-page callers see ordered results without blocking the kernel on serialization.
Writes keep their payload alive until completion; if a fixed buffer slot is available we reuse it, otherwise we fall back to heap buffers. A companion WriteState tracks an optional fdatasync so the caller still observes exactly one Result<()> once all CQEs land.
Errors (short read/write, errno) are normalised into QuillSQLError values that flow back on the original channel.

config::IOSchedulerConfig controls:
- workers: number of io_uring workers (default = available parallelism).
- wal_workers: WAL runtime threads (default workers / 2).
- iouring_queue_depth, iouring_fixed_buffers, iouring_sqpoll_idle_ms.
- fsync_on_write: whether data-page writes also issue fdatasync (WAL sync is managed separately by WalManager).

Worker-local file descriptors plus positional I/O remove shared mutable state on the hot path. The new per-worker handle cache further reduces syscall overhead.
Shutdown sequence: enqueue Shutdown, dispatcher forwards it to every worker, each worker drains outstanding SQEs/CQEs, and finally dispatcher + workers are joined.
BufferPool and TableHeap integrate via the same scheduler channels; inflight guards prevent duplicate page fetches even when multiple scans touch adjacent pages.

Random page access benefits from fewer syscalls and deeper outstanding queue depth than the blocking fallback.
Only the io_uring backend currently ships (Linux x86_64). A portable fallback remains future work.
For large sequential scans, rely on the buffer pool’s sequential access pattern or add a custom iterator on top of ReadPages if you want to experiment with direct I/O.

Queue-depth aware scheduling and CQE bulk harvesting.
Optional group commit (aggregate writes, single fsync) behind configuration.
Metrics hooks (queue depth, submit/complete throughput, latency percentiles, error codes).
Cross-platform fallback backend and richer prioritisation/throttling policies.
Control-plane knobs for throttling individual background workers.