Rule-Based Optimization
After the LogicalPlanner creates an initial LogicalPlan, it’s passed to the LogicalOptimizer. The initial plan is a direct, syntactically correct translation of the SQL query, but it’s often not the most efficient way to execute it. The optimizer’s job is to transform this plan into an equivalent, but more performant, logical plan.
The Optimizer in QuillSQL
QuillSQL implements a Rule-Based Optimizer. This is a common and powerful approach where the optimizer is equipped with a set of predefined transformation rules. It repeatedly applies these rules to the logical plan tree until no more rules can be applied, or a maximum number of passes is reached.
The main components are:
LogicalOptimizer(optimizer/logical_optimizer.rs): The main driver. It holds a list of rules and contains the logic to recursively walk the plan tree and apply them.LogicalOptimizerRuleTrait: An interface that every optimization rule must implement. Its core method istry_optimize, which takes a plan node and attempts to return a rewritten, optimized version of that node.
Deep Dive: The PushDownLimit Rule
One of the most classic and effective optimizations is “pushing down” operations as far as possible towards the data source. Let’s examine the PushDownLimit rule (optimizer/rule/push_down_limit.rs) to see this in action.
Consider the following query:
SELECT * FROM users ORDER BY signup_date LIMIT 10;
The Naive Plan
A naive logical plan for this query would be:
Limit(10)
└── Sort(by: signup_date)
└── TableScan(users)
If executed directly, this plan would:
- Scan the entire
userstable. - Sort the entire table by
signup_date. - Finally, discard all but the first 10 rows.
This is incredibly inefficient, especially for a large table, as it involves a massive, memory-intensive sort operation.
The Optimization Rule
The PushDownLimit rule is designed to recognize this specific pattern: a Limit operator directly on top of a Sort operator.
When the optimizer applies this rule, the try_optimize method matches on the Limit node. It inspects its child and sees that it’s a Sort node. The rule then knows it can apply its logic.
The Rewritten Plan
The rule rewrites the plan tree by “pushing” the limit information into the Sort node itself:
Limit(10)
└── Sort(by: signup_date, limit: 10)
└── TableScan(users)
Notice the new limit: 10 property on the Sort node. This seemingly small change has a huge performance impact. When the PhysicalSort operator is created from this logical node, it now knows that it only needs to find the top 10 rows. Instead of performing a full sort, it can use a much more efficient algorithm, like a heap sort (using a min-heap of size 10), to find the top 10 rows in a single pass over the data.
This optimization avoids sorting the entire table, dramatically reducing both CPU and memory consumption.
Other Rules
QuillSQL implements other simple but effective rules:
EliminateLimit: Removes aLIMITclause if it provides no value (e.g.,LIMIT NULL).MergeLimit: If twoLIMITclauses are stacked on top of each other (which can happen after other rule transformations), this rule merges them into a single, more restrictiveLIMIT.
Conclusion
While QuillSQL’s optimizer is currently rule-based and relatively simple, it demonstrates the fundamental principles of query optimization. By separating the logical representation of a query from its physical execution and applying equivalence-preserving transformations, a database can achieve massive performance gains. More advanced systems build on this with a Cost-Based Optimizer, which uses table statistics to estimate the “cost” of different physical plans (e.g., choosing between a NestedLoopJoin and a HashJoin) and pick the cheapest one.
For Study & Discussion
-
Rule Ordering: The
LogicalOptimizerapplies its list of rules in a fixed order for a set number of passes. Can the order in which rules are applied affect the final, optimized plan? Can one rule’s transformation enable another rule to be applied in a subsequent pass? -
Cost-Based vs. Rule-Based: What is the primary limitation of a purely rule-based optimizer? When would a rule-based optimizer make a poor decision that a cost-based optimizer (with accurate statistics) would get right? (Hint: consider join algorithm selection).
-
Programming Exercise: Implement the classic Predicate Pushdown rule. Your rule should look for a
Filteroperator whose child is aJoin. If the filter’s predicate only uses columns from one side of the join, the rule should push theFilternode down to that side of the join, below theJoinnode. This is one of the most effective optimizations in any database. -
Programming Exercise: Implement a Constant Folding rule. This rule would traverse expression trees and pre-compute constant expressions. For example:
- An expression
WHERE age = 10 + 5would be rewritten toWHERE age = 15. - An expression
WHERE 1 = 1would be evaluated totrue, and a smart optimizer could then potentially eliminate theWHEREclause entirely.
- An expression