10 Query Optimization
Week 10: April 10 - 16, 2022
- COWS Chapter 15 - A Typical Relational Query Optimizer
- Query Optimization on Gradescope.
The beauty of a declarative language like SQL is that the user tells the DBMS what answer they want, not how to get the answer. Now, the DBMS translates this SQL statement into an executable query plan. Over the past few units, we have looked at different ways we can execute a query, for example, by selecting different join algorithms and how each of these choices can drastically impact the performance of a query.
The job of the query optimizer is to figure out the “best” possible plan. It generally employs the following strategies:
- Heuristics/Rules: Rewrite the query using equivalences to prune in-efficient plans.
- Cost-based Search: Use a cost-model to estimate the cost of several equivalent query plans to choose one with the least estimated cost.
Any cost-based optimizer can be described by three dimensnions
- Plan space or the space of possible plans to consider for a given query.
- Cost estimation or how to estimate the cost of two queries to select the better one
- Search strategy or how to efficiently search the massive space of possible queries to find an “optimial” query.
10.1 Plan Space
Two relational algebra expressions are relationally equivalent if they generate the same set of tuples. Given this, the DBMS can identify alternative query plans and even find better ones without a cost model.
Physical equivalences boil down to enumerating the different but equivalent physical implementations of an operator. For example, the different join algorithms (hash vs sort-merge join) or base table accesses (heap scan vs index scan) available.
A query’s plan space is the set of all relationally or physically equivalent query plans.
For queries that involve multiple relations, the number of alternative query plans grows as the number of tables joined increases. For an \(n\)-way join, the number of ways to order the join operations is known as the Catalan number \((\approx 4^n)\) where \(n\) is the number of tables. This is too large a space to enumerate. Not only is it infeasible for a the DBMS to consider all possible plans, but in many situations, the time to optimize and decide on a query plan should be interactive. You wouldn’t wait an hour for an optimizer to select the best plan when the query only takes seconds to execute!
Many systems, thus, reduce their search complexity. For example System R only considers left-deep join-tree plans. Left-deep plans have the advantage of allowing you to pipeline data through your operators (remember the iterator model) and they only need to maintain a single join table in memory (compare with bushy plans). Even modern commercial systems like Oracle limit their search space to left deep plans.
What does Postgres do? well it doesn’t limit the search space to left-deep plans only. It also considers bushy plans and right-deep plans. It still uses the “Selinger Optimizer” or dynamic programming algorithm for join orders. However, it only uses this exhaustive strategy for queries that have less than 12 tables by default. Beyond that it uses a genetic algorithm to select a good join ordering.
What benefits do we get from right deep plans? Using hash joins, right-deep trees are executed by first creating hash tables out of each relation except one before probing in all of these hash tables in a pipelined fashion, whereas in left-deep trees, a new hash table is built from the result of each join. So a right-deep tree leads to better pipelining in the case of hash joins. However, this comes at the cost of using a lot of memory to store simultaneously the many hash tables in the buffer! A left-deep plan would naturally suppress this as each join needs to wait for the full result from its left (outer) upstream join operation before the hash table is ready for probing.
10.2 Cost Estimation
The general idea here is to use an internal cost model to estimate the execution cots of different query plans to select the one with the least cost without actually executing the query.
The cost model is based on estimates that are usually not comparable to real-world costs (this is especially true as the query optimizer may not be aware of the underlying hardware. Is it a direct attached disk or a network disk? Is it an SSD or an HDD, etc? Is it a virtual machine?). The estimates are often based on how different resources are used:
- Disk: Number of blocks accessed - this is the key resource we will focus on in this unit
- Memory: size of memory used - this consideration was briefly touched when we looked at why we chose left-deep plans.
- Network: number of messages transferred
- CPU: how much processing per tuple - the big Oh of computational instructions executed per tuple.
To help with cost estimation, a DBMS maintains different statistics such as table size, attributes, indexes available, etc in the catalog. Higher-end systems have better statistics.
These estimates are quite important. Even the best optimizers can make bad decisions especially if their cost estimates are off.
The main tool we used to estimate result set size cardinalitity is selectivity.
Beware: highly selective in common English is the opposite of a high selectivity value (\(\frac{|\mathrm{output}|}{|\mathrm{input}|}\) high).
A useful way of thinking of selectivity is as the probability that a tuple will satisfy a given condition / appear in the output.
Here is a quick summary list:
- \(\mathrm{sel}(A = v) = \frac{1}{|\pi_AR|}\)
- \(\mathrm{sel}(R.A = S.A) = \frac{1}{\mathrm{max}(|\pi_AR|, |\pi_AS|)}\). The size of the join result is \(|R|\times |S|\times \mathrm{sel}(R.A = S.A)\)
- \(\mathrm{sel}(\neg p) = 1 - \mathrm{sel}(p)\), where \(p\) is a predicate term
- \(\mathrm{sel}(A > v) = \frac{\mathrm{High}(A) - v}{\mathrm{High}(A) - \mathrm{Low}(A)}\), you can add an error factor to the denominator (e.g. +1 to avoid division by 0)
- \(\mathrm{sel}(p_1 \wedge p_2) = \mathrm{sel}(p_1) \times \mathrm{sel}(p_2)\), assuming the predicates are independent
- \(\mathrm{sel}(p_1 \vee p_2) = \mathrm{sel}(p_1) + \mathrm{sel}(p_2) - \mathrm{sel}(p_1) \times \mathrm{sel}(p_2)\)
10.3 Search Strategy
While a cost model estimates the cost of each model, we still need a search algorithm that enumerates and costs correct alternative query execution plans.
We use a dynamic programming strategy where the base case selects the best plan for single relation queries and then builds on the base case by joining in one relation at a time until all relations in the query are appropriately combined. Integral to the DP stratgey is the assumption of optimal substructure: i.e. the optimal plan overall has an optimal sublan. This is not always the case: a suboptimal subplan may create optimal plans because their results sets have interesting orders that downstream operators can benefit from. To accomodate for this wrinkle, the DP approach will store in addition to optimal query sub-plans, the cheapest query sub-plan for a given interesting order.
Query Optimization is still an exciting area of research in DBMS. I highly recommend you taking the time to read Leis et al.- How Good Are Query Optimizers, Really?
This benchmark paper gives lots of fascinating details on real-world optimizers in a very easy to access fashion and explains how estimation errors and poor heuristics can easily compound to disastrous effect.