9 Query Processing & Joins

Week 9: April 3 - 9, 2022

  • COWS Chapter 12 - Provides a quick overview of query evaluation including optimization and processing. Sec 12.4.3 focueses on the iterator query processing model.
  • COWS Chapter 14.4 - Describes the Join Algorithms in detail
  • COWS Chapter 14 - You need to read about other operator implementations to complete Lab 2!

9.1 Query Processing

Previously, we have seen that a SQL query is converted into a query plan of relation algebra operators. In this plan, operators are arranged in a tree and data flows from the leaves, which scan data from the relations (using a variety of access methods - heap sequential scan, index scan, etc.), towards the root operator, which outputs the results of the query.

Operators are either unary (have 1 child e.g. scan, project, select, etc.) or binary (have 2 children, e.g. joins).

In this unit, we will understand the implementation of each operator and how they are connected or invoked. This is referred to as the processing model. We will explore the implementation from a pseudo-code level of a heap-scan, selection operator \(\sigma\) and a 2-Pass sort (\(\tau\)) operator and then dive deep into the many different implementations of Joins (\(\bowtie\)). In the previous unit, we looked at how to implement *IO-aware** sorting and hashing, which serve as a building-block for many operations including sort, group by and aggregations and duplicate elimination.

Our emphasis when discussing different operator implemenation is IO cost not CPU cost.

9.1.1 The Iterator Model

This is the most common query processing model and almost every row-oriented DBMS uses this model. It is also referred to as the pipeline model as each tuple is passed in a pipeline fashion to the next operator in the plan. It is also called the volcano model as the operators are laid out in tree-like dataflow with a wide base (if the query scans multiple relations) and tuples flow through the plan like lava!

To summarize, in the iterator model:

  • Every operator implements a next() function.
  • Each next() call, makes the operator return a single tuple or a 'EOF' marker if there are no more tuples to return.
  • Each operator implements a loop that calls next() on each of its children to retrieve the next tuple to process.
  • Not every operator can process a tuple at a time from its children and return it immediately. These blocking operators have to block until their children emit all their tuples (e.g. sorting operators are blocking). To contrast, streaming or on-the-fly operators can process tuples one at a time with constant effort per tuple.

9.1.2 Other Processing Models

There are other models e.g. the materialization model where each operator processes all its inputs and then emits all its output at once – intermediate results are thus “materialized.” This model works better for transactional workloads and typically on memory-resident databases.

Another model is the vectorization or batch model where each operator outputs a batch of outputs instead of a single tuple at a time. This model works better for analytical workloads that process large amounts of data and by processing batches at a time, you minimize the overhead of function calls.

9.2 Joins

There are two design decisions that influence the design of join operators

  1. What to output? Conceptually, given an outer (left) table R and an inner (right) table S, if a tuple r in R matches with a tuple s in S given the join condition, the join concatenates (r, s) into an new output tuple and emits it. In reality, contents of output tuples generated by a join operator varies. It depends on the processing model, storage model, and the query (what attributes do operators higher up in the query plan actually need). Generally, the join operator emits either:
    • Data: Copy the values for the attributes in the outer and inner tables into tuples put into an intermediate result table just for that operator. Following operators in the query plan do not access the base tables to get the data they may need. However, more memory is needed to materialize the entire tuple. We will assume this approach for simplicity!
    • Record Ids: Copy only the join keys along with the record ids of the matching tuples. This approach called late materialization is suitable for column-oriented storage models.
  2. Which join algorithm to use? To make this decision, we need to analyze the costs of different join algorithms. Our cost analysis will focus on the number of disk IOs necessary to compute the join. It does not include the IOs required to write the final output result of a join because regardless of which algorithm you choose, they will all output the same number of tuples!

We will express IO costs using the following variables:

  • \(N\) pages in table \(R\), \(n\) tuples total
  • \(M\) pages in table \(S\), \(m\) tuples total

We will study the following algorithms:

  • Nested Loop Join
  • Block Nested Loop Join
  • Index Nested Loop Join
  • Sort Merge Join
  • Hash Join
  • Grace Hash Join

9.2.1 Nested Loops Join

The general idea of this simple join algorithm is to have two nested for loops that iterate over all tuples in both tables of a join to find tuples that match the join condition. Here is the pseudocode for a simple nested loops equi-join.

for each tuple r in R:
  for each tuple s in S:
    if (key(r)==key(s)) then emit (r, s);

The outer loop iterates over the outer (or left) table and the inner loop iterates over the inner (or right) table.

As you have seen in the video, an IO-aware NLJ improves on the naive NLJ in the following ways:

  • The outer table is always “smaller” as you want to buffer as much as possible of the outer table.
  • Matches tuples across all pages that are loaded in memory instead.

Cost Analysis:

In Naive NLJ, for each tuple in the outer table \(R\), compare it with each tuple in the inner table \(S\). This is the worst case scenario and give us a cost of \(N + nM\). With Block NLJ, we compare tuples across all loaded papges, this gives us a cost of \(N + \lceil\frac{N}{B-2}\rceil M\), with 3 buffer pages available, this is the simple Page NLJ, which has a cost of \(N + NM\) and with \(B \geq N+2\), we have the best possible IO cost of \(N+M\).

If you have an index on the join key of the inner table, then you can save the cost of a sequential scan to find matches in the inner table and instead do an index lookup to find the match. Assuming the cost of each index probe is some constant value \(k\) per tuple, then the IO cost is \(N + nk\).

9.2.2 Sort Merge Join

If your tables are sorted, then you don’t need to sequentially scan the inner table for each tuple in your outer tuple and you only need to scan your outer table once: effectively you perform one pass over both tables. The algorithms proceeds in two phases

  • Phase #1 – Sort: sort both input tables on the join attribute.
  • Phase #2 – Merge: Scan the two sorted tables in parallel, and emit matching tuples.

See how its done!

It might be a good idea to revise the external sorting unit from last week if you feel somewhat shaky with the cost analysis of external merge sort

Cost = (Sort Costs) + (Merge Cost) = (Sort Costs) + \((N+M)\)

9.2.3 Grace Hash Join

A simple hash join proceeds in two phases: a build phase, where you build a hash table on the outer table and a probe phase where for each tuple in the inner table, you probe the hash table to find a match.

What happens when you can’t fit the outer hash table in memory?

When the tables do not fit on main memory, you do not want the buffer pool manager constantly swapping tables in and out. The Grace Hash Join is an extension of the basic hash join where you first partition both tables into more manageable partitions (i.e. you can fit an in-memory hash table for each partition’s outer table) that are written to disk and then you proceed with a simple hash join for each partition. Sounds familiar - it is external hashing after all!

The build phase is \(2(N+M)\) and the probe phase cost is \((N + M)\) for a total cost of \(3(N + M)\)

Another strategy to deal with hash-tables that spill to disk is using a common systems design trick: use another level of indirection. In this trick you create a bloom-filter. These are very small probabilistic data structures that can fit easily in memory. You can probe them to check if the actual hash-table has a matching entry before accessing the on-disk hash table. If the bloom filter finds no match - you are guaranteed that the hash table doesn’t have a match but if the filter says there is a match then you may or may not find a match in the hast table. These data structures can save you a lot of unnecessary IO! To understand how they work you can watch this optional video

9.2.4 Summary