8 Sorting & Hashing
Week 8: March 27 - April 2, 2022
- COWS Chapter 13 - External Sorting
- COWS Section 14.3 - Sorting & Hashing for Projections
- Sorting & Hashing on Gradescope.
8.1 Basic Building Blocks
We will now begin looking at the internal implementations of different relational operators. The choice of which algorithms to use for a relational operator depends on many factors: the size of the input data set in relation to the IO costs of different implementations, how skewed the data is, etc.
Two building blocks for many relational operators are sorting and hashing and so we will begin by describing how these two operations are implemented in such a way to operate over large files that cannot fit entirely in memory. These external implementations aim to reduce IO costs and prefer sequential disk access patterns where possible. For example, these algorithm stream through the input file in passes and hence sequentially access all the pages of the file. A key assumption in these algorithms is that the latency of internal processing is smaller than the IO latency. Thus the choice of algorithms for internally sorting the tuples within a page once it is loaded in memory can be different than the external sorting algorithm, which operates at a page level. As you will see in the following videos, we use merge-sort for external sorting but we can use quick sort to sort the contents of a page in memory.
8.2 External Sorting
8.2.1 2-Way Merge Sort
We will begin by describing the vanilla 2-way merge sort and then describe optimizations to it that aim to reduce overall IOs.
At each pass, we have to read and write the entire file. So each pass costs us \(2N\) IOs. So, the overall cost of a 2-way merge sort depends on the number of passes. Our first streaming pass generates \(N\) sorted runs of length 1. Then we have a series of merge passes. Each merge pass combines two sorted runs of length \(k\) to create a sorted run of length \(2k\). So each merge pass doubles the length of the previous sorted run length.
So the number of merge passes can be derived as follows:
Merge Pass \(p\) | Run Length (\(L_p\)) |
---|---|
1 | \(L_1 = 2\) or \((2^1)\) |
2 | \(L_2 = 2*L_1 = 2*2 = 4\) or \((2^2)\) |
3 | \(L_3 = 2*L_2 = 2*(2*L_1) = 2*4\) or \((2^3)\) |
… | … |
p | \(L_p = 2*L_{p-1} = \Pi_1^{p}(2)\) or \((2^p)\) |
So how many passes do you need? We need to find when does the length of sorted run equal the lenght of the file, i.e. \(2^p \geq N\) or \(p \geq \log_2(N)\) or \(p = \lceil log_2(N) \rceil\).
So the total cost of a two-way merge sort is simply \((\lceil log_2(N) \rceil + 1)\times 2N\).
8.2.2 K-Way Merge Sort
The K-way merge makes two main optimizations:
- Utilizes all the buffer space available to generate sorted runs of length \(B\), where \(B\) is the number of buffer frames available. At the end of the streaming sorting pass, we will now have run length of size \(N/B\)
- Merges \(B-1\) sorted runs so that at end of each pass \(p\), we end up with run lengths \(L_p\) that are \((B-1)\times L_{p-1}\). Thus, we only need \(\lceil \log_{B-1}(N/B)\rceil\) merge passes to fully sort the file.
8.3 External Hashing
Like sorting, we need our hashing algorithm to work externally, i.e. assuming data cannot fit entirley in-memory.
It is a divide-conquer algorithm, where the goal of the divide phase is to partition the file into \(B-1\) partitions that are more manageable, i.e. we can build an in-memory hash table for each partition.
8.4 The Sort-Hash Duality
External hashing and sorting are duals and the choice between using one or the other as a basis for another relational operator’s implementation is subtle.
8.5 Beyond a Single Machine
In the external sorting video, we illustrate how we can sort a 4TB in only two passes with a relatively small memory buffer of roughly 127MB only. However, we would need an additional 8TB to store the partially sorted file from the first streaming sorting pass and for the final sorted output. With commodity disks of a disk bandwidth of 30MB/s, it would take 1.5 days to just read or write the file. With fancy disks of 500MB/s, it would take 2-3 hours just to read or write the file ignoring all the other CPU (and non-sequential IO) costs.
For large data sets, data processing occurs over large clusters with the work partitioned across multiple machines. The first step for external sorting or hashing is now data partitioning: range partitioning (distribute the data by ranges, such that machine A gets range [1,100), machine B get range [100, 200), etc.) and hash partitioning respectively (distribute the data by a hash function modulo the number of machines (\(n\)) in the cluster, such that machine 0 gets all the data where \(h(\mathrm{key})~~\mathrm{mod}~~n\) = 0 and machine 2 gets all the data where \(h(\mathrm{key})~~\mathrm{mod}~~n = 2\), etc.).
Each of these partitioning approaches have their own challenges: how to partition data into ranges if there is data skew (certain ranges are bigger than others)? How to re-balance partitions if the cluster grows or shrink? Consistent hashing is often used to address rebalancing issues in many large scale database systems such as Amazon Dynamo.
Partitioning is a building block for distributed databases systems just as much as sorting and hashing are!