B Benchmarking

This guide will present advanced SQL constructs. Take your time when working through it. Some experimental results will make more sense after you covered the module on indexing.

In this worksheet, we will explore some SQL constructs and postgres tips and tricks to help you create test databases and evaluate different options for indexing or querying to help you better understand how the postges DBMS in particular executes SQL queries.

The main purpose of this worksheet is to allow you to adopt a more trial-and-error or experimental approach to designing and querying your database. The idea is that when trying to answer questions like what schema, data types, indexes, etc. to use, where you may have different alternatives and you are not immediately certain on what the best approach is, you instead try out different configurations quickly and pick a suitable one with some empirical confidence. This worksheet will provide you some tips and tricks to get started on your journey to becoming a systems-engineer or an insanely paid DBA.

A lot of what you will see here is somewhat unique to postgres but the ideas port well to other database systems: you just need to figure out their own quirks and syntax for executing similar commands.

B.1 Generating Synthetic Databases

B.1.1 Generate Data the Synner Way

Our lab huda-lab.github.io has built a tool, Synner, to help you create tables with different data types and distributions for each data type. You can access Synner here synner.io.

You can download the data sets as a csv and then load the CSV into your database.

For example suppose you build the following table of biometric identifiers in Synner (you download the csv we generated):

Name Surname Sex Age Height
Dulcy Bland F 19 40
Osgood Castelin M 41 33
Jennica King F 20 6
Nelson Birchill M 40 31

First, create your table with the appropriate data types:

create type biologicalsex as enum ('F', 'M');
create table biometrics(
    name varchar(30),
    surname varchar(30),
    sex biologicalsex,
    age smallint,
    height smallint;

Second, load the CSV into the table:

copy biometrics
  from '/.../biometrics.csv' --- put the full path to your file
  with csv header; 
  --- specify that the format is csv and that there is a header

There are many widely available synthetic data generators. Here is another one: Mockaroo but we think Synner is cooler ;).

B.1.2 Generate Data with your own SQL

Alternatively you can use SQL generators to create by yourself all kinds of data sets. Synner heavily relies on these generators to work.

First, lets generate 1000 numbers from 1 to 1000:

create table seqnums as
(select * from generate_series(1, 1000) as x);

What is the schema of this generated table?

\d seqnums

Now that you have a series you can use that to create 1000 random integers between 0 and 100:

create table randomintegers as
(select round(100*random()) from generate_series(1, 1000) as x);

With all kinds of casting, mathematical expressions, and postgres’ extensive library of built-in functions, you can use this to produce all kinds of synthetic datasets including strings and tables with multiple columns:

--- generate 1000 alphanumeric 10-character words and random numbers
drop table if exists randomwords;
create table randomwords as
(select substr(md5(random()::text), 0, 10) as word, 
        (random()*100)::integer as num
        from generate_series(1, 1000));

How long would it take to generate 10,000,000 rows?

\timing
--- generate 10,000,000 alphanumeric 10-character words and random numbers
drop table if exists randomwords;
create table randomwords as
(select substr(md5(random()::text), 0, 10) as word, 
        (random()*100)::integer as num
        from generate_series(1, 10000000));

You can also make use of the explain analyze SQL feature in postgres. explain by itself gives you information on the query plan. Combined with analyze it actually executes the query and gives you information on running times including those of intermediate query steps. There is usually some additional time overhead to using this feature.

--- generate 10,000,000 alphanumeric 10-character words and random numbers
drop table if exists randomwords;
explain analyze create table randomwords as
(select substr(md5(random()::text), 0, 10) as word, 
        (random()*100)::integer as num
        from generate_series(1, 10000000));

B.2 Run a Mini Experiment

Now that you have a data set or at least a synthetic one that resembles your actual one or some future projected scale of your data, you can start experimenting with different organizations.

Suppose we often have a workload that executes the following query:

select * from randomwords where num > x;
--- x is some number between 0 and 100.

You are trying to decide whether you should build/keep a num index on your table randomwords to support with the above workload of range queries. These are some of the questions that you may ask, which each can be transformed into a direct mini-experiment that will help guide you!

  • Would an index on num help?
  • Does the value of \(x\) matter?
  • What if we had an equality instead of an inequality?
  • Does the scale of the data matter (1000 rows, vs 10,000, vs 1,000,000 vs 10,000,000?)
  • Does the index add an overhead to single/batch insertions/deletions?

For any such mini-experiment though, you have to layout out your baselines and alternative(s).

  • Baseline: The randomwords table as is without any indices (let’s get our experimental framework up and running on a small table (1000 rows) before doing longer-running scaling experiments where we increase table size).
  • Indexed: Let’s go for the standard B+ tree index on the num column:
--- create a duplicate table to allow us to compare both variants
create table idxrandomwords as (select * from randomwords);
create index on idxrandomwords(num);

Now we have the two variants, which one is better?

explain (analyze, timing) select * from randomwords where num > 10;
explain (analyze, timing) select * from idxrandomwords where num > 10;

I get two values for execution time: 0.338ms and 0.343ms. The run times are almost identical.

Conclusion: The index makes no difference!

That hardly counts as an experiment! Maybe you need to run several iterations?

I run ten iterations of each query, and these are the runtimes I get:

Baseline (time ms) With Index (time ms)
0.286 0.283
0.280 0.279
0.278 0.296
0.282 0.282
0.279 0.279
0.300 0.279
0.287 0.278
0.278 0.278
0.278 0.322
0.425 0.324

The average run times across both variants are 0.297ms and 0.290ms… why does the index not help?

Well in this case you should have paid attention to the full output of explain analyze and not only the execution times reported. In both cases, no index is used.

So this leads one to question why? Perhaps the condition num > 10 is not selective enough and so it makes no sense to use an index (i.e. with num > 10 we are returning too many rows or most of the table and the index is giving us little benefit in terms of what it selects). What happens if num > 80 instead?

Baseline (time ms) With Index (time ms)
0.191 0.189
0.188 0.183
0.182 0.183
0.182 0.185
0.182 0.182
0.182 0.182
0.184 0.182
0.182 0.183
0.182 0.182
0.181 0.182

Now both queries run faster but again the index makes no difference. On examining the explain you do notice that with the index, postgres chooses to use a Bitmap Index Scan on idxrandomwords_num_idx? What is that? You can try and figure out from the postgres documentation or wait until we get to module on indexing.

Either way, you are feeling somewhat disappointed that the index is not helping? May be it helps only with larger data sets? Let’s run another mini-experiment.

Baseline (time ms) With Index (time ms)
1569.388 805.823
1537.577 790.420
1533.247 799.750
1534.532 787.080
1537.513 787.541
1534.559 793.736
1538.403 826.346
1589.860 852.379
1571.073 829.126
1568.213 908.542

These results do tell us that at larger scale the index does help at least with a more selective query (num > 80), but when we decrease selectivity (i.e. have a condition that allows more rows as results), we don’t get any benefit of the index. At what point exactly (what value of \(x\)) does this happen?

You can actually construct a specific experiment to find that threshold! On my setup, this turning point happens at num > 60 the postgres optimizer switches from using a sequential scan to a bitmap heap scan at that point and the average respective run times for the baseline and indexed table are 1788 ms 1404 ms.

Again these results are counter-intuitive. This is a very simple range query and an index on an integer. You should be getting better results.

The issue in this case could be that you may have chosen the wrong index or didn’t cluster it. If you are reading this guide before the module on indexing, hang in there or read the postgres docs. Briefly, however, a clustered index sorts your data by the column that is indexed so now when you select a range, it simply uses the index to find the starting point and quickly sequentially scans the data from the point until the end of the range. It should be faster than using a bitmap heap index scan!

cluster verbose idxrandomwords using idxrandomwords_num_idx;
--- simple sanity check that it worked:
select * from idxrandomwords limit 10;
--- the num values should be in order!

Clustering takes a long while for very large tables. Did it work? Run a mini-experiment again. I get the same exact results as before!

We somehow need to inform the DBMS that our index is clustered and our data is sorted and one way to do this is to run the following:

--- the vacumm is actually not needed here but it helps if you have a table that sustained a lot of insertions/deletions
vacuum analyze idxrandomwords;

The average run time for selecting rows where num > 60 has dropped for me down to 1157 ms from 1404 ms and now the query explanation shows that postgres uses Index scan instead of the Bitmap heap scan. Also it now switches from a sequential scan to an index on a lower value of \(x\).

Can you force an index to be used? Try the following and see what the overhead is of using an index scan when the optimizer prefers a sequential scan:

set enable_seqscan = false;

What happens when the workload is more selective num = x? I get an average runtime of 480 ms vs 28 ms for the baseline vs indexed table.

Note you can do even better if you include the word attribute in the index itself. This allows for index-only scans. Oh, so many options, so many experiments.

You need to be somewhat thorough with your experimental setup. That means generating several iterations, storing the results of run in a form that allows you to execute these systematically for all kinds of parameters, data set sizes, configurations, etc.

Running explain analyze queries one at a time in a terminal and copy-pasting execution times into a spreadsheet is a very in-efficient way of doing things.

B.2.1 Taking Things Up a Notch

Try and use functions and even external scripts to help you. Here is a template to execute a specific query 10 or more times.

CREATE OR REPLACE FUNCTION miniexp() returns setof numeric  AS $$
DECLARE
  result_str text;
  time numeric;
  iteration integer := 0;
BEGIN
  drop table if exists exectimes;
  create table exectimes(time numeric);
  loop
    exit when iteration >= 10;
    iteration := iteration + 1;
    EXPLAIN (ANALYZE, TIMING, FORMAT JSON) into result_str 
        SELECT * from randomwords where num > 50;
    time := ((result_str::jsonb)-> 0 -> 'Execution Time')::numeric;
    --- raise notice '%', time;
    insert into exectimes values (time);
  end loop;
  return query select * from exectimes;
END;
$$  LANGUAGE plpgsql;

select * from miniexp();
select avg(time) from exectimes;

You can add a lot more to this template. For example, you can create a python script to generate a bunch of experiments with different SQL queries that are part of the workload you are trying to benchmark. You can add more queries to the above to store their runtimes in an expanded exectimes table that includes additional attributes such as query type, parameter settings, etc. You can introduce timestamps to the exectimes table to keep track of when you ran certain experiments and not delete the table before every run.

Why is it a good idea to store experimental results in a database table? Because you can now run all kinds of queries on the results like find average run-time, std-deviation, filter by parameter settings, etc.

B.2.2 Other Considerations

When conducting transaction benchmarks e.g. finding the overhead of insertions/deletions/modifications with an index, you typically report throughput or how many transactions you can complete per second.

Depending on the setup, you may need to control for several DBMS configurations that impact performance. It takes quite a bit of experience to become an expert DB tuner, but it does come with a hefty salary. Here are some factors to consider when benchmarking:

  • auto-commit on or off?
  • should my caches be hot or cold?
  • what is the right buffer size?
  • do I make my index concurrently?
  • what is the right memory size to allocate for join algorithms?

There are lots of other knobs, the documentation is a good place to start.