B File Layouts in Practice

A database is stored as one or more files on disk. It is usually a proprietary binary format, so other apps or even the OS does not know anything about the contents of these files. There are database systems that use open formats such as parquet.

createdb layouts
psql -d layouts
create table students(sid int,
name varchar(40), email varchar(20),
major char(20));
insert into students values (1, 'sami', 'sami@nyu.edu', 'CS');
insert into students values (2, 'muneeb', 'muneeb@nyu.edu', 'CS');
insert into students values (3, 'geon', 'geon@nyu.edu', 'CS');
insert into students values (4, 'sophie', 'sophie@nyu.edu', 'CS');
insert into students values (5, 'will', 'will@nyu.edu', 'CE');

Can you find postgres’s database files?

select current_setting('data_directory') || '/' 
|| pg_relation_filepath('students');

Can you read them?

Each file is one or more pages. A page is fixed-size block of data and it can contain tuples, meta-data, indexes, log records, etc. Each page has an identifier that could be unique per DBMS instance, per database, or per table.

Can you find the page size of postgres?

show block_size;

The most common layout scheme in row-oriented databases is slotted pages. The slot array maps “slots” to the tuples’ starting position offsets. The header keeps track of the # of used slots and the offset of the starting location of the last slot used.

See how it looks in postgres

select r.ctid, r.* from students r;
ctid sid name email major
(0,1) 1 sami CS
(0,2) 2 muneeb CS
(0,3) 3 geon CS
(0,4) 4 sophie CS
(0,5) 5 will CE

ctid gives us the block or page # and the slot id of each tuple.

What happens when we delete a student?

delete from students where sid=3;

What happens when we insert a student?

insert into students values (6, 'maheen', 'maheen@nyu.edu', 'CS');

Why was the empty slot not filled? Can we compact this to reduce fragmentation?

vacuum full students;
select r.ctid, r.* from students r;

In a heap file organization, special directory pages track the location of data pages in the database files. They also keep meta-data about pages’ contents such as the amount of free space per page, a list of free or empty pages and page type (data vs. meta-data).

How to see some of this information in postgres?

First, let’s add some more data

delete from students where true;
insert into students
select gs, 'student' || gs, 'st' || gs || '@nyu.edu', 
  (ARRAY['CS','CE','EE','Math'])[1 + floor(random()*4)::int]
from generate_series(1, 1000) AS gs;
create extension pg_freespacemap;
select * from pg_freespace('students');

What happens when we delete half our data?

delete from students where random() < 0.5;
select * from pg_freespace('students');

Why did it not update? What happens when we insert new records?

insert into students
select gs, 'student' || gs, 'st' || gs || '@nyu.edu', 
  (ARRAY['CS','CE','EE','Math'])[1 + floor(random()*4)::int]
from generate_series(1001, 1010) AS gs;

You can see more information about a page using the pageinspect extension:

create extension pageinspect;
--- to see the 5th page:
select * from heap_page_items(get_raw_page('students', 5));
--- what happens if you insert some null values? t_bits holds the null bitmap

From this exercise we can see a few issues already with slotted pages:

  • Fragmentation: Deletion of tuples can leave gaps in the pages
  • Useless Disk I/O: Due to the block-oriented nature of non-volatile storage, the whole block needs to be fetched to update a tuple.
  • Random Disk I/O: What happens when I need to update 20 different tuples?

B.1 Analytical Workloads

Let’s revisit the stock simulation example we examined a few days ago when trying to motivate yet another data model and DBMS, the stochastic data model and probabilistic database systems.

But now let’s work with more data!

createdb stocksim;
psql -d stocksim;

Let’s generate 100,000 stocks each with 1000 possible stock scenarios. The scenarios follow are normally distributed around some mean mu and some variance sigma\(^2\).

We want to find the top-5 least profitable stocks in terms of their expected price.

\timing
create table stocks_arr (
  code char(4),
  prices float[]
);

with params as (
  select
    -- 4-letter code (A..Z base-26)
    chr(65 + ((x/17576) % 26)) ||
    chr(65 + ((x/676)   % 26)) ||
    chr(65 + ((x/26)    % 26)) ||
    chr(65 + ( x        % 26))        AS code,
    5  + 495*random()                 AS mu,       -- $5 .. $500
    0.02 + 0.13*random()              AS sigma     -- 2 .. 15
  from (select gs-1 as x FROM generate_series(1,100000) gs) t
)
insert into stocks_arr (code, prices)
select p.code,
  (select array_agg(
      round(GREATEST(0.01, p.mu + p.mu*p.sigma 
      * (sqrt(-2*ln(GREATEST(1e-12, r1))) * cos(2*pi()*r2)  -- Z ~ N(0,1)
      ))::numeric, 2)::float)
    from (
      select random() AS r1, random() AS r2
      from generate_series(1,1000)) g
  ) as prices
from params p;

Normalizing this table:

create table stocks_tall (
  code char(4),                  
  simid int,
  price float
);

insert into stocks_tall(
select code, u.ord, u.val 
from stocks_arr
cross join lateral unnest(prices) 
  with ordinality as u(val, ord));

A very strong argument was made for not normalizing and storing data this way instead of normal form. What were some of the arguments?

What about space utilization?

select count(*) from pg_freespace('stocks_arr');
select count(*) from pg_freespace('stocks_tall');

When thinking about disk space there are several factors to consider. It is not about bytes stored per se, disk is cheap! It is about how well we make use of memory? how much we need to spend reading in disk blocks?

What about performance?

select code, (select avg(v) from unnest(prices) as t(v)) as exp_price 
from stocks_arr order by exp_price limit 5;
select code, avg(price) as exp_price 
from stocks_tall group by code order by exp_price limit 5;

Wait. How is this possible? Part of the problem is that arrays are not native relational data types and so there are no native functions that can process them without unnesting them into a relational form first then computing aggregates. So we don’t immediately get the benefits of vectorized operations unless we use the right operations and even then, certain engines are not amenable to full vectorized query processing.

You can install the aggs_for_arrays extension

git clone https://github.com/pjungwir/aggs_for_arrays.git
cd aggs_for_arrays
make & sudo make install

If this doens’t work right away, you might have a bit of a rough time getting the extension to work.

Now try this UDF that this is specifically designed to efficiently compute the mean of an array:

create extension aggs_for_arrays;
select code, array_to_mean(prices) as exp_price 
from stocks_arr order by exp_price limit 5;

What if I need to consider simulation updates, metadata updates etc?

B.1.1 Logical Design or DBMS Choice?

brew install duckdb
duckdb
.open stocksim
install postgres_scanner;
LOAD postgres_scanner;

.timer on

create table stocks_arr as 
(select * from postgres_scan('host=localhost port=5432 dbname=stocksim', 
'public', 'stocks_arr'));

select * from stocks_arr limit 1;

create table stocks_tall as 
(select * from postgres_scan('host=localhost port=5432 dbname=stocksim', 
'public', 'stocks_tall'));

select * from stocks_tall limit 1;

Now let’s run both queries:

select code, list_avg(prices) as exp_price 
from stocks_arr order by exp_price limit 5;

select code, avg(price) as exp_price from 
stocks_tall group by code order by exp_price limit 5;

Why is the second query faster?

What compression techniques is duckdb applying and why?

--- you can get more info on the file layout with this:
select * FROM pragma_storage_info('stocks_tall');
select * FROM pragma_storage_info('stocks_tall');