B File Layouts in Practice
A database is stored as one or more files on disk. It is usually a proprietary binary format, so other apps or even the OS does not know anything about the contents of these files. There are database systems that use open formats such as parquet.
insert into students values (1, 'sami', 'sami@nyu.edu', 'CS');
insert into students values (2, 'muneeb', 'muneeb@nyu.edu', 'CS');
insert into students values (3, 'geon', 'geon@nyu.edu', 'CS');
insert into students values (4, 'sophie', 'sophie@nyu.edu', 'CS');
insert into students values (5, 'will', 'will@nyu.edu', 'CE');Can you find postgres’s database files?
Can you read them?
Each file is one or more pages. A page is fixed-size block of data and it can contain tuples, meta-data, indexes, log records, etc. Each page has an identifier that could be unique per DBMS instance, per database, or per table.
Can you find the page size of postgres?
The most common layout scheme in row-oriented databases is slotted pages. The slot array maps “slots” to the tuples’ starting position offsets. The header keeps track of the # of used slots and the offset of the starting location of the last slot used.
See how it looks in postgres
| ctid | sid | name | major | |
|---|---|---|---|---|
| (0,1) | 1 | sami | sami@nyu.edu | CS |
| (0,2) | 2 | muneeb | muneeb@nyu.edu | CS |
| (0,3) | 3 | geon | geon@nyu.edu | CS |
| (0,4) | 4 | sophie | sophie@nyu.edu | CS |
| (0,5) | 5 | will | will@nyu.edu | CE |
ctid gives us the block or page # and the slot id of each tuple.
What happens when we delete a student?
What happens when we insert a student?
Why was the empty slot not filled? Can we compact this to reduce fragmentation?
In a heap file organization, special directory pages track the location of data pages in the database files. They also keep meta-data about pages’ contents such as the amount of free space per page, a list of free or empty pages and page type (data vs. meta-data).
How to see some of this information in postgres?
First, let’s add some more data
delete from students where true;
insert into students
select gs, 'student' || gs, 'st' || gs || '@nyu.edu',
(ARRAY['CS','CE','EE','Math'])[1 + floor(random()*4)::int]
from generate_series(1, 1000) AS gs;What happens when we delete half our data?
Why did it not update? What happens when we insert new records?
insert into students
select gs, 'student' || gs, 'st' || gs || '@nyu.edu',
(ARRAY['CS','CE','EE','Math'])[1 + floor(random()*4)::int]
from generate_series(1001, 1010) AS gs;You can see more information about a page using the pageinspect extension:
create extension pageinspect;
--- to see the 5th page:
select * from heap_page_items(get_raw_page('students', 5));
--- what happens if you insert some null values? t_bits holds the null bitmapFrom this exercise we can see a few issues already with slotted pages:
- Fragmentation: Deletion of tuples can leave gaps in the pages
- Useless Disk I/O: Due to the block-oriented nature of non-volatile storage, the whole block needs to be fetched to update a tuple.
- Random Disk I/O: What happens when I need to update 20 different tuples?
B.1 Analytical Workloads
Let’s revisit the stock simulation example we examined a few days ago when trying to motivate yet another data model and DBMS, the stochastic data model and probabilistic database systems.
But now let’s work with more data!
Let’s generate 100,000 stocks each with 1000 possible stock scenarios. The scenarios follow are normally distributed around some mean mu and some variance sigma\(^2\).
We want to find the top-5 least profitable stocks in terms of their expected price.
\timing
create table stocks_arr (
code char(4),
prices float[]
);
with params as (
select
-- 4-letter code (A..Z base-26)
chr(65 + ((x/17576) % 26)) ||
chr(65 + ((x/676) % 26)) ||
chr(65 + ((x/26) % 26)) ||
chr(65 + ( x % 26)) AS code,
5 + 495*random() AS mu, -- $5 .. $500
0.02 + 0.13*random() AS sigma -- 2 .. 15
from (select gs-1 as x FROM generate_series(1,100000) gs) t
)
insert into stocks_arr (code, prices)
select p.code,
(select array_agg(
round(GREATEST(0.01, p.mu + p.mu*p.sigma
* (sqrt(-2*ln(GREATEST(1e-12, r1))) * cos(2*pi()*r2) -- Z ~ N(0,1)
))::numeric, 2)::float)
from (
select random() AS r1, random() AS r2
from generate_series(1,1000)) g
) as prices
from params p;Normalizing this table:
create table stocks_tall (
code char(4),
simid int,
price float
);
insert into stocks_tall(
select code, u.ord, u.val
from stocks_arr
cross join lateral unnest(prices)
with ordinality as u(val, ord));A very strong argument was made for not normalizing and storing data this way instead of normal form. What were some of the arguments?
What about space utilization?
When thinking about disk space there are several factors to consider. It is not about bytes stored per se, disk is cheap! It is about how well we make use of memory? how much we need to spend reading in disk blocks?
What about performance?
select code, (select avg(v) from unnest(prices) as t(v)) as exp_price
from stocks_arr order by exp_price limit 5;Wait. How is this possible? Part of the problem is that arrays are not native relational data types and so there are no native functions that can process them without unnesting them into a relational form first then computing aggregates. So we don’t immediately get the benefits of vectorized operations unless we use the right operations and even then, certain engines are not amenable to full vectorized query processing.
You can install the aggs_for_arrays extension
git clone https://github.com/pjungwir/aggs_for_arrays.git
cd aggs_for_arrays
make & sudo make installIf this doens’t work right away, you might have a bit of a rough time getting the extension to work.
Now try this UDF that this is specifically designed to efficiently compute the mean of an array:
create extension aggs_for_arrays;
select code, array_to_mean(prices) as exp_price
from stocks_arr order by exp_price limit 5;What if I need to consider simulation updates, metadata updates etc?
B.1.1 Logical Design or DBMS Choice?
.open stocksim
install postgres_scanner;
LOAD postgres_scanner;
.timer on
create table stocks_arr as
(select * from postgres_scan('host=localhost port=5432 dbname=stocksim',
'public', 'stocks_arr'));
select * from stocks_arr limit 1;
create table stocks_tall as
(select * from postgres_scan('host=localhost port=5432 dbname=stocksim',
'public', 'stocks_tall'));
select * from stocks_tall limit 1;Now let’s run both queries:
select code, list_avg(prices) as exp_price
from stocks_arr order by exp_price limit 5;
select code, avg(price) as exp_price from
stocks_tall group by code order by exp_price limit 5;Why is the second query faster?
What compression techniques is duckdb applying and why?