A SQL Deep Dive

In this worksheet, you will work through some SQL constructs. You can execute these queries in either postgres or sqlite. Since it is preferable to complete these queries in postgres, here are the instructions to install postgres.

A.1 Install Postgres

On Mac OS X, if you have never installed a postgres instance before, I recommend installing postgres from here: http://postgresapp.com/ or using brew.

brew install postgresql

You will need to set your paths as well, see instructions here: http://postgresapp.com/documentation/cli-tools.html to be able to run the command line tools from Terminal.

Or after brew terminates it will print out instructions on how to correctly set up your paths: follow them!

You will also benefit from installing pgAdmin4 but for later assignments. Go here: https://www.pgadmin.org/download/

Finally, if you don’t have a Mac, you can buy one and follow the steps above ;) OR You can install postgres from here, http://www.postgresql.org/download/, pick your OS and install the pre-built binary. I do not recommend compiling from source.

Check your installation works

First, create a database. In your shell (i.e. not inside the psql shell):

createdb worksheet

If for any reason, you want to start fresh, just drop the whole database:

dropdb worksheet

To open the postgres shell at the created dblp database:

psql worksheet

You might need to provide other information like the host (-h localhost), a username or a password if you created ones.

In psql shell, you can directly type SQL commands:

create table foo(bar numeric);
select * from foo;

Other psql commands can help you such as:

\q -- quit (exit psql)
\h -- help
\d -- list relations
\? -- help for internal commands

A.2 Create & Insert Queries

First, create the following table of companies:

cname	stockprice	country
Canon	65	Japan
GizmoWorks	25	USA
Emaratitech	15	UAE
Hitachi	10	Japan

create table company(
    cname char(20) primary key, 
    stockprice float, 
    country char(20));

and populate it with data:

insert into company values ('GizmoWorks', 25, 'USA');
insert into company values ('Canon', 65, 'Japan');
insert into company values ('EmaratiTech', 15, 'UAE');
insert into company values ('Hitachi', 10, 'Japan');

Now create a table of products:

pname	price	category	manufacturer
Gizmo	19.99	Gadgets	GizmoWorks
MTouch2	200.99	Gadgets	Canon
MultiTouch	203.99	Household	EmaratiTech
PowerGizmo	29.99	Gadgets	GizmoWorks
SingleTouch	149.99	Photography	Canon
Widget	150.00	Photography	Hitachi

create table product (
    pname char(20) primary key, 
    price decimal(9, 2), 
    category char(20), 
    manufacturer char(20), 
    foreign key (manufacturer) references company (cname));

and populate it with data:

insert into Product values ('Gizmo', 19.99, 'Gadgets', 'GizmoWorks');
insert into Product values ('PowerGizmo', 29.99, 'Gadgets', 'GizmoWorks');
insert into Product values ('SingleTouch', 149.99, 'Photography', 'Canon');
insert into Product values ('MultiTouch', 203.99, 'Household', 'EmaratiTech');
insert into Product values ('MTouch2', 200.99, 'Gadgets', 'Canon');
insert into Product values ('Widget', 150.00, 'Photography', 'Hitachi');

You can also create the table without a foreign key and add it later:

alter table Product 
ADD FOREIGN KEY (Manufacturer) REFERENCES Company(CName);

Make sure you check that everything works:

select * from company c, product p where c.cname = p.manufacturer;

In the above query we renamed company to c and product to p: this makes writing queries with lots of conditions and multiple tables easier.

Data Types

There are many data types supported by postgres, check out the documentation: https://www.postgresql.org/docs/current/datatype.html. The main ones:

varchar(n) means variable-length strings up to n characters.
char(n) means a fixed-length, blank padded string
text means a variable unlimited length string.
int, bigint, smallint, numeric, float, decimal(x, y) are all different numeric data types.

There are other datatypes that are supported through extensions such as

PostGIS: supports spatial data types like geometry, geography, point, line, polygon, etc.,
hstore: supports storing sets of key-value pairs within a single column, useful for semi-structured data.
pgvector: supports vector data type for storing and querying vector embeddings, commonly used in AI/ML applications for similarity searches.

SQL is case-insensitive so a table named product can be written as PROduct and keyword SELECT as select, etc. Constants need to be in single quotes ‘abc’ not “abc”. Constants are case sensitive so ‘Abc’ != ‘abc’

A.2.1 The primary-foreign key constraint

Can you delete a tuple from company?

delete from company where cname = 'Canon';

No, you will receive an error message the you are violating foreign key constraint. To force the delete, you can alternatively alter the schema as follows:

\d+ product --- lists all the constraints on a table

alter table product 
drop constraint product_manufacturer_fkey1; 
--- drops the original pk-fk relationship
--- you man want to check what exactly is the constraint called first
--- there are differnces across postgres versions

The following forces a deleting cascade so if you delete a company, the products of that company are deleted too!

alter table Product 
ADD constraint pkfk 
FOREIGN KEY (Manufacturer) REFERENCES Company(CName) 
on delete cascade;

delete from company where cname = 'Canon';

A.3 Selection, Projection & Join Queries

Get product categories

select category from product;

Get distinct product categories

select distinct category from product;

Get gadgets that cost more than 10$ and order them by price and pname

select pname, category, price 
from product 
where category = 'Gadgets' and 
price > 10 
order by price desc, pname;

Now find distinct categories but ordered by pname?

select distinct category 
from product 
order by pname;

When selecting distinct, order by expressions must appear in the select list. This behavior makes sense as the user’s intention on what order of appearance to use is ambiguous.

Find all products under $200 manufactured in Japan; return their names and prices

select pname, price, cname, manufacturer, country  
from product, company 
where price < 200 
and manufacturer = cname 
and country = 'Japan';

Find all Japanese companies that manufacture Gadgets.

select cname 
from company, product 
where manufacturer = cname 
and country = 'Japan' 
and category = 'Gadgets';

Find all Japanese companies that manufacture Gadgets or Photography items

select cname 
from company, product 
where manufacturer = cname 
and country = 'Japan' 
and (category = 'Gadgets' or category = 'Photography');

Find all Japanese companies that manufacture both Gadgets and Photography items.

select cname 
from company, product 
where manufacturer = cname 
and country = 'Japan' 
and (category = 'Gadgets' and category = 'Photography');

Why does the above query return 0 tuples? It expects a single category to be both Gadgets and Photography which is impossible.

There are three ways to do this query.

The first way is to self-join or basically to re-pivot the values in Category to appear as two separate columns, which we can then apply the condition on separately. If you haven’t deleted Canon from your table, it should appear as a result.

select cname from company, product p1, product p2 
where p1.manufacturer = cname 
and p2.manufacturer = cname 
and country = 'Japan' 
and p1.category = 'Gadgets' 
and p2.category = 'Photography';

The second way to write this query is to use set division. Make sure you understand how the division operation in relational algebra works.

$R(x, y) \div S(y) = \pi_x(R) - \pi_x((\pi_x(R) \times S) - R)$:

--- let's make a table, S(y), of the categories a company must have.
select * from (values ('Gadgets'), ('Photography')) as S(y); 

--- take the cross product of y with manufacturers.
select manufacturer, y from 
product cross join 
  (select * from (values ('Gadgets'), ('Photography')) as y) as S(y);

--- subtract out the actual manufacturer, category tuples, 
--- this is the bad set
select manufacturer, y from product cross join 
  (select * from (values ('Gadgets'), ('Photography')) as y) as S(y) 
except
  select manufacturer, category from product;
    
--- select manufactures in the bad set W
select manufacturer from 
  (select manufacturer, y from product cross join 
  (select * from (values ('Gadgets'), ('Photography')) as y) as S(y) 
except
  select manufacturer, category from product) as W;

--- now the whole query, select companies not in the bad set W  
select cname from company c where country = 'Japan' and
      cname not in (
        select manufacturer from 
          (select manufacturer, y from product cross join 
            (select * from (values ('Gadgets'), ('Photography')) as y) 
            as S(y) 
          except
            select manufacturer, category from product) as W);

The third way would be to use aggregrates or indicator counts. To have temporary tables that only last per query, use the keyword with.

with v(x) as (select * from (values ('Gadgets'), ('Photography')))
select manufacturer from 
product join v on category = x 
group by manufacturer 
having count(distinct category) = (select count(*) from v);

A.4 Nested Queries or Sub-Queries

Let’s prepare the database again. This time let’s make this about food!

Here is the company table:

cname	stockprice	country
sushiyo	20	japan
shakeshack	25	usa
mandi	5	uae
biryanipot	10	india
mcdonalds	10	usa

Here is the product table:

pname	price	quantity	category	cname
BigMac	15	100	burger	mcdonalds
ShackBurger	30	50	burger	shakeshack
ShroomBurger	25	60	burger	shakeshack
MacArabia	18	120	burger	mcdonalds
ShrimpTempura	45	30	seafood	sushiyo
BeefBiryani	30	30	meat	biryanipot
ShrimpBiryani	40	40	seafood	biryanipot
LambBiryani	50	50	meat	mandi

drop table if exists company cascade;
--- cascade here will drop the primary-foreing key constraint only
--- not the entire product table!
create table company (cname char(20) primary key, 
stockprice int, 
country char(20));

drop table if exists product;
create table product (pname char(20) primary key, 
price int, 
quantity int, 
category char(20), 
cname char(20), 
foreign key (cname) references company (cname));

insert into company values ('sushiyo', 20, 'japan');
insert into company values ('shakeshack', 25, 'usa');
insert into company values ('mandi', 5, 'uae');
insert into company values ('biryanipot', 10, 'india');
insert into company values ('mcdonalds', 10, 'usa');

insert into product values ('bigmac', 15, 100,  'burger', 'mcdonalds');
insert into product values ('shackburger', 30, 100,  'burger', 'shakeshack');
insert into product values ('shroomburger', 25, 100,  'burger', 'shakeshack');
insert into product values ('macarabia', 18, 120 ,  'burger', 'mcdonalds');
insert into product values ('shrimptempura', 45, 30,  'seafood', 'sushiyo');
insert into product values ('beefbiryani', 35, 30,  'meat', 'biryanipot');
insert into product values ('shrimpbiryani', 40, 40,  'seafood', 'biryanipot');
insert into product values ('lambbiryani', 50, 50,  'meat', 'mandi');

For each product return the country it was made in.

select pname, (select country from company 
              where company.cname = p.cname) 
from product p;

The above query works correctly but has unnecessary nesting. You should always flatten such queries:

select pname, country 
from company, product 
where company.cname = product.cname;

Find the number of products made by each company.

select cname, (select count(*) from product p 
              where p.cname = c.cname) 
from company c;

The above query works correctly but has unnecessary nesting. You should always flatten such queries. Make sure you use the group by clause.

select p.cname, count(*) 
from company, product p 
where p.cname = company.cname
group by p.cname;

The above query is correct but it eliminates companies that have no products. How will you fix that? See the section below on NULLs.

Find companies that make some products with price < 20

select distinct c.cname 
from company c, product p 
where c.cname = p.cname 
and p.price < 20;

Let’s look at the query above, there are alternative ways of expressing the same query as above that use other set semantics:

exists

select c.cname from company c 
where exists (select * from product p 
              where p.price < 20 
              and p.cname = c.cname);

select c.cname from company c 
where c.cname in (select cname from product p 
                  where p.price < 20);

any

select c.cname from company c 
where 20 > ANY (select p.price from product p 
              where p.cname = c.cname);

The above clauses make it easy to specify queries that negate these properties. For example:

Find companies that make only products with price <= 25.

not exists

select c.cname from company c 
where not exists (select * from product p 
                        where p.price > 25 
                        and p.cname = c.cname);

not in

select c.cname from company c 
where c.cname not in (select p.cname from product p 
                      where p.price > 25);

all

select c.cname from company c
where 25 > ALL (select p.price from product p 
                  where p.cname = c.cname);

Can we always unnest these queries?

We just described two different kinds of queries: existential and universal ones. Most people find universal queries very trick to write! In an experiment I ran with DB experts, only 2/13 got the correct query in 10 minutes!

There is an important distinction in the semantics of ANY(empty set) vs. ALL (empty set). “x > ANY (empty set)” will always evaluate to false, but “x > ALL(empty set)” will always evaluate to true!

To see how empty sets interact with “all” and “any”. First insert a company that has no products:

insert into company values ('notfoodco', '100', 'uae');

Now execute the following queries:

any(empty set)

select c.cname from company c 
where 0 > ANY (select p.price from product p 
              where p.cname = c.cname);

all(empty set)

select c.cname from company c 
where 0 > ALL (select p.price from product p 
              where p.cname = c.cname);

Why does the any query return 0 companies but the all query returns notfoodco, which has no products?

This is somewhat odd but the way to read these queries is as follows:

Read the any query as follows: return companies where 0 is greater than at least one of their product prices.
Read the all as follows: return companies where none of their products has a price less than 0.

You could now see how a company that has no products is an answer to the second query because none of its products cost less than 0 dollars!

A.5 Logical SQL Query Evaluation Order

Step	Clause	What it does
1	`FROM`	Loads and joins data sources (tables, subqueries, joins, etc.)
2	`WHERE`	Filters rows before grouping or windowing; cannot see column aliases
3	`GROUP BY`	Groups rows by specified columns (if using aggregation)
4	`HAVING`	Filters groups (after aggregation); can use group-level aggregates
5	`WINDOW`	Applies window functions like `ROW_NUMBER`, `RANK`, etc.
6	`SELECT`	Chooses which columns or expressions to output; assigns column aliases
7	`DISTINCT`	Removes duplicates (if used)
8	`ORDER BY`	Sorts the final result set (can see aliases defined in `SELECT`)
9	`LIMIT`, `OFFSET`	Limits the number of returned rows

A few things to note:

WHERE and GROUP BY cannot see SELECT aliases. Use raw expressions there!
LATERAL subqueries are part of the FROM clause, so they are evaluated during FROM, but row-by-row. All items in the FROM clause are evaluated independently (in parallel). But when you mark a subquery (or function/table expression) as LATERAL, it tells the engine: “Evaluate this subquery once per row of the previous FROM item(s), and let me refer to them.”

A.6 Aggregation Queries

The main ones are: sum, count, avg, min, max. Except ‘count’ all aggregates apply to a single attribute (e.g. sum(a) or sum(a+b) — here a+b is an expression that produces a single attribute); you can apply count(a, b).

Count counts duplicates unless you use count distinct.

Let’s try a few queries:

Find the number of products in each category with price < 20

select category, count(*) from product 
where price < 20;

At what cost, can you empty out the food inventory?

select sum(price * quantity) from product;

At what cost, can you buy out all burgers?

select sum(price * quantity) from product 
where category = 'burger';

Find total quantities per category for all products with prices over 20 AED.

select category, sum(quantity) 
from product 
where price > 20 
group by category;

Find total quantities and maximum price by category

select category, max(price), sum(quantity) 
from product 
group by category;

Find total quantities and maximum price by category but only for categories with more than 70 total units.

select sum(quantity) as total, max(price) 
from product 
group by category 
having total > 70;

A.7 Finding Witnesses

Find for each company, its most expensive product.

Let’s breakdown this query. First, we find the maximum price of each company’s products:

select cname, max(price) as mp
from product 
group by cname;

Now let’s compare that price with each product’s price.

select p.pname, p.price, x.mp 
from product p, 
    (select cname, max(price) as mp 
    from product group by cname) as x
where p.price = x.mp 
and p.cname = x.cname;

Here is a more compact solution (albeit not as efficient):

select p.pname, p.price 
from product p 
where p.price >= ALL (select price from product x 
                    where x.cname = p.cname);

You can use lateral joins instead to express it. The keyword `lateral’ simply forces an order on the query so you can reference values in the outer table from the inner table of the join.

select c.cname, p.pname, p.price from company c
left join lateral
  (select * from product p
    where p.cname = c.cname
    order by p.price DESC
    limit 1
  ) p ON true;

Or this

select distinct on (cname) * from product
order by cname, price desc;

Or this, which uses window functions

select cname, pname, price from (
  select p.*, row_number() over (partition by cname order by price desc) as rn
  from product p
) where rn = 1;

Why can’t we flatten this?

select p.*, row_number() over (partition by cname order by price desc) as rn
from product p
where rn = 1;

Look back at the logical order of evaluation, where conditions are applied before computing window functions!

A.8 Fun with Nulls

What are nulls?

When we don’t have a value we put a NULL. Nulls mean several things:

value doesn’t exist
value exists but we don’t know it
value not applicable

A schema can state explicitly whether an attribute can be null or not. Are primary keys nullable? No! because you can’t use an ambiguous value like NULL to uniquely identify a row.

How does SQL interpret NULLS?

Arithmetic: If $x$ = NULL then operations on $x$ $\rightarrow$ null. e.g. 4 + $x$ = NULL
Booelean: In boolean logic, true has a value of 1, false has a value of 0. null has a value of 0.5. Thus, let $x$ = null.

If $y=1$ then, ($x$ AND $y$) = min($x, y$) = $x = 0.5$ and ($x$ OR $y$) = max($x, y$) = 1.
If $y = 0$, then ($x$ AND $y$) = 0; ($x$ OR $y$) = 0.5
(NOT $x$) = (1 - $x$) = 0.5 = null
Boolean operations on $x$, such as select (null ='Joe'); returns null
- What does select (null = null); return?

Let’s add data with null values into the database

insert into company values ('freezerfoods', NULL, 'norway');
insert into company values ('worldfoods', 100, NULL);

SQL only returns tuples that yield true for a selection condition. What happens with?

select * from company where 
stockprice > 10 or stockprice <= 10;

Now let’s run some aggregation queries with null values.

Find the number of companies in each country

select country, count(*) from company group by country;

The query groups all companies with null as the country together. count(*) counts rows.

    select country, count(country) from company group by country;

The above query does not return the correct count for null countries. It shows 0 for null. Why?

Except for the count(*) function, all aggregations perform a null-elimination step, so that nulls are not included in the final result of the calculation. This means that select avg(i) - (sum(i)/count(*)) from R; is not zero if you have null values in column $i$.

select distinct country, (select count(*) from company c 
                            where c.country = x.country) 
from company x;

The above query also does not return the correct count for null countries (It shows 0 for null). Note distinct country does not eliminate nulls, neither does count(*). However, c.country = x.country is a selection condition, which means only true values are returned and the null rows are eliminated because what does null = null mean after all?

Find the number of products made in each country.

select country, count(p.pname) 
from product p, company 
where p.cname = company.cname 
group by country;

The above query eliminates both ‘norway’ and null country companies, why?

select distinct country, (select count(*) from product p 
                          where p.cname = c.cname) 
from company c 
group by country, cname;

The above query counts by company not by country. So it will provide incorrect counts per country!

Here is a nice solution to the problem:

select country, sum(
          (select count(*) from product p 
          where p.cname = c.cname)) 
from company c 
group by country;

To see why the above query works, try understanding first, what the subquery does:

select country, (select count(*) from product p 
                  where p.cname = c.cname) 
from company c;

A.9 Joins

Add a new table called featured

pname	month	feature
ShroomBurger	9	timeout
MacArabia	9	national
lambbiryani	8	gulfnews
shrimpbiryani	3	timeout

create table featured (pname char(20), month int, feature char(20));

insert into featured values ('shroomburger', 9, 'timeout');
insert into featured values ('macarabia', 9, 'national');
insert into featured values ('lambbiryani', 8, 'gulfnews');
insert into featured values ('shrimpbiryani', 3, 'timeout');
insert into featured values ('shroomburger', 7, 'timeout');

Now let’s runs some queries:

Find out in which outlets products were featured

select p.pname, price 
from product p, featured f 
where p.pname = f.pname;

What if you want to also show products that were never featured?

select p.pname, price, feature 
from product p left outer join featured f 
on p.pname = f.pname;

For each product, find the number of features it had in September (9)

select p.pname, count(month) 
from product p left outer join featured f 
on p.pname = f.pname 
where month = 9 
group by p.pname;

What if we want to show products that were never featured in September with a count of 0? The query above eliminates month != 9 from the where condition.

select p.pname, count(month) 
from product p left outer join featured f 
on p.pname = f.pname and f.month = 9 
group by p.pname;

To understand why the above query where works, lets understand the semantics of the left outer join. Let $P$ stand for product and $F$ stand for featured relations. Then \[P ⟕_{P.\text{pname} = F.\text{pname} \wedge F.\text{month} = 9} F\] \[\equiv\] \[P \bowtie_{P.\text{pname} = F.\text{pname} \wedge F.\text{month} = 9} F \cup (P - (P \bowtie_{P.\text{pname} = F.\text{pname} \wedge F.\text{month} = 9} F)) \times \{\text{null}\}\]

Equivalently, you can write this in SQL as follows:

select p.pname, count(month) 
from product p join featured f on p.pname = f.pname and f.month = 9 
group by p.pname

UNION all

select p.pname, count(cast(NULL as integer)) 
from product p 
where not exists 
  (select * from product p1 join featured f 
  on p1.pname = f.pname and f.month = 9 
  and p1.pname = p.pname) 
group by p.pname

Alternatively, you can also write the following query:

select p.pname, sum(coalesce(cast(month=9 as integer), 0))
from product p left outer join featured f on p.pname = f.pname 
group by p.pname;

To understand how the above query works, break it down into parts:

--- get features per product
select p.pname, month
from product p left outer join featured f 
on p.pname = f.pname;
    
--- set isFeatured to 1 if product was featured in 9, 
--- 0 otherwise (including nulls).
select p.pname, coalesce(cast(month=9 as integer), 0) as isFeatured
from product p left outer join featured f 
on p.pname = f.pname;

12 Recovery

B File Layouts in Practice