A SQL Deep Dive
In this worksheet, you will work through some SQL constructs. You can execute these queries in either postgres or sqlite. Since it is preferable to complete these queries in postgres, here are the instructions to install postgres.
A.1 Install Postgres
On Mac OS X, if you have never installed a postgres instance before, I recommend installing postgres from here: http://postgresapp.com/ or using brew.
You will need to set your paths as well, see instructions here: http://postgresapp.com/documentation/cli-tools.html to be able to run the command line tools from Terminal.
Or after brew terminates it will print out instructions on how to correctly set up your paths: follow them!
You will also benefit from installing pgAdmin4 but for later assignments. Go here: https://www.pgadmin.org/download/
Finally, if you don’t have a Mac, you can buy one and follow the steps above ;) OR You can install postgres from here, http://www.postgresql.org/download/, pick your OS and install the pre-built binary. I do not recommend compiling from source.
Check your installation works
First, create a database. In your shell (i.e. not inside the psql shell):
If for any reason, you want to start fresh, just drop the whole database:
To open the postgres shell at the created dblp database:
You might need to provide other information like the host (-h localhost), a username or a password if you created ones.
In psql shell, you can directly type SQL commands:
Other psql commands can help you such as:
A.2 Create & Insert Queries
First, create the following table of companies:
cname | stockprice | country |
---|---|---|
Canon | 65 | Japan |
GizmoWorks | 25 | USA |
Emaratitech | 15 | UAE |
Hitachi | 10 | Japan |
and populate it with data:
insert into company values ('GizmoWorks', 25, 'USA');
insert into company values ('Canon', 65, 'Japan');
insert into company values ('EmaratiTech', 15, 'UAE');
insert into company values ('Hitachi', 10, 'Japan');
Now create a table of products:
pname | price | category | manufacturer |
---|---|---|---|
Gizmo | 19.99 | Gadgets | GizmoWorks |
MTouch2 | 200.99 | Gadgets | Canon |
MultiTouch | 203.99 | Household | EmaratiTech |
PowerGizmo | 29.99 | Gadgets | GizmoWorks |
SingleTouch | 149.99 | Photography | Canon |
Widget | 150.00 | Photography | Hitachi |
create table product (
pname char(20) primary key,
price decimal(9, 2),
category char(20),
manufacturer char(20),
foreign key (manufacturer) references company (cname));
and populate it with data:
insert into Product values ('Gizmo', 19.99, 'Gadgets', 'GizmoWorks');
insert into Product values ('PowerGizmo', 29.99, 'Gadgets', 'GizmoWorks');
insert into Product values ('SingleTouch', 149.99, 'Photography', 'Canon');
insert into Product values ('MultiTouch', 203.99, 'Household', 'EmaratiTech');
insert into Product values ('MTouch2', 200.99, 'Gadgets', 'Canon');
insert into Product values ('Widget', 150.00, 'Photography', 'Hitachi');
You can also create the table without a foreign key and add it later:
Make sure you check that everything works:
In the above query we renamed company to c and product to p: this makes writing queries with lots of conditions and multiple tables easier.
Data Types
There are many data types supported by postgres, check out the documentation: https://www.postgresql.org/docs/current/datatype.html. The main ones:
- varchar(n) means variable-length strings up to n characters.
- char(n) means a fixed-length, blank padded string
- text means a variable unlimited length string.
- int, bigint, smallint, numeric, float, decimal(x, y) are all different numeric data types.
There are other datatypes that are supported through extensions such as
- PostGIS: supports spatial data types like geometry, geography, point, line, polygon, etc.,
- hstore: supports storing sets of key-value pairs within a single column, useful for semi-structured data.
- pgvector: supports vector data type for storing and querying vector embeddings, commonly used in AI/ML applications for similarity searches.
SQL is case-insensitive so a table named product can be written as PROduct and keyword SELECT as select, etc. Constants need to be in single quotes ‘abc’ not “abc”. Constants are case sensitive so ‘Abc’ != ‘abc’
A.2.1 The primary-foreign key constraint
Can you delete a tuple from company?
No, you will receive an error message the you are violating foreign key constraint. To force the delete, you can alternatively alter the schema as follows:
alter table product
drop constraint product_manufacturer_fkey1;
--- drops the original pk-fk relationship
--- you man want to check what exactly is the constraint called first
--- there are differnces across postgres versions
The following forces a deleting cascade so if you delete a company, the products of that company are deleted too!
A.3 Selection, Projection & Join Queries
- Get product categories
- Get distinct product categories
- Get gadgets that cost more than 10$ and order them by price and pname
select pname, category, price
from product
where category = 'Gadgets' and
price > 10
order by price desc, pname;
- Now find distinct categories but ordered by pname?
When selecting distinct, order by expressions must appear in the select list. This behavior makes sense as the user’s intention on what order of appearance to use is ambiguous.
- Find all products under $200 manufactured in Japan; return their names and prices
select pname, price, cname, manufacturer, country
from product, company
where price < 200
and manufacturer = cname
and country = 'Japan';
- Find all Japanese companies that manufacture Gadgets.
select cname
from company, product
where manufacturer = cname
and country = 'Japan'
and category = 'Gadgets';
- Find all Japanese companies that manufacture Gadgets or Photography items
select cname
from company, product
where manufacturer = cname
and country = 'Japan'
and (category = 'Gadgets' or category = 'Photography');
- Find all Japanese companies that manufacture both Gadgets and Photography items.
select cname
from company, product
where manufacturer = cname
and country = 'Japan'
and (category = 'Gadgets' and category = 'Photography');
Why does the above query return 0 tuples? It expects a single category to be both Gadgets and Photography which is impossible.
There are three ways to do this query.
The first way is to self-join or basically to re-pivot the values in Category to appear as two separate columns, which we can then apply the condition on separately. If you haven’t deleted Canon from your table, it should appear as a result.
select cname from company, product p1, product p2
where p1.manufacturer = cname
and p2.manufacturer = cname
and country = 'Japan'
and p1.category = 'Gadgets'
and p2.category = 'Photography';
The second way to write this query is to use set division. Make sure you understand how the division operation in relational algebra works.
\(R(x, y) \div S(y) = \pi_x(R) - \pi_x((\pi_x(R) \times S) - R)\):
--- let's make a table, S(y), of the categories a company must have.
select * from (values ('Gadgets'), ('Photography')) as S(y);
--- take the cross product of y with manufacturers.
select manufacturer, y from
product cross join
(select * from (values ('Gadgets'), ('Photography')) as y) as S(y);
--- subtract out the actual manufacturer, category tuples,
--- this is the bad set
select manufacturer, y from product cross join
(select * from (values ('Gadgets'), ('Photography')) as y) as S(y)
except
select manufacturer, category from product;
--- select manufactures in the bad set W
select manufacturer from
(select manufacturer, y from product cross join
(select * from (values ('Gadgets'), ('Photography')) as y) as S(y)
except
select manufacturer, category from product) as W;
--- now the whole query, select companies not in the bad set W
select cname from company c where country = 'Japan' and
cname not in (
select manufacturer from
(select manufacturer, y from product cross join
(select * from (values ('Gadgets'), ('Photography')) as y)
as S(y)
except
select manufacturer, category from product) as W);
The third way would be to use aggregrates or indicator counts. To have temporary tables that only last per query, use the keyword with.
A.4 Nested Queries or Sub-Queries
Let’s prepare the database again. This time let’s make this about food!
Here is the company table:
cname | stockprice | country |
---|---|---|
sushiyo | 20 | japan |
shakeshack | 25 | usa |
mandi | 5 | uae |
biryanipot | 10 | india |
mcdonalds | 10 | usa |
Here is the product table:
pname | price | quantity | category | cname |
---|---|---|---|---|
BigMac | 15 | 100 | burger | mcdonalds |
ShackBurger | 30 | 50 | burger | shakeshack |
ShroomBurger | 25 | 60 | burger | shakeshack |
MacArabia | 18 | 120 | burger | mcdonalds |
ShrimpTempura | 45 | 30 | seafood | sushiyo |
BeefBiryani | 30 | 30 | meat | biryanipot |
ShrimpBiryani | 40 | 40 | seafood | biryanipot |
LambBiryani | 50 | 50 | meat | mandi |
drop table if exists company cascade;
--- cascade here will drop the primary-foreing key constraint only
--- not the entire product table!
create table company (cname char(20) primary key,
stockprice int,
country char(20));
drop table if exists product;
create table product (pname char(20) primary key,
price int,
quantity int,
category char(20),
cname char(20),
foreign key (cname) references company (cname));
insert into company values ('sushiyo', 20, 'japan');
insert into company values ('shakeshack', 25, 'usa');
insert into company values ('mandi', 5, 'uae');
insert into company values ('biryanipot', 10, 'india');
insert into company values ('mcdonalds', 10, 'usa');
insert into product values ('bigmac', 15, 100, 'burger', 'mcdonalds');
insert into product values ('shackburger', 30, 100, 'burger', 'shakeshack');
insert into product values ('shroomburger', 25, 100, 'burger', 'shakeshack');
insert into product values ('macarabia', 18, 120 , 'burger', 'mcdonalds');
insert into product values ('shrimptempura', 45, 30, 'seafood', 'sushiyo');
insert into product values ('beefbiryani', 35, 30, 'meat', 'biryanipot');
insert into product values ('shrimpbiryani', 40, 40, 'seafood', 'biryanipot');
insert into product values ('lambbiryani', 50, 50, 'meat', 'mandi');
- For each product return the country it was made in.
The above query works correctly but has unnecessary nesting. You should always flatten such queries:
- Find the number of products made by each company.
The above query works correctly but has unnecessary nesting. You should always flatten such queries. Make sure you use the group by clause.
The above query is correct but it eliminates companies that have no products. How will you fix that? See the section below on NULLs.
- Find companies that make some products with price < 20
Let’s look at the query above, there are alternative ways of expressing the same query as above that use other set semantics:
- exists
select c.cname from company c
where exists (select * from product p
where p.price < 20
and p.cname = c.cname);
- in
- any
select c.cname from company c
where 20 > ANY (select p.price from product p
where p.cname = c.cname);
The above clauses make it easy to specify queries that negate these properties. For example:
- Find companies that make only products with price <= 25.
- not exists
select c.cname from company c
where not exists (select * from product p
where p.price > 25
and p.cname = c.cname);
- not in
select c.cname from company c
where c.cname not in (select p.cname from product p
where p.price > 25);
- all
select c.cname from company c
where 25 > ALL (select p.price from product p
where p.cname = c.cname);
Can we always unnest these queries?
We just described two different kinds of queries: existential and universal ones. Most people find universal queries very trick to write! In an experiment I ran with DB experts, only 2/13 got the correct query in 10 minutes!
There is an important distinction in the semantics of ANY(empty set) vs. ALL (empty set). “x > ANY (empty set)” will always evaluate to false, but “x > ALL(empty set)” will always evaluate to true!
To see how empty sets interact with “all” and “any”. First insert a company that has no products:
Now execute the following queries:
- any(empty set)
select c.cname from company c
where 0 > ANY (select p.price from product p
where p.cname = c.cname);
- all(empty set)
select c.cname from company c
where 0 > ALL (select p.price from product p
where p.cname = c.cname);
Why does the any query return 0 companies but the all query returns notfoodco, which has no products?
This is somewhat odd but the way to read these queries is as follows:
- Read the any query as follows: return companies where 0 is greater than at least one of their product prices.
- Read the all as follows: return companies where none of their products has a price less than 0.
You could now see how a company that has no products is an answer to the second query because none of its products cost less than 0 dollars!
A.5 Logical SQL Query Evaluation Order
Step | Clause | What it does |
---|---|---|
1 | FROM |
Loads and joins data sources (tables, subqueries, joins, etc.) |
2 | WHERE |
Filters rows before grouping or windowing; cannot see column aliases |
3 | GROUP BY |
Groups rows by specified columns (if using aggregation) |
4 | HAVING |
Filters groups (after aggregation); can use group-level aggregates |
5 | WINDOW |
Applies window functions like ROW_NUMBER , RANK , etc. |
6 | SELECT |
Chooses which columns or expressions to output; assigns column aliases |
7 | DISTINCT |
Removes duplicates (if used) |
8 | ORDER BY |
Sorts the final result set (can see aliases defined in SELECT ) |
9 |
LIMIT , OFFSET
|
Limits the number of returned rows |
A few things to note:
-
WHERE
andGROUP BY
cannot seeSELECT
aliases. Use raw expressions there! -
LATERAL
subqueries are part of theFROM
clause, so they are evaluated duringFROM
, but row-by-row. All items in the FROM clause are evaluated independently (in parallel). But when you mark a subquery (or function/table expression) as LATERAL, it tells the engine: “Evaluate this subquery once per row of the previous FROM item(s), and let me refer to them.”
A.6 Aggregation Queries
The main ones are: sum, count, avg, min, max. Except ‘count’ all aggregates apply to a single attribute (e.g. sum(a) or sum(a+b) — here a+b is an expression that produces a single attribute); you can apply count(a, b).
Count counts duplicates unless you use count distinct.
Let’s try a few queries:
- Find the number of products in each category with price < 20
- At what cost, can you empty out the food inventory?
- At what cost, can you buy out all burgers?
- Find total quantities per category for all products with prices over 20 AED.
- Find total quantities and maximum price by category
- Find total quantities and maximum price by category but only for categories with more than 70 total units.
A.7 Finding Witnesses
Find for each company, its most expensive product.
Let’s breakdown this query. First, we find the maximum price of each company’s products:
Now let’s compare that price with each product’s price.
select p.pname, p.price, x.mp
from product p,
(select cname, max(price) as mp
from product group by cname) as x
where p.price = x.mp
and p.cname = x.cname;
Here is a more compact solution (albeit not as efficient):
select p.pname, p.price
from product p
where p.price >= ALL (select price from product x
where x.cname = p.cname);
You can use lateral joins instead to express it. The keyword `lateral’ simply forces an order on the query so you can reference values in the outer table from the inner table of the join.
select c.cname, p.pname, p.price from company c
left join lateral
(select * from product p
where p.cname = c.cname
order by p.price DESC
limit 1
) p ON true;
Or this
Or this, which uses window functions
select cname, pname, price from (
select p.*, row_number() over (partition by cname order by price desc) as rn
from product p
) where rn = 1;
Why can’t we flatten this?
select p.*, row_number() over (partition by cname order by price desc) as rn
from product p
where rn = 1;
Look back at the logical order of evaluation, where conditions are applied before computing window functions!
A.8 Fun with Nulls
What are nulls?
When we don’t have a value we put a NULL. Nulls mean several things:
- value doesn’t exist
- value exists but we don’t know it
- value not applicable
A schema can state explicitly whether an attribute can be null or not. Are primary keys nullable? No! because you can’t use an ambiguous value like NULL to uniquely identify a row.
How does SQL interpret NULLS?
-
Arithmetic: If \(x\) =
NULL
then operations on \(x\) \(\rightarrow\)null
. e.g. 4 + \(x\) = NULL -
Booelean: In boolean logic,
true
has a value of 1,false
has a value of 0.null
has a value of 0.5. Thus, let \(x\) =null
.
- If \(y=1\) then, (\(x\) AND \(y\)) = min(\(x, y\)) = \(x = 0.5\) and (\(x\) OR \(y\)) = max(\(x, y\)) = 1.
- If \(y = 0\), then (\(x\) AND \(y\)) = 0; (\(x\) OR \(y\)) = 0.5
- (NOT \(x\)) = (1 - \(x\)) = 0.5 =
null
- Boolean operations on \(x\), such as
select (null ='Joe');
returnsnull
- What does
select (null = null);
return?
- What does
Let’s add data with null values into the database
insert into company values ('freezerfoods', NULL, 'norway');
insert into company values ('worldfoods', 100, NULL);
SQL only returns tuples that yield true
for a selection condition. What happens with?
Now let’s run some aggregation queries with null values.
- Find the number of companies in each country
The query groups all companies with null
as the country together. count(*)
counts rows.
The above query does not return the correct count for null
countries. It shows 0 for null
. Why?
Except for the count(*)
function, all aggregations perform a null-elimination step, so that nulls are not included in the final result of the calculation. This means that select avg(i) - (sum(i)/count(*)) from R;
is not zero if you have null values in column \(i\).
select distinct country, (select count(*) from company c
where c.country = x.country)
from company x;
The above query also does not return the correct count for null countries (It shows 0 for null). Note distinct country
does not eliminate nulls, neither does count(*)
. However, c.country = x.country
is a selection condition, which means only true values are returned and the null rows are eliminated because what does null = null
mean after all?
- Find the number of products made in each country.
select country, count(p.pname)
from product p, company
where p.cname = company.cname
group by country;
The above query eliminates both ‘norway’ and null country companies, why?
select distinct country, (select count(*) from product p
where p.cname = c.cname)
from company c
group by country, cname;
The above query counts by company not by country. So it will provide incorrect counts per country!
Here is a nice solution to the problem:
select country, sum(
(select count(*) from product p
where p.cname = c.cname))
from company c
group by country;
To see why the above query works, try understanding first, what the subquery does:
A.9 Joins
Add a new table called featured
pname | month | feature |
---|---|---|
ShroomBurger | 9 | timeout |
MacArabia | 9 | national |
lambbiryani | 8 | gulfnews |
shrimpbiryani | 3 | timeout |
create table featured (pname char(20), month int, feature char(20));
insert into featured values ('shroomburger', 9, 'timeout');
insert into featured values ('macarabia', 9, 'national');
insert into featured values ('lambbiryani', 8, 'gulfnews');
insert into featured values ('shrimpbiryani', 3, 'timeout');
insert into featured values ('shroomburger', 7, 'timeout');
Now let’s runs some queries:
- Find out in which outlets products were featured
- What if you want to also show products that were never featured?
- For each product, find the number of features it had in September (9)
select p.pname, count(month)
from product p left outer join featured f
on p.pname = f.pname
where month = 9
group by p.pname;
- What if we want to show products that were never featured in September with a count of 0? The query above eliminates month != 9 from the where condition.
select p.pname, count(month)
from product p left outer join featured f
on p.pname = f.pname and f.month = 9
group by p.pname;
To understand why the above query where works, lets understand the semantics of the left outer join. Let \(P\) stand for product and \(F\) stand for featured relations. Then \[P ⟕_{P.\text{pname} = F.\text{pname} \wedge F.\text{month} = 9} F\] \[\equiv\] \[P \bowtie_{P.\text{pname} = F.\text{pname} \wedge F.\text{month} = 9} F \cup (P - (P \bowtie_{P.\text{pname} = F.\text{pname} \wedge F.\text{month} = 9} F)) \times \{\text{null}\}\]
Equivalently, you can write this in SQL as follows:
select p.pname, count(month)
from product p join featured f on p.pname = f.pname and f.month = 9
group by p.pname
UNION all
select p.pname, count(cast(NULL as integer))
from product p
where not exists
(select * from product p1 join featured f
on p1.pname = f.pname and f.month = 9
and p1.pname = p.pname)
group by p.pname
Alternatively, you can also write the following query:
select p.pname, sum(coalesce(cast(month=9 as integer), 0))
from product p left outer join featured f on p.pname = f.pname
group by p.pname;
To understand how the above query works, break it down into parts:
--- get features per product
select p.pname, month
from product p left outer join featured f
on p.pname = f.pname;
--- set isFeatured to 1 if product was featured in 9,
--- 0 otherwise (including nulls).
select p.pname, coalesce(cast(month=9 as integer), 0) as isFeatured
from product p left outer join featured f
on p.pname = f.pname;