A SQL Deep Dive

In this worksheet, you will work through some SQL constructs. You can execute these queries in either postgres or sqlite. Since it is preferable to complete these queries in postgres, here are the instructions to install postgres.

A.1 Install Postgres

On Mac OS X, if you have never installed a postgres instance before, I strongly recommend installing postgres from here: http://postgresapp.com/.

You will need to set your paths as well, see instructions here: http://postgresapp.com/documentation/cli-tools.html to be able to run the command line tools from Terminal.

You will also benefit from installing pgAdmin4 but for later assignments. Go here: https://www.pgadmin.org/download/

Finally, if you don’t have a Mac, you can buy one and follow the steps above ;) OR You can install postgres from here, http://www.postgresql.org/download/, pick your OS and install the pre-built binary. I do not recommend compiling from source.

Check your installation works

First, create a database. In your shell (i.e. not inside the psql shell):

createdb worksheet

If for any reason, you want to start fresh, just drop the whole database:

dropdb worksheet

To open the postgres shell at the created dblp database:

psql worksheet

You might need to provide other information like the host (-h localhost), a username or a password if you created ones.

In psql shell, you can directly type SQL commands:

create table foo(bar numeric);
select * from foo;

Other psql commands can help you such as:

\q -- quit (exit psql)
\h -- help
\d -- list relations
\? -- help for internal commands 

A.2 Create & Insert Queries

First, create the following table of companies:

cname stockprice country
Canon 65 Japan
GizmoWorks 25 USA
Emaratitech 15 UAE
Hitachi 10 Japan
create table company(
    cname char(20) primary key, 
    stockprice float, 
    country char(20));

and populate it with data:

insert into company values ('GizmoWorks', 25, 'USA');
insert into company values ('Canon', 65, 'Japan');
insert into company values ('EmaratiTech', 15, 'UAE');
insert into company values ('Hitachi', 10, 'Japan');

Now create a table of products:

pname price category manufacturer
Gizmo 19.99 Gadgets GizmoWorks
MTouch2 200.99 Gadgets Canon
MultiTouch 203.99 Household EmaratiTech
PowerGizmo 29.99 Gadgets GizmoWorks
SingleTouch 149.99 Photography Canon
Widget 150.00 Photography Hitachi
create table product (
    pname char(20) primary key, 
    price decimal(9, 2), 
    category char(20), 
    manufacturer char(20), 
    foreign key (manufacturer) references company (cname));

and populate it with data:

insert into Product values ('Gizmo', 19.99, 'Gadgets', 'GizmoWorks');
insert into Product values ('PowerGizmo', 29.99, 'Gadgets', 'GizmoWorks');
insert into Product values ('SingleTouch', 149.99, 'Photography', 'Canon');
insert into Product values ('MultiTouch', 203.99, 'Household', 'EmaratiTech');
insert into Product values ('MTouch2', 200.99, 'Gadgets', 'Canon');
insert into Product values ('Widget', 150.00, 'Photography', 'Hitachi');

You can also create the table without a foreign key and add it later:

alter table Product 
ADD FOREIGN KEY (Manufacturer) REFERENCES Company(CName);

Make sure you check that everything works:

select * from company c, product p where c.cname = p.manufacturer;

In the above query we renamed company to c and product to p: this makes writing queries with lots of conditions and multiple tables easier.

Data Types

There are many data types supported by postgres, check out the documentation: https://www.postgresql.org/docs/9.5/datatype.html. The main ones:

  • varchar(n) means variable-length strings up to n characters.
  • char(n) means a fixed-length, blank padded string
  • text means a variable unlimited length string.
  • int, bigint, smallint, numeric, float, decimal(x, y) are all different numeric data types.

SQL is case-insensitive so a table named product can be written as PROduct and keyword SELECT as select, etc. Constants need to be in single quotes ‘abc’ not “abc.” Constants are case sensitive so ‘Abc’ != ‘abc’

A.2.1 The primary-foreign key constraint

Can you delete a tuple from company?

delete from company where cname = 'Canon';

No, you will receive an error message the you are violating foreign key constraint. To force the delete, you can alternatively alter the schema as follows:

\d+ product --- lists all the constraints on a table
alter table product 
drop constraint product_manufacturer_fkey; 
--- drops the original pk-fk relationship

The following forces a deleting cascade so if you delete a company, the products of that company are deleted too!

alter table Product 
ADD constraint pkfk 
FOREIGN KEY (Manufacturer) REFERENCES Company(CName) 
on delete cascade;
delete from company where cname = 'Canon';

A.3 Selection, Projection & Join Queries

  1. Get product categories
select category from product;
  1. Get distinct product categories
select distinct category from product;
  1. Get gadgets that cost more than 10$ and order them by price and pname
select pname, category, price 
from product 
where category = 'Gadgets' and 
price > 10 
order by price desc, pname;
  1. Now find distinct categories but ordered by pname?
select distinct category 
from product 
order by pname;

When selecting distinct, order by expressions must appear in the select list. This behavior makes sense as the user’s intention on what order of appearance to use is ambiguous.

  1. Find all products under $200 manufactured in Japan; return their names and prices
select pname, price, cname, manufacturer, country  
from product, company 
where price < 200 
and manufacturer = cname 
and country = 'Japan';
  1. Find all Japanese companies that manufacture Gadgets.
select cname 
from company, product 
where manufacturer = cname 
and country = 'Japan' 
and category = 'Gadgets';
  1. Find all Japanese companies that manufacture Gadgets or Photography items
select cname 
from company, product 
where manufacturer = cname 
and country = 'Japan' 
and (category = 'Gadgets' or category = 'Photography');
  1. Find all Japanese companies that manufacture both Gadgets and Photography items.
select cname 
from company, product 
where manufacturer = cname 
and country = 'Japan' 
and (category = 'Gadgets' and category = 'Photography');

Why does the above query return 0 tuples? It expects a single category to be both Gadgets and Photography which is impossible. We need to self-join to basically re-pivot the values in Category to appear as two separate columns, which we can then apply the condition on separately. If you haven’t deleted Canon from your table, it should appear as a result.

select cname from company, product p1, product p2 
where p1.manufacturer = cname 
and p2.manufacturer = cname 
and country = 'Japan' 
and p1.category = 'Gadgets' 
and p2.category = 'Photography';

Another way to write this query is to use set division. Make sure you understand how the division operation in relational algebra works.

\(R(x, y) \div S(y) = \pi_x(R) - \pi_x((\pi_x(R) \times S) - R)\):

--- let's make a table, S(y), of the categories a company must have.
select * from (values ('Gadgets'), ('Photography')) as S(y); 

--- take the cross product of y with manufacturers.
select manufacturer, y from 
product cross join 
  (select * from (values ('Gadgets'), ('Photography')) as y) as S(y);
--- subtract out the actual manufacturer, category tuples, 
--- this is the bad set
select manufacturer, y from product cross join 
  (select * from (values ('Gadgets'), ('Photography')) as y) as S(y) 
except
  select manufacturer, category from product;
    
--- select manufactures in the bad set W
select manufacturer from 
  (select manufacturer, y from product cross join 
  (select * from (values ('Gadgets'), ('Photography')) as y) as S(y) 
except
  select manufacturer, category from product) as W;
--- now the whole query, select companies not in the bad set W  
select cname from company c where country = 'Japan' and
      cname not in (
        select manufacturer from 
          (select manufacturer, y from product cross join 
            (select * from (values ('Gadgets'), ('Photography')) as y) 
            as S(y) 
          except
            select manufacturer, category from product) as W);

A.4 Nested Queries or Sub-Queries

Let’s prepare the database again. This time let’s make this about food!

Here is the company table:

cname stockprice country
sushiyo 20 japan
shakeshack 25 usa
mandi 5 uae
biryanipot 10 india
mcdonalds 10 usa

Here is the product table:

pname price quantity category cname
BigMac 15 100 burger mcdonalds
ShackBurger 30 50 burger shakeshack
ShroomBurger 25 60 burger shakeshack
MacArabia 18 120 burger mcdonalds
ShrimpTempura 45 30 seafood sushiyo
BeefBiryani 30 30 meat biryanipot
ShrimpBiryani 40 40 seafood biryanipot
LambBiryani 50 50 meat mandi
drop table if exists company cascade;
--- cascade here will drop the primary-foreing key constraint only
--- not the entire product table!
create table company (cname char(20) primary key, 
stockprice int, 
country char(20));
drop table if exists product;
create table product (pname char(20) primary key, 
price int, 
quantity int, 
category char(20), 
cname char(20), 
foreign key (cname) references company (cname));
insert into company values ('sushiyo', 20, 'japan');
insert into company values ('shakeshack', 25, 'usa');
insert into company values ('mandi', 5, 'uae');
insert into company values ('biryanipot', 10, 'india');
insert into company values ('mcdonalds', 10, 'usa');

insert into product values ('bigmac', 15, 100,  'burger', 'mcdonalds');
insert into product values ('shackburger', 30, 100,  'burger', 'shakeshack');
insert into product values ('shroomburger', 25, 100,  'burger', 'shakeshack');
insert into product values ('macarabia', 18, 120 ,  'burger', 'mcdonalds');
insert into product values ('shrimptempura', 45, 30,  'seafood', 'sushiyo');
insert into product values ('beefbiryani', 35, 30,  'meat', 'biryanipot');
insert into product values ('shrimpbiryani', 40, 40,  'seafood', 'biryanipot');
insert into product values ('lambbiryani', 50, 50,  'meat', 'mandi');
  1. For each product return the country it was made in.
select pname, (select country from company 
              where company.cname = p.cname) 
from product p;

The above query works correctly but has unnecessary nesting. You should always flatten such queries:

select pname, country 
from company, product 
where company.cname = product.cname;
  1. Find the number of products made by each company.
select cname, (select count(*) from product p 
              where p.cname = c.cname) 
from company c;

The above query works correctly but has unnecessary nesting. You should always flatten such queries. Make sure you use the group by clause.

select p.cname, count(*) 
from company, product p 
where p.cname = company.cname
group by p.cname;

The above query is correct but it eliminates companies that have no products. How will you fix that? See the section below on NULLs.

  1. Find companies that make some products with price < 20
select distinct c.cname 
from company c, product p 
where c.cname = p.cname 
and p.price < 20;

Let’s look at the query above, there are alternative ways of expressing the same query as above that use other set semantics:

  • exists
select c.cname from company c 
where exists (select * from product p 
              where p.price < 20 
              and p.cname = c.cname);
  • in
select c.cname from company c 
where c.cname in (select cname from product p 
                  where p.price < 20);
  • any
select c.cname from company c 
where 20 > ANY (select p.price from product p 
              where p.cname = c.cname);

The above clauses make it easy to specify queries that negate these properties. For example:

  1. Find companies that make only products with price <= 25.
  • not exists
select c.cname from company c 
where not exists (select * from product p 
                        where p.price > 25 
                        and p.cname = c.cname);
  • not in
select c.cname from company c 
where c.cname not in (select p.cname from product p 
                      where p.price > 25);
  • all
select c.cname from company c
where 25 > ALL (select p.price from product p 
                  where p.cname = c.cname); 

Can we always unnest these queries?

We just described two different kinds of queries: existential and universal ones. Most people find universal queries very trick to write! In an experiment I ran with DB experts, only 2/13 got the correct query in 10 minutes!

There is an important distinction in the semantics of ANY(empty set) vs. ALL (empty set). “x > ANY (empty set)” will always evaluate to false, but “x > ALL(empty set)” will always evaluate to true!

To see how empty sets interact with “all” and “any.” First insert a company that has no products:

insert into company values ('starveco', '100', 'uae'); 

Now execute the following queries:

  • any(empty set)
select c.cname from company c 
where 0 > ANY (select p.price from product p 
              where p.cname = c.cname);
  • all(empty set)
select c.cname from company c 
where 0 > ALL (select p.price from product p 
              where p.cname = c.cname);

Why does the any query return 0 companies but the all query returns starveco, which has no products?

This is somewhat odd but the way to read these queries is as follows:

  • Read the any query as follows: return companies where 0 is greater than at least one of their product prices.
  • Read the all as follows: return companies where none of their products has a price less than 0.

You could now see how a company that has no products is an answer to the second query because none of its products cost less than 0 dollars!

A.5 Aggregation Queries

The main ones are: sum, count, avg, min, max. Except ‘count’ all aggregates apply to a single attribute (e.g. sum(a) or sum(a+b) — here a+b is an expression that produces a single attribute); you can apply count(a, b).

Count counts duplicates unless you use count distinct.

The evaluation order of aggregations

  1. Evaluate the From to Where clauses.
  2. Then we partition into groups using the group attributes
  3. Compute aggregate in the select statement over each group
  4. Apply having clause conditions and return result.

Let’s try a few queries:

  1. Find the number of products in each category with price < 20
select category, count(*) from product 
where price < 20;
  1. At what cost, can you empty out the food inventory?
select sum(price * quantity) from product;
  1. At what cost, can you buy out all burgers?
select sum(price * quantity) from product 
where category = 'burger';
  1. Find total quantities per category for all products with prices over 20 AED.
select category, sum(quantity) 
from product 
where price > 20 
group by category;
  1. Find total quantities and maximum price by category
select category, max(price), sum(quantity) 
from product 
group by category;
  1. Find total quantities and maximum price by category but only for categories with more than 70 total units.
select sum(quantity) as total, max(price) 
from product 
group by category 
having total > 70;

A.6 Finding Witnesses

Find for each company, its most expensive product.

Let’s breakdown this query. First, we find the maximum price of each company’s products:

select cname, max(price) as mp
from product 
group by cname;

Now let’s compare that price with each product’s price.

select p.pname, p.price, x.mp 
from product p, 
    (select cname, max(price) as mp 
    from product group by cname) as x
where p.price = x.mp 
and p.cname = x.cname;

Here is a more compact solution:

select p.pname, p.price 
from product p 
where p.price >= ALL (select price from product x 
                    where x.cname = p.cname);

A.7 Fun with Nulls

What are nulls?

When we don’t have a value we put a NULL. Nulls mean several things:

  • value doesn’t exist
  • value exists but we don’t know it
  • value not applicable

A schema can state explicitly whether an attribute can be null or not. Are primary keys nullable? No! because you can’t use an ambiguous value like NULL to uniquely identify a row.

How does SQL interpret NULLS?

  1. Arithmetic: If \(x\) = NULL then operations on \(x\) \(\rightarrow\) null. e.g. 4 + \(x\) = NULL
  2. Booelean: In boolean logic, true has a value of 1, false has a value of 0. null has a value of 0.5. Thus, let \(x\) = null.
  • If \(y=1\) then, \(x\) AND \(y\) = min(\(x, y\)) = \(x = 0.5\) and \(x\) OR \(y\) = max(\(x, y\)) = 1.
  • If \(y = 0\), then \(x\) AND \(y\) = 0; \(x\) OR \(y\) = 0.5
  • NOT \(x\) = (1 - \(x\)) = 0.5 = null
  • Boolean operations on \(x\), such as select (null ='Joe'); returns null
    • What does select (null = null); return?

Let’s add data with null values into the database

insert into company values ('freezerfoods', NULL, 'norway');
insert into company values ('worldfoods', 100, NULL);

SQL only returns tuples that yield true for a selection condition. What happens with?

select * from company where 
stockprice > 10 or stockprice <= 10;

Now let’s run some aggregation queries with null values.

  1. Find the number of companies in each country
select country, count(*) from company group by country;

The query groups all company with null as the country together. count(*) counts rows.

    select country, count(country) from company group by country;

The above query does not return the correct count for null countries. It shows 0 for null. Why?

Except for the count(*) function, all aggregations perform a null-elimination step, so that nulls are not included in the final result of the calculation. This means that select avg(i) - (sum(i)/count(*)) from R; is not zero if you have null values in column \(i\).

select distinct country, (select count(*) from company c 
                            where c.country = x.country) 
from company x;

The above query also does not return the correct count for null countries (It shows 0 for null). Note distinct country does not eliminate nulls, neither does count(*). However, c.country = x.country is a selection condition, which means only true values are returned and the null rows are eliminated because what does null = null mean after all?

  1. Find the number of products made in each country.
select country, count(p.pname) 
from product p, company 
where p.cname = company.cname 
group by country;

The above query eliminates both ‘norway’ and null country companies, why?

select distinct country, (select count(*) from product p 
                          where p.cname = c.cname) 
from company c 
group by country, cname;

The above query counts by company not by country. So it will provide incorrect counts per country!

Here is a nice solution to the problem:

select country, sum(
          (select count(*) from product p 
          where p.cname = c.cname)) 
from company c 
group by country;

To see why the above query works, try understanding first, what the subquery does:

select country, (select count(*) from product p 
                  where p.cname = c.cname) 
from company c;

A.8 Joins

Add a new table called featured

pname month feature
ShroomBurger 9 timeout
MacArabia 9 national
lambbiryani 8 gulfnews
shrimpbiryani 3 timeout
create table featured (pname char(20), month int, feature char(20));

insert into featured values ('shroomburger', 9, 'timeout');
insert into featured values ('macarabia', 9, 'national');
insert into featured values ('lambbiryani', 8, 'gulfnews');
insert into featured values ('shrimpbiryani', 3, 'timeout');
insert into featured values ('shroomburger', 7, 'timeout');

Now let’s runs some queries:

  1. Find out in which outlets products were featured
select p.pname, price 
from product p, featured f 
where p.pname = f.pname;
  1. What if you want to also show products that were never featured?
select p.pname, price, feature 
from product p left outer join featured f 
on p.pname = f.pname;
  1. For each product, find the number of features it had in September (9)
select p.pname, count(month) 
from product p left outer join featured f 
on p.pname = f.pname 
where month = 9 
group by p.pname;
  1. What if we want to show products that were never featured in September with a count of 0? The query above eliminates month != 9 from the where condition.
select p.pname, count(month) 
from product p left outer join featured f 
on p.pname = f.pname and f.month = 9 
group by p.pname;

To understand why the above query where works, lets understand the semantics of the left outer join. Let \(P\) stand for product and \(F\) stand for featured relations. Then \[P ⟕_{P.\text{pname} = F.\text{pname} \wedge F.\text{month} = 9} F\] \[\equiv\] \[P \bowtie_{P.\text{pname} = F.\text{pname} \wedge F.\text{month} = 9} F \cup (P - (P \bowtie_{P.\text{pname} = F.\text{pname} \wedge F.\text{month} = 9} F)) \times \{\text{null}\}\]

Equivalently, you can write this in SQL as follows:

select p.pname, count(month) 
from product p join featured f on p.pname = f.pname and f.month = 9 
group by p.pname

UNION all

select p.pname, count(cast(NULL as integer)) 
from product p 
where not exists 
  (select * from product p1 join featured f 
  on p1.pname = f.pname and f.month = 9 
  and p1.pname = p.pname) 
group by p.pname

Alternatively, you can also write the following query:

select p.pname, sum(coalesce(cast(month=9 as integer), 0))
from product p left outer join featured f on p.pname = f.pname 
group by p.pname;

To understand how the above query works, break it down into parts:

--- get features per product
select p.pname, month
from product p left outer join featured f 
on p.pname = f.pname;
    
--- set isFeatured to 1 if product was featured in 9, 
--- 0 otherwise (including nulls).
select p.pname, coalesce(cast(month=9 as integer), 0) as isFeatured
from product p left outer join featured f 
on p.pname = f.pname;

Here is another much simpler formulation thanks to Cole Beasley (Class of Fall 2021) that plugs in the condition month=9 into the left outer join selection.

select p.pname, count(month)
from product p left outer join featured f 
on p.pname = f.pname and f.month = 9
group by p.pname;

Can you break down this query to understand why baking in the selection condition into the join condition works?