# 4 Normalization

Week 4: February 13 - 19, 2022

- COWS 19.1-19.7

- Normalization on Gradescope.

After logical design, your relational schema may still need refinement especially to eliminate redundancies.

## 4.1 The Theory: Functional Dependencies

As we described in the video above, functional Dependencies (FDs) are a form of integrity constraints. \(\mathbb{A}\rightarrow\mathbb{B}\), which is read \(\mathbb{A}\) determines \(\mathbb{B}\), means that if two tuples in \(R\) have the same values for the attributes in set \(\mathbb{A}\), then they will have the same values for the attributes in set \(\mathbb{B}\).

FDs come from the application semantics and they help us identify redundancies in our schemas.

In the following video, we will examine how to infer any FD that applies on a relation \(R\) from a given set of FDs \(F\) using a set of axioms known as **Armstrong’s axioms**.

We can try and derive the set of **all** the functional dependencies that apply on a relation \(R\) from the inital set \(F\) using the axioms. This set is known as the FD closure \(F^+\). It is expensive to compute this set as it is exponential in the number of attributes in \(R\).

In the following video, we describe the **attribute closure** (\(\mathbb{A}^+\)) of a set of attributes \(\mathbb{A}\), or what are all the attributes that are determined by \(\mathbb{A}\) given a set of functional dependencies \(F\) that hold on a relation \(R\).

Even though computing \(F^+\) for a set \(F\) is hard, exponential in the number of attributes, we can efficiently check if a functional dependency \(\mathbb{A} \rightarrow \mathbb{B}\) exists in \(F^+\) by computing the *attribute closure* \(\mathbb{A}^+\) of \(\mathbb{A}\), and checking if \(\mathbb{B}\subseteq\mathbb{A}^+\). *For many CS problems, the decision problem is easier than the enumeration problem.*

**AttributeClosure(\(F\), \(\mathbb{A}\))**

**INPUT**\(F :=\) set of given functional dependencies in \(R\)

**INPUT**\(\mathbb{A} :=\) attribute set in \(R\)

**OUTPUT**\(\mathbb{A}^+\)

\(\mathbb{A}^+:= \mathbb{A}\)

**Repeat**until no change in \(\mathbb{A}^+\)

**for**\(X \rightarrow Y \in F\)

**if**\(X \subseteq \mathbb{A}^+\)

\(\mathbb{A}^+ = \mathbb{A}^+ \cup Y\)

Here is the pseudocode that computes attribute closures to determine the membership of \(\mathbb{A} \rightarrow \mathbb{B} \in F^+\):

**VerifyFD(\(\mathbb{A} \rightarrow \mathbb{B}\), \(F\))**

**INPUT**\(F :=\) set of given functional dependencies in \(R\)

**INPUT**\(\mathbb{A} \rightarrow \mathbb{B} :=\) functional dependency to check existence of in \(F^+\)

**OUTPUT**true if \(\mathbb{A} \rightarrow \mathbb{B} \in F^+\)

\(\mathbb{A}^+\) =

**AtributeClosure(**\(F\), \(\mathbb{A}\)

**)**

**return**\((\mathbb{B} \subseteq \mathbb{A}^+)\)

The pseudocode can also be used to check for keys of a relation. Here is a quick recap on keys:

- If \(\mathbb{A}^+ = R\), then \(\mathbb{A}\) is a
**super key**of \(R\). - If \(\forall\) attributes \(A \in \mathbb{A}\) \((\mathbb{A}-A)^+ \neq R\), then \(\mathbb{A}\) is a
**candidate key**for \(R\). - A relation can have many candidate keys, the one we choose to be the key of the relation is the
**primary key**.

Assuming you have a database system that supports assertions (most don’t!), to ensure the correctness or integrity of your database you want to assert that you do not violate any of the functional dependency constraints in \(F^+\). However, that can be really expensive! You will have to check all the constraints on insertion, deletion and update operations.

A **Minimal Cover** is a set of functional dependencies \(F_c\) such that the closure of \({F_c}^+ \equiv F^+\), there are no extraneous attributes in any of the functional dependencies in \(F_c\) and the RHS of the functional dependencies are single attributes.

Why bother with minimal or canoncial covers? Because they provide a minimal set of assertions that we need to implement to ensure integrity.

At this point, you might be asking why did you go through all of that formalism! Hang in there. *By understanding functional dependencies, closures and keys, you will now be able to refine your schemas to ensure that they have no redundancies.*

## 4.2 The Process

**Normalization** is the schema refinement process where you **decompose** a relation recursively until you reach a specific **normal form**.

**Decomposition** is simply splitting a table \(R\) into two or more tables \(R_1, ..., R_n\) such that:

- Attributes(\(R_i\)) \(\subseteq\) Attributes(\(R\)), and
- Attributes(\(R_1\)) \(\cup ... \cup\) Attributes(\(R_n\)) \(=\) Attributes(\(R\)).

Note that at least one column has to be shared across a pair of resulting tables to help us recreate the original table by joining the resulting tables on that column!

A decomposition is good if it achieves the following:

- it is loseless: this is really a requirement!
- it eliminates redundancies
- it is dependency preserving.

**Loseless Decomposition** means you get back the data in the original table by joining the resulting tables of a decomposition.

A decomposition where the intersection of the two resulting tables is a key for one of the two tables is loseless (See theorem 3 in Sec 19.5.1 of the COWS book).

Formally, the decomposition of \(R\) into \(R_1\) and \(R_2\) is lossless with respect to a set of functional dependencies \(F\) that hold over \(R\) if and only if the closure of \(F\), \(F^+\), contains:

- \(R_1 \cap R_2 \rightarrow R_1\), or
- \(R_1 \cap R_2 \rightarrow R_2\)

**Dependency Preservation** means that I can check that the functional dependencies \(F\) that applied over the original table \(R\) still hold on the resulting tables, \(R_1, ..., R_n\) after a decomposition by only checking the functional dependencies that individually hold over each resulting table \(F_1, ..., F_n\). So, I don’t need to join the resulting tables to form the original table to check that the FDs in \(F\) haven’t been violated.

Formally, \((F_1 \cup ... \cup F_n)^+ = F+\). Note that we are testing the equality of the closures not the just FD sets! Work out this example to understand the subtlety here.

\(R = ABC\). We have FDs \(A \rightarrow B, B \rightarrow C, C \rightarrow A\). Suppose we decompose \(R\) to \(R_1 = AB\) and \(R_2=BC\). Is this decomposition dependency preserving^{2}? What about \(C \rightarrow A\)?

While a lossless join is necessary to get the original relation back correctly, dependency preservation relates more to being able to check all dependencies in the relation (either by having redundancies or joining, which is why it’s a performance property). It’s essential that all decompositions used to deal with redundancies be lossless. But, dependency preservation in our decompositions is not essential since we can check dependencies by joining the tables together.

### 4.2.2 BCNF

**Normal Forms** guarantee that a relation meets certain requirements. With respect to FDs, there are four forms (1NF, 2NF, 3NF and BCNF). Other normal forms go beyond functional dependencies (e.g. multivalued dependencies, sections 19.8 onwards in the COWS book if you are interested). We really only care about BCNF. The forms build on each other so a relation in 2NF is also in 1NF and satisfies more restrictions.

**1NF** relations have only atomic values, no list or set values for attributes. (At the get go, JSON, XML, and other contemporary data models are not even in 1NF!

Even though the forms build on each other, it is easiest to work backwards by understanding BCNF and then relaxing the restrictions to get to 3NF and then 2NF.

**Boyce-Codd Normal Form (BCNF)** is a form that guarantees no redundancies with respect to the FDs and is lossless. It may not be dependency preserving.

Formally, a relation is in BCNF if for all non-trivial FDs \(\mathbb{A} \rightarrow \mathbb{B}\), \(\mathbb{A}\) is a superkey for \(R\). Another way to think of this, is for all sets of attributes \(\mathbb{A}\) in \(R\), either \(\mathbb{A}=\mathbb{A}^+\) or \(\mathbb{A}^+=\) all attributes of \(R\).

Here is a BCNF decomposition algorithm:

**BCNFy(\(R\))**:

find \(X\) s.t. \(X \neq X^+ \neq\) [all attributes]

if (not found) then

\(R\) is in BCNF

else

Let \(Y = X^+ - X\)

Let \(Z =\) [all attributes] \(- X^+\)

Let \(R_1 = (X \cup Y)\)

Let \(R_2 = (X \cup Z)\)

BCNFy(\(R_1\))

BCNFy(\(R_2\))

### 4.2.3 3NF

BCNF does not always preserve dependencies and that’s why 3NF exists as a normal form. It is a relaxation of BCNF that trades off eliminating redundancy for dependency preservation. Consider the following relation:

Account | Client | Firm |
---|---|---|

101 | 1st Lehman Bro | Law For Money |

212 | 1st Lehman Bro | New Law |

101 | 2nd Lehman Bro | Law For Money |

333 | Enron | Old Law |

This relation has the following FDs:

- Client, Firm \(\rightarrow\) Account
- Account \(\rightarrow\) Firm

**What is the BCNF decomposition?**

To be in BCNF, we decompose as follows:

- (Account, Firm) for which (Account \(\rightarrow\) Firm) holds, and
- (Account, Client) for which there are no FDs!!

**In BCNF we lost the FD: (Client, Firm \(\rightarrow\) Account)**

So we keep the table as is, i.e. we don’t decompose, which is not BCNF, but is 3NF.

Formally, a relation is in 3NF, if \(A_1, ..., A_n \rightarrow B\) is a non-trivial dependency in R, then \(\{A_1, ..., A_n\}\) is a superkey for \(R\) or \(B\) is part of a key.

## 4.3 Schema Evolution

- This is a short editorial by Hellerstein describing data independence, Hellerstein’s inequality and the generalization of data independence to network independence. If you are still shaky about data independence this is a must read! Also, Hellerstein’s inequality may change how you view many fields within CS!

What if we want to change our database schema? Logical independence means that to a large degree the schema can change without the need to rewrite or update the application. One way that relational database systems support changes to the schema, while still allowing legacy applications to refer to the old schema is through **views**.

For example, suppose our enclosures’ schema changes so that area is now replaced by length and width. How can we still support all our applications that rely on “area?” Views to the rescue!

```
--- enclosures (old): eid, location, open, area
--- new_enclosures (new): eid, location, open, length, width
create view enclosures AS (
select eid, location, open, length*width as area
from new_enclosures
);
```

All queries and some updates are supported by views. So updates to the base table will be reflected in the view: we call this **view maintenance**. But updates to the view may not be easy to push down to the base table: we call this the **view update problem**. To see why this is a difficult problem think about how an update to the area can be translated into updates to length and width? You can specify specific update rules: e.g. \({\sf length=}\sqrt{\sf area}\) and \({\sf width=}\sqrt{\sf area}\).

View maintenance either re-executes the select query within the view or maintains a set of incremental update queries. Views can suffer from performance issues so there is a lot of research into how to maintain views efficiently with incremental updates. This is an exciting area of research!

The view update problem is still an open DBMS research problem.