1 The Relational Model

Week 1: August 26 - August 31, 2025

  • REDBOOK Chapter 9: Languages: The commentary by Hellerstein on the powerful abstractions that database systems offer. Focus on data independence.

A database is an organized collection of inter-related data that models some aspect of the real-world. This module focuses on data models or the tools and concepts that describe the semantics (meaning) of the data, the relationships within the data and the constraints in it.

A database management system (DBMS) is the software that manages the database. Don’t confuse these two.

We often assume that the people interacting with a DBMS are data analysts, decision-makers, etc. The reality is that the people that interact most with a DBMS and use languages like SQL are software developers. A DBMS is just another API for software development. So it is important to ask what is it that they offer to make our lives easier as software developers. Here will explore one such feature: clean data models and their associated higher level data manipulation languages. We will build up to the relation model and in the next module, we will explore relational algebra and SQL as the relational model’s higher level language.

1.1 Flat File Strawman

To understand what a data model and a DBMS offers, let’s start with a world where we just store our data on flat files and rely on the operating system’s file system to take care of things like recovery and concurrency control. What could possibly go wrong?

Issues with the RegistrarApp flat file database:

  • Data Integrity
    • How do we makes sure that repeated information (e.g. student details for a student with multiple entries for each course they took) is consistent?
    • What if we write an invalid string in our csv, skip a field, corrupt one line?
    • How do we store multiple courses for each student? What if each developer chooses a different strategy (e.g. multiple rows with repeated information, a nested map or a nested array of courses for each student?)
  • Implementation
    • How do you search for a student efficiently?
    • How to allow mutliple applications to access the same database?
    • How to allow concurrent access (reads and writes) to the same file at the same time?
  • Durability
    • How do you ensure that the data stays consistent when the system crashes as you are updating the record? You can’t rely on the OS file system, they protect the file’s meta data but not its contents!
    • What if you replicate your database? How to make sure that the replicas are consistent?

We will focus on the first set of problems in the first part of the course, then we will look at implementation problems and finally concurrency and durability challenges at the end of the course.

1.2 Data Models

A data model is a collection of concepts for describing the data in a database.

A schema is a description of a particular collection of data, using a given data model.

We will explore the evolution of data models from a historical perspective starting with the hierarchical tree model, then the network model and finally the relational model.

Different data models satisfy different needs. With the following criteria, watch the videos below to understand why the relational model is a clear winner:

  • Avoid data repetition: Having multiple copies of the same data means you have to ensure that your updates keep all the copies consistent — harder to program

  • Ensure data independence, both physical and logical: Data independence 1 shields application developers from changes in the organization of underlying databases, and shields database administrators from getting involved in the design and maintenance of applications. Physical independence relieves the application developers from thinking about the physical layout of data and access methods (e.g. do I sequentially scan or use an index to get to a record?). Logical independence has to do with the abiltiy to evolve the logical structure (schema) of the database over time without requiring major software overhauls.

  • Supports a high level manipulation language: How can you access the data in the database? Tree models and network models opt for “record-at-a-time” languages, the relational model has an algebra with “set-at-a-time” semantics. You will find arguments for both sides. SQL is another higher-level abstraction over relational algebra.

1.2.1 Tree models

1.2.2 Network models

1.2.3 Relational model

Which is better, set-at-a-time or record-at-a-time?

Whether set-at-a-time or record-at-a-time is better depends on the situation! Think this throught for your weekly assessment and we’ll put a note here next week.

Relational Terms and Definitions

Term Definition
Database A collection of relations.
Relation A table (a set), with rows and columns
Tuple A row in a relation
Attribute A column in a relation (its name and type)
Domain The universe of possible values for an attribute. E.g. the domain of a “country” attribute is the set of all possible countries (e.g. UAE, USA, UK, …)
Schema The name of the relation and the name and type of each column
Keys Keys manage relationships between records. A Primary key uniquely identifies a record. A Foreign key refers to a particular key in another table. A foreign key specifies that an attribute from one relation has to map to a tuple in another relation.
Relational Algebra Relational Algebra is a set of fundamental operations to retrieve and manipulate tuples in a relation. Each operator takes in one or more relations as inputs, and outputs a new relation. To write queries we can “chain” these operators together to create more complex operations.

The optional course reading in this module takes you a step further in seeing the downsides of the relational model.

For those of you who are familiar with pandas, dataframes are their own data model. They are not relations: you can think of them as a matrix / relation blend. What makes them different? There is an inherent (i) row ordering, (ii) you can transpose the dataframe, (iii) and data types are lazily evaluated, which means that a column can have cells of different data types. This is a nice blog post on the origins of the data frames model and its subsequent adulteration.

Does AI, and in particular LLMs and ChatGPT, upend the relational model and with it how we architect database systems?

Think about some of the data needs for LLMs and what can and cannot be captured by traditional database systems and let’s discuss in class.

You can now complete this module’s assessment!

Next week we will take a deep dive into relational algebra and SQL.