1 The Relational Model
Week 1: January 23 - January 29, 2022
- What Goes Around Comes Around by Stonebraker and Hellerstein.
- REDBOOK Chapter 1: Background: The commentary to the paper above by Stonebraker discusses new data models like JSON.
- COWS 1.1 - 1.5
- REDBOOK Chapter 9: Languages: The commentary by Hellerstein on the powerful abstractions that database systems offer. Focus on data independence.
- Data Models on Gradescope.
A database is an organized collection of inter-related data that models some aspect of the real-world. This module focuses on data models or the tools and concepts that describe the semantics (meaning) of the data, the relationships within the data and the constraints in it.
A database management system (DBMS) is the software that manages the database. Don’t confuse these two.
We often assume that the people interacting with a DBMS are data analysts, decision-makers, etc. The reality is that the people that interact most with a DBMS and use languages like SQL are software developers. A DBMS is just another API for software development. So it is important to ask what is it that they offer to make our lives easier as software developers. Here will explore one such feature: clean data models and their associated higher level data manipulation languages. We will build up to the relation model and in the next module, we will explore relational algebra and SQL as the relational model’s higher level language.
1.1 Flat File Strawman
To understand what a data model and a DBMS offers, let’s start with a world where we just store our data on flat files and rely on the operating system’s file system to take care of things like recovery and concurrency control. What could possibly go wrong?
Issues with the RegistrarApp flat file database:
-
Data Integrity
- How do we makes sure that repeated information (e.g. student details for a student with multiple entries for each course they took) is consistent?
- What if we write an invalid string in our csv, skip a field, corrupt one line?
- How do we store multiple courses for each student? What if each developer chooses a different strategy (e.g. multiple rows with repeated information, a nested map or a nested array of courses for each student?)
-
Implementation
- How do you search for a student efficiently?
- How to allow mutliple applications to access the same database?
- How to allow concurrent access (reads and writes) to the same file at the same time?
-
Durability
- How do you ensure that the data stays consistent when the system crashes as you are updating the record? You can’t rely on the OS file system, they protect the file’s meta data but not its contents!
- What if you replicate your database? How to make sure that the replicas are consistent?
We will focus on the first set of problems in the first part of the course, then we will look at implementation problems and finally concurrency and durability challenges at the end of the course.
1.2 Data Models
A data model is a collection of concepts for describing the data in a database.
A schema is a description of a particular collection of data, using a given data model.
We will explore the evolution of data models from a historical perspective starting with the hierarchical tree model, then the network model and finally the relational model.
Different data models satisfy different needs. With the following criteria, watch the videos below to understand why the relational model is a clear winner:
Avoid data repetition: Having multiple copies of the same data means you have to ensure that your updates keep all the copies consistent — harder to program
Ensure data independence, both physical and logical: Data independence1 shields application developers from changes in the organization of underlying databases, and shields database administrators from getting involved in the design and maintenance of applications. Physical independence relieves the application developers from thinking about the physical layout of data and access methods (e.g. do I sequentially scan or use an index to get to a record?). Logical independence has to do with the abiltiy to evolve the logical structure (schema) of the database over time without requiring major software overhauls.
Supports a high level manipulation language: How can you access the data in the database? Tree models and network models opt for “record-at-a-time” languages, the relational model has an algebra with “set-at-a-time” semantics. You will find arguments for both sides. SQL is another higher-level abstraction over relational algebra.
1.2.3 Relational model
Which is better, set-at-a-time or record-at-a-time?
Whether set-at-a-time or record-at-a-time is better depends on the situation!
Programmers do like record-at-time. Certain algorithms are just easier to express when you can individually iterate over each record — think of how often you use foreach. Many annoyances from using SQL stem from the set-at-a-time semantics: you get a result set as a query result and now you have to create cursors over that result set and iterate over it in the application program. And the result set can’t be directly updated within your iteration, you have to create new “update queries.”
Set-at-a-time is better from a data independence / declarative perspective: you can let the optimizer choose what is the best strategy to generate a result set and you only need to define the properties of the set rather than what to do at each record and how to get each record. Let’s say I want the set of junior, CS students. Now with record-at-a-time, I’ll have to get a cursor to the student records in the database, scan each student and check ‘are they junior?’ ‘are they CS students?’ and return each student that is (then do whatever processing is needed) and then continue scanning from where I last returned. With record-at-a-time, you always need to keep track of where you are, which records did you visit, which ones you didn’t? But what if there were intermediate updates!? With a set-at-a-time, you just state your properties (Junior students in CS). Now your optimizer can decide whether it will sequentially scan the students set and construct a result set of the junior, cs students or if it will say use an index (if one exists) to quickly get to the CS students, etc.
Note that record-at-a-time doesn’t exclude the ability of using indices, you would need to know of them and once you navigate to the records at the leaves of the index, you may still need to keep track of where you are.
Set-at-a-time also gives you the ability to string along operations that occur as transformations over the entire set – think lambda architecture! This is all the rage these days.
So set-at-a-time makes a lot of sense from a data analytics perspective. With transactions, however, this may be the wrong abstraction — you often just want to update one customer’s bank account. Set-at-a-time can still generalize in this case: a set of one is equivalent to a single record but it makes things messy: this is the “impedance mismatch” that the optional reading fusses about.
Relational Terms and Definitions
Term | Definition |
---|---|
Database | A collection of relations. |
Relation | A table (a set), with rows and columns |
Tuple | A row in a relation |
Attribute | A column in a relation (its name and type) |
Domain | The universe of possible values for an attribute. E.g. the domain of a “country” attribute is the set of all possible countries (e.g. UAE, USA, UK, …) |
Schema | The name of the relation and the name and type of each column |
Keys | Keys manage relationships between records. A Primary key uniquely identifies a record. A Foreign key refers to a particular key in another table. A foreign key specifies that an attribute from one relation has to map to a tuple in another relation. |
Relational Algebra | Relational Algebra is a set of fundamental operations to retrieve and manipulate tuples in a relation. Each operator takes in one or more relations as inputs, and outputs a new relation. To write queries we can “chain” these operators together to create more complex operations. |
The optional course reading in this module takes you a step further in seeing the downsides of the relational model.
For those of you who are familiar with pandas, dataframes are their own data model. They are not relations: you can think of them as a matrix / relation blend. What makes them different? There is an inherent (i) row ordering, (ii) you can transpose the dataframe, (iii) and data types are lazily evaluated, which means that a column can have cells of different data types. This is a nice blog post on the origins of the data frames model and its subsequent adulteration.
You can now complete this module’s assessment!
Next week we will take a deep dive into relational algebra and SQL.