When I first heard the phrase, database normalization, I wondered if there were abnormal databases, or wild uncouth databases roaming the earth.
Instead, database normalization is an elegant solution to a common design problem in computer science and application design. It's also a solution challenged today by database designs which reject normalization to gain performance benefits. This article provides a high level overview of database normalization and database design.
A Simple Database Design Example
Let's say you are designing an application to store company name, addresses, and contact information. To begin, we might design a database with these columns:
- Contact First Name
- Contact Last Name
- Company Name
- Street Address
- Postal Code
- Main Company Phone Number
- Main Company Fax Number
- Website URL
These are simply the most obvious data points to add. There could be many others.
Now let's assume a few details are true about the companies and individuals in our database:
- A company might have two or more office locations, each with their own address, phone, fax, URL, and other details.
- We might contact two or more people at each company.
- Each individual might have multiple phone numbers, social media contacts, and other data worth keeping.
The Problem Database Normalization Solves
It seems reasonable we could keep our original list of column names and use one database table. However, storing the same company name for each individual in that company, as well as each alternate office location, would be a waste of database space and potentially increase processing time. And it's possible updating the company name in one row of data might not update the company name in another row. To add or update data, we'd have to check every instance of the company name is accurate in every row where the company name appears.
How do you avoid these problems? Database normalization.
Specifically, if we look at our sample data, we can identify several types of data:
- The name of a company.
- The name of an individual.
- All the contact data for each individual.
- All the unique data for each company location.
The solution is to build one database table each for most or all of these types of data then use unique incrementing ID numbers assigned to each record of company name, person, and so on, then in other tables use the unique ID number instead of the plain text company name or other data..
For example, let's say we create these database tables:
- Company Names
- Company Locations
The Company Name table would have two columns, one column to hold a unique incrementing number automatically assigned to each new row of data and a second column for the company name.
The Company Locations table would have columns for address, phone, and other company location related data plus a new column to hold the unique incrementing number assigned to the company name in the Company Name table.
When we display data from our two data tables, our database query would say, in effect, for Company X with unique ID y in the Company Name table, grab all data in the Company Location table where the unique ID y value is stored. If there are five locations for Company X, our database query will return five results from the Company Location table. Because the unique ID y value is numeric, we don't have to worry about having the text Company X accurate in those rows of data in the Company Location table.
Even if we did have to worry about updating an ID number, dividing our database tables logically helps maintain our data. We don't have to worry about duplicate data in multiple tables.
Forms of Database Normalization
Our Company Name and Company Location tables are called single theme tables because they contain only one type of data. By using unique ID numbers, also called foreign keys, to identify company names in the Company Location table, we create implied relationships between types of data in the two tables. This is called second normal form, or 2NF. It is the simplest and most common form of database normalization.
Third normal form, or 3NF, applies to data we might find in the Company Location table. Perhaps each company location has a unique website URL, for example. If so, we might create a Company Websites table with a unique ID to identify each website URL and use the unique ID in the Company Location table. Because a company might have multiple locations with multiple website URLs, it is implied the company name might be assigned multiple URLs.
Where to stop database normalization is a fascinating question. In business, limits are set based on what functionality is needed, including future functionality. The database design takes into account different ways the database tables might need to be expanded over time. Once possible future changes are tested to ensure they would not disrupt the database design too much, the normalization process ends.
EF Codd in 1970 first created the idea of database normalization and then updated his idea in 1971 with third normal form. Codd worked with other computer scientists to identify additional forms of database normalization.
No SQL For You!
Perhaps the most interesting recent development in database design has been the adoption of non-normalized database designs. MongoDB, Google's BigTable, Couch, and other projects instead create massive tables with often repeated data.
Why? You might think, given the benefits of database normalization, creating the fewest possible number of tables would risk complexity and failure.
The answer lies in what databases do: read and write. It turns out database reads are much faster than writing and updating data in a database table. If your application mostly reads data and displays it for people, there is no reason to use a normalized database. The overhead of stitching data together is greater than simply popping all your data into one or a small number of database tables. Writing to your non-normalized database — which requires locking the table and field, adding the data, then unlocking the table and field — in that scenario is no big deal. It doesn't happen often enough to matter.
To return to our company information example we normalized above, in a NoSQL or non-relational database the data might be structured in nested groups. For example, one table might hold all our data with each row of data organized into key:value pairs. The key would be a unique ID. The value would be all the data for each company grouped by type.
In plain English, we would have a tree of data with nested groups of key:value pairs:
In this example, notice how all the data is organized into pairs separated by a colon (:). The items on the left of the colon are keys. The items on the right of the colon are called values. You can see we are able to create one master key:value made of a unique ID number as key and an array (a group of data organized into key:value pairs) as value. Within this array in the master key:value, we have other key:value pairs, specifically, one for Name, one for Location1 and one for Location2. We could, in theory, have 20 locations for this company, or 1. Within the Location1 and Location2 arrays, notice how the key:values differ: the address values differ, as well as web site URLs.
Now imagine, two years from today, we need to add some new type of contact information. With a relational database, we might create a table and use our unique company name identifier as a foreign key to connect the new data to our company data set. With non-relational databases, we’d simply add a new key:value pair then include it within the appropriate Location1, Location2, LocationN key:value set.
Another difference: with our non-relational database, one row of data in a database would hold all the data a relational database would hold in three or more database tables.
Bottomline, non-relational or NoSQL databases use key:value pairs and other strategies as an alternative to the database normalization used for relational databases.