Practical Technology

for practical people.

Relational Databases: The Real Story

| 0 comments

Did anyone else notice a few months ago when every food product in the land was advertised as “Light?” It didn’t matter a darn if there was any truth to their claims. Light was the new hype phrase and so everyone used it.

Breakfast cereals, margarines, and even ice creams were all advertised as being light. They got away with it because while you may think you know what light means, for advertising purposes it’s an ambiguous term. Food advertisers aren’t the only ones to play fast and loose with language, computer companies do the same thing.

Instead of “light”, buzzwords like “turbo” and “object-oriented” bombard us. One term with a precise definition has been bantered about so often that even computer veterans have lost track of its meaning. That phrase is relational database management system (RDBMS).

Companies have been using this phrase for years. There’s only one problem: few programs come even close to being relational database managers.

Thanks to their efforts even database professionals believe that RDBMSs are any program that can access more than one database at a time. Nothing could be further from the truth. Just because a program allows you to view two different tables with its procedural language doesn’t make it relational.

The sad thing about this is that there’s no real excuse for this confusion. The relational database model has been around since 1970. In that year Edgar Codd introduced it to the world in his classic paper, “A Relational Model for Large Shared Data Banks” in Communications of the ACM. Since then he and his cohort have been beating the drum for RDBMS both within and without the walls of IBM. To make certain that everyone got the idea, Codd summed up the definition of RDBMS in twelve rules.

These are:
In a way it’s funny that so many programs try to wrap themselves in the RDBMS mantle. During its first years of existence, nobody except Codd and company supported the RDBMS model. RDBMS, based firmly on mathematical theory and predicate logic, was disliked for its break with traditional database thought. It was reviled as being too theoretical. Many said that it would never be practical. Whoops!

The prevailing database model of the late 60s and early 70s was the hierarchical one. In this still popular system, data is stored in a tree structure. Every data element is defined as being a member of a group in this arrangement. These groups are themselves data elements and can be members of a higher group.

Under this scheme, your zip code is a data element is part of another data element, your address. This could then be a part of another data element, a mailing list for instance. These structures can grow quite complicated and require pointers and indexes to properly track information.

RDBMSs take a much simpler view of data. In a relational database, two dimensional arrays of rows and columns hold all information. This seems too easy and to some extent, that criticism is valid. Not every kind of information can be easily accessed under a RDBMS. Imaging information, for instance, doesn’t easily fit into a RDBMS’ neat patterns of data. Still, a RDBMS works fine for most textual or numeric data. The structure may be simple, but the data retrieval and manipulation powers inherent in it are unparalleled.

In part, this is because RDBMS theoretical underpinning allows for much easier coding. For example, finding records between two points, say all names between Vau and Vo, is a breeze in a relational system. The same search in conventional databases requires many record by record comparisons.

Even the most complicated data relationships can be reduced to two-dimensional table formats in a process called data normalization. This format makes changing and displaying information far easier than under a hierarchical system. The fundamental improvement of a RDBMS is that data can be added, deleted, or changed throughout an entire database by treating it as a single set. Ordinary database managers require record by record updates that drastically slow down performance.

The RDBMS concept sounds easy, but its full implementation does t more difficult to do in practice than in theory. The RDBMS complete definition contained within Codd’s 12 rules has proven difficult to achieve. Any database manager that claims to be relational must conform to the complete dozen. So far [1990] none have.

There are a few which have come close. IBM’s mainframe driven DB2, Ingress, Oracle, and R:Base 3.0 all attempt to meet the demands of the RDBMS model. At one time Codd stated that a relational database manager needed to meet only seven of his rules. By that standard, some have been successful. In order to understand why this has proven so hard to do, a review of some of these laws is in order.

First, here are the laws:

Rule 1: The Information Rule
All data should be presented to the user in table form.

Rule 2: Guaranteed Access Rule
All data should be accessible without ambiguity. This can be accomplished through a combination of the table name, primary key, and column name.

Rule 3: Systematic Treatment of Null Values
A field should be allowed to remain empty. This involves the support of a null value, which is distinct from an empty string or a number with a value of zero.

Rule 4: Dynamic On-Line Catalog Based on the Relational Model
A relational database must provide access to its structure through the same tools that are used to access the data.

Rule 5: Comprehensive Data Sublanguage Rule
The database must support at least one clearly defined language that includes functionality for data definition, data manipulation, data integrity, and database transaction control.

Rule 6: View Updating Rule
Data can be presented to the user in different logical combinations, called views. Each view should support the same full range of data manipulation that direct-access to a table has available.

Rule 7: High-level Insert, Update, and Delete
Data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables.

Rule 8: Physical Data Independence
The user is isolated from the physical method of storing and retrieving information from the database.

Rule 9: Logical Data Independence
How a user views data should not change when the database’s logical structure (tables structure) changes.

Rule 10: Integrity Independence
The database language (like SQL) should support constraints on user input that maintain database integrity.
” No component of a primary key can have a null value. (see rule 3)
” If a foreign key is defined in one table, any value in it must exist as a primary key in another table.

Rule 11: Distribution Independence
A user should be totally unaware of whether or not the database is distributed

Rule 12: Nonsubversion Rule
There should be no way to modify the database structure other than through the multiple row database language (such as SQL).

Now, let’s dive in. The fundamental rule of RDBMS is that all information must be manageable entirely through relational means. This means that on the logical level everything in a relational database must be represented by values in tables. Here is where many database managers fall short. It is all too tempting to take a short cut in data representation or manipulation for the sake of short term efficiency. Unfortunately, this easy road leads quickly away from the basic model.

Other rules re-enforce this concept. In a true RDBMS, the database description, or catalog, must be contained in tables and be controlled by the data manipulation language.

Perhaps the most troublesome rule here is the number three which deals with nulls. In a RDBMS, nulls represent missing or inapplicable data. They are a vital part of the relational concept. This isn’t the same thing as empty or blank fields or the concept of zero.

If that sounds confusing, you’re right, it is. English isn’t the appropriate language for discussing the issue. To make it clearer, here’s the best definition I know for the concept. Null represents information which isn’t known at the time of the record creation or modification. For instance, a hospital keeps birth records. Sometimes happy new parents don’t have a first name for their little darlings. Now, there are several ways this could be handled in the hospital files. The baby could be given a false name, “John Doe” or “Smith’s daughter” and the record changed latter to the true name. This will work but it makes both record updating and reporting more complicated. Under a RDBMS, the missing information is represented by the null value.

The problem with this is that even within the relational framework there is no agreement on how to manipulate record sets containing null values. Debate rages on what to do with nulls. Since it’s at the heart of the theory, it can’t just be disregarded. For example SQL, sequential query language, the most popular relational language, identifies primary record keys by the combination of their unique identity and by being “not null.” How do you handle a situation where the primary record key includes a null value? That’s a good question with many ways to answer it.

The waters have been further muddied by Codd’s suggestions that two types of nulls should be recognized. One would represent missing information and the other inappropriate data. There’s no sign that a definite answer on this thorny question will be quickly forthcoming. [January 29th, 2008, we’re still no closer to a universal answer.]

Another rule that has proved difficult to implement is that a RDBMS must have distributional independence. In other words, the data manager must be able to cope with distributed databases. A true RDBMS shouldn’t care if the view on your screen is made up of a table from your PC and another from a VAX across the country.

As any database manager watcher knows, this is one area where progress is rapidly being made. Database servers and clients are the hottest development in the field. The RDBMS model is largely responsible for this. Its simple design and data integrity rules have made it the cornerstone of most such schemes. [In 2008, this issue has long been solved. How to handle replicating date in case of a disconnection or data corruption remains a practical problem, but the core design problem has long been solved.]

You might think that you could combine the best features of a RDBMS with other systems. Many software companies have thought just that. They’ve tried to put a coat of relational paint on top of other systems. Codd addresses this with his final rule. The gist of which is that if a relational system does have an ordinary single-record-at-a-time language it can’t be used to detour around the system’s relational rules. Sour grapes over the spread of programs that did just that wasn’t the reason for this final commandment. The relational model is fundamentally different from other ways of viewing and manipulating data. A true hybrid system could never produce the full gains promised by relational theory.

In some circles, talking about database theory is like talking about religion or politics: you’re sure to have an argument. I’m in favor of the relational model, but I’ll be the first to admit that it has some problems. There are still grey areas, like nulls, that need a clearer definition. There are practical problems as well. RDBMS require comparatively large amounts of RAM and disk storage. To put it more bluntly, they tend to be resource hogs. It should also be considered that for many purposes there’s nothing wrong with hierarchical systems.

Most flat file database managers work quite well within their theoretical constraints. In point of fact, most database programs, including such popular favorites as dBase, FoxPro, Clipper, Superbase, and Paradox owe more to this model than they do to the relational one. Still, as time goes by the theoretically superior relational model will be successfully implemented on more platforms. Then, and only then, will we have products that can honestly be presented as being relational.

Until that time, the best that can be said of most programs is that they include some relational features.

Today, in 2008, most serious DBMSs–MySQL, Oracle, PostgreSQL–are while not perfectly relational, are based in large part on Codd’s rules.

A version of this article first appeared in Byte Magazine in 1990.

Leave a Reply