Sunday, July 6, 2008

Normalization for Normal People, Part I--The "Language" of the Database, Installment Four

After such a lengthy break I feel I should do a quick review of previous thoughts on the language of the database and how it impacts the way we design our tables.

As I see it, the language of relational databases has three components:


The nouns are the names of the things we want to keep track of: People, Places, Projects, Transactions, Events and so on.

The adjectives are the descriptive terms we use to distinguish between one noun and another: FirstName and LastName for People, Address for Places, and so on.

The verbs of the database world, which I haven't addressed in detail yet in this blog, are the action or interaction words which describe how one noun relates to another: People place orders, thus creating transactions, for example. Companies publish catalogs. Schools schedule classes. Teachers teach students (or classes).

Unlike natural languages such as English, a relational database can not tolerate any degree of ambiguity. That means every noun must have one and only one meaning, every adjective must refer to one and only one attribute, or quality of a noun, and every verb must refer to one and only one type of interaction between nouns.

In previous blogs I've devoted considerable time and space to ways a "Zero Ambiguity" requirement can play out in a database.

Today, I'd like to push on a bit further in that area, exploring a few issues around the fact that it isn't always possible to avoid ambiguity entirely, and what we can do about that.

You Can't Always What You Get Want.

The problem is that, although we want to insist every item in a database must have one and only one meaning, there are times when it just doesn't work out that way.

My Rough And Rowdy Ways

I'm going to cheat on my sweetheart a bit here, in order to make a point. In the following discussion I'm going to treat "address" as an attribute of a person (or an organization). There is a sense in which that is accurate, but there is another sense in which it is probably less so. Without going to far astray from the central them today, i'll address that difference by pointing out that is possible to consider "Organization" and "Address" as two entities, and that they are related to one another by the verb "resides at", or "gets mail at", or "works at". In that sense, an address not so much an attribute.

However, there is another sense in which one can think of an address as an attribute of the entity: It's one of a number of facts about the entity which help to distinguish one from another.

For example, I've lived in a number of houses since I've been in Seattle--at least five. Only ONE of them is my CURRENT address, of course. If I'm giving the insurance company my address so they can insure my home, it's really important that they record that address, that complete address, and nothing but that address.

However, the fact is that there has been a historical, or time, dimension to my living arrangements. Sometimes it makes no difference, but sometimes it does.

One quick example is filling out a credit application. On the form are two lines: Current Address___________, and Former Address___________. In very fine print, the form tells me that I have to fill out the "Former Address" line only if I've lived at my current address two years or less.

Of course, that previous address information is stored somewhere in the dark nether reaches of my mind and I could, if necessary, call up that former address. I don't have to make the conscious decision to remember/not remember former addresses, that's just the way our brains work. (On the other hand, if I had to remember the address I lived at prior to that one, or heaven forbid, two moves before that, well, sometimes the memory just can't get us what we want.)

When you are designing your new database, of course, you must consciously make the decision whether to store only one, presumably current, address for people being tracked, or whether you need to keep the history associated with their addresses.

And that applies to every attribute which can have one or more values for the same entity.

The reason this is an issue--from the perspective of the language and structure of the database--is that variable attributes must be handled differently from those which are static (our dates of birth, for example).

Does Anybody Really Know What Time It Is?

The way I see it, when you are identifying the attributes of an entity, only those attributes which are static can be stored in the same table as the entity. Static attributes have no time or space dimension to them.

To use the language analogy, if the table is the noun, then only adjectives which can never have more than one value belong in the same table with it.

Historical Variance
If an adjective can have one value on one occasion, but a different value on a different occasion, then it should not be stored in the same table as its noun. This I call historical variance because it does involve a time dimension.

Concurrent Variance
Historical variance is easy to identify and easy to deal with, but there is another way in which an adjective describing a particular noun can have two or more different values, and sometimes those are a lot harder to identify. An adjective that describes a particular noun can have two different values at the same time. What, you might be asking yourself, is this guy talking about now? Two values at the same time? Well, yes.

The most obvious example is the one we've already presented: George Jones has two homes, and therefore, two addresses. They are concurrent with one another in that George owns both at the same time. The separation between them is based on space, not time.

Here are some other examples:

Willie is currently enrolled in three classes at the local community college. They don't all meet at the same time, but if you want to know Tommy's schedule for Fall Quarter, you have to list all three, not just one.

Waylon has three email addresses.

In cases where an attribute is variable, either because of historical or concurrent variance, that variable should not be stored in the same table as the entity to which it applies. Variable attributes should be in their own tables, unless you create and apply a Business Rule which specifically permits you to disregard that principle.

I've already given you the example of a person's address to illustrate this point, so let's continue to use it.

First You Say You Do, and Then You Don't

Because most people don't live at the same address for their entire lives, "Address" is one of those variable attributes which should be moved into its own, separate table. And, for the most part, every database should start out assuming that this will be the case.

However, your business may adopt a rule which says something to the effect of: "We don't care what happened in the past. All we want to know is where we can find you today. Just give us your current address. Also, if you have more than one address (like a primary residence and a summer home), we don't care. We just want that one address where we can find you when we need you."

If a business rule such as this applies, it is okay to violate the principle that variable attributes should be stored in their own, separate and move that attribute into the main table with the entity it describes. However, as the statement of the rule above indicates, this approach WILL result in some loss of data. What happened last year is lost when the current value is entered, replacing the old one. Information about any other values not captured in the record are simply not available, and more over, there is no way to even know if there ARE any other values.

In other words, deciding to store a variable attribute as a static attribute introduces a level of ambiguity into the data. It's up to you and your clients to decide if that kind of ambiguity is acceptable or not.

Obviously, once you've identified any variable attributes in your data, you must decide on the business rule that will apply to them. And from there, you can build your tables accordingly.

How Long Has This Been Going On?

Next time. Serial variance versus historical variance. Does time really matter?