Sunday, August 3, 2008

Normalization for Normal People, "How Long Has This Been Going On and Who Else is Involved?"

There is just a bit of irony in the title to today's installment. The cynical answer might be "Way too long already, dude. You've been talking about attributes for WEEKS now." That's true, but it's a fascinating topic and there is still a bit more to be said, so I'm going to say it. This week I'm also going to shift gears and bring a description of physical table structure into the discussion.

Variance, Past, Present and Future
The previous installment of my blog dwelt a lot on VARIANCE with regard to attributes. By that, I mean that for certain entities, some attributes have more than one value. I'm of the opinion that variable attributes can be the source of much confusion for novice or less experienced developers, so it really does make sense to dig into the subject at some length. Last time out, for example, we went through both concurrent and sequential, or historical, variance. Concurrent variance is determined by a spatial, or locational, dimension, while sequential, or historical, variance is determined by a temporal, or time-based, dimension.

To Each His Own
In both the case of concurrent variance and sequential or historical variance, sound table design normally requires that the attribute in question be removed from the Roster Table that lists entities in question. Variant attributes are properly stored in their own table in which each variant value for each entity makes up a single record.

That can't be emphasized enough. Variable attributes almost always require a table of their own.

Here's are some new examples of what I'm talking about and an expanded discussion of the importance of capturing historical variance.

I'm Leaving On a Jet Plane, Don't Know When I'll Be Back Again.
Let's say you have a database in which you want to track "incidents" involving aircraft at various airports. In that database, two of the tables will be roster tables, one listing Aircraft and one listing Airports. Both Aircraft and Airports are entities (the "nouns" of our database language). In order to accurately describe them, we need to include the attributes that uniquely describe each aircraft and each airport. I won't list all of the possible attributes (or "adjectives", in the language of the database) for this database, partly because they are not relevant here, and partly because I just don't know what all of them might be. We'll just talk about a couple that help illustrate our point for today: how to properly handle variant attributes

One of the "adjectives" that describes an airport is "runway length". This factor is important in evaluating incidents involving aircraft because one type of incident might be overshooting a runway. However, some airports have one runway, some have two, some have three, some have more. This is an example of concurrent variation. All of those runways exist at the same time; the difference between them is spatial, not temporal.

One of the adjectives that describes an aircraft is "service hours". Each time a plane goes up, it accrues additional service hours. This factor is also important in evaluating incidents involving aircraft because one type of incident might be equipment failure. Each time a maintenance measurement is taken, the service hours for a plane are different, but the date and time when that measurement was taken is an integral part of that measurement. This is an example of historical variation.

Therefore, this database will have at least four tables: Aircraft (a Primary Roster Table), Airports (a Primary Roster Table), Runways (a Secondary Roster Table) and AircraftMaintenance (a History Table).

Primary and Secondary Rosters
I have used the more generic term "Roster Table" up to now, but today, in order to differentiate between rosters which list only entities and their static variables and rosters which list variable attributes for those entities, we need to break Rosters down into Primary and Secondary Rosters.

Primary Rosters
Airport is a primary roster. Primary Rosters really only REQUIRE two fields (although they can have more.) Primary rosters need a primary key field and one Name or description field. If there are other static attributes, they can be included as well.

Secondary Rosters
Secondary Rosters REQUIRE at least three fields because they only exist by virtue of the relationship to the Primary Roster for which they are created. They need a primary key field of their own, plus a foreign key field "pointing back" to the related record in the Primary Roster table, and at least one field for the variable attributes being tracked.

In this aircraft incident database, Runway is a secondary roster containing at least two fields in addition to its primary key. The first required field is the foreign key field which "points back to" the Primary Roster table. For example, if LAX, the airport in Los Angeles, is identified by primary key "1" in the Primary Roster Table of Airports, then each of the runways at LAX will be identified in the Secondary Roster Table of Runways as Foreign Key 1 as well.

Obviously, the reason for creating the secondary roster table is that we need to store one record for each of the runways at LAX because we need to track the LENGTH of those runways. That means the third field in the runway table will be "Runway Length".
Three fields are the minimum requirement for a Secondary Roster Table, but in this case, a fourth field in the runway table will be the compass bearing, or direction of the runway. That's an important attribute that varies by runway as well.

And finally, it's probably necessary to have a fifth field for the name, or other designation for runways, e.g. "Runway One", which runs North to South and measures 10,000 feet in length. ("No" I didn't look that up, I'm just making up examples, so don't bother to correct me, okay?)

Notice that, if this Secondary Roster table did NOT have an associated Primary Roster table (i.e. if it were not related to "airports"), it would not have a foreign key field and it would therefore be a Primary Roster Table.

Another way to say this is that Primary and Secondary Roster Tables really differ only in that Secondary Rosters have a foreign key pointing to their related Primary Roster.

What's that you say about relational databases? Is the presence of relationships between two tables what they mean? Are we onto some important point here? I think it very well could be the case, don't you?
Most Secondary Roster Tables are like this runway table: A Primary Key field, a Foreign Key field and one or two variable attribute fields. They exist ONLY because you need to capture the variable attributes which can't be included in the Primary Roster table, and they always have a one-to-many relationship to their Primary table.

Historical Tables
Historical Tables resemble Secondary Roster tables in all aspects but one. Their name is the dead give away: "Historical".

Now I Long for Yesterday
You can't talk about history without including dates and times. Remember, one of the great enemies of sound databases is ambiguity. Relative references like "Today", "Yesterday" and "Tomorrow" are not acceptable. We need the real deal--that is, dates like "August 1st, 2008", "June 22nd, 2007" and so on.

With that in mind, it should be obvious that the required additional field for historical tables is "Date". Well, not exactly "Date" because that's too ambiguous. Maybe, in the case of an Aircraft Maintenance table, it would be "ServiceDate" or "MaintanenceDate" or "InspectionDate".

So, to create a historical table, which tracks variable attributes at specific points in time, we need a minimum of four fields: A Primary Key, a Foreign Key that points back to the Primary Roster Table, the "As Of Date" field, and, last but not least, the variable attribute which we are tracking. In the case of our Aircraft Maintenance table, that is "ServiceHours".

One last observation about historical tables, and then we'll call it a blog.

Is you Is or Is You Ain't My Baby?*
While many historical tables are intended to capture ONE variable attribute, it is just as common to find several such attributes in a single historical table. In the aircraft maintenance table, for example, "ServiceHours", meaning the number of hours an aircraft has flown as of the date it was inspected, or maintained, is being tracked. However, other kinds of information related to that inspection instance can be tracked as well, things like, whether or not the hydraulic and engine fluids were replaced, whether tires were replaced, and so on. Anything that relates to a single instance of Maintenance in the work flow at hand can be tracked in that single table.

That, unfortunately, is another trap for the unwary. Because a typical historical table can, and often does, include more than one variable attribute, it's a place where an inappropriate attribute can sneak into the table. Just make sure, therefore, that every such variable attribute added to your historical tables actually pertains to the incidents being tracked in that table.

I'll Buy That
That pretty much wraps it up for this part of our journey. Next time, we'll move from historical tables to transactional tables. Here's a clue to the difference. Not every transaction involves money, but they all require two to tango.


==========
*In case this particular song reference is too obscure for some of you, it's a 1944 Louis Jordan song later recorded by, among others, Bing Crosby and Nat King Cole, both of which I heard growing up in the fifties.