Friday, March 13, 2020

Answering the Big Questions with Simple Answers

Mathematical and Linguistic Logic — The Rules of Normalization and the Rules of Rhetoric.

Relational database design is based on predicate logic mathematics, more specifically, the mathematical theory of sets. Unfortunately for many folks, the mathematics involved tend to be dauntingly complex. And I'm speaking from personal experience here.
At least partly for that reason, people have codified a set of design principles, or rules, to make it more accessible to a wider audience. We know these principles as the Rules of Normalization.
There are at least five Rules of Normalization, although only the first three are considered crucial to sound database design in many, if not most, cases. To make matters more challenging, different writers have come up with different formulations of the rules!  Here's a fairly typical overview of one of those formulations.
For a table to be in the First Normal Form, it should follow these 4 rules:
  1. It should only have single (atomic) valued attributes/columns.
  2. Values stored in a column should be of the same domain
  3. All the columns in a table should have unique names.
  4. The order in which data is stored does not matter.
For a table to be in the Second Normal Form,
  1. It should be in the First Normal form.
  2. It should not have Partial Dependency.
A table is said to be in the Third Normal Form when,
  1. It is in the Second Normal form.
  2. It doesn't have Transitive Dependency.

Do What Now?

See what I mean about accessibility? What are atomic valued attributes? What is a partial dependency and how does it differ from a transitive dependency and how do you know either of them when you see them? It's enough to make a grown person cry, right?
Seriously, though, while it is does take dedication and study, nearly anyone can understand and implement the Rules of Normalization. As a matter of fact, if you intend to work with MS Access to build relational database applications for yourself, your organization or clients and customers, you must do so.
As self-identified MS Access trainer and advisor, one of the ways I've tried to penetrate that layer of complexity on behalf of new developers is to offer analogies between the natural languages we all speak and the "language of the database" inherent in the Rules of Normalization. In a recent post, for example, I compared Tables, Fields and Relationships  between tables in a database to the Nouns, Adjectives and Verbs of a natural language.
I would like to expand on that further in this post.

Natural Language and Rhetoric, Database Language and Normalization

While the range of potential topics for discussion in natural languages is open-ended, we rely on a commonly understand set of rhetorical patterns for all of them. We use the same language structures to talk about an endless variety of topics: poetry, politics, business, science, wisdom, war, disease, culture, family, religion, and on and on and on. The important concept at work here is that there is considerable overlap both in vocabulary and—more importantly—in the rhetorical structures that organize our discussions of each of those subjects. That is the central thesis of Will Pitkin's PhD Dissertation, A Pedagogical Model of Discourse in which he argued that virtually all "discourse blocs" follow the same clearly identifiable rhetorical patterns. Understanding how discourse is composed of those building blocs is an essential skill. We seldom pay attention to that fact, but it is crucial to our ability to communicate complex subject matter. There are, as I see it, "Rules of Normalization" for rhetoric and we all intuitively rely on them to communicate.

Databases, in my opinion, are very much like that too. Any given database uses the same building blocks regardless of the subject matter it contains. A college professor, for example, might have a database to organize her library of poetry and literature, but it would probably not have a component devoted to rare mammals of the tropics. The scientist studying those mammals, on the other hand, would have tables devoted to species and habitats, but probably nothing about rhyme and meter. Yet poet and scientist both rely on the components of a database—tables, attributes and relationships between tables—in exactly the same ways.

The Role of Ambiguity

The problem is that natural languages tend to be a lot more forgiving of ambiguity.
Another way in which we can contrast natural languages and database languages is to recognize that ambiguity is an inherent, and sometimes even a desirable, element of natural language discourse. Poetry relies on it, as do politics and humor. The interplay between two or more possible interpretations of a phrase or sentence delights us, confuses us, and even enrages or mislead us. All of us have employed ambiguity at one time or another to achieve one or more of those goals. And the more skillful the writer, the more enjoyable it can be.

Technical discourse—which is the closest natural language analogue to the relational database applications of interest to us—is the most likely rhetoric to try to limit ambiguity.

To be useful, the language of a database must exclude ambiguity to the greatest extent possible. That is, after all, one of the primary reasons the scientist goes to the trouble of using it to classify the specimens collected for study. If you can't differentiate between those specimens, what's the point? It matters a great deal that the attributes of a small furry mammal are clearly and unambiguously stored in a table.
So how does this relate to our theory of database design? Let's take a look at the three components needed to "describe" any subject in the language of the database.

The Language of the Database

  • Tables are the Nouns of the Database Language
  • Fields are the Adjectives of the Database Language
  • Relationships are the Verbs of the Database Language

Tables—The "Nouns" of the Database Language

Tables are the objects in which we store data about the subjects of concern to that database. In a database of poetry, we'll need to talk about things like "poets" and "poems", at the very least. If the database is intended to support analysis of poems, then it needs to talk about things like "rhyme" and "meter" and so on. If it's intended to support biographical history of poets, it will need still other tables. Each of those topics—or entities to use a more database-friendly term— is a candidate to become one table. And the names of those things correspond to the names of the tables. A table called Poet", a table called "Poem", a table called "Rhyme", and a table called "Meter" would appear in this database. More importantly, only those things identified as "poems" can be stored in the poem table. On the surface, this seems obvious when we are in the natural language realm. A poem is a poem, not a helicopter. But it's not always that simple, unfortunately. Is James Joyce's Ulysses intended to be a novel or a poem? It's certainly much longer than other poems like a novel is, but the language and style of it set it apart from most novels. It's been compared to Homer's Odyssey, itself an extended poetic tour de force hundreds of pages long. So, poem or novel, or a bit of both? That's where ambiguity fits nicely in a natural language, but much less so in a database. You can talk about Ulysses without ever classifying definitively. You can't store it in a database both as a poem and as a novel.
And that means we have to identify and define the characteristics by which we classify "Poems" in this database. By identifying the appropriate attributes, we can safely store data without ambiguity.

Attributes—The "Adjectives" of the Database Language

To make it clear what is, and what isn't, a Poem, the Poem table needs to store attributes, or characteristics, that pertain to Poems and ONLY to Poems. Poems may or may not have rhyme schemes, for example. And if they do, that rhyme scheme will be one of those previously defined,  iambic pentameter, for example, or blank verse. That means one of the fields needed for the Poem table is "RhymeScheme". There are many different possible values for that attribute, but there's only one category, or field, for them all. I'll provide more information on this in a later blog.

Relationships Between Tables—The "Verbs" of the Database Language

Poets WRITE poetry. Poems ARE WRITTEN BY poets. No other relationship exists. Poems are not BAKED BY poets. Poets don't DRIVE poems. In order for the database of poetry to make sense, then, there is one and only one possible relationship between the tables for Poets and Poems.
The verb "WRITE" in this natural language formulation corresponds to the logical relationship between the Poet table and the Poem table.
Let's foreshadow a future discussion by pointing out that Poets seldom limit themselves to a single poem in their entire career, at least they don't do so by design. So, while the relationship itself is between "poet" and "poem", the reality is that we have a table of poets and a table of poems, there will be one OR MORE poems in the poem table for each of the poets.
I think that's enough for now. I'll continue my thoughts on the Language of Databases in future installments.