Thursday, March 26, 2020

You Say Whatchamacallit, I Say Department

Last time out I tried to explain some of my thinking behind selecting the names of things we need to track in a relational database application. Specifically, I likened them to the nouns of a natural language as a way to help differentiate between entities themselves and the attributes that qualify them. It's more complex than that, though. Let's drill into one aspect of that complexity using the example of departments within an organization.

In terms of the Discourse Blocs analysis originated in Will Pitkin's original formulation, the specific structure here is:
X is one kind of Y.
The relationship between X and Y is that Y is a higher level group of similar X's, a hierarchy. X does not modify Y, which would be another possible X and Y rhetorical structure.
X is an attribute of Y.
Let's go back and look at two of the proposed tables I outlined in the previous post.

  • 2018 Budget
  • 2019 Budget
  • 2020 Budget
  • Current Employee Assignment
  • Department
  • Human Resources
  • Location
  • Employee
  • Payroll
  • Sick Days
  • Vacation Days

We're focusing on "Human Resources" and  "Department".

It's obvious to users of our natural language, that the rhetorical structure here is that "Human Resources" is ONE KIND OF "Department". (At least I think it's obvious to most of us.) We recognize that "Human Resources" is a qualifier (also referred to as an adjective) for the noun "Department". But the underlying rhetorical structure is that Human Resources is ONE KIND of Department, which is different in rhetorical meaning from, say, "Large Department" or "Small Department". "Large" and "Small" are attributes of ANY department.

In company A, Human Resources is a large Department.
In company B, Human Resources is a small Department.

There are rhetorically speaking, two X and Y structures at work.

  • X is ONE KIND of Y.
    Human Resources is a Department in Company A and in Company B.
  • X is AN ATTRIBUTE of Y.
    Human Resources is large in Company A and small in Company B.

Turning to the way we apply this understanding to the way we design tables in a relational database application, we can make the following conclusions.

In a relational database application that means we have one table called "Department", i.e. the noun, and that "Human Resources" is one of the records in that table  along with "Operations",  "Marketing", etc. They are all "One Kind Of", and that means they are all records in the Department table.

In other words, the specific terms differentiate between instances of "Department". Each such instance becomes one record. (I'll get to another aspect of this principle later, when I talk more about the problem of repeating groups and spreadsheet style tables.)

In the previous post, I also described the relationship between Employees and Managers along the same X and Y principle.

A "Manager" is ONE KIND OF "Employee".

Note that unlike Departments, we don't normally attach descriptors like "Manager" to nouns like "Employees" even though the X is one Kind of Y structure is the same. I.e. we don't say "John is a Manager Employee." We DO recognize, though, that the underlying rhetorical concept is consistent, i.e. "X is a kind of Y".

 Pitkin explored this important concept in his original work: most rhetorical structures can be expressed in more than one way. It's important to look for and identify those rhetorical relationships and not focus exclusively on the specific words in a block of discourse.

I realize that I'm pounding hard on this point. I'm doing so because it's so important to my way of thinking about table and relationship design.

It's not just the specifics of the words in a phrase that we need to understand. We also have to understand the principles by which we describe one thing in terms of another. At the risk of over-simplification, what I'm proposing is that it's not enough to notice that we can, and do, refer to a "Human Resources Department", even though we do not refer to "Manager Employee" or "Salesperson Employee". It's not just the actual syntax that matters, it's also the semantics behind that syntax which matters.

And now we have a better way to understand that meaning in terms of the X and Y structures.

Let me illustrate it this way. One of these statements is true, one is not. It has to do with whether the X and Y relationship is ONE KIND OF or AN ATTRIBUTE OF.

Once a Human Resources Department, always a Human Resources Department.
Once a receptionist, always a receptionist.

If we go to work on Monday in the Human Resources Department, we don't expect to go into work on Tuesday and find that the Human Resources Department is now in charge of Sales instead.

If we go to work on Monday as a receptionist, it's quite possible that we can go back to work on Tuesday as a Marketing Specialist due to a transfer or promotion.

And that's the kind of logic we need to bring to our tables in a relational database application.

I'll sum it up by saying that, if you decide you are looking at a potential component of your database, and you decide that it is an example of ONE KIND OF thing, it becomes a record in a table. If you decide it is an example of AN ATTRIBUTE OF, it becomes an attribute, or field, in a table.


Records and Attributes in a Relational Table





Thursday, March 19, 2020

I Need One of Those Whatchamacallits

Tables—The "Nouns" of the Database Language

The first rule of normalization specifies how fields and columns in a table should be set up. The other rules add greater specificity. However, in my opinion, before we can even get to that point, we have to have a clear idea of what the table itself should look like. I like to make the analogy that the tables in a relational database application play the role played by nouns in a natural language. In other words, they are the names of the things we want to talk about. But we have to be very disciplined in selecting those names and the things they represent.

Let's consider a more or less random list of names of things that you might want to materialize as tables in a relational database application. At this point, they are still candidates for inclusion, still in the realm of natural language, because we are still just thinking about the things we see and talk about in our day to day work; things we may want to create tables for in a relational database application.

As I review them, I'll try to explain why some of them ARE good candidates to become tables, and some ARE NOT. If we can codify the thinking behind those decisions, we can reuse that decision-making to create other tables for other databases. Also, we'll identify at least one table that is missing from our initial list of candidates.
  • 2018 Budget
  • 2019 Budget
  • 2020 Budget
  • Current Employee Assignment
  • Department
  • Employee
  • Human Resources 
  • Location
  • Manager
  • Payroll
  • Sick Days
  • Vacation Days

Human Resources Functions, I Presume?

First, we can deduce that these might be tables you'd find in a Human Resources relational database application. In fact, Human Resources is one of the names in the list of candidates to be materialized as a table. So let's see what else we know about Human Resources. 

Most organizations organize themselves into  departments; Human Resources is one of them along with Operations, Marketing, Information Services, and usually others. People who work in Human Resources are tasked with managing the people who work for the organizations, how and how much they get paid, which departments they work in and so on. People working in Operations are responsible for organizing the way the organization does its work, Marketing employees develop programs to sell the organization's goods and services, and so on.

The first thing that jumps out to me is that the list of candidates includes Department and Human Resources. Obviously, we are primarily concerned about HR functions here, so it's understandable that HR would be proposed as one of the table candidates. However, what we already know about how organizations work in the normal world tells us this is not going to work in a relational database application.

Egos aside, does it make sense to lump other departments into one category while isolating the HR department into another, separate category at the same level? From a natural language perspective, if you were going to create a presentation to the board of directors, would this rhetorical structure make sense?
The XYZ Company consists of two divisions: Human Resources and Other Departments.
Is it really a binary choice between two categories? Probably not.

You might see a rhetorical organization like this, though:
The XYZ Company consists of numerous divisions, including Human Resources, which is the subject of our presentation today.
According to Discourse Bloc theory, the basic rhetorical structure of these two formations is very similar. Both follow one of the common Bloc structures found throughout technical discourse. Specifically, that would be:
Bloc X consists of one or more units of Bloc Y.
For example:
Animal species (Bloc X) include (or consists of) mammals, reptiles, insects, and others (Bloc Y). 
 In our relational database application, the rhetorical formulation is this:
Bloc X (the XYZ Company) is organized into multiple Bloc Y parts (divisions or departments). 
  • The first formulation artificially limits the Y number of parts to two UNEQUAL parts. 
  • The second formulation recognizes that there are actually multiple, EQUAL parts.
You probably agree with me that that the first formulation would make no sense. In fact, it might get you in trouble with the other departments represented at that meeting. However, the second formulation is semantically, as well as rhetorically, sound because it a) correctly identifies the relationship between whole (XYZ Corporation) and its parts ( XYZ Departments), b) acknowledges the parity of Human Resources with other departments, and c) identifies the primary focus of the presentation.

In our database language, then, the noun we choose for our table name should reflect the same logic. We need tables which accurately reflect the real world.  I suspect the logic is a lot easier to recognize "in the real world" because that's where we spend most of our time. We know that there are many  more or less equal departments even if we only want to talk about one of them. In a relational database application, sometimes we lose sight of that.

In the database we need one table, called Department (the whole). One of the departments (the parts) listed in that table will be Human Resources. That's true even though all of the other things we name and talk about in this database all pertain to that one department. We're tracking the functions of HR, but we're not ignoring reality to do so. In this relational database application, we're not tracking what the Operations or Training Departments do, we're tracking the people, i.e. the Human Resources, who work in all of the organization's many departments.

Conclusion: We eliminate "Human Resources" from further consideration to become a table. We retain "Department" as a table candidate. Department is the name of the entity we want to track in our database. It's the proper noun.

The Things Human Resources Must Track

So, if we know that these are all related to Human Resources activities, the next task is to decide which of these proposed tables should be included, and if they are included, should they be used "as is" or further refined.

Managers are People Too

There is a proposed table for "Managers" and a proposed table for "Employees". Does that make sense? I suggest not, for this reason.

The people who work for XYZ Corportation  includes all of those people who are employed to "do something" on behalf of the organization. The Discourse Bloc for this is similar to the one we saw for departments, Bloc X consists of multiple Bloc Ys.
Bloc X (people who work for XYZ) fills several Bloc Y roles (managers, sales people, IT Developers, assembly line workers, and so on).
Formulated in that X/Y structure, it's pretty easy to see that we need a higher level, Bloc X category for all of the Bloc Y roles. The obvious choice is Employee. All managers are employees; all delivery drivers are employees, all software engineers are employees.

Once again, we drop the lower level, Y Bloc, terminology from consideration as a table. We should have an employee table; we should have no manager table.

Y Blocs Don't Disappear

At this point, we need to point out that removing "Human Resources" and "Manager" from consideration as tables in themselves does not mean they aren't relevant. They are. They are "Bloc Y" members of the two remaining "Bloc X" categories. And that means they'll become records in their respective tables, along with any other Bloc Y units that catch our attention.

Another way to say this is that "Department" and "Employee" are two of the nouns we need to include in order to talk logically about Human Resources activities at the XYZ Corporation.

We can continue review and analysis of the other candidates on the list; some will make into the Bloc X group that become tables, some into the Bloc Y group that will become records in those tables. I'll take up that task in my next installment.


Friday, March 13, 2020

Answering the Big Questions with Simple Answers

Mathematical and Linguistic Logic — The Rules of Normalization and the Rules of Rhetoric.

Relational database design is based on predicate logic mathematics, more specifically, the mathematical theory of sets. Unfortunately for many folks, the mathematics involved tend to be dauntingly complex. And I'm speaking from personal experience here.
At least partly for that reason, people have codified a set of design principles, or rules, to make it more accessible to a wider audience. We know these principles as the Rules of Normalization.
There are at least five Rules of Normalization, although only the first three are considered crucial to sound database design in many, if not most, cases. To make matters more challenging, different writers have come up with different formulations of the rules!  Here's a fairly typical overview of one of those formulations.
For a table to be in the First Normal Form, it should follow these 4 rules:
  1. It should only have single (atomic) valued attributes/columns.
  2. Values stored in a column should be of the same domain
  3. All the columns in a table should have unique names.
  4. The order in which data is stored does not matter.
For a table to be in the Second Normal Form,
  1. It should be in the First Normal form.
  2. It should not have Partial Dependency.
A table is said to be in the Third Normal Form when,
  1. It is in the Second Normal form.
  2. It doesn't have Transitive Dependency.

Do What Now?

See what I mean about accessibility? What are atomic valued attributes? What is a partial dependency and how does it differ from a transitive dependency and how do you know either of them when you see them? It's enough to make a grown person cry, right?
Seriously, though, while it is does take dedication and study, nearly anyone can understand and implement the Rules of Normalization. As a matter of fact, if you intend to work with MS Access to build relational database applications for yourself, your organization or clients and customers, you must do so.
As self-identified MS Access trainer and advisor, one of the ways I've tried to penetrate that layer of complexity on behalf of new developers is to offer analogies between the natural languages we all speak and the "language of the database" inherent in the Rules of Normalization. In a recent post, for example, I compared Tables, Fields and Relationships  between tables in a database to the Nouns, Adjectives and Verbs of a natural language.
I would like to expand on that further in this post.

Natural Language and Rhetoric, Database Language and Normalization

While the range of potential topics for discussion in natural languages is open-ended, we rely on a commonly understand set of rhetorical patterns for all of them. We use the same language structures to talk about an endless variety of topics: poetry, politics, business, science, wisdom, war, disease, culture, family, religion, and on and on and on. The important concept at work here is that there is considerable overlap both in vocabulary and—more importantly—in the rhetorical structures that organize our discussions of each of those subjects. That is the central thesis of Will Pitkin's PhD Dissertation, A Pedagogical Model of Discourse in which he argued that virtually all "discourse blocs" follow the same clearly identifiable rhetorical patterns. Understanding how discourse is composed of those building blocs is an essential skill. We seldom pay attention to that fact, but it is crucial to our ability to communicate complex subject matter. There are, as I see it, "Rules of Normalization" for rhetoric and we all intuitively rely on them to communicate.

Databases, in my opinion, are very much like that too. Any given database uses the same building blocks regardless of the subject matter it contains. A college professor, for example, might have a database to organize her library of poetry and literature, but it would probably not have a component devoted to rare mammals of the tropics. The scientist studying those mammals, on the other hand, would have tables devoted to species and habitats, but probably nothing about rhyme and meter. Yet poet and scientist both rely on the components of a database—tables, attributes and relationships between tables—in exactly the same ways.

The Role of Ambiguity

The problem is that natural languages tend to be a lot more forgiving of ambiguity.
Another way in which we can contrast natural languages and database languages is to recognize that ambiguity is an inherent, and sometimes even a desirable, element of natural language discourse. Poetry relies on it, as do politics and humor. The interplay between two or more possible interpretations of a phrase or sentence delights us, confuses us, and even enrages or mislead us. All of us have employed ambiguity at one time or another to achieve one or more of those goals. And the more skillful the writer, the more enjoyable it can be.

Technical discourse—which is the closest natural language analogue to the relational database applications of interest to us—is the most likely rhetoric to try to limit ambiguity.

To be useful, the language of a database must exclude ambiguity to the greatest extent possible. That is, after all, one of the primary reasons the scientist goes to the trouble of using it to classify the specimens collected for study. If you can't differentiate between those specimens, what's the point? It matters a great deal that the attributes of a small furry mammal are clearly and unambiguously stored in a table.
So how does this relate to our theory of database design? Let's take a look at the three components needed to "describe" any subject in the language of the database.

The Language of the Database

  • Tables are the Nouns of the Database Language
  • Fields are the Adjectives of the Database Language
  • Relationships are the Verbs of the Database Language

Tables—The "Nouns" of the Database Language

Tables are the objects in which we store data about the subjects of concern to that database. In a database of poetry, we'll need to talk about things like "poets" and "poems", at the very least. If the database is intended to support analysis of poems, then it needs to talk about things like "rhyme" and "meter" and so on. If it's intended to support biographical history of poets, it will need still other tables. Each of those topics—or entities to use a more database-friendly term— is a candidate to become one table. And the names of those things correspond to the names of the tables. A table called Poet", a table called "Poem", a table called "Rhyme", and a table called "Meter" would appear in this database. More importantly, only those things identified as "poems" can be stored in the poem table. On the surface, this seems obvious when we are in the natural language realm. A poem is a poem, not a helicopter. But it's not always that simple, unfortunately. Is James Joyce's Ulysses intended to be a novel or a poem? It's certainly much longer than other poems like a novel is, but the language and style of it set it apart from most novels. It's been compared to Homer's Odyssey, itself an extended poetic tour de force hundreds of pages long. So, poem or novel, or a bit of both? That's where ambiguity fits nicely in a natural language, but much less so in a database. You can talk about Ulysses without ever classifying definitively. You can't store it in a database both as a poem and as a novel.
And that means we have to identify and define the characteristics by which we classify "Poems" in this database. By identifying the appropriate attributes, we can safely store data without ambiguity.

Attributes—The "Adjectives" of the Database Language

To make it clear what is, and what isn't, a Poem, the Poem table needs to store attributes, or characteristics, that pertain to Poems and ONLY to Poems. Poems may or may not have rhyme schemes, for example. And if they do, that rhyme scheme will be one of those previously defined,  iambic pentameter, for example, or blank verse. That means one of the fields needed for the Poem table is "RhymeScheme". There are many different possible values for that attribute, but there's only one category, or field, for them all. I'll provide more information on this in a later blog.

Relationships Between Tables—The "Verbs" of the Database Language

Poets WRITE poetry. Poems ARE WRITTEN BY poets. No other relationship exists. Poems are not BAKED BY poets. Poets don't DRIVE poems. In order for the database of poetry to make sense, then, there is one and only one possible relationship between the tables for Poets and Poems.
The verb "WRITE" in this natural language formulation corresponds to the logical relationship between the Poet table and the Poem table.
Let's foreshadow a future discussion by pointing out that Poets seldom limit themselves to a single poem in their entire career, at least they don't do so by design. So, while the relationship itself is between "poet" and "poem", the reality is that we have a table of poets and a table of poems, there will be one OR MORE poems in the poem table for each of the poets.
I think that's enough for now. I'll continue my thoughts on the Language of Databases in future installments.

Tuesday, March 10, 2020

Speaking of Relationships, How You Doin' ?

Many years ago I first started thinking about relationships. At that point, I was studying language and linguistics in preparation for what I thought would be a career in teaching English. That relationship didn't work out.

Many years later I discovered MS Access and fell in love all over again, starting another relationship that's lasted quite a bit longer.

What I didn't realize at the time, though, is that much of what I'd learned about rhetorical competence way back in the 70's was applicable to understanding normalization in developing relational database applications some 20 years later. And it's still relevant today as a I look back over several decades of a (mostly) enjoyable career.

Specifically, the linear legacy in both careers is that relationships are at the heart of everything.

The concepts behind Bloc Discourse* included the realization that comprehension of the content of any individual block of text is only one of the skills needed to understand expository discourse. There is a higher level principle at work. That principle depends on understanding the relationships between various units of discourse--sentences and paragraphs as well as longer blocs.

These days, when I try to explain Normalization to new developers, I often call upon a metaphor that combines both relational database design and rhetorical competence.

First, I like to think of Entities as the database counterpart of Nouns and phrases in a natural language. They are the names of things of interest.

Second, Entities and nouns don't stand alone. We need Attributes to provide the detail that makes our Entities meaningful within a context. They are the counterparts of adjectives in a natural language. They provide the nuance and flavor that distinguish one instance of an entity from another. The red car as opposed to the blue car. In a table of cars, therefore, we have a field called "CarColor" to hold that attribute of each car.

And third, in order to construct meaningful sentences, we need the linking words, verbs in a natural language and relationships in a database language. "Customers PURCHASE cars." "Dealers SELL cars."

All three components are required to communicate meaningfully, whether we are talking about a technical article describing the structure of discourse or a relational database application describing the sales operation of Red and Blue Motor Car Company.






*A Pedagogical Model of Discourse, Willis J. Pitkin, 1973

Monday, March 9, 2020

I May Have Been Ahead of Myself By 20 Years

First, a bit of background is in order. Recently, my ex-wife and her new husband suggested, if that's the right word, I might want to remove a few boxes of my things from their garage. They've been patient for a while, so even though I have no space for it all here, I went and picked them up.

One of those boxes contains papers I wrote while working on an advanced degree in English in the early 1970's, so maybe 45 years ago or there about.

I pulled one out and re-read it. It was on the general topic of rhetorical competence and the ability to understand technical papers by recognizing the structural relationships between units of discourse in those papers. My conclusions included the observation that, while content is important to reading competently, understanding those relationships and the writers' intent in establishing them was an equally important skill. In other words, the relationships between blocs* of text are just as important as the semantic meaning of those blocs of text.

So, as I read this paragraph, I couldn't help but think I was foreshadowing the days when I moved on to the development of relational database applications and returned to that fundamental concept: relationships between entities.

And finally, during the course of this analysis, it became clear that there were different kinds of relationships involved between certain units Specifically, it is possible to identify units from a structural viewpoint exclusively, in terms of bloc signals and sequencing of units. But in many cases, the analysis here resorted to what is more properly termed an "operational" analysis, wherein other considerations were made in deciding what the relationship with and between blocs should be. (emphasis added)
Forty five years ago, the key role of relationships was already being developed in how I understood effective technical communication. Twenty years later, when I first encountered Database Normalization, I guess it's fair to say that I was already primed to "get it".

And for that I will always be grateful to the late Will Pitkin and the other members of my thesis committee, John E. Lackstrom and Karl J. Krahnke.




=====================
*I used the spelling "blocs" because much of the thinking behind this approach came from a PhD Dissertation written by one of my professors, Dr. Willis J. Pitkin, who used that term in his work.