Wednesday, March 16, 2016

It's all about the data.


That's so obvious it seems silly to even comment on it in a page that's all about database design, doesn't it? But, every day, while addressing Access questions on my favorite Access forum, Utter Access, I run across questions can only be answered by looking at the specific data involved, not at the code used to manipulate it.

A recent example might help explain what I’m talking about. I’ve rephrased the question so as to avoid making it too easy to identify the source.

“My query raises a division by zero error. The query includes two calculated fields. The data is from three subqueries. The SQL for the final is shown below. How do I avoid the division by zero issue?”
Not picking on anyone, but the basic mathematician in me says “this ain’t a database question, it’s a math question. You avoid division by zero by not including zeros in the divisor.” But to the questioner, that simply hadn’t occurred, I guess. He or she was looking for “an Access solution” to the math problem.

There are, of course, two answers to this.

First, if records in the underlying tables have either Nulls or zeroes in one of the fields going into the calculation, then those values have to be resolved before you even START writing queries against that data. Exclude those records from the selection before you try to do any math on the remaining records.

There is, also in that sense, “an Access solution” to the question, "How do you handle Nulls and Zeroes in SQL so that they don’t blow up calculations?" The answers, of course, involve functions to convert Nulls and zeroes, as needed, when they appear. But that’s not the point of today’s comments.

Second, there is an even earlier, more fundamental, question to be addressed. If you are doing math (and a division by zero  error definitely signals that math is being attempted), then you have to decide whether Null is valid for the data at hand, and whether Zero is valid for the data at hand. If so, then why do you then want to do math on either of those values, knowing that your calculation is not valid under any circumstances?

For example, if you want to calculate the average number of days between the date an order is placed and the date it is shipped, you have to decide, right at the beginning, if you want to try to include orders without a ship date. Those orders are still being processed and not yet shipped. I submit the answer to that one is obvious. You must exclude them because it’s only meaningful to ask about average processing days for orders that were actually processed and shipped. So, by the time you write that SQL with a division in it, there are no nulls to fret over.

Another example would have to do with calculating error rates in a manufacturing process. Let’s say you want to calculate a percentage of errors detected in Quality Control versus errors reported by customers after products are delivered. Unfortunately, if the calculation is QA Errors divided by Customer Reported Errors, it’s entirely possible (and one would have to hope, likely) to have one or more errors found in QA and Zero reported by customers for a particular product. See what’s going to happen there? Yup, a divide by Zero error.

The answer to that one is equally obvious, I think. You don’t do the arithmetic that way in the first place. You come up with a more appropriate, mathematically valid, way to calculate this metric. And how you do that is as much a business rule as it is a math problem. Maybe the next approach might be, well, okay, lets SUM the QA and Customer Reported errors first and then divide the QA errors by that total to get a ratio of QA errors. Uh uh! If there are no errors of either sort, then you’re still dividing by Zero.

At this point, I will step in and acknowledge that there is an Access way to handle it, but you can only get there by understanding the math—and the business rule—behind it.

It's all about understanding your data first. After that, it's all just code.