Friday, September 7, 2012

Forward to the Past: "Out-clevering" the DBMS



When I see a title like "Software Engineering: What are some clever programming techniques that aren't used often enough?" my antennae come up. More often than not (1) programmers consider certain things clever only because they are not familiar with history (2) in its quest for solving problems engineering sometimes tries to out-clever science, which usually spells trouble.


Timothy Bushnell writes:
I won't ask for "algorithms" or "data structures" specifically, because they require a formal definition and those questions already exist. Instead I'm intentionally being more abstract and asking for "techniques". That way you can feel free to pull out any abstract thing you thought was worth a little exposure.
More abstract with "techniques" than with formal structures?
If you are programming against a database, pay some attention to getting the data structures right before you start coding, as will make a considerable difference to performance and programming.
This is good. We are in agreement.
Don't be too rigid about normalisation. For instance, I have no problem storing the product's SupplierID in the purchase table as well as the product table. This may be duplication but can help later relating customers to suppliers without requiring the extra join. This is just an example. Disk space is not as expensive as it was - a little redundancy is okay.
I would have been willing, if reluctantly, to accept this if it was qualified as follows:
  • Given certain specific DBMS implementations;
  • Having ascertained that all physical factors affecting performance are not responsible;
  • Having considered the costs of redundancy in terms of its control (either constraints, which need to be coded and will offset performance gains, if any, or acceptance of certain amount of inconsistency; and of bias for some applications and against others, including future ones).
Disk space is not a real issue here, but the illusion that denormalization is costless is.
Consider using integer data types for boolean values - this will make it easier to add extra values like "Not Sure" or "Not Known" if addition to normal Yes/No of the requirement changes. Think ahead.
I agree that in accordance with the principle of cautious design we should always minimize possible disruptions caused by changes, but it is not clear how integer data types instead of Yes/No would be a significant improvement.
If you have multiple boolean fields in a record, consider combining them into a single Integer field. For instance in a User record create a single UserType field instead of 6 separate field for  IsTrainee, IsManager, IsTrainer, IsHR, IsSupplier, IsSupport. By assigning 1,2,4,8 and 16, 32 as "yes" values for these then we can say that a value of 3 in this UserType field tell us that they are both Trainee and a Manager; 36 that they are the Trainer, and they are responsible for Support. The advantage of combining these into one field is that is another type can be added - eg. IsFirstAider=64 - without adding a field.
When it comes to column overloading, however, Bushnell and I part company. It was gratifying to see readers demonstrate some knowledge of fundamentals. Although the first reader gave the post "+1 on considering de-normalization for performance" without qualifications, he also wrote:
I would choose to have the database be more self documenting with explicit fields. You can always create special multi-key indexes on the combinations of booleans that are used frequently and modern databases often use integers under the hood rather than bits for performance reasons.
An anonymous reader discerned the incorrect perspective on levels of abstraction and the importance of cautious in the design principle:
I have to disagree with some of the items in this post. Unless you're really tight on space, then type safety and proper abstractions trump the ideas here. To be honest, the suggestions sort of remind me of the kind of behavior C macro enthusiasts partake in. Some of the "think ahead" syndrome inflicted some of my coworkers, and they decided that that meant "store everything as strings, so we could store potentially any data format in any syntax we'd like." This ended up being disastrous out in the field.
But the best answer came from Toby Thain. How do I know? Because he referred Bushnell to dbdebunk :).
With respect, this advice is backward from a correctness point of view and violates the letter and spirit of the relational model—please see any decent relational text (try E.F. Codd, C.J. Date, Hugh Darwen, or Fabian Pascal).

Redundancy is NOT OK in a relational data model. Using excessively general types is NOT OK. Putting composite values in a column is NOT OK (relations provide the means to model this correctly). Review the "information principle".

It gets worse, though. With your bitset representation—or any non-relational representation—typically you can't index & join efficiently. In this case, you would also need relations to add attributes to subtypes. And non-relational representation complicates queries.

Also you did not mention constraints of any kind, which actually are essential to integrity.
But it would not be me if I did not quibble a bit with the reply:

1. I would not use 'correctness'. Violations of relational principles inflict costs, so the more appropriate term would be 'not very useful' or 'has drawbacks'.

2. Not "in a relational data model", but rather 'in relational database design'.

Overloading of columns does, indeed, violate Codd's core Information Principle (IP):
The entire information content of a relational database is represented in one and only one way: namely, as [explicit] attribute [column] values within tuples [rows] within relations [R-tables].
R-tables are by definition in First Normal Form (1NF) or normalized by obeying the discipline of single-valued columns. Bushnell's trick is too clever by half, because it "hides" information from the DBMS, violating IP and 1NF in the process. Keeping information secret from the DBMS means that it will be only known by those applications written to exploit it and, therefore, not amenable to relational manipulation by the DBMS and its integrity enforcement. 
  • What happens when the data changes--which applications will be affected?
  • Who will ensure that all pertinent applications will protect data integrity?
  • What about performance optimization by the DBMS? 
  • What happens when the developers of the pertinent applications leave?
There is an anecdote that circulates and I paraphrase it. Many years back Chris Date and Michael Stonebraker sat on a conference panel. Date said "We have to give up on the older, mainframe programmers who are used to old ways, and hope that the young generation of developers learn the relational model in school and will apply it". To which Stonebraker, a professor at Berkeley at the time, replied "Chris, you don't know what you're talking about. All these young guys were not around when we had all those horrible problems that we got rid of with the relational model and are reinventing all of them."

In this context, see this week's quote. Those who don't know the past...

No comments:

Post a Comment

View My Stats