Sunday, May 18, 2014

For Codd's Sake -- UPDATED

UPDATE: Correction on 6/8/14

This is a response to comments by a reader on one of my posts.
L: I realize that you have taken much further what Codd wrote on the first page of his 1970 paper but it's still remarkable how many people in the data business are not able to refer to, let alone talk productively, about his "natural structure of data". And many treat RT as a fait accompli when it is still evolving, not to mention those who, as you've pointed out many times, treat SQL gizmos such as outer join as if they come from RT when they don't.

I have two joint interpretations of the Codd's meaning of 'natural', that are actually not inconsistent with one another.
  • Everything based purely on content (values), not on computing internals
  • Application-independent, in their own right
The failure L refers to is due, on the one hand, to application developers who have an application-specific perspective and disregard for theory and data fundamentals.
L: Surely, as I think you suggest, "optimal" means the results we need for the least effort.
When I use 'optimal' I often mean the generality-to-simplicity ratio (g2s)  of a data model: broadest representational scope with least complexity. I am unaware of an alternative model that betters RM on it. Indeed, about the only two that come close to being data models in the Codd sense are the hierarchic and network models, which the RM has replaced decades ago due to their inferior ratio (for the criteria on which to compare data models see Business Modeling for Database Design).
L: Codd set the stage by giving us a vernacular, getting rid of the fields everybody was using forty years ago and replacing them with attributes.
That is correct, although we should not confuse two distinct uses of 'field'. Codd replaced the one associated with physical data files that preceded databases. The other one is the display-field appearing on the screen at the user interface.
L: He got rid of specific navigations and replaced them with logic that applies to all navigations. So he came up with a smaller context and a smaller language for the same purpose.  In theory, the logic amounts to a grand total of two or three operators, not too many for a child to learn.  Lots of people think the many variations of SQL operators reflect complexity of RT but the ones that are valid are based only on those two or three. Given a relational structure, those are all that's needed to discuss the meaning of "any analytical project".
I am not sure that just the basic logic operators are sufficient from a user's perspective. Even Codd realized that and did not limit the relational algebra strictly to its primitive operators, but added convenient combinations thereof e.g. join. Neither am I convinced that pure logic is always intuitive or easy just because the language is smaller, as any formulator or validator of integrity constraints knows (see the Appendix of the above mentioned paper). It is precisely for this reason that education is critical, yet how many data professionals learn logic, including CS graduates? That also explains not just the poor knowledge in the industry, but the pride commonly taken in disregarding/dismissing sound theoretical foundations.
L: Regarding the choice of relations to use, Codd introduced some approaches for data design too, but it's not clear to me that the objective of avoiding redundancy is the best way of describing what is needed. It's been years since I read your normalization paper and I guess much of that has changed by now, plus from what I gather your approach is more top-down than that of some other experts, but what I've seen elsewhere (including continuing refinements by various experts) suggests that avoiding redundancy in data is still the main focus. Sometimes I think they take that purpose too far which might obscure RT's usefulness for the rest of us, eg., there hasn't been as much written about the ambiguity that results when data designs lose the ability to distinguish the propositions involved in a join.
If I remember correctly, even the initial version of my paper specified several benefits of full normalization, aside from avoiding redundancy. Indeed, that was one emphasis of the paper. In fact, I am currently working on several posts @All Analytics in which I demonstrate the very problem you refer to--wrong query results o due poor understanding of normalization.
L: Adhoc ambiguity seems to be rampant by other people who ignore some of what you've said elsewhere, e.g., in other techniques where everything is specific to an application and people pretend that "new clothes" like XML can connect different apps.
True, but on the other hand redundancy carries a major integrity risk for the common denormalization for performance, of which most practitioners are unaware. They do not realize that if and when they get performance gains, if any, after denormalizing--and that by no means always happens, or is significant--it comes not from the change in logical design per se (which is impossible), but rather from the failure to control the redundancy by adding the necessary integrity constraints. Those constraints effect the very joins denormalization is intended to avoid. In other words, any gains come from trading consistency for them (see The Costly illusion: Normalization, Performance and Integrity).
L: Ambiguity is also just a cover for contradictions, eg., most SQL systems assume an exception should be raised when a duplicate key insert is attempted, but from a logical perspective the exception isn't needed, only a result that excludes the rows representing the two contradictory propositions.  Doing otherwise is an arbitrary choice, not a logical requirement.
If I understand this correctly, it is important to distinguish between database functions that are the domain of the DBMS and application functions that are in the purview of application programs. The latter are mainly communication with the user and presentation of results. An exception can be used to inform the user that he attempted to insert a duplicate row, or that he violated some constraint, with some action conditioned on that.
L: I mention this because much of what I read here and elsewhere has to do with application choices people make as opposed to theoretical conclusions. The relational fans, myself included, are sometimes guilty of trying to make that theory into more than it is, not recognizing that Codd was applying first-order logic to a purpose, not implementing it as is. These are some reasons why it's strange to me why should it be even asked what is optimal. People should ask what other approach is more optimal than the RT one!  So far, we have nothing more universal than the theory Codd started, even though he didn't finish it.
There are so many factors involved in this, that one would have to read my entire body of writings and teaching to explain it and that still would not be exhaustive. A major one is, again, educational failure.

Indeed, the common dismissal in the industry of the RM is rooted in failure to realize that it is theory applied to database management, which Codd adapted to practical needs. One example is keys: a mathematical relation does not have keys because its tuples are unique by definition. However, their uniqueness is by virtue of the combination of all their values. Database R-table rows, however, represent facts about real world entities, which are unique by means of one or more identifying attributes, hence the necessity of the key constraint. It is highly ironic that many of those who dismiss the RM as "just theory" do not consider keys mandatory, or are proponents of universal deployment of surrogate keys that have no real world counterpart.

And, by the way, another missed aspect is that RM is not just the theory of relations, but has a dual theoretical foundation--the equivalence of first order predicate logic to it: a FOPL expression is equivalent to a relational expression! That's where Codd's genius lies--his realization that this equivalence can be exploited fordatabase management.

No comments:

Post a Comment

View My Stats