Tuesday, March 18, 2014

Science, Religion, EAV and the Relational Model

Note: This is a 11/06/17 revision. Thanks to Erwin Smout for his review of a draft and suggesting improvements.

The claims that (1) the relational data model (RDM) is old and, by implication, obsolete -- the industry has purportedly "progressed" -- and (2) promoting it as a superior alternative to NoSQL, Hadoop and other "modern" data management technologies is "religious" in character are routine. They have popped again in a LinkedIn exchange and I responded as I usually do, by asking why is the promotion of a scientific approach deemed religious, while pushing ad-hoc alternatives is not?

"I did not say that science is a cult. I said that there are RM cultists. These statements are not equivalent, but you read what you want to read to suit your agenda. Is there any evidence to support my statement, e.g. indications of a cult of personality, or of blind rejection of counter points of view? “In Codd’s Name”, no. “For Codd’s Sake”, I wouldn’t dream of suggesting – though others might – that your pompous, deluded and disingenuous rejection of anything you don’t understand represents a defensive isolationist (cultist) stance."
I am always amused by criticism of my "non-rationality" expressed by some in highly emotional language that I myself never use -- he irony escapes my critics. And setting exaggerated strawmen only to demolish them is a known trick. Readers will have to judge for themselves whether the adjectives attributed to me fit (do I blindly reject counter points of view?) And, incidentally, is "defensive isolationist" not pompous?

Do I invoke Codd's name in arguments from authority? My paper The Last NULL in the Coffin argues that Codd's 4VL solution to missing data was wrong. Codd made mistakes, but they are overwhelmed by his enormous contribution, as I demonstrate in Truly Relational: What it Really Means. His stuff is real data science, not what passes for it in the industry these days.With its poor relational fidelity even SQL confers advantages that are now taken for granted, but the RDM must appreciated in historical context, relative to what it replaced. Such recognition is not a cult of personality, but giving Codd his due and .
"Is the relational field, science, whole science and nothing but science? Of course the core of RM is science, underpinned by set theory and relational algebra. What about the 12 rules? Debatable. What about the “whole science” part – as in “the RM solves every data management problem you’ll ever encounter and <<…obviates the need…>> for anything else”? Bunk, clearly."

    A relation is a kind of set. Codd's genius was in the realization that a fraction of first order predicate logic (FOPL) can express the relational algebra (RA) -- the set operations that he adapted to database management which, when applied to relations produce relations. every RA expression -- relation specification -- has an associated relation predicate (RP). In other words, the RDM is theory -- logic and mathematics -- applied to database management. Not only is this science, but the plethora of relational advantages,
    • Semantic correctness and system-guaranteed logical validity;
    • Declarative, decidable data language;
    • Physical and logical independence;
    • Simplicity and flexibility;
    derive directly from the theoretical foundation.

    This is part of what I usually refer to as data fundamentals, of which the industry is largely dismissive and to which few practitioners are introduced. Online exchanges are not the appropriate vehicle to convey them -- it is only possible to alert to their importance and refer to sources -- they should be an integral and mandatory par t of the education preceding tool practice, but the industry is going into opposite direction, eliminating whatever little education is left and substituting coding and tool training. 

    For practical benefits to materialize, the theory underlying the RDM must be concretized within DBMS software. Codd's 12 rules were whipped up as a quick way to expose vendors who were adding a /R to the name of their DBMSs and claiming they were relational. As Date pointed out, they are not systematic, independent, or complete and, thus, are not a proper definition of relational fidelity and were never claimed to be science. It is ironic that those who dismiss the RDM as "just theory" and, by implication, not practical, at the same time criticize demonstrations of its practical applicability as "not scientific".

    I dare anybody else to produce evidence that either I, or any other serious relational proponent ever claimed that “the RDM solves every data management problem you’ll ever encounter and obviates the need for anything else". What I have always argued and do is that where there exists a sound theoretical foundation, it is preferable to ad-hoc approaches, not out of "purity" considerations, but because it yields practical benefits, not the least of which is soundness.

    The RDM is a general purpose formal system of deduction applied to database management -- tuples of base relations represent axioms, tuples of derived relations represent theorems and queries applying the RA are proofs. So it can tackle any inferencing tasks. But to benefit users it (1) must be incorporated fully and correctly in DBMS software and (2) databases must be properly designed by adhering to three fundamental principles:
    • The Principle of Orthogonal Design (POOD);
    • The Principle of Expressive Completeness (POEC);
    • The Principle of Representational Parsimony (PORP);
    that jointly imply the Principle of Full Normalization (POFN) (although this is an yet unproven conjecture). Neither of these has yet happened and criticizing the RDM for the poor relational fidelity of SQL and its implementations and for the poor jobs of database designers is absurd and defeats the purpose.

    If straightforward sequential access to some set of data is all that will ever be needed for that set of data, then throwing in a relational engine is overkill. Experience, however, shows that even when the initial needs are modest, hyped non-relational products such as, currently, NoSQL and HADOOP end up limiting in the long run (see, for example, Why You Should Never Use MongoDB and my comments in the Anatomy of a Data Management Project series @All Analytics.
    "You think (you can “prove”) that EAV has serious flaws. So what? How “scientific” is your assessment of “serious”? Can you prove that the “serious” flaws outweigh the potential benefits? No. You have no idea about the benefits, because your mind is closed – just like any other cultist. For the record, an EAV approach provides a number of major opportunities:
    • A means to future proof solutions, e.g. to incorporate newly identified diseases/symptoms into a medical application, or a newly developed product into a sales solution.
    • A means (together with other modelling/implementation techniques) to manage sub-typing and inheritance.
    • A means to implement data driven integration with legacy systems, in both operational and data warehousing solutions.
    • Various other more subtle benefits, e.g. to do with data validation and cross-validation, or derived attribute processing.
    Perhaps these hugely important business considerations have had no significance in all the systems that you have successfully delivered to clients over the decades? Your supreme confidence in the superiority of your own “foundation knowledge” proves only that you understand so very little about the issue of data management and systems development in the round. Not even wrong."
    Points arising:
    • I did not claim to prove anything. I only alerted that the EAV approach has serious drawbacks that are amply documented. I have been contacted by users who asked me for advice to work around them. I did not assign any scientific value to my comments.
    • EAV designs can be implemented with SQL DBMS's (the closest one comes to a relational system in the industry), so my criticism is not relational per se. Rather, it alerts to the mistake commonly made in database practice of focusing upfront exclusively on structure -- here, ability to change the schema frequently -- at the cost of long-term prohibitive data integrity and manipulation burden. This approach can  be considered if and only if (1) there is no significant requirement for integrity and (2) there is no significant risk of manipulation of the data becoming too cumbersome. Such circumstances are extremely rare exceptions in database management, not the rule.
    • SQL failure to support entity supertype-subtypes (ESS) is, indeed, one of the many SQL (not relational) deficiencies. I do not know if such support can be added to SQL, but I suspect EAV's drawbacks are too high a price for it.
    • I understand legacy system constraints, but they are hardly an argument against superior relational technology. Certainly not while claiming  that my "needle is stuck in the 70's".
    I am not "supremely confident in the superiority of my foundation knowledge". I am confident that knowledge and understanding of data fundamentals are critically important for data management and system development practice, that they are sorely lacking in the industry and that the consequences are predictable and visible, as I have demonstrate with evidence for more than four decades.

    No comments:

    Post a Comment

    View My Stats