Tuesday, March 18, 2014

Science, Religion, EAV and the Relational Model

Note: Thanks to Erwin Smout for his review of a draft and suggesting improvements.

The claims that (1) the relational model (RM) is old and, by implication, obsolete--as the market has "progressed"--and that (2) promoting it as an alternative NoSQL, Hadoop and other modern technologies is "religious" in character are quite common. They have popped again in a recent LinkedIn exchange and I responded as I usually do: why is the promotion of a scientific approach deemed religious, while ad-hoc alternatives are not?

JS: 1. I did not say that science is a cult. I said that there are RM cultists. These statements are not equivalent, but you read what you want to read to suit your agenda. Is there any evidence to support my statement, e.g. indications of a cult of personality, or of blind rejection of counter points of view? “In Codd’s Name”, no. “For Codd’s Sake”, I wouldn’t dream of suggesting – though others might – that your pompous, deluded and disingenuous rejection of anything you don’t understand represents a defensive isolationist (cultist) stance.
I am always amused by criticism of my non-rationality expressed in highly emotional language that I never use--the irony escapes my critics. And setting exaggerated strawman only to demolish them is a very old trick. I will disregard the personals--readers will have to make their own judgment whether the adjectives attributed to me fit (do I blindly reject counter points of view?) And, incidentally, is "defensive isolationist" not pompous?

Do I invoke Codd's name in arguments from authority? My paper The Last NULL in the Coffin shows Codd's 4VL solution to missing data was wrong. In The Key To Keys: A Matter of Identity I agree with C. J. Date that Codd's primary key mandate was too strong, given that PK selection is imformal/pragmatic, not formal/theoretical. Codd made a couple of mistakes, but they are overwhelmed by his enormous contribution, as I demonstrate in Truly Relational: What it Really Means, which is incomparable to what passes for "data science" these days. Many of its benefits are taken for granted now, but RM must be appreciated in its historical context and relative to what it replaced. Such recognition gives Codd his due and is not a cult of personality.
JS: Is the relational field, science, whole science and nothing but science? Of course the core of RM is science, underpinned by set theory and relational algebra. What about the 12 rules? Debatable. What about the “whole science” part – as in “the RM solves every data management problem you’ll ever encounter and <<…obviates the need…>> for anything else”? Bunk, clearly.
I don't know what the "relational field" and its "core" are. What I do know:
  • the theory of relations--a kind of sets: mathematics
  • the first order predicate logic (FOL): logic
  • the relational model of data: database management
Codd's genius was in his realization that FOL and the theory of relations are equivalent--every relational algebra expression (R-table description) is associated with a FOL expression (predicate)--and this equivalence can be exploited for database management. In other words, RM is theory (logic and math) applied to database management. A plethora of practical benefits derive from this, among them data independence, soundness (provably logically correct query answers with respect to the real world), generality with simplicity, and flexibility.

This is part of what I usually refer to as data fundamentals, of which the industry is largely dismissive and to which few practitioners are properly exposed, if at all. Online exchanges are not the appropriate vehicle to convey them (they should be an integral and mandatory part of education preceding practice). One can only alert to their relevance and importance in specific online contexts and refer to sources.

For its benefits to materialize, the theory (RM) must be concretized within (implemented by) a DBMS. Codd's 12 rules were devised in the context of the industry faking it (relabeling non-relational DBMS's relational, often by just adding a /R to their name), as informal quick rules of thumb for users to assess the validity of vendor claims. They were not claimed to be science (C. J. Date pointed out they are not systematic, independent, or complete). Another irony is that those who dismiss RM as just a theory, criticize at the same time attempts to show its practical applicability as "not scientific".

I dare JS or anybody else to produce evidence that either I or any other serious relational proponent ever claimed “the RM solves every data management problem you’ll ever encounter" and "obviates the need for anything else". What I do argue is that where there exists a sound theoretical foundation, it is preferable to ad-hoc approaches, not out of "purity" considerations, but because it yields practical benefits. (When I used the phrase "obviates the need", it was in the completely different context of foundation knowledge vs. "data patterns").

The RM is a general purpose system, and it can address any data management problem anyone could ever encounter, but of course for those properties to turn into something beneficial to its users, someone somewhere will, first, have to implement it and, second, reality will have to be modeled into a relational structure. No one has ever claimed that every problem should be solved the relational way. If straightforward sequential access to some set of data is all that will ever be needed for that set of data, then throwing in a relational engine is overkill. Experience, however, shows that even when the initial needs are modest, hyped non-relational products such as, currently, NoSQL and HADOOP end up limiting in the long run (see, for example, Why You Should Never Use MongoDB and my comments in the Anatomy of a Data Management Project series @All Analytics.
JS: 2. You think (you can “prove”) that EAV has serious flaws. So what? How “scientific” is your assessment of “serious”? Can you prove that the “serious” flaws outweigh the potential benefits? No. You have no idea about the benefits, because your mind is closed – just like any other cultist. For the record, an EAV approach provides a number of major opportunities:
  • A means to future proof solutions, e.g. to incorporate newly identified diseases/symptoms into a medical application, or a newly developed product into a sales solution.
  • A means (together with other modelling/implementation techniques) to manage sub-typing and inheritance.
  • A means to implement data driven integration with legacy systems, in both operational and data warehousing solutions.
  • Various other more subtle benefits, e.g. to do with data validation and cross-validation, or derived attribute processing.
Perhaps these hugely important business considerations have had no significance in all the systems that you have successfully delivered to clients over the decades? Your supreme confidence in the superiority of your own “foundation knowledge” proves only that you understand so very little about the issue of data management and systems development in the round. Not even wrong.
Points arising:
  • I did not claim to prove anything. I only alerted that the EAV approach has serious drawbacks documented online. I have also been contacted by users who asked me for advice to work around them. I did not assign any scientific value to my comment.
  • EAV designs can be implemented with SQL DBMS's (the closest one comes to a relational system in the industry), so my criticism is not relational per se. Rather, it refers to a mistake commonly made in database practice of focusing upfront exclusively on structure--here, ability to change the schema frequently--but at the cost of prohibitive data integrity and manipulation burden. This approach can  be considered if there is no extensive requirement for integrity and if there is no significant risk of manipulation of the data becoming too cumbersome, but that is an exception, not the rule.
  • SQL failure to support entity supertype-subtypes (ESS) is, indeed, one of the many SQL (not relational) deficiencies. I do not know if such support can be added to SQL, but I suspect EAV's drawbacks are too high a price for it.
  • I understand legacy system constraints, but they are hardly an argument against relational technology, while claiming (as JS does) that "my needle is stuck in the 70's".
I am not "supremely confident in the superiority of my foundation knowledge". I am confident that knowledge and understanding of data fundamentals are critically important for data management and system development practice, that they are sorely lacking in the industry and that the consequences are predictable and visible, as I demonstrate with evidence.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment