Tuesday, March 18, 2014

Science, Religion, EAV and the Relational Model


Note: Thanks to Erwin Smout for his review of a draft and suggesting improvements.

The claims that (1) the relational model (RM) is old and, by implication, obsolete--as the market has "progressed"--and that (2) promoting it as an alternative NoSQL, Hadoop and other modern technologies is "religious" in character are quite common. They have popped again in a recent LinkedIn exchange and I responded as I usually do: why is the promotion of a scientific approach deemed religious, while ad-hoc alternatives are not?

JS: 1. I did not say that science is a cult. I said that there are RM cultists. These statements are not equivalent, but you read what you want to read to suit your agenda. Is there any evidence to support my statement, e.g. indications of a cult of personality, or of blind rejection of counter points of view? “In Codd’s Name”, no. “For Codd’s Sake”, I wouldn’t dream of suggesting – though others might – that your pompous, deluded and disingenuous rejection of anything you don’t understand represents a defensive isolationist (cultist) stance.
I am always amused by criticism of my non-rationality expressed in highly emotional language that I never use--the irony escapes my critics. And setting exaggerated strawman only to demolish them is a very old trick. I will disregard the personals--readers will have to make their own judgment whether the adjectives attributed to me fit (do I blindly reject counter points of view?) And, incidentally, is "defensive isolationist" not pompous?

Do I invoke Codd's name in arguments from authority? My paper The Last NULL in the Coffin shows Codd's 4VL solution to missing data was wrong. In The Key To Keys: A Matter of Identity I agree with C. J. Date that Codd's primary key mandate was too strong, given that PK selection is imformal/pragmatic, not formal/theoretical. Codd made a couple of mistakes, but they are overwhelmed by his enormous contribution, as I demonstrate in Truly Relational: What it Really Means, which is incomparable to what passes for "data science" these days. Many of its benefits are taken for granted now, but RM must be appreciated in its historical context and relative to what it replaced. Such recognition gives Codd his due and is not a cult of personality.
JS: Is the relational field, science, whole science and nothing but science? Of course the core of RM is science, underpinned by set theory and relational algebra. What about the 12 rules? Debatable. What about the “whole science” part – as in “the RM solves every data management problem you’ll ever encounter and <<…obviates the need…>> for anything else”? Bunk, clearly.
I don't know what the "relational field" and its "core" are. What I do know:
  • the theory of relations--a kind of sets: mathematics
  • the first order predicate logic (FOL): logic
  • the relational model of data: database management
Codd's genius was in his realization that FOL and the theory of relations are equivalent--every relational algebra expression (R-table description) is associated with a FOL expression (predicate)--and this equivalence can be exploited for database management. In other words, RM is theory (logic and math) applied to database management. A plethora of practical benefits derive from this, among them data independence, soundness (provably logically correct query answers with respect to the real world), generality with simplicity, and flexibility.

This is part of what I usually refer to as data fundamentals, of which the industry is largely dismissive and to which few practitioners are properly exposed, if at all. Online exchanges are not the appropriate vehicle to convey them (they should be an integral and mandatory part of education preceding practice). One can only alert to their relevance and importance in specific online contexts and refer to sources.

For its benefits to materialize, the theory (RM) must be concretized within (implemented by) a DBMS. Codd's 12 rules were devised in the context of the industry faking it (relabeling non-relational DBMS's relational, often by just adding a /R to their name), as informal quick rules of thumb for users to assess the validity of vendor claims. They were not claimed to be science (C. J. Date pointed out they are not systematic, independent, or complete). Another irony is that those who dismiss RM as just a theory, criticize at the same time attempts to show its practical applicability as "not scientific".

I dare JS or anybody else to produce evidence that either I or any other serious relational proponent ever claimed “the RM solves every data management problem you’ll ever encounter" and "obviates the need for anything else". What I do argue is that where there exists a sound theoretical foundation, it is preferable to ad-hoc approaches, not out of "purity" considerations, but because it yields practical benefits. (When I used the phrase "obviates the need", it was in the completely different context of foundation knowledge vs. "data patterns").

The RM is a general purpose system, and it can address any data management problem anyone could ever encounter, but of course for those properties to turn into something beneficial to its users, someone somewhere will, first, have to implement it and, second, reality will have to be modeled into a relational structure. No one has ever claimed that every problem should be solved the relational way. If straightforward sequential access to some set of data is all that will ever be needed for that set of data, then throwing in a relational engine is overkill. Experience, however, shows that even when the initial needs are modest, hyped non-relational products such as, currently, NoSQL and HADOOP end up limiting in the long run (see, for example, Why You Should Never Use MongoDB and my comments in the Anatomy of a Data Management Project series @All Analytics.
JS: 2. You think (you can “prove”) that EAV has serious flaws. So what? How “scientific” is your assessment of “serious”? Can you prove that the “serious” flaws outweigh the potential benefits? No. You have no idea about the benefits, because your mind is closed – just like any other cultist. For the record, an EAV approach provides a number of major opportunities:
  • A means to future proof solutions, e.g. to incorporate newly identified diseases/symptoms into a medical application, or a newly developed product into a sales solution.
  • A means (together with other modelling/implementation techniques) to manage sub-typing and inheritance.
  • A means to implement data driven integration with legacy systems, in both operational and data warehousing solutions.
  • Various other more subtle benefits, e.g. to do with data validation and cross-validation, or derived attribute processing.
Perhaps these hugely important business considerations have had no significance in all the systems that you have successfully delivered to clients over the decades? Your supreme confidence in the superiority of your own “foundation knowledge” proves only that you understand so very little about the issue of data management and systems development in the round. Not even wrong.
Points arising:
  • I did not claim to prove anything. I only alerted that the EAV approach has serious drawbacks documented online. I have also been contacted by users who asked me for advice to work around them. I did not assign any scientific value to my comment.
  • EAV designs can be implemented with SQL DBMS's (the closest one comes to a relational system in the industry), so my criticism is not relational per se. Rather, it refers to a mistake commonly made in database practice of focusing upfront exclusively on structure--here, ability to change the schema frequently--but at the cost of prohibitive data integrity and manipulation burden. This approach can  be considered if there is no extensive requirement for integrity and if there is no significant risk of manipulation of the data becoming too cumbersome, but that is an exception, not the rule.
  • SQL failure to support entity supertype-subtypes (ESS) is, indeed, one of the many SQL (not relational) deficiencies. I do not know if such support can be added to SQL, but I suspect EAV's drawbacks are too high a price for it.
  • I understand legacy system constraints, but they are hardly an argument against relational technology, while claiming (as JS does) that "my needle is stuck in the 70's".
I am not "supremely confident in the superiority of my foundation knowledge". I am confident that knowledge and understanding of data fundamentals are critically important for data management and system development practice, that they are sorely lacking in the industry and that the consequences are predictable and visible, as I demonstrate with evidence.


Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

10 comments:

  1. .............Note: There is a rider to these 12 rules known as Rule Zero: "For any system that is claimed to be a relational database management system, that system must be able to manage data entirely through its relational capabilities."

    but after all: 12 + 0 = 12
    eugene

    ReplyDelete
    Replies
    1. Of course. The rest of the rules are implications of it.

      Delete
  2. I keep track this blog and linkedin discussions with great interest and for me the claim "RM is old" seems to be a nonsence too. Implementations of the model (systems) can be old, not model itself. Moreover for me the RM is only real data model which potencial has not be opened fully yet.

    From my point of view, today implementations of RM is very restricted. The main point of the restriction is that the RDBMS looks at relation not only as at a form of how data can be manipulated and represented for users (according to Rule Zero), but also only as at way how an object domain can be described. Of course it's very direct and simple way, because if object domain is describer in terms of relations all data on the object domain will be guarantedly presented as set of relations. But is it the only possible way?

    I believe that more complex implementation can exist too. As an example - http://www.odbms.org/2014/03/object-oriented-management-systems-relational-databases-rxo-dbms/ The formal description of how the work is given in terms of RM fully.

    So, my question is - if system has ability to turn complex non-relationan data description into description of pure relational structures, and translate all operations on data described in complex terms into relational operation, and represent all the data as set of relations - is that system relational?

    Evgeniy

    ReplyDelete
    Replies
    1. If I understand you correctly, the short answer is no. Besides, why would you want to complexify something if a simpler way exists? That would defeat the point of RM.

      Delete
  3. Why OO languages has appeared? It's simpler to operate with data in linear memory without all these compiler. And if the compilers change the principles of linear memory organization? They just allow programmer believe that they work with complex objects (because the programmers use OO language) but in fact theu operate on computer (nothing new here).

    As a user I can want to think about a some bills or about some sales as about whole objects without all there "header" and "lines". So I describe a whole complex structures. The system analize this description and creates two internal relations which are similar to "header" and "lines", They are created automatically, but not by hand. If I try to get data from such system I use dot-separeted name sequences like "bill.items.position" and system after analizing of the name sequences builds a query on thr internal realtiona in a way like the "header" and "line" were JOINed by progammer's hand. Even if I write the complex algorithm on my bill the system transform the algoritnm into procedure on the "header" and "line" realtions. As a result such system allows programmer believe that they work with complex objects but in fact he operates with relational system. And all the data is presented in terms of relations because it's the only way possible for relational system which are executed all the actions.

    Why such system is not relational?

    Evgeniy.

    ReplyDelete
    Replies
    1. You asked a question, I answered.

      Just a couple of comments:

      1. OO is a programming, not data paradigm
      2. Memory and compiling are physical, not logical.

      That's the best I can do in a comment here.

      Delete
    2. Thank you for answer, Fabian. But without explanation it seems too short to be understood.

      On your remarks.

      1. Sure! Absolutely! Imagine a pure realtional DBMS. It has a memory in form of relvars. It has input/output operation. It has ability to operate with the data in the relvars and even to define procedires on data in the relvars.Sure RM is not programming paradigm! It's paradigm of a target mashine! I can use the mashine as it is, or I can create translator which implement some programming paradigm keeping the mashine unchanged.

      2) O no! In java-mashine or IM the memory is not physical. For users memory is defined only by commands which have be used to configure the memory and operate with data in the memory. From this point of view there is no difference between "malloc" and "create table" or between "mov" and "insert into" it's just a commands, nothig more.

      Delete
  4. RM is not pure math. It is math APPLIED to database management. The math has been adapted to suit database management.

    The question is what does the user and app developer see? If they see R-tables and apply R-algebra to manipulate them and express integrity constraints, then it the system is relational. Otherwise it is not.

    ReplyDelete
    Replies
    1. The question is that the verb "see" is not exact. Data will be always presented in form of realtions because of pure relational systems used as target machine. It's not a question even.

      To explain what I mean, let's show on something "object.reference.items.value >> cout" in C++. Here the complex expression is just a text which allow me believe that I work with complex structure of data. In fact there is not the structure in computer memory, only scalar values including pointers. So system returns me scalar value (because it's all that it can) but I believe that this value is a part of complex structure.

      In the system (I wrote about) user can write "SELECT c.d FROM a.b". Here the user thinks about about some complex structue where "a.b.c.d" is a correct dot-separated expression. Sytems analizes all the name sequences and as a result performs JOINs on internal tables so result will be always a relation. It's very important that we can perform only relational operation(because this all that target machine can do) on only tables (because it's all possible form of data in th target machine). But as I used complex names in command, I continue to believe that I works with complex structure.

      So user will see only relations, continuing to think that he operates with complex structures.

      Delete
    2. This is not something that can be discussed online, sorry.

      I suggest you write something on the subject and publish it. Subject to time constraints I will take a look and if I understand it I will comment on it.

      Delete