Sunday, September 17, 2017

Database Management: No Progress Without Data Fundamentals



I have recently -- yet again -- been accused in a LinkedIn exchange  of "gibberish without any evidence" and of claiming that "nobody know what they're doing" with databases. I will leave it to readers to judge whether (1) five decades worth of writings and teaching is "no evidence" and (2) my comments in the exchange are gibberish. Here I would like to dare anybody to find claims to that effect in any of my pronouncements. What I did, do and will say is that most data professionals do not know and understand data and relational fundamentals -- an incontrovertible fact proved not just by me[1], but also by others[2,3] and that this inhibits real progress in database management. 

As I wrote two weeks ago:
"The RDM put database management on a formal, scientific foot. Consequently, tool experience and relational terminology are insufficient -- foundation knowledge is necessary. Unfortunately, most data professionals do not possess it, in part because they have been misled by the industry and in part because few go through an education -- as distinct from training -- program that teaches the RDM and teaches it correctly. Consequently, even those with the heart in the right place defend the RDM without a full understanding, their views distorted by what passes for it (stay tuned for a debunking of such a recent example)."
I will now fulfill the promise by debunking just such a "heart-in-the-right-place" defense of the RDM. 

--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.
--------------------------------------------------------------------------------- 

It  starts as follows:
"I don’t like talking about the relational theory of data. It is absolutely fundamental to any deep understanding of data, but most practitioners get along fine without it. It’s more the implementers of database management systems (DBMSs) who need to understand relational theory, so teaching relational theory to ordinary practitioners is a bit like tormenting people with irrelevant theory before you let them get on with the business at hand. Moreover, some of those who understand relational theory use their knowledge to beat other people over the head with it. I don’t want to be associated with that high-handed approach to this important theory.

But I’ve been goaded. Google made me do it. My attention was drawn to a video put out by some folks at Google, Data Modeling for BigQuery. The video is fine for the most part, but it makes some misstatements about relational theory that just drive me crazy. They repeat commonly accepted misconceptions about relational databases—misconceptions that, unfortunately, have driven some of the “advances” we’ve seen of late in the realm of database technology. There have definitely been some true advances, but some new technology is merely different without being better."

If you’re a practitioner, designing, implementing, and using databases, whether SQL or NoSQL, this won’t matter much to you, although it never hurts to learn a little more about the theory of data. However, if you are a programmer who might be the one who builds the next NoSQL mega-star that will replace decades-old technology, you need to know this, because this knowledge will enable you to blind-side every established DBMS vendor, whether SQL or NoSQL. So, I’m going to correct the record here. --Ted Hills, Understand Relational to Understand the Secrets of Data

Something is fundamentally wrong if a data professional "doesn't like talking about something that is absolutely fundamental to deep understanding of data". Now, of course, it should be obvious that it is impossible to design RDBMSs without a full grasp of the RDM, as deficient SQL products demonstrate (although not to many a DBMS designer). But to argue that it is "irrelevant theory" for database practitioners -- DBAs, application developers and yes, even some end-users -- who "can get along fine without it" is outright wrong. It is precisely such arguments that are responsible for the very misstatements that raised Hills ire and his failure to realize it suggests his own fundamental deficiency. Without relationally knowledgeable users, not only are there bad database practices and weak DBMSs, but also no incentive for DBMS designers to educate themselves on the RDM and produce true RDBMSs.

Some of Hill's arguments are correct, for example:

"Relational data theory doesn’t tell us how to store data (in tables); rather, it tells us how to understand data. On the surface, it seems that relational theory introduced the idea of storing data in tabular form, and there’s some truth to that. However, when you treat relational theory as purely logical, divorced from any assumptions about physical storage, you realize that it provides the only co mprehensive theory of data that works for data stored in any form, whether tabular (row-oriented, column-oriented, or key/value), document (XML or JSON), graph, or whatever."
The RDM is a general formal theory of data and physical independence (PI) is, indeed, a core motive and advantage of the RDM. But here I will focus on his problematic arguments.


Relations, Relationships and Levels of Representation

 "All databases, not just relational ones, mimic relationships in the real world. Document databases mimic real-world relationships between fields of a document, and certainly graph databases mimic relationships between nodes of a graph. Tabular databases mimic relationships between the columns of a table. Perhaps this belief that relational theory is uniquely about mimicking real-world relationships comes from the fact that, in ordinary English, “relation” and “relationship” are synonyms. Unfortunately, the “relation” in relational theory does not refer to relationships. To put it simply, the “relation” of relational theory is a table where the order of its rows and columns carry no information. Relationships are something else. It’s a terminological tragedy that relational theory overloaded the word “relation” and created this confusion with “relationship”."
Hills argues, correctly, that the RDM is not about tables, but then says a relation "is an unordered table" and refers to "tabular databases" that "mimic relationships between the columns of a table". First, a database relation is not a table, but a kind of mathematical set specially adapted to database management that can be visualized as a R-table on some physical medium[4,5,6]. Second, columns are not "elements of the real world", but just visualizations of attributes that represent properties in the real world[7]. Third,
the "terminological tragedy" confuses relations with relationships between relations, but a relation is a relationship among attributes -- not columns -- that mimics a relationship among object properties in the real world[8,9].

Note: Document fields, graph nodes and their relationships are elements of representations -- albeit non-relational -- of the real world. Document and graph databases are not mimicking their relationships, but rather their relationship mimick some real world relationship (unless, of course, documents are viewed as real world objects rather than as representations within their content of real world objects).



Redundancy, Performance, Integrity and Distribution

"The relational theory of data was introduced to the world in 1970 when IBM researcher E. F. “Ted” Codd published a paper, A Relational Model of Data for Large Shared Data Banks. In this paper he states, “Although it is logically unnecessary to store both a relation and some permutation of it, performance considerations could make it advisable.” In other words, far from recommending against data duplication, he advocates duplicating data if it will make the database perform better.

The desire to eliminate data duplication was not because storage costs were high in 1970. It was because data duplication leads to the potential for inconsistent data. For instance, data in one part of a database might be updated while data in another part of the database that is supposed to be identical (that is, duplicate) might be incorrectly left unchanged. The result could be that various queries on the database return inconsistent results.

Keep in mind that this paper was written in an era when no one was thinking of distributed databases with hundreds of servers working hard to keep data consistent across the nodes. The assumption in the Relational Model paper was that all the data was in a single-node database. Inconsistency would be the result, then, not of physical data distribution, but of a design allowing data duplication. The goal of making everything dependent on the key was one of correctness, not of minimizing storage. And, again, Codd makes allowance for deliberately duplicating data if it helps with performance."

Yes, but -- and it's a big but. Redundancy is permissible for performance maximization if and only if it is declared to and controlled by the DBMS via the enforcement of constraints added specifically to prevent the associated integrity risk, beyond those enforcing the relational discipline and database consistency with the conceptual model. As I demonstrated for redundancy introduced by denormalization, however, adding such constraints defeats the performance purpose[10]. I suspect that Codd would warn practitioners that they must make a conscious tradeoff of integrity for performance, but few practitioners make the choice knowingly.

While the RDM was not conceived in a distributed environment, RDBMSs are not inherently non-distributable. Early on C. J. Date explained why the relational approach is best suited for distribution and published the 12 rules for distributed DBMSs that would guarantee integrity[11]. It is significant that the only product to come closest to implementing the rules was IngresStar  -- particularly so since its data sub-language, QUEL, was relationally superior to SQL.

Note: There is an important distinction between a distributed DBMS and distributed database -- you can have the latter without the former.



Data Must Reflect Understanding of the Real World

"The reality is that, in order to understand any set of data, one needs to know which parts of the data make it unique and distinguishable from other sets of data. Those parts that make it unique are its key. That is a logical distinction, and is unrelated to whether the data is stored in rows, columns, documents, graphs, or whatever."
A relational database is a formalization of an informal conceptual model of a segment of reality of interest. The business rules comprising the model specify the meaning to be assigned to the database -- the defining properties of the real world object groups -- and are formalized as constraints, one type of which is the key constraint, which is the formalization of an object uniqueness rule. 

So to avoid confusion of levels of representation, I would phrase it differently. In order to understand how to design a database, one must first conceptualize the object groups which relations represent in the database and their defining properties. Objects in the real world are distinguishable by a combination of one or more properties, represented by a primary key (PK) -- a combination of one or more attributes with unique values. A uniqueness constraint on the relation -- representing the distinguishability rule -- constrains the PK values to be unique[12].


Can't Get There From Here

 "So Why Does Relational Theory Matter? Relational theory, when properly understood, and when divorced from all considerations of physical storage, gives us the ability to understand, describe, and design any data at the logical level. If properly applied, it gives us data independence, which means independence from the form of storage. That independence is critical, so that we can store the same data in many different physical forms without distorting or corrupting it in some way. DBMS Rock-Star, are you Listening?"
I have, of course, no argument with this, except to stress that in contrast to what passes for "data science" in the industry, the RDM is the real data science -- a formal database foundation grounded in set theory and first order predicate logic (FOPL) and there is much more to the to the RDM that than PI, not the least of which are semantic correctness and system-guaranteed logical validity[13].
"The DBMS that will wipe out all its competition will:
- Treat all possible physical data organizations as implementation details, and support them all;
- Have, as a basic feature, the ability to manage the deliberate duplication of data—not only duplication in indexes, but also deliberate duplication in denormalized tables, documents, and graphs; and
- Have a logical data language that can describe any data, no matter how it’s stored or represented, without reference to its storage or representation.
The last feature can only be achieved by a data language that’s faithful to relational data theory."
Indeed. Unfortunately, this will not happen unless and until both DBMS designers and database practitioners are properly educated on fundamentals. The industry, however, is going in the opposite direction, so I won't hold my breath.


References

[1] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER.

[2] Hepburn, N., Why even the most intelligent software architects don't understand the Relational Model.

[3] McGoveran, D., Comments on Jim Starkey's "Is the Relational Data Model Spent?"

[4] Pascal, F., The Interpretation and Representation of Database Relations.

[5] Pascal, F., Don't Confuse Tables with Relations!

[6] Pascal, F., Multidimensional Relations, Flat Tables and Logical-Physical Confusion.

[7] Pascal, F., The Conceptual-Logical Conflation and the Logical-Physical Confusion.

[8] Pascal, F., Levels of Representation: Relationships, Rules, Relations and Constraints.

[9] Pascal, F., Relationships and Relations.

[10] Business Modeling for Database Design: Formalizing the Informal.

[11] Date, C. J., 12 Rules for Distributed DBMS.

[12] Integrity Is Not Only Referential.


[13] Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness



No comments:

Post a Comment

View My Stats