Saturday, March 11, 2017

What Is a True Relational System (and What It Is Not)

(This is a rewrite of a 12/10/16 post, to bring it in line with McGoveran's interpretation of Codd's RDM.)

Here's what's wrong with last week's picture, namely:
"A quick-and-dirty definition for a relational database might be: a system whose users view data as a collection of tables related to each other through common data values.

The whole basis for the relational model follows this train of thought: data is stored in tables, which are composed of rows and columns. Tables of independent data can be linked, or related, to one another if they each have columns of data that represent the same data value, called keys. This concept is so common as to seem trivial; however, it was not so long ago that achieving and programming a system capable of sustaining the relational model was considered a longshot with limited usefulness.

If a vendor’s database product didn’t meet Codd’s 12 item litmus tests, then it was not a member of the club ... these rules determine whether the database engine itself can be considered truly “relational”. These rules were constructed to support a data model that would ensure the ACID properties of transactions and also eliminate a variety of data manipulation anomalies that frequently occurred on non-relational database platforms (and **still do**)." --Kevin Kline, SQLBlog.com


Note: I know that insisting on not to use the terms database and DBMS interchangeably is dismissed as pedantry, but the database field is so loaded with confused and careless use of terms with costly consequences [1], that I will continue to insist. Here's one example why: for the practical advantages of the RDM to materialize, both the DBMS and databases must be truly relational, which has a distinct meaning for each.


A Relational System Is a Formal System


Quick and dirty is what we had prior to the RDM and insisting on it is what is regressing the industry to those pre-RDM and even pre-database unsound, complex and inflexible products that the RDM saved the industry from in the 60s. It is hardly the best way to explain a technology intended to put database management on a rigorous and sound scientific basis.

The most important characterization of a relational system--which includes both a RDBMS and relational databases--is that it is grounded in the dual theoretical foundation of simple set theory (SST) and first order predicate logic (FOPL) [2]. It is this that confers the practical advantages, the core being system-guaranteed logical and semantic correctness of query and update results.

Thhat means:
  • Compliance of the DBMS with the SST/FOPL set-oriented formal structural, manipulative and integrity requirements of the RDM and
  • Adherence of databases to the Principle of Full Normalization (POFN) which integrates three design principles:
  • Principle of Representational Minimality (PORM)
  • Principle of Orthogonal Design (POOD)
  • Principle of Expressive Completeness (POEC)
The relational benefits will not materialize without them, but all the costly consequences do [3]. Practically nobody in the industry understand this, which is how SQL was developed and how it has been effectively sold as relational.


Codd's 12 Rules Are Deprecated


While a true RDBMS will comply with Codd's 12 Rules, compliance is insufficient for relational fidelity. The rules were initially whipped up by Codd in response to DBMS vendors promoting their non-relational products as relational by prefixing or suffixing their names with an R. They are neither a complete and systematic set, nor independent of one another and were intended just as a quick and dirty test  that could expose false claims [4]. The rules are deprecated precisely because they are quick and dirty, another example of why that is not a good idea.

The rules and the data model have nothing to do with transactions and the ACID properties. In a relational context anomalies are the undesirable side-effects of operating on relations that are not fully normalized, which results in loss of guaranteed semantic correctness [5].
 


SQL DBMSs Are Not Relational


The numerous reasons which disqualify SQL DBMSs as truly and fully relational have been amply documented and I will not repeat them here. The consequence of confusing RDM with SQL is that the latter fails to confer anywhere near the RDM benefits, which is blamed on SQL being relational, rather on the fact that it isn't. If you are one of those who ask "If the RDM is so great, why hasn't a true RDBMS ever been implemented?", there is no need when everybody believes
SQL systems are it. IBM and Oracle have a lot to answer for regarding the stagnation of--indeed, regress in--the database field. Lack of, or poor education is another culprit.


R-tables Are Not Relations


A second confusion is of relations with tables. A R-table--not just any table--is only one way to represent a relation on some physical medium (e.g., paper or screen) and, as Codd explicitly stated, the physical arrangement of tuples and attributes as rows and columns "is not part of the relational view of data" [6]--i.e., it encodes no information. The confusion is responsible for erroneously criticizing the RDM as simplistic because "flat tables cannot handle complex data". But an image of a thing is not the same as the thing itself--do we deem people or objects in photographs flat? Relations are multidimensional: each attribute is a dimension. R-tables are depictions of relations on a medium that is flat.

Any system can present users with a tabular view of data, but that does not make it relational.



Relations Are Not Stored


The data of only some of them are, and if they are, they are not stores as relations. While a direct image representation in storage is possible (sequential files are the storage isomorph of tables), it is neither mandatory and certainly not the only way. Logical-physical confusion (LPC) is responsible for the erroneous criticism of RDBMSs for poor performance because "they don't explicitly handle the physical level". This is backwards: physical independence (PI) means that the RDM gives DBMS designers complete freedom to deploy any storage and access methods to maximize performance and DBAs to change them at will without impairing applications. Some commercial SQL products offer some of that, but not enough; others store attribute rather than tuple data. These are implementation details, which are the exclusive determinants of performance, with which the math and logic of the RDM have nothing to do.

Note: I would have liked Kline to dispel the misconception that 'relational' does not come from "relating tables", but rather from mathematical relations, which are sets (as one commenter correctly pointed out).

"Representing the same data value" is a rather poor expression. More accurately, the related relations are subject to a referential constraint that enforces matches between values of a referencing FK and those of the referenced PK. The two keys must be defined on the same domain--i.e., they must represent the same real world property.



Relational Systems Demand Knowledge and Reason


Kline's claim of superiority of SQL databases over the hierarchic and network systems that it replaced is, of course, correct. But given the nature of those systems, it wasn't very hard to be better and SQL is far from being truly and fully relational (some of the commenters provide good explanations why). 


Kline claims that "where NoSQL databases excel at storing data that is moderately important and requires eventual consistency, SQL database excel at storing data that is of paramount importance and requires immediate consistency." I very much doubt that this is the real attraction of NoSQL. They are often promoted for "when you haven't figured out what exactly you want to do with the data yet". The relational approach requires that you know and think carefully upfront about your information needs and model the business and design the database accordingly, while NoSQL products invite design laziness by letting you defer the upfront effort at the expense of usually costlier future effort of using the database. Expediency is too strong a temptation to resist, but what Codd realized--and many don't want to hear--is that by the time you figure out how to use the data, a RDBMS is the least costly solution and you better resist the temptation.



References


[1] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS (Database Debunkings, 2017).


[2] Pascal, F., Truly Relational: What It Really Means, PRACTICAL DATABASE FOUNDATIONS paper #5, www.dbdebunk.com.

[3] See Why even the most intelligent software architects don't understand the RDM on the LINKS page.
 
[4] Codd, E. F., Does your DBMS run by the rules?, ComputerWorld (1985).

[5] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS (Alternative Technologies, forthcoming).

[6] Codd, E. F., A Relational Model of Data for Large Shared Data Banks, Comm. ACM 13(6), 1970.


Note: I will not publish or respond to anonymous comments. If you want to say something, stand behind it. Otherwise don't bother, it'll be ignored.





Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

1 comment:

  1. Regarding "the core being system-guaranteed logical and semantic correctness of query and update results", it is easy to show that any dbms that defines an "insert" operation on a base relation as set union of itself and a set of tuples when another base relation has an existing tuple that does not agree on common attributes before the operation but does agree after the operation is not relational. Logically, such a result involves "deletion" aka set difference on the latter relation as well as the set union on the former.

    I imagine this logical error must include pretty much all dbms'es today, including ones that claim to be "truly" relational. (I think you have previously stated base updates should be avoided.)

    I call such systems pseudo-relational file systems. One of the characteristics of a file system is that any structurally compatible (in the physical sense) result is allowed, in other words what is not logically possible can be ignored, as in the above example where User semantics are allowed to trump logical results.

    ReplyDelete