Tuesday, October 16, 2012

What a Truly Relational System Is (and What It Is Not)

Rewrite 2/1/17.

Here's what's wrong with last week's picture, namely:

"A quick-and-dirty definition for a relational database might be: a system whose users view data as a collection of tables related to each other through common data values.

The whole basis for the relational model follows this train of thought: data is stored in tables, which are composed of rows and columns. Tables of independent data can be linked, or related, to one another if they each have columns of data that represent the same data value, called keys. This concept is so common as to seem trivial; however, it was not so long ago that achieving and programming a system capable of sustaining the relational model was considered a longshot with limited usefulness.

If a vendor’s database product didn’t meet Codd’s 12 item litmus tests, then it was not a member of the club ... these rules determine whether the database engine itself can be considered truly “relational”. These rules were constructed to support a data model that would ensure the ACID properties of transactions and also eliminate a variety of data manipulation anomalies that frequently occurred on non-relational database platforms (and **still do**)." --Kevin Kline, SQLBlog.com


A Relational System Is a Formal System


Quick and dirty is what we had prior to the RDM and insisting on it is what is regressing the industry to those pre-RDM and even pre-database unsound, complex and inflexible products that the RDM saved the industry from in the 60s. It is hardly the best way to explain a technology intended to put database management on a rigorous and sound scientific basis.

The most important characterization of a relational system--which includes both a RDBMS and relational databases--is that it is grounded in the dual theoretical foundation of simple set theory (SST) and first order predicate logic (FOPL) [1]. It is this that confers all the practical advantages, the core being system-guaranteed logical and semantic correctness of query and update results.

I know that insisting on not to use the terms database and DBMS interchangeably is dismissed as pedantry, but the database field is so loaded with confused and careless use of terms with costly consequences [2], that I will continue to insist. Here's one example why: for the practical advantages of the RDM to materialize, both the DBMS and databases must be truly relational, which has a distinct meaning for each.


Codd's 12 Rules Are Deprecated


While a true RDBMS will comply with Codd's 12 Rules, compliance is insufficient for relational fidelity. The rules were initially whipped up by Codd in response to DBMS vendors promoting their non-relational products as relational by prefixing or suffixing their names with an R. They were intended as a quick and dirty test  that could expose false claims [3]. The rules are deprecated precisely because they are quick and dirty, another example of why that is not a good idea.

The rules and the data model have nothing to do with transactions and the ACID properties. In a relational context anomalies are the undesirable side-effects of operating on derived relations that are not fully normalized--query results that are not guaranteed to be semantically correct [4].
 

SQL DBMSs Are Not Relational


The numerous reasons which disqualify SQL DBMSs as relational have been amply documented and I will not repeat them here. The consequence of confusing RDM with SQL is that the latter fails to confer anywhere near the RDM benefits, which is blamed on SQL being relational, rather on the fact that it isn't. If you are one of those who ask "If the RDM is so great, why hasn't it ever been implemented?", why do it when most believe SQL systems are it? IBM and Oracle have a lot to answer for regarding the stagnation of--indeed, regress in--the database field. Lack of, or poor education is another culprit.

R-tables Are Not Relations


A second confusion is of relations with tables. A R-table--not just any table--is only one way to represent a relation on some physical medium (e.g., paper or screen) and, as Codd explicitly stated, the physical arrangement of tuples and attributes as rows and columns "is not part of the relational view of data" [5]. The confusion is responsible for erroneously criticizing the RDM as simplistic because "flat tables cannot handle complex data". But an image of a thing is not the same as the thing itself--do we deem people or objects in photographs flat? Relations are multidimensional: each attribute is a dimension. R-tables are depictions of relations on a medium that is flat.

Any system can present users with a tabular view of data, but that does not make it relational. Compliance of the DBMS with the RDM and of databases with the three principles of database design do and the relational benefits will not materialize without them. If you don't understand that I can sell you any system as relational and you won't be the wiser. This is what actually happens in the industry.


Relations Are Not Stored


The data of only some of them are, and if they are, not as relations. While a direct image representation in storage is possible (sequential files are the storage isomorph of tables), it is neither mandatory, nor the only way. Logical-physical confusion (LPC) is responsible for the erroneous criticism of RDBMSs for poor performance because "they don't explicitly handle the physical level". This is backwards: physical independence (PI) means that the RDM gives DBMS designers complete freedom to deploy any storage and access methods to maximize performance and DBAs to change them at will without impairing applications. Some commercial SQL products offer some of that, but not enough; others store attribute rather than tuple data. These are implementation details, which are the exclusive determinants of performance, with which the math and logic of the RDM have nothing to do.

Note: I would have liked Kline to dispel the misconception that 'relational' does not come from "relating tables", but rather from mathematical relations, which are sets (as one commenter correctly pointed out).

"Representing the same data value" is a rather poor expression. More accurately, the related relations are subject to a referential constraint that enforces matches between values of a referencing FK and those of the referenced PK. The two keys must be defined on the same domain--i.e., they must represent the same real world property.

Relational Systems Require Thinking


Kline's claim of superiority of SQL databases over the hierarchic and network systems that it replaced is, of course, correct. But given the nature of those systems, it wasn't very hard to be better and SQL is far from being truly and fully relational (some of the commenters provide good explanations why). 


Klein claims that "where NoSQL databases excel at storing data that is moderately important and requires eventual consistency, SQL database excel at storing data that is of paramount importance and requires immediate consistency." I very much doubt that this is the real attraction of NoSQL. They are often promoted for "when you haven't figured out whar exactly you want to do with the data yet". The relational approach requires that you think carefully upfront about your information needs and model the business and design the database accordingly, while NoSQL products invite design laziness by letting you defer the upfront effort at the expense of usually costlier future effort of using the database. Expediency is too strong a temptation to resist, but what Codd realized--and many don't want to hear--is that by the time you figure out how to use the data, a RDBMS is the least costly solution and you better resist the temptation.

References


[1] Pascal, F., Truly Relational: What It Really Means, PRACTICAL DATABASE FOUNDATIONS paper #5, www.dbdebunk.com.

[2] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS (Database Debunkings, 2017).

[3] Codd, E. F., Does your DBMS run by the rules? ComputerWorld (1985).

[4] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS (Alternative Technologies, forthcoming).

[3] Codd, E. F., A Relational Model of Data for Large Shared Data Banks, Comm. ACM 13(6), 1970.


Note: I will not publish or respond to anonymous comments. If you want to say something, stand behind it. Otherwise don't bother, it'll be ignored.







Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment