Saturday, March 17, 2018

Physical Independence Part 2: Logical-physical Confusion

Note: This is a rewrite of older posts (which now link here), to bring them into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

Revised 3/17/18

(Continued from Part 1)

This is the second part of my response today to an old DBDebunk query:

"You constantly remind us that the relational model is a logical model having no connection to any physical model (so I infer). You also indicate how no commercial product fully implements the relational model. Therefore, how do we make use of the relational model when dealing with the physical constructs of a commercial database program (Oracle, Access, DB2, etc.)?"
In Part 1 I explained physical independence (PI) and claimed that the  industry has failed to internalize its importance. Here I provide evidence to that effect and discuss some consequences.


I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you.



  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 

Given the horrible experience with pre-SQL non-relational DBMSs that did not support PI, Codd could not have imagined that the RDM and logical database design, would end up being absurdly blamed for poor performance. But data professionals lack familiarity with the history of the field, do not have a grasp of the RDM and confuse levels of representation. 
As we have seen, performance is determined exclusively at the physical implementation level, outside the RDM. So the responsibility for performance optimization is relegated to DBMS vendors and DBAs, away from application developers and end users. When SQL DBMSs (which are far from being relational) yield unsatisfactory performance, data professionals absurdly blame the RDM and logical database design, instead of DBMS and database implementations, and "denormalize for performance". This logical-physical confusion (LPC) is akin to looking for lost keys where there is light, not where they were lost, but practitioners insist that when they denormalize databases, performance improves. There are fallacies involved.

Logic Cannot Affect Performance

First, normalization is poorly understood and in many, if not most, of the claimed instances, either the poor performance is not with databases that are really fully normalized, or the redesign for performance is not denormalization.

Second, to the extent that performance improves after real denormalization, the gain does not come from it. Denormalization introduces redundancy and associated integrity risks, which requires enforcement of additional constraints to protect database consistency. Data professionals are mostly unaware of this, do not know how to formulate the constraints, disregard them, and/or SQL DBMSs do not support them, so they are not enforced. The performance gains, if any, come from this tradeoff of integrity for them, not from denormalization (if the constraints were added, the gains would disappear)[2].

Thus, consistent with PI, all else being equal, unsatisfactory performance with a properly constrained 5NF relational database cannot be due to the RDM or logical design, but only to the physical implementation of the DBMS, the database, or both. In other words, iff (1) performance is unsatisfactory and (2) the database physical implementation is optimal, then (3) the DBMS implementation is responsible. In that rare case there may be no choice but denormalize, but that should not lead users to confuse levels of representation and distract  them from the real culprit.

Since we now contend that relations are in 5NF by definition (otherwise they are not relations)[3], when you denormalize you not only introduce redundancy, but are out of the RDM altogether and all bets are off[4]. You should be aware of the loss of all the relational advantages you lose (system-guaranteed logical validity, semantic correctness, declarative decidable languages, physical and logical independence, and so on) and consider this cost against any performance gains. Unfortunately, this does not occur in practice.

Data professionals keep asking "If the RDM is so great, why aren't there true RDBMSs?" Well, if, unaware of the costs, they act against their own interest and so willingly denormalize for performance, why should DBMS vendors, themselves lacking foundation knowledge, implement well performing true RDBMSs?

Designing for Consistency, Implementing for Performance

It is in order to prevent confusion of levels of representation that I recommend the three-fold terminology of conceptual modeling, logical design and physical implementation[5].

Properly constrained relations are formal database representations of object groups, so the first priority in designing them is database consistency with those groups (i.e., accurate representation thereof)[6]. This does not mean performance is not important, only that business modeling and logical design should not be contaminated with performance considerations that should be deferred to physical implementation. Yet in practice database design focuses almost exclusively on performance.

"When you design your database tables there are some important things to think of...:
  • Normalize to remove redundant data
  • Use the smallest datatype possible
  • Create as few indexes as possible, but not too few
  • Avoid redundant indexes
  • Every table must have clustered index."
Indexing is part of the implementation. Domains (SQL data types are not the same thing) should be chosen, first and foremost, such that the attributes defined on them are accurate representations of object properties. The only logical design advice is "normalize", and even that is inaccurate. Contrary to conventional wisdom (such as it is), if relations are correctly designed in 5NF, there is no need to further normalize. Incidentally, avoiding redundancy is not the only reason for 5NF[2].

Obsession with performance at the expense of everything else can lead to nonsense.

"First is not to get mesmerized by the marketing hype. Second, ask yourself, does my relational database give me the performance that I need with the data I have? If the answer is yes, even if it's a little slow, then they have no business looking at big data. The reason I say that so emphatically is that big data is not the next generation of database technology. There are tradeoffs to be made."
Mesmerization by marketing hype is substituted with mesmerization by performance. Choice of DBMS should consider the data model it supports and its fit to the informational use of the data (i.e., its structural, integrity and manipulative components), not just performance. If a data model (or, probably, more than one) underlies big data systems, it has not been formally defined and no theoretical foundation has been claimed for it. Non-relational systems usually have no hope to support PI and most other relational advantage,

It is in their interest for data professionals to stop confusing levels of representation and create demand that would incent vendors to produce true RDBMSs.

Taking Advantage of the RDM

We are now in a position to answer the original question.

The RDM is a data model used to formalize enterprise-specific conceptual models as enterprise-specific logical models for database representation. It is essentially applied math and logic, which have nothing to do with performance. Any physical constructs can be used to implement relational databases, as long as the DBMS supports PI and insulates users and applications from them.

SQL DBMSs are the closest the industry has -- and probably will -- ever come to the RDM. They are far from being relational, but they have been sold as such and most data professionals on both the vendor and user side, lacking foundation knowledge, believe they are. But SQL DBMSs provide at least some PI support.

In absence of proper education and foundation knowledge, the best a knowledgeable practitioner can do is (1) proper conceptual modeling (2) correct logical database design using the RDM, and (3) plug any holes caused by relational failures of SQL DBMSs. The worst they can do is regress to non-SQL products that are relabeled versions of the non-relational systems that SQL made obsolete five decades ago, or worse.


[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., The Costly Illusion: Normalization, Integrity and Performance.

[3] Pascal, F., What Relations Really Are and Why They Are Important.

[4] Pascal, F., Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness.

[5] Pascal, F., Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation

[6] Pascal, F., What Meaning Means Business Rules, Predicates, Integrity Constraints and Database Consistency.

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.

No comments:

Post a Comment

View My Stats