Wednesday, July 4, 2018

N-ary vs. Binary Relations: Anatomy of a Compound Misconception

Note: This is a re-write of an older post, to bring it into line with the McGoveran formalization and interpretation [1] of Codd's true RDM, which includes refinements, corrections, and extensions of his own.
“Codd has based his model on n-ary relations and that is the key mistake he has made; that leads to complex structure (absolutely not necessary) and situations where there are no values known and as a consequence the need of the concept we know too well -> the null pointers; binary relations (smallest possible) are sufficient to express any predicate/sententional formula and there is no possibility to have something like null; if a value is unknown then we do not know it thus it is not a fact for us thus it is not in our database.”
We suggest you test your foundation knowledge by figuring what's wrong with this picture before you proceed.


I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you. 


NEW: The Key to Relational Keys: A New Perspective


I deleted my Facebook account. You can follow me on Twitter:
The DBDebunk page will contain links to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture, which I am bringing back. I created a #RelModel hashtag for this purpose.
The PostWest pages will contain links to evidence for, and my take on Dystopian Western Decadence, The Only Acceptable Racism Left, and The Weaponized Myth of a "Palestinian Nation" for which the anti-semitic world fell lock stock and barrel. I will create a #PostWest hash tag for this purpose.


  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 

Binary Relations: More Complexity, Not Simplicity

In fact, Codd initially thought to base the RDM on binary relations, but realized it would not be practical. A n-ary relation represents (facts about) members of an entity group (with all their properties). Let's assume that a binary relation represents an individual property associated with those entities. It should be clear that:

  • Many more constraints would have to be declared by users and enforced by the DBMS than in the n-ary case, to ensure that the relationships among the disparate properties represented by binary relations jointly represent the entity group correctly; and,
  • Operations on them to query about those entities would be more complex.

Note very carefully, however, that as soon as any manipulation on binary relations is performed, it unavoidably must contend with n-ary relations anyway.

So while "binary relations are sufficient to express any [single property] predicate", they are less useful for expressing the compound predicates descriptive of the real world represented by databases. The belief that binary relations bring more simplicity than n-ary relations is grounded in the common tendency of practitioners to focus on structure, and ignore integrity and manipulation implications, where complexity accrues[2]. 

N-ary Relations Do Not "Lead" to NULLs

“... n-ary relations ... leads to ... where there are no values known ... the need of the concept we know too well -> the null pointers" suggests that n-ary relations are responsible for NULLs representing unknown data, which purportedly binary relations would not. This lumps together three misconceptions.”
First, it is trivial to show that insofar as unknown data are concerned, there is no difference between binary and n-ary relations. Suppose the value of a property for a specific entity is unknown and missing. If missing values were represented by NULLs -- and this brings me to the second misconception (see next) -- the attribute representing the property would have a NULL in either a binary, or n-ary relation! And, of course, database relations cannot contain missing unknown values.

Second, NULL is just the SQL-specific representation of missing values, and for reasons that have been exhaustively explained elsewhere[3], SQL tables that contain them are not relations, rendering the binary/n-ary distinction moot.

Third, the industry commonly distinguishes between two kinds of missing data -- unknown, and inapplicable. We have shown that the latter is a misconception -- no data are missing, the NULLs are an unnecessary artifact of poor design[4,5]. The comment's author refers to unknown, but means inapplicable, which explains why he thinks "n-ary relations lead to NULLs": it is only with n-ary, but not binary relations that poor design can manufacture the need for inapplicable NULLs (why?), which can create, in the absence of foundation knowledge, the impression that they, not the bad design, are the culprit. In other words, while "there is no possibility to have something like a[n inapplicable] NULL" in a SQL binary table, it sure is possible to have an unknown NULL. Practitioners use NULL for both, excacerbating the problems.

Note: NULLs are, of course, not pointers. While Codd's four-valued logic (4VL) -- I-Mark for inapplicable data, and A-mark for unknown data -- was a mistake, it was only an inspiration for SQL's "3.5VL" -- single NULL mark for both in SQL -- disastruous in both concept and implementation. Neither is consistent with the formal foundation of the RDM, and the relational solution to missing data lies elsewhere[6].

Unknown Data Are Facts, But About Data

That some data of interest are unknown is a fact of interest that should be recorded, but indeed, "not in our database", because it is not data about the real world, but about data (i.e., metadata) that belongs in the database catalog and should be managed by the DBMS[6].


[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.


[3] Pascal, F., Chapter 10, PRACTICAL ISSUES IN DATABASE MANAGEMENT (esp. bibliography).

[4] Pascal, F., The Principle of Orthogonal Database Design Parts I,II,III.

[5] Pascal, F., Meaning Criteria and Entity Supertype-Subtypes.

[6] Pascal, F., The Last NULL In the Coffin: A Relational Solution to Missing Data.

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.

No comments:

Post a Comment