Thursday, December 3, 2020

OBG: Missing Data -- "Horizontal Decomposition" Part 3

Note: To demonstrate correctness and stability of a sound foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing "oldies but goodies" from the old DBDebunk (2000-06), so that you can judge for yourself how well my then arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may break long pieces into multiple posts, revise, and/or add comments and references.

In Part 1 we re-published a reader's comments on "horizontal decomposition" -- Hugh Darwen's proposal on How to Handle Missing Information without Using NULLs relative to our The Final NULL in the Coffin: A Relational Solution to Missing Data;  In Part 2 we re-published Darwen's response. Here's my reply revised for consistency with the current state of knowledge.


DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

07/22/20: LINKS update: Added “An Argument for Controlled Natural Languages in Mathematics”, “Let’s Make Set Theory Great Again”.
- 07/21/20 LINKS update: Added “How Gödel’s Proof Works”.

- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
- @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.


Note: Even though there are relvars "under the covers" in relational databases, in simple set theory expressible in first order logic (SST/FOPL) -- the theoretical foundation of the RDM -- there are only relations. This is one of the issues tackled by McGoveran's effort to re-interpret, extend and formalize Codd's RDM[1], which is beyond the scope of this discussion.

In Hugh's simple example there are entities of two types, employees and projects, whereby an employee is assigned to one or (over time) more (though, for simplicity, not simultaneous) projects. Assignments are "relationship entities": an assignment is a 1:M relationship between an employee and one or more projects. According to our new understanding of the RDM, all "relationship entity types" such as assignments (not just N:M, but 1:M also) are represented by association relations.

Given EMPLOYEES and PROJECTS relations, assignments would normally be represented by the relation:

unless data are missing, in which case ASSIGNMENTS is not a relation (why?) and all bets are off. Hugh's "horizontal decomposition" avoids that, but he admits "possible disadvantages over other solutions". He mentions constraint complexity, but there are also similar implications for database design and query formulation (why?). This is what our solution avoids.

What we record in a database is our knowledge of the real world, which is often imperfect: if all assignments have end-dates, but some are unknown to the system, the TO attribute representing the end-date property has missing data. 

But note very carefully that whether the end-date is known or not is not a property of assignments. Strictly speaking it is "our property" -- lack of knowledge about the property -- essentially data about data, or meta-data which, in a relational database is recorded in the catalog and managed by the DBMS. Thus, every user relation has an associated catalog relation which records, for every missing value, the attribute name and primary key value of its tuple. In our example, for ASSIGNMENTS:

where MD_ASSIGNMENTS is a catalog relation, identifying every TO missing value in ASSIGNMENTS. Its predicate is:
For employee (E#) assigned to project (P#) property (ATTR_NAME} is unknown.
For details we refer the reader to our paper -- for the purposes of this discussion we argue as follows:

  • Our solution satisfies Darwen's requirements 1 and 2;
  • With respect to 3, we deem ours a relational (i.e., theoretically sound) solution because:

- the DBMS operates exclusively on relations (i.e., in both 1NF and 5NF and no NULLs) and constraints are truth-valued expressions, albeit, with "no-op" enforcement, just like RA operators;
- it is consistent with our new understanding of the RDM derived from McGoveran's work, in which "multi-relations" play a central role beyond missing data, which inspires confidence (although we do not constrain ourselves to Date's and Darwen's THE THIRD MANIFESTO as currently specified);
While we admit that the effect on the operators are yet to be spelled out (excellent research project), we believe they are expressible in the RA, although subject to certain interpretation by the DBMS in the presence of missing data.

A major difference between the two approaches is that Darwen’s imposes database design and constraint and query formulation burdens on users, which are relatively complex and prone to error. By contrast, our approach places most of the burden where it really belongs, on the DBMS -- except for those interpretations, users don’t have to do anything different than what they do in the absence of missing data. And that is surely preferable on both correctness and pragmatic grounds. 

Continued in Part 4


Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.


[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK (draft chapters), forthcoming.

No comments:

Post a Comment

View My Stats