Friday, November 27, 2020

OBG: Missing Data -- "Horizontal Decomposition" Part 2

Note: To demonstrate the correctness and stability of a sound foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing "Oldies But Goodies" material from the old (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may break long pieces into multiple posts, revise, and/or add comments and references.

In Part 1 we re-published a reader's response to "horizontal decomposition" -- Hugh Darwen's How to Handle Missing Information without Using NULLs  -- in comparison to our The Final NULL in the Coffin: A Relational Solution to Missing Data). Here's Hugh's response.


DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

07/22/20: LINKS update: Added “An Argument for Controlled Natural Languages in Mathematics”, “Let’s Make Set Theory Great Again”.
- 07/21/20 LINKS update: Added “How Gödel’s Proof Works”.

- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
- @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.


Hugh Darwen:
Thank you for responding to my request for more information about your objections to and questions on my slide presentation (not a paper, yet!), HTHMIWUN. Let me first respond to your point on "horizontal decomposition", then clarify one or two things that seem to need clarifying in relation to that presentation

a. The term "horizontal decomposition" first came up in connection with the study of temporal databases that Date and I published ("Temporal Data and The Relational Model" in 2001. Consider the relation

for example, whose predicate is
"Employee E# is working on project P#".
If we wish to "temporalise" this, so that we can keep a historical record of all projects an employee has ever worked on, we run into difficulties if we just add FROM and TO attributes to WORKS_ON, because we don't yet know the TO dates for the projects currently being worked on. We therefore decompose the putative
Yes, the constraints needed on that decomposition are pretty horrendous, so in the book we propose various shorthands to address that problem.

b. Most importantly, notice that the decomposition I have just described is really just recognizing that we are dealing with two distinct predicates:
(a) "Employee E# has been working on project P# since day SINCE"; and
(b) "Employee E# worked on project J# from day FROM until day TO".

c. Horizontal decomposition is a solution to problems that arise when you try to deal with a disjunctive predicate. If we tried to combine WORKS_ON and WORKED_ON into a single relation variable (relvar), what is the predicate for that relvar? It can't be
"Either employee E# has been working on project J# since day SINCE, or E# worked on J# from day FROM until day TO"
because the two disjuncts do not involve the same variables -- do not yield type-compatible relations. To make them type-compatible, we have to replace that SINCE by a FROM and a TO:
"Either employee E# has been working on project J# since day FROM (in which case day TO refers to the end of time), or E# worked on J# from day FROM until day TO"  
But now we are landing ourselves in a bit of trouble, not unlike some of the trouble that arises from the use of NULLs.

The horizontal decomposition shown in HTHMIWUN is likewise concerned with avoiding disjunctive predicates.  I can't handle
"Either employee E# is paid salary S, or we don't know what E#'s salary is, or E# isn't eligible for a salary"
in a single relvar. Therefore I split the sentence into three, each of which, being atomic, is easily handled. Granted, we now have a bit of a problem with the constraints, but at least I have a design for a relational database in which we can record what we want to be recorded.

Notice that whereas horizontal decomposition involves removing awkward "or"s, the more familiar "vertical decomposition" that takes place in classical normalization involves removing awkward "and"s.  One could loosely characterize the sixth normal form (6NF), described by Date and myself in the temporal book, as removal of all "and"s; when you add horizontal decomposition, you are effectively removing all conjunctions, yielding predicates that are "atomic".

I claim that it makes good sense, as a preliminary to database design, to write down all the atomic predicates that are required to be represented in that database. I further claim that a database consisting of distinct relvars for each of those atomic predicates will provide a valid solution, with several advantages over other solutions as well as possible disadvantages over other solutions.  I further surmise that distinct relvars for each atomic predicate provide the safest and best starting point for any design--so you might say that I prefer design by denormalization to design by normalization, though I don't think I would ever recommend a non-5NF design (so denormalization never goes further than replacement of two or more 6NF relvars by a single, equivalent 5NF relvar).

Now, to clarify the intent of HTHMIWUN: I see that the notes I wrote on the first slide refer to the material as a recommendation. I now think that's perhaps a little too strong and I'm surprised to see that I wrote those words. As I wrote to Fabian the other day, I think of HTHMIWUN more as an attempt to prove that the so-called "missing information problem" can be adequately addressed in a relational database -- by which I mean, a database constructed by use of a DBMS that conforms to THE THIRD MANIFESTO (TTM). I prefer to think of it as a strawman proposal.

By the way, you remarked that my proposal had the advantage of being implementable using existing technology. Well, if you look at slide #5, you'll see that I disagree with you on that count! Slide #5 says: "I also comment on the extent (<100%) to which the proposed solutions are available in today’s technology."  That "<100%" refers to the requirement for support for what Date and I call "multiple assignment". For example, if an employee who was previously ineligible for a salary suddenly starts to be paid one, then simultaneous updates to EARNS and UNSALARIED are needed. (You could argue that the "deferred constraint checking" that is available with some existing SQL implementations can be used for the purpose at hand, but deferred constraint checking is not supported by TTM, for reasons given in the book.)

Anyway, I am far from claiming that there aren't other approaches worth considering. I do, however, suggest a few "rules of engagement", should we be faced with several "competing" proposals:

1. The requirements are those indicated by my atomic predicates shown on slides #9 and #10.
2. The problems to be addressed are those indicated by the four points on my slide #5 (database structure, constraints, updates, queries). (Did I leave any out?)
3. The proposed solution must not involve any departure from the relational model (ideally, from my point of view, from TTM). In particular: the database structure must consist of relvars only; the constraints must be truth-valued expressions, not referencing any variables other than the database relvars; the updates must be expressible in terms of (multiple) relvar assignment; the queries must be expressible in relational algebra.

Pascal's solution interests me, too, but it doesn't meet all the requirements mine meets and it seems to involve a "multiple query" operator (not yet fleshed out), that involves more than just relational algebra.

I hope you can see, now, why I didn't attempt an approach that would involve a DBMS capable of understanding semantics in the sense expressed in your point 1). That would be beyond the scope of the exercise. Besides, at least we already know how to make a relational DBMS, even if we don't yet have any industry-strength ones.

I believe I have given something of an answer to your question as to how the db designer can know when to apply horizontal decomposition: if the predicate is disjunctive, split it; otherwise, don't. That seems pretty close to being a formal rule. (And for 6NF, the apply the same rule with "con-" instead of "dis-".)

Regarding the complexity of my constraints: (a) I do propose convenient shorthands; (b) don't those shorthands, combined with multiple assignment, adequately push the complexity under the covers? (c) how can they be avoided?

(Continued in Part 3)


Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.

No comments:

Post a Comment

View My Stats