ON DARWEN’S “HANDLING OF MISSING INFORMATION WITHOUT USING NULLS”
with Hugh Darwen and Fabian Pascal

 

 

 

From: BS

To: Hugh Darwen

Date: 14 Oct 2004

 

Early this year, a friend and I were a bit confused by your paper How To Handle Missing Information Without Using NULLs (HTHMIWUN).  We spent a lot of time going analyzing it. I intended to ask you some questions, but as I was formulating questions, I kept finding more and more until I had too many to write a simple e-mail.  So, I began writing a website to cover the different questions that came to mind.  In the end, I found that I could not produce anything that was both comprehensive and cohesive, so I abandoned the idea.  I realized that the root of the problem was that I was working from a paper too small to fully explain your idea, and I decided that perhaps I should wait until you or your colleagues wrote more about your solution to precluding nulls.

 

Since you have asked, I shall do my best to briefly mention my most serious points.  My main issue is with horizontal decomposition, which seems to be the heart of HTHMIWUN.  I believe that horizontal decomposition is not a logical process, but rather a semantic process, which leads to two problems:

 

1) A DBMS does not understand semantics. Handling multiple base relations with predicates of the same semantic field, predicates that may even be combined at the higher level of virtual relations, seems very difficult at best.  I hope that someday, computers will be able to handle semantics as well as humans.  In the meantime, it seems to me that your approach would require the addition of some meta-logical data to work well as a general solution.  (Perhaps "semantically tagging" relations or using higher-order logic to group relations of the same semantic field could help.)

 

2) Logical processes have rules to help arrive at proper solutions (normalization, for example); semantic processes, as far as I know, can not have such rules, due to the rather arbitrary (or unknown) nature of semantics.  So, how can a database designer know when and how to use horizontal decomposition effectively?  Even more strictly, can there be any formal rules for such a process?

 

Other problems come to mind, such as the difficulty of virtual relations (especially on updates), complexity of distributed key constraints, and potential difficulties for database users, but they all stem from the problems of horizontal decomposition.  All things considered, I'd rather not use your method to preclude nulls.

 

Here is my abandoned site, as unpolished as it is, if you wish to see it.  However, it is large, sometimes long-winded, and doesn't really say anything more important than what I have addressed here.  (It might be good for a chuckle, though.)

 

I am rather pleased with Pascal's approach in PRACTICAL DATABASE FOUNDATIONS paper #8, The Final NULL in the Coffin, to precluding NULLS, even if it does have some obstacles in the area of implementation.  If you continue to believe in your method, I hope that you will write more about it.  Perhaps my misgivings are due mainly to the brevity of your paper.

 

As silly as this may sound, I see horizontal decomposition as a kind of frontier for relational database modeling. The relational topic that interests me most is updating virtual relations, and I feel that horizontal decomposition may play a big role in that.  So, what I'm getting at is even if you abandon your approach to NULLS, I hope that you will continue exploration of horizontal decomposition.

 

 

Hugh Darwen Responds:  Thank you for responding to my request for more information about your objections to and questions on my slide presentation (not a paper, yet!), HTHMIWUN.  Let me first respond to your point on "horizontal decomposition", then clarify one or two things that seem to need clarifying in relation to that presentation [Lettering added for convenience of commenting].

 

a. The term "horizontal decomposition" first came up in connection with the study of temporal databases that Date and I published in "Temporal Data and The Relational Model" in 2001.  Consider the relation WORKS_ON { E#, P# }, for example, whose predicate is "Employee E# is working on project P#".  If we wish to "temporalise" this, so that we can keep a historical record of all projects an employee has ever worked on, we run into difficulties if we just add FROM and TO attributes to WORKS_ON, because we don't yet know the TO dates for the projects currently being worked on.  We therefore decompose the putative WORKS_ON { E#, P#, FROM, TO } into WORKS_ON { E#, P#, SINCE } and WORKED_ON { E#, P#, FROM, TO }.  Yes, the constraints needed on that decomposition are pretty horrendous, so in the book we propose various shorthands to address that problem.

 

b. Most importantly, notice that the decomposition I have just described is really just recognizing that we are dealing with two distinct predicates: (a) "Employee E# has been working on project P# since day SINCE", and (b) "Employee E# worked on project J# from day FROM until day TO".

 

c. Horizontal decomposition is a solution to problems that arise when you try to deal with a disjunctive predicate.  If we tried to combine WORKS_ON and WORKED_ON into a single relation variable (relvar), what is the predicate for that relvar?  It can't be "Either employee E# has been working on project J# since day SINCE, or E# worked on J# from day FROM until day TO", because the two disjuncts do not involve the same variables--do not yield type-compatible relations.  To make them type-compatible, we have to replace that SINCE by a FROM and a TO: "Either employee E# has been working on project J# since day FROM (in which case day TO refers to the end of time), or E# worked on J# from day FROM until day TO".  But now we are landing ourselves in a bit of trouble, not unlike some of the trouble that arises from the use of nulls.

 

The horizontal decomposition shown in HTHMIWUN is likewise concerned with avoiding disjunctive predicates.  I can't handle "Either employee E# is paid salary S, or we don't know what E#'s salary is, or E# isn't eligible for a salary" in a single relvar.  Therefore I split the sentence into three, each of which, being atomic, is easily handled.  Granted, we now have a bit of a problem with the constraints, but at least I have a design for a relational database in which we can record what we want to be recorded.

 

Notice that whereas horizontal decomposition involves removing awkward "or"s, the more familiar "vertical decomposition" that takes place in classical normalization involves removing awkward "and"s.  One could loosely characterize the sixth normal form (6NF), described by Date and myself in the temporal book, as removal of all "and"s; when you add horizontal decomposition, you are effectively removing all conjunctions, yielding predicates that are "atomic".

 

I claim that it makes good sense, as a preliminary to database design, to write down all the atomic predicates that are required to be represented in that database.  I further claim that a database consisting of distinct relvars for each of those atomic predicates will provide a valid solution, with several advantages over other solutions as well as possible disadvantages over other solutions.  I further surmise that distinct relvars for each atomic predicate provide the safest and best starting point for any design--so you might say that I prefer design by denormalization to design by normalization, though I don't think I would ever recommend a non-5NF design (so denormalization never goes further than replacement of two or more 6NF relvars by a single, equivalent 5NF relvar).

 

Now, to clarify the intent of HTHMIWUN:

 

I see that the notes I wrote on the first slide refer to the material as a recommendation.  I now think that's perhaps a little too strong and I'm surprised to see that I wrote those words.  As I wrote to Fabian the other day, I think of HTHMIWUN more as an attempt to prove that the so-called "missing information problem" can be adequately addressed in a relational database (by which I mean, a database constructed by use of a DBMS that conforms to The Third Manifesto).  I prefer to think of it as a strawman proposal.

 

By the way, you remarked that my proposal had the advantage of being implementable using existing technology.  Well, if you look at slide #5, you'll see that I disagree with you on that count!  Slide #5 says: "I also comment on the extent (<100%) to which the proposed solutions are available in today’s technology."  That "<100%" refers to the requirement for support for what Date and I call "multiple assignment".  For example, if an employee who was previously ineligible for a salary suddenly starts to be paid one, then simultaneous updates to EARNS and UNSALARIED are needed.  (You could argue that the "deferred constraint checking" that is available with some existing SQL implementations can be used for the purpose at hand, but deferred constraint checking is not supported by The Third Manifesto, for reasons given in the book.)

 

Anyway, I am far from claiming that there aren't other approaches worth considering.  I do, however, suggest a few "rules of engagement", should we be faced with several "competing" proposals:

 

1. The requirements are those indicated by my atomic predicates shown on slides #9 and #10.

 

2. The problems to be addressed are those indicated by the four points on my slide #5 (database structure, constraints, updates, queries).  (Did I leave any out?)

 

3. The proposed solution must not involve any departure from the relational model (ideally, from my point of view, from TTM).  In particular: the database structure must consist of relvars only; the constraints must be truth-valued expressions, not referencing any variables other than the database relvars; the updates must be expressible in terms of (multiple) relvar assignment; the queries must be expressible in relational algebra.

 

Pascal's solution interests me, too, but it doesn't meet all the requirements mine meets and it seems to involve a "multiple query" operator (not yet fleshed out), that involves more than just relational algebra.

 

I hope you can see, now, why I didn't attempt an approach that would involve a DBMS capable of understanding semantics in the sense expressed in your point 1).  That would be beyond the scope of the exercise.  Besides, at least we already know how to make a relational DBMS, even if we don't yet have any industry-strength ones.

 

I believe I have given something of an answer to your question as to how the db designer can know when to apply horizontal decomposition: if the predicate is disjunctive, split it; otherwise, don't. That seems pretty close to being a formal rule. (And for 6NF, the apply the same rule with "con-" instead of "dis-".)

 

Regarding the complexity of my constraints: (a) I do propose convenient shorthands; (b) don't those shorthands, combined with multiple assignment, adequately push the complexity under the covers? (c) how can they be avoided?

 

 

Fabian Pascal Comments:  I always found it problematic to infer general database design principles from simple, incompletely specified examples, that run the risk of being contrived to produce such inferences.

 

a. Darwen refers to “keeping a historical record of all projects an employee has ever worked on”. This suggests there are multiple projects per employee, but given that the assignments to projects are TO: and FROM:, it’s not clear whether there are also multiple period assignments of any employee to any project. For simplicity let’s assume not (otherwise the EP relvar would have a four-attribute composite key, which is not pretty). Similarly, neither is it clear whether the corresponding FROM: and SINCE values are always the same. If they are, then Darwen’s design contains redundancy (which, we suppose, contributes to the added constraint burden). We will assume such equality, for simplicity.

 

Darwen claims that the two-relvar design “is really just recognizing that we are dealing with two distinct predicates”. But conceptual design—the specification of business rules—is based on subjective perceptions of the world (see PRACTICAL DATABASE FOUNDATIONS paper #4, Un-muddling Modeling). And there is no scientific way to prefer one perception to another. Since predicates are formal representations in the database of informal business rules, it follows that their formulation is, in this sense, arbitrary. In other words, the choice of perception, business rules and, therefore, predicates--is, made on pragmatic, not formal grounds. This is reflected in Darwen’s recognition that his solution has “several advantages over other solutions, as well as possible disadvantages over other solutions”.

 

His perception of the real world consists of

 

1. Two business rules that map to relvar predicates (headings)

 

Employee (E#) worked for project (P#) from (FROM:) to (TO:).

Employee (E#) works for project (P#) from (SINCE:).

 

2. Some pertinent business rules that map to database predicates (constraints).

 

Our perception separates between the real world, and our knowledge of it: whether a TO: value is known or not is not a property of the project assignment, but of the TO: attribute—that is, data about the data—and, therefore, we perceive

 

1. One business rule that maps to a relvar predicate

 

Employee (E#) worked for project (P#) from (FROM:) to (TO:).

 

2. One business rule that maps to a metadata relvar predicate

 

The value of attribute (ATTRIBUTE) for employee (E#) who worked on project (P#) is unknown.

 

where ATTRIBUTE is the name of any attribute with unknown values, here, TO:.

 

There is no additional constraint burden on the user.

 

So there are two equally valid perceptions of the world, and the question is which is preferable on pragmatic grounds.

 

Darwen proposes four criteria on which to base this decision. Our solution satisfies 1 and 2. With respect to 3, we believe that:

 

·   It does not involve any departure from the relational model, and therefore, expressible in multi-relvar assignments (although, we do not constrain ourselves to TTM as currently specified);

·         The DBMS operates on relvars only (albeit, with iterations and no-ops (see our paper);

·   The constraints are truth-valued expressions (albeit, with “no-ops” enforcement, like the relational operators which, after all, they are for enforcement purposes);

·   While we admit in our paper that the effect on the operators are yet to be spelled out, we believe they are expressible in the relational algebra, although subject to certain interpretation by the DBMS in the presence of missing data.

 

A major difference between the two approaches is that Darwen’s imposes burdens on users, via database design, constraint, and query formulation, which are relatively complex and prone to error. By contrast, our approach places the entire burden where it really belongs, on the DBMS; users don’t have to do anything different than what they do in the absence of missing data. And that is surely preferable on pragmatic grounds.

 

 

Posted 2/4/05