From: BS
To: Hugh Darwen
Date: 14 Oct 2004
Early this year, a friend and I were a bit confused by your
paper How
To Handle Missing Information Without Using NULLs (HTHMIWUN). We spent a lot of time going analyzing it. I
intended to ask you some questions, but as I was formulating questions, I kept
finding more and more until I had too many to write a simple e-mail. So, I began writing a website to cover the
different questions that came to mind.
In the end, I found that I could not produce anything that was both
comprehensive and cohesive, so I abandoned the idea. I realized that the root of the problem was that I was working
from a paper too small to fully explain your idea, and I decided that perhaps I
should wait until you or your colleagues wrote more about your solution to
precluding nulls.
Since you have asked, I shall do my best to briefly mention
my most serious points. My main issue
is with horizontal decomposition, which seems to be the heart of HTHMIWUN. I believe that horizontal decomposition is
not a logical process, but rather a semantic process, which leads to two
problems:
1) A DBMS does not understand semantics. Handling multiple
base relations with predicates of the same semantic field, predicates that may
even be combined at the higher level of virtual relations, seems very difficult
at best. I hope that someday, computers
will be able to handle semantics as well as humans. In the meantime, it seems to me that your approach would require
the addition of some meta-logical data to work well as a general solution. (Perhaps "semantically tagging"
relations or using higher-order logic to group relations of the same semantic
field could help.)
2) Logical processes have rules to help arrive at proper
solutions (normalization, for example); semantic processes, as far as I know,
can not have such rules, due to the rather arbitrary (or unknown) nature of
semantics. So, how can a database
designer know when and how to use horizontal decomposition effectively? Even more strictly, can there be any formal
rules for such a process?
Other problems come to mind, such as the difficulty of
virtual relations (especially on updates), complexity of distributed key
constraints, and potential difficulties for database users, but they all stem
from the problems of horizontal decomposition.
All things considered, I'd rather not use your method to preclude nulls.
Here is my abandoned site, as
unpolished as it is, if you wish to see it.
However, it is large, sometimes long-winded, and doesn't really say
anything more important than what I have addressed here. (It might be good for a chuckle, though.)
I am rather pleased with Pascal's approach in PRACTICAL DATABASE FOUNDATIONS paper
#8, The Final NULL
in the Coffin, to precluding NULLS, even if it does have some obstacles
in the area of implementation. If you
continue to believe in your method, I hope that you will write more about
it. Perhaps my misgivings are due
mainly to the brevity of your paper.
As silly as this may sound, I see horizontal decomposition as
a kind of frontier for relational database modeling. The relational topic that
interests me most is updating virtual relations, and I feel that horizontal
decomposition may play a big role in that.
So, what I'm getting at is even if you abandon your approach to NULLS, I
hope that you will continue exploration of horizontal decomposition.
Hugh Darwen Responds:
Thank you for responding to my request for more information about your
objections to and questions on my slide presentation (not a paper, yet!), HTHMIWUN.
Let me first respond to your point on "horizontal decomposition",
then clarify one or two things that seem to need clarifying in relation to that
presentation [Lettering added for convenience of commenting].
a. The term "horizontal decomposition" first came
up in connection with the study of temporal databases that Date and I published
in "Temporal Data and The Relational Model" in 2001. Consider
the relation WORKS_ON { E#, P# }, for example, whose predicate is
"Employee E# is working on project P#". If we wish to
"temporalise" this, so that we can keep a historical record of all
projects an employee has ever worked on, we run into difficulties if we just
add FROM and TO attributes to WORKS_ON, because we don't yet know the TO dates
for the projects currently being worked on. We therefore decompose the
putative WORKS_ON { E#, P#, FROM, TO } into WORKS_ON { E#, P#, SINCE } and
WORKED_ON { E#, P#, FROM, TO }. Yes, the constraints needed on that
decomposition are pretty horrendous, so in the book we propose various
shorthands to address that problem.
b. Most importantly, notice that the decomposition I have
just described is really just recognizing that we are dealing with two distinct
predicates: (a) "Employee E# has been working on project P# since day
SINCE", and (b) "Employee E# worked on project J# from day FROM until
day TO".
c. Horizontal decomposition is a solution to problems that
arise when you try to deal with a disjunctive predicate. If we tried to
combine WORKS_ON and WORKED_ON into a single relation variable (relvar), what
is the predicate for that relvar? It can't be "Either employee E#
has been working on project J# since day SINCE, or E# worked on J# from day
FROM until day TO", because the two disjuncts do not involve the same
variables--do not yield type-compatible relations. To make them
type-compatible, we have to replace that SINCE by a FROM and a TO: "Either
employee E# has been working on project J# since day FROM (in which case day TO
refers to the end of time), or E# worked on J# from day FROM until day
TO". But now we are landing ourselves in a bit of trouble, not
unlike some of the trouble that arises from the use of nulls.
The horizontal decomposition shown in HTHMIWUN is likewise
concerned with avoiding disjunctive predicates. I can't handle
"Either employee E# is paid salary S, or we don't know what E#'s salary
is, or E# isn't eligible for a salary" in a single relvar. Therefore
I split the sentence into three, each of which, being atomic, is easily
handled. Granted, we now have a bit of a problem with the constraints,
but at least I have a design for a relational database in which we can record
what we want to be recorded.
Notice that whereas horizontal decomposition involves
removing awkward "or"s, the more familiar "vertical
decomposition" that takes place in classical normalization involves
removing awkward "and"s. One could loosely characterize the
sixth normal form (6NF), described by Date and myself in the temporal book, as
removal of all "and"s; when you add horizontal decomposition, you are
effectively removing all conjunctions, yielding predicates that are "atomic".
I claim that it makes good sense, as a preliminary to
database design, to write down all the atomic predicates that are required to
be represented in that database. I further claim that a database
consisting of distinct relvars for each of those atomic predicates will provide
a valid solution, with several advantages over other solutions as well as
possible disadvantages over other solutions. I further surmise that
distinct relvars for each atomic predicate provide the safest and best starting
point for any design--so you might say that I prefer design by denormalization
to design by normalization, though I don't think I would ever recommend a
non-5NF design (so denormalization never goes further than replacement of two
or more 6NF relvars by a single, equivalent 5NF relvar).
Now, to clarify the intent of HTHMIWUN:
I see that the notes I wrote on the first slide refer to the
material as a recommendation. I now think that's perhaps a little too
strong and I'm surprised to see that I wrote those words. As I wrote to
Fabian the other day, I think of HTHMIWUN more as an attempt to prove that the
so-called "missing information problem" can be adequately addressed
in a relational database (by which I mean, a database constructed by use of a
DBMS that conforms to The Third Manifesto). I prefer to think of it as a
strawman proposal.
By the way, you remarked that my proposal had the advantage
of being implementable using existing technology. Well, if you look at
slide #5, you'll see that I disagree with you on that count! Slide #5
says: "I also comment on the extent (<100%) to which the proposed
solutions are available in today’s technology." That
"<100%" refers to the requirement for support for what Date and I
call "multiple assignment". For example, if an employee who was
previously ineligible for a salary suddenly starts to be paid one, then
simultaneous updates to EARNS and UNSALARIED are needed. (You could argue
that the "deferred constraint checking" that is available with some
existing SQL implementations can be used for the purpose at hand, but deferred
constraint checking is not supported by The Third Manifesto, for reasons given
in the book.)
Anyway, I am far from claiming that there aren't other
approaches worth considering. I do, however, suggest a few "rules of
engagement", should we be faced with several "competing"
proposals:
1. The requirements are those indicated by my atomic
predicates shown on slides #9 and #10.
2. The problems to be addressed are those indicated by the
four points on my slide #5 (database structure, constraints, updates,
queries). (Did I leave any out?)
3. The proposed solution must not involve any departure from
the relational model (ideally, from my point of view, from TTM). In
particular: the database structure must consist of relvars only; the
constraints must be truth-valued expressions, not referencing any variables
other than the database relvars; the updates must be expressible in terms of
(multiple) relvar assignment; the queries must be expressible in relational
algebra.
Pascal's solution interests me, too, but it doesn't meet all
the requirements mine meets and it seems to involve a "multiple
query" operator (not yet fleshed out), that involves more than just
relational algebra.
I hope you can see, now, why I didn't attempt an approach
that would involve a DBMS capable of understanding semantics in the sense
expressed in your point 1). That would be beyond the scope of the
exercise. Besides, at least we already know how to make a relational
DBMS, even if we don't yet have any industry-strength ones.
I believe I have given something of an answer to your
question as to how the db designer can know when to apply horizontal
decomposition: if the predicate is disjunctive, split it; otherwise, don't. That
seems pretty close to being a formal rule. (And for 6NF, the apply the same
rule with "con-" instead of "dis-".)
Regarding the complexity of my constraints: (a) I do propose
convenient shorthands; (b) don't those shorthands, combined with multiple assignment,
adequately push the complexity under the covers? (c) how can they be avoided?
Fabian Pascal Comments: I always found it problematic to infer general database design
principles from simple, incompletely specified examples, that run the risk of
being contrived to produce such inferences.
a. Darwen refers to “keeping a historical record of all
projects an employee has ever worked on”. This suggests there are multiple
projects per employee, but given that the assignments to projects are TO: and
FROM:, it’s not clear whether there are also multiple period assignments of any
employee to any project. For simplicity let’s assume not (otherwise the EP
relvar would have a four-attribute composite key, which is not pretty).
Similarly, neither is it clear whether the corresponding FROM: and SINCE values
are always the same. If they are, then Darwen’s design contains redundancy
(which, we suppose, contributes to the added constraint burden). We will assume
such equality, for simplicity.
Darwen claims that the two-relvar design “is really just
recognizing that we are dealing with two distinct predicates”. But conceptual
design—the specification of business rules—is based on subjective
perceptions of the world (see PRACTICAL
DATABASE FOUNDATIONS paper #4, Un-muddling Modeling).
And there is no scientific way to prefer one perception to another.
Since predicates are formal representations in the database of informal
business rules, it follows that their formulation is, in this sense,
arbitrary. In other words, the choice of perception, business rules and,
therefore, predicates--is, made on pragmatic, not formal grounds. This
is reflected in Darwen’s recognition that his solution has “several advantages
over other solutions, as well as possible disadvantages over other solutions”.
His perception of the real world consists of
1. Two business rules that map to relvar predicates
(headings)
Employee (E#)
worked for project (P#) from (FROM:) to (TO:).
Employee (E#)
works for project (P#) from (SINCE:).
2. Some pertinent business rules that map to database
predicates (constraints).
Our perception separates between the real world, and our knowledge
of it: whether a TO: value is known or not is not a property of the project
assignment, but of the TO: attribute—that is, data about the data—and,
therefore, we perceive
1. One business rule that maps to a relvar predicate
Employee (E#)
worked for project (P#) from (FROM:) to (TO:).
2. One business rule that maps to a metadata relvar
predicate
The value of
attribute (ATTRIBUTE) for employee (E#) who worked on project (P#) is unknown.
where ATTRIBUTE is the name of any
attribute with unknown values, here, TO:.
There is no additional constraint burden on the user.
So there are two equally valid perceptions of the world, and
the question is which is preferable on pragmatic grounds.
Darwen proposes four criteria on which to base this decision.
Our solution satisfies 1 and 2. With respect to 3, we believe that:
·
It does not involve any departure from the relational
model, and therefore, expressible in multi-relvar assignments (although, we do
not constrain ourselves to TTM as currently specified);
·
The DBMS operates on relvars only (albeit, with
iterations and no-ops (see our paper);
·
The constraints are truth-valued expressions (albeit,
with “no-ops” enforcement, like the relational operators which, after all, they
are for enforcement purposes);
·
While we admit in our paper that the effect on the
operators are yet to be spelled out, we believe they are expressible in the
relational algebra, although subject to certain interpretation by the DBMS in
the presence of missing data.
A major difference between the two approaches is that
Darwen’s imposes burdens on users, via database design, constraint, and
query formulation, which are relatively complex and prone to error. By
contrast, our approach places the entire burden where it really belongs, on the
DBMS; users don’t have to do anything different than what they do in the
absence of missing data. And that is surely preferable on pragmatic grounds.
Posted
2/4/05