ABSTRACT
This article contains the text of a debate between Dr. E. F.
Codd and myself (C. J. Date) on nulls and related matters. It probably doesn't resolve anything, but it
at least touches on most of the arguments on both sides of the issue. The article was originally published in Database
Programming & Design 6, No. 10 (October 1993), where it was described
as "a point/counterpoint on the tough issue of missing values." An edited version of the original article
subsequently appeared in my book RELATIONAL
DATABASE WRITINGS 1991-1994.
What follows is a slightly revised version of that chapter.
INTRODUCTION
(The introduction that follows is a lightly edited version of
Database Programming & Design's own introduction to the original
debate.)
Dr. E. F. Codd and C. J. Date are two of the best known
figures in the history, development, and exposition of what was a breakthrough
concept in database technology: the relational model. Ever since the model was first defined by Codd in 1970,*
in his famous paper A Relational Model of Data for Large Shared Data Banks,
we have been reading, listening to, and interpreting their commentaries on
it. While Codd and Date have agreed
upon much during the course of the relational model's evolution and
implementation, on some issues they definitely do not agree. One important disagreement—and the topic of
this debate—centers on the issue of nulls and missing values, and the
underlying theoretical problems of three- and four-valued logic.
----------
*
Actually Codd's first paper on the relational model (Derivability,
Redundancy, and Consistency of Relations Stored in Large Data Banks) was
dated 1969.
----------
Both have written extensively on these topics, as is noted in
the “References and Bibliography" section at the end. The comments presented here were sparked by
Date's columns in our magazine beginning last December, when he discussed the
three-valued logic approach to missing information. Codd then sent us his criticisms of Date's writings; his
commentary is presented here. Date then
provided a rebuttal to specific points of Codd's, which follows Codd's
remarks. Finally, we give Codd a chance
to rebut Date's rebuttal.
For the reader's convenience, we kept Codd's commentary
together, so it may be read as a whole.
We noted throughout, however, where Date's specific rebuttals apply, and
should be read. This way, the reader
may follow Codd's comments all the way through, and then return to his essay,
reading Date's rebuttals as appropriate.
At first, the issues may seem arcane and theoretical, but
most developers and DBAs know they clearly are not, and merit serious
debate. Missing values remain one of
the toughest—and potentially, most dangerous--problems in database technology.
[Ed. Note: The
reader is referred to PRACTICAL
DATABASE FOUNDATIONS paper #8, The Final NULL in the
Coffin for a possible relational solution to the missing values
problem.]
We begin with Dr. Codd's commentary.
CODD'S COMMENTARY
Although C. J. Date has been a strong supporter of the
relational approach to database management for over 20 years, from time to time
I have found that his criticisms of the relational model have been
incorrect. I do agree with many
of his criticisms of SQL: However, he
often fails to make a clear distinction between SQL and the relational
model. SQL came after the relational
model was described; it was invented by a small IBM group in the Yorktown
Heights [N.Y.] Research Laboratory. In my
book, THE RELATIONAL MODEL FOR DATABASE MANAGEMENT: VERSION 2 [1], I
make it clear what semantic properties a relational language should have if it
is to conform to the model, and label such a language RL. I also describe three major shortcomings of
SQL (there are, of course, numerous others [2]):
Ø As
a user option, SQL permits rows to occur within a single relation that are
complete duplicates of each other. I
call this a tabular error because it is based on two
misconceptions:
1. That relations and tables are in
one-to-one correspondence
2. That duplicate rows are essential
to some applications
Ø Full
support of first-order predicate logic is sacrificed in the name of user
friendliness. I call this a psychological
mixup: A logically sound language
is absolutely necessary as a foundation.
Any useful "user-friendly features" should be grafted as a
layer on top, along with rigorously defined translation between layers.
Ø The
treatment of missing information is wrong for two reasons:
1. Support in the language for
multi-valued logic is grossly inadequate.
2. A user is permitted to designate
a value that is acceptable to a column specifically to indicate the fact that
some value is missing from that column.
I call this latter error one of missing-value misrepresentation.
Date has criticized the multi-valued logic approach to
missing values in the relational model, claiming it can lead to catastrophic
errors. He has advocated the
missing-value misrepresentation approach, which he calls the default value
approach. In 1986, when Date had his
original paper reprinted in the U.S., I prepared a technical response [3].
The ideas behind Date's default value approach came
completely from prerelational products that used single-record-at-a-time
processing. The default-value approach
appealed to RDBMS vendors because it placed all of the responsibility for the
representation and handling of missing values in a relational database
completely on the users. However, I
think it is best described as a nonsolution to the problem, and a complete
evasion of the issue. The approach
contains no clear description of how missing values in a column are to be
treated. That means that the treatment
will often be invented by application programmers and buried in their programs. It also means that there are likely to be
many different treatments buried in numerous programs.
Now that we are dealing with RDBMSs that employ
multiple-record-at-a-time processing, this default-value approach is
unacceptable for the following reasons:
1. The meaning of the fact that a value is missing
from some part or column of a relational database is quite different from the
meaning of a value that is legitimate within that part or column.
2. A single relational request can touch many different
columns in a relational database, and therefore it is intolerable that in
conceiving such a request the user should have to understand and cope with as
many different representations and treatments of missing values as the columns
that are touched. In a relational
database, both the representation and treatment of missing values must
be uniform across the entire database.
[See Date's Rebuttal I]
Date and other critics of multi-valued logic claim that
serious errors are inevitable if a multi-valued logic is made available to
users. However, such critics have
failed to provide a single example of a severely wrong answer being
delivered as a result of a multi-valued logic.
A result is severely incorrect if the logical expression is evaluated by
the DBMS to be either:
· True
when it is actually false or unknown
· False when it is actually true
or unknown
A result is mildly incorrect if the DBMS evaluates an
expression as unknown when it is actually either true or false. In the paper in which I introduced
three-valued logic (3VL) [4], I cited an example of a request mildly
mishandled by 3VL: For some
requests the condition would be evaluated as unknown when the correct
answer was true or false, if the DBMS were unable to recognize
tautologies. This example shows that
simple 3VL should be augmented by some inferential capability. An example would be the following: Suppose that the birth year is recorded for
most employees, but it is missing from the database for a few. Now, consider the request: Retrieve the serial numbers and names of
employees for each of whom
1. The birth year is 1960, or
2. The birth year is earlier than 1960, or
3. The birth year is later than 1960.
Suppose the DBMS does NOT have the capability of recognizing
that the whole condition must be true for every employee, whether the
birth year happens to be missing or not.
That is, it is unable to detect tautologies or contradictions. Then for
those employees whose birth year is unknown, the DBMS comes up with unknown
for each of the three subconditions.
And, using the rule that for truth values
unknown OR unknown is unknown,
it evaluates the whole condition to be unknown. This is an example of a mild error. This kind of error is just as likely to
occur (and other kinds much more likely) if the responsibility for handling
missing information is placed totally on the users.
[See Date's Rebuttal II]
Now, an obvious cure for this is to equip the DBMS not only
with 3VL, but also with the capability of recognizing for any whole condition
whether it is a tautology. This would
be easy if only propositional logic were being supported. However, the relational model requires the
more powerful predicate logic to be supported in specifying the condition part
of a request. It is well known that it
is a logically undecidable problem to determine whether an arbitrary formula in
predicate logic is a tautology or a contradiction.
Therefore, it is pointless to search for an allegedly
universal algorithm for detecting all possible tautologies and all possible
contradictions. A reasonably good
algorithm can be developed that will take care of at least all of the simple
cases that will be encountered in commercial activities, and this algorithm
should be incorporated into every RDBMS product. The RDBMS will then make mild errors only when a most unusual
request is made. An RDBMS must admit
its inability to deduce a sound response to a user request whenever this is
impossible because of missing values.
Also, present treatment by SQL of missing values is, in my opinion,
totally unsatisfactory. For a more
complete treatment of missing values and a refutation of Date's criticisms,
refer to my book [5].
[See Date's Rebuttal III]
Date's argument that true and false are the
only truth values, and that, therefore, unknown cannot be treated as a
logical value makes no sense to me.
After all, it is very common in mathematics to label unknown values by
letters such as m, n, x, y, z.
The fact that the letters m, n do not "look like" any
of the integers does not prevent them from actually having integer values in an
expression such as m + n, m - n, or an assertion that m
* m = m. In any event,
when dealing with missing values, an RDBMS must be able to determine whether
NOT A, A OR B, and A AND B is true, false, or unknown when A, or
B, or both are unknown.
Date's argument that the number of distinct functions from
truth values to truth values is very large, and that fact makes 3VL and
four-valued logic (4VL) unusable is ridiculous. After all, the number of distinct functions from integers to
integers is infinite, because the number of distinct integers is infinite. However, no one in his right mind would use
that as an argument that integers are unusable.
Taking the whole of Date's article into consideration, I
completely reject Date's claims:
· To
have inserted "more nails into the 3VL coffin"
· That
it is time to drop the pretense that 3VL is a good thing
[See Date's Rebuttal IV]
DATE'S REBUTTAL I
Before I begin, let me make one thing crystal clear: My quarrel is not with the relational
model. On the contrary, I felt at the
time when it was first introduced (and I still feel) that the original model
was a work of genius. All of us owe Dr.
Codd a huge debt of gratitude for his major contribution. And, as the originator and "elder
statesman" of relational theory, Codd always deserves the courtesy of very
close attention to his remarks on relational matters.
So my quarrel is not with the relational model, but rather
with nulls and three-valued logic (3VL), which—in a database context—were
first discussed by Codd in 1979. It is
true that Codd now regards 3VL as an integral part of the relational model, but
I do not (and I am not alone in taking this position). Indeed, the whole question of how to handle
missing information is largely independent of whether the underlying model is
relational or something else. Thus, I
would like to distinguish very carefully between what we might call
"RM" (the original model, with two-valued logic) and
"RM+3VL" (Codd's version, with three-valued logic). My quarrel, to repeat, is with the
"3VL" portion of "RM+3VL."
Now, regarding Codd's first point (that default values
misrepresent the fact that information is missing): I do not dispute this!
However, I would make two points:
1. It is default
values, not nulls, that we use in the real world, as I pointed out in my
December 1992 column in Database Programming & Design [6].
2. Nulls
misrepresent the semantics too (see below).
In other words, I don't think we yet know how not to misrepresent
the semantics; and given that this is so, I take the position that we should
not undermine the solid foundation of the relational model with something as
suspect as 3VL, when it demonstrably doesn't solve the problem anyway.*
----------
* In other words, I invoke the Principle of
Cautious Design—see the article of that name in RELATIONAL DATABASE WRITINGS
1989-1991, by C. J. Date and Hugh Darwen.
----------
Note: When I
talk about "undermining the foundations of the relational model,"
what I mean is that a "relation" that includes nulls, whatever else
it might be, is not a relation!—at least, not in the formal sense of
that term. As a consequence, the entire
foundation crumbles; we can no longer be sure of any aspect of the
underlying theory, and all bets are off. I cannot believe that Codd really wants to destroy the entire
edifice that he has so painstakingly constructed over the years.
As for nulls also misrepresenting the semantics, consider the
following two points:
Ø A
(Codd-style) 3VL system supports just one type of null, "value
unknown." There is thus a strong
likelihood that users will use that null for purposes for which it is not
appropriate. For example, suppose
employee Joe is not a salesperson and so does not qualify for a
commission. Then Joe's commission is
quite likely to be misrepresented as "value unknown" (it should of
course be "value does not apply").
One simple consequence of this misrepresentation error is that Joe's
total compensation (salary plus commission) will incorrectly evaluate to
"unknown" instead of to just the salary value.
What's more, an analogous argument
will continue to apply so long as the system supports fewer types of null than
are logically necessary. In other
words, simply adding support for a "value does not apply" null might
solve the specific problem mentioned in the previous paragraph, but it will not
solve the general problem. Thus, a
system that supports fewer types of null than are logically necessary is just
as open to misuse--perhaps even more so--than a system that does not support
nulls at all.
Ø Now
suppose the system supports two kinds of null, "value unknown" and
"value does not apply" and four-valued logic (4VL), and
suppose employee Joe's job is unknown.
What do we do about Joe's commission?
It surely must be null--the information is surely missing--but we don't
know whether that null should be "value unknown" or "value does
not apply." Perhaps we need
another kind of null, and five-valued logic ... This argument clearly
goes on for ever, leading to an apparent requirement for an infinite number
of kinds of null. What do we
conclude from this state of affairs?
Next, regarding Codd's allegation that the default-value approach
lacks a "clear description of how [default] values are to be
treated": I have published several
such descriptions over the past few years, the first in 1982, the most recent
in 1992 [7].
Of course, Codd is quite right to warn of the dangers of undisciplined
use of default values. That's why I
have consistently advocated a disciplined approach. By the way, a system that supports nulls can
still be used in an undisciplined way, as I have already shown. In fact, an argument can be made that such a
system is more susceptible to lack of discipline, partly (a) because of
the false sense of security provided by the fact that nulls are supported
("Missing information? Don't worry
about it, the system can handle it"), and partly (b) because
1. The system designers assume that users are
going to use nulls, and therefore
2. They typically don't provide explicit
system support for defaults, and therefore
3. Users who have made the (in my opinion,
very sensible) decision to avoid nulls are on their own--the system doesn't
help (in fact, it positively hinders).
Finally, I completely reject Codd's suggestion that the
default values idea comes from prerelational systems—on the contrary, it comes
from the real world, as I have already said.
I also reject the suggestion that it has anything to do with
"record-at-a-time" thinking—how to deal with missing information has
nothing to do with whether the operators are record-at-a-time or
set-at-a-time.
On behalf of the vendors, I also reject the suggestion that
default values appealed to them "because it placed all of the
responsibility on the users."
Might it not have been that the vendors had their own misgivings
concerning 3VL? In any case, I know of
no vendor that actually supported a proper default values scheme before
supporting 3VL. Moreover, a proper
default values scheme does not "place all of the responsibility on
the users." To contend otherwise
is to misrepresent the semantics of the default values scheme.*
----------
* I
don't claim that default values are a good solution to the problem,
however. I'd still like to find
something better.
----------
DATE'S REBUTTAL II
First, a small point regarding Codd's claim that the mild
error he notes "is just as likely to occur (and other kinds much more
likely)" in a default values scheme.
It seems to me that there is all the difference in the world
between:
·
Building a system--i.e., one based on 3VL--in which we know
errors will occur, because the system has logical flaws in it, and
·
Building a system that is at least logically correct
but is open to misuse. Any
system is open to misuse. That's why we
have to have discipline.
Next, and more important:
Contrary to Codd's claim that "[I] have failed to provide a single
example of a severely wrong answer," I gave the following example in 1989
[8], and repeated it in my December 1992 column in Database Programming
& Design. The database (DB1) is
shown in Fig. 1 (the "--" represents "value unknown"). The query is:
SELECT E#
FROM DEPT, EMP
WHERE NOT ( DEPT.D# =
EMP.D# AND EMP.D# = 'D1' )
+----------------------------------------------------------------+
¦ +----+ +---------+ ¦
¦ DEPT ¦ D# ¦ EMP ¦ E# ¦ D# ¦ ¦
¦ +----¦ +----+----¦ ¦
¦ ¦ D2 ¦ ¦ E1 ¦ -- ¦ ¦
¦ +----+ +---------+ ¦
+----------------------------------------------------------------+
Fig. 1: Example database DB1
It's not worth going through the example again in detail
here. The basic point is that the
expression in the WHERE clause is "actually unknown" (Codd's
phraseology) but is treated as false, with the result that employee
number E1 is not retrieved but in fact should be (the real-world answer to the
query is E1). This is a severe error by
Codd's definition.
Please note too that the foregoing is a 3VL error, not just
an SQL error. Paraphrasing slightly
from Codd's book [9], he states:
"Executing a query delivers only those cases in which the condition
part evaluates to true."
This is tantamount to treating unknown as false.
In case the reader is not convinced, let me give another
example. The database (DB2) is given in
Fig. 2. The query is "Does anyone
in department D1 earn a salary of 100K?"
This query will involve a test to see whether the literal row
"<D1,100K>" appears in the projection of EMP over D# and
SAL. In forming that projection,
however, the row "<D1,--->" will be eliminated [10]. Result:
The 3VL answer to the query is false; the real-world answer, by
contrast, is unknown. This is
surely also a severe error by Codd's definition.
+----------------------------------------------------------------+
¦ +---------------+ ¦
¦ EMP ¦ E# ¦ D# ¦ SAL ¦ ¦
¦ +----+----+-----¦ ¦
¦ ¦ E1 ¦ D1 ¦ 50K ¦ ¦
¦ ¦ E2 ¦ D1 ¦ --- ¦ ¦
¦ +---------------+ ¦
+----------------------------------------------------------------+
Fig. 2: Example database DB2
Finally, and much more important still: The whole business of "severe" vs.
"mild" errors is in any case surely nonsense. It seems to me that it is nothing more than
a rearguard attempt to shore up an already suspect position. After all, if we were talking about integers
instead of truth values, what would we think of a system that occasionally
produced the answer 2 when the correct answer was 1 or 3? And in what sense could this be any more
acceptable than one that occasionally produced 1 or 3 when the correct answer
was 2?
Suppose the DBMS says it doesn't know whether Country
X is developing a nuclear weapon, whereas in fact Country X is not doing
so; and suppose Country Y therefore decides to bomb Country X "back to the
Stone Age," just in case. The error
here can hardly be said to be mild.
(This example is not to my taste; I choose it deliberately for its shock
value.)
DATE'S REBUTTAL III
Here I would just like to raise a few questions.
1. What
evidence is there that "a reasonably good algorithm can be
developed"?
2. Is there a
precise characterization of the "simple cases" that such an algorithm
will handle?
3. Is that
characterization intuitively understandable?
In other words, will the user be able to predict with any confidence
whether or not the DBMS is going to give the right answer to a given
query?
4. If the
answer to the previous question is no, then why would any user ever use the
system for any purpose at all?
5. In fact,
Codd is requiring the DBMS itself to "admit its inability to deduce a
sound response to a user request" whenever applicable. In other words, he is asking for a decision
procedure regarding the decidability of formulas in three-valued logic. What evidence is there that such a procedure
exists?
6. What percentage
of real-world queries that are "encountered in commercial activities"
are "simple" in the foregoing sense?
7. What
evidence exists to support the answer to the previous question?
8. If we are
limited to using "simple" queries only, exactly what incremental
value is the "RM+3VL" system providing over a prerelational,
record-at-a-time system?
9. What does
"commercial activities" include?
Does it include financial applications?
Manufacturing applications?
Hardware and software design applications? Geographic and mapping applications? Document-handling applications?
Medical applications? Chemical
applications?
I think it is time to quote Wittgenstein again: All logical differences are big
differences.
Now, turning to Codd's discussion of missing values in his
book, and refutation of my views therein:
Codd claims that his book "refutes my criticisms in
detail." I don't think it does. The two major criticisms dealt with in his
book are (1) "the alleged counterintuitive nature [of nulls and 3VL]"
and (2) "the alleged breakdown of normalization."
Ø Regarding
(1), Codd does not address the counterintuitive nature of 3VL per se,
but instead claims that default values are counterintuitive too. In doing so, incidentally, he confuses the semantics
of the two very different expressions "not known" and "known
not"--a trap that is all too easy to fall into, of course (indeed, this
confusion is precisely one of the reasons why I claim that 3VL is
counterintuitive). In an earlier paper
[11], I gave an example of a (very simple!) query involving 3VL that Codd and I
both seriously misinterpreted when we first considered it (after a
somewhat lengthy discussion too, I might add).
The misinterpretation rested once again on the distinction between
"not known" and "known not." I stand by my contention that 3VL is difficult to deal with on an
intuitive level.
Ø Regarding
(2), I originally claimed that "the fundamental theorem of
normalization" breaks down in the presence of nulls [12], and so it does. Codd's counterargument is unconvincing.
In any case, I have several other serious criticisms of 3VL
that Codd's chapter does not address at all.
They include (and this is not an exhaustive list):
Ø The
fact that we apparently need an infinite number of types of null
Ø The
semantic overloading or "misrepresentation" that will occur if not
all types of null are supported (bound to be the case, given the previous
point)
Ø The
lack of a convincing justification for the different treatment of equality of nulls
in comparisons vs. equality of nulls in duplicate elimination
Ø The
fact that the (admittedly informal) argument in support of the entity integrity
rule ("primary keys in base relations do not permit nulls") quite
obviously extends to every column in the database--implying that nulls
should be inadmissible everywhere
Ø If
TABLE_DUM corresponds to false and TABLE_DEE corresponds to true,
what corresponds to unknown? (see my March 1993 column in Database
Programming & Design [13])
DATE'S REBUTTAL IV
Regarding my argument that there are only two truth
values: Codd's counterargument here
makes no sense to me. Is he suggesting
that unknown is not a truth value after all, but just a variable whose
actual value at any given time is either true or false? So we aren't really dealing with 3VL after
all?
The only way I might make sense of Codd's position here is to
interpret his remarks as actually agreeing with what I said in my
January 1993 column (which I'm sure was not what he intended). Here's what I said in that column [14]:
How many truth values are there? The answer, of course, is two, namely true and false. We might SAY that unknown is a
third truth value, but that doesn't make it one. After all, I might say that oggle-poggle is another integer,
but that doesn't make it one; it has absolutely no effect on the set of all
integers. Likewise, the set of all
truth values just IS the set {true, false}, and there is
nothing more to be said.
If we are given some proposition, say the proposition
"Employee E1 works in department D1," then that proposition is either
true or false. I might
not know which it is, but it is one of the two (if it isn't, it isn't a
proposition). Let's assume, in fact,
that I don't know which it is. Then I
certainly might say, informally, that the truth value of the proposition is
unknown to me; but that "unknown" is a very different kind of object
from the truth values true and false themselves. And pretending that it is the same kind of
object--in other words, pretending that we have three truth values--is bound to
lead to problems of interpretation (as of course it does).
Note: Of
course, I understand the point that we are free to define a purely formal
system in which there are as many "truth values" as we like. This does not alter the fact that, in the
real world, the values true and false (on the one hand) and the
value unknown (on the other) are totally different kinds of things.
Finally, regarding my argument concerning truth-valued
functions: Here Codd both misrepresents
my position and misses the point. My
argument was not that because there were so many functions, we should not
support 3VL; rather, it was that if we want to support 3VL, we should be sure
that we support all possible 3VL functions.
Now, in the case of integers, it is true that the total number of
functions is infinite--BUT we know that any computable function is
expressible in terms of a small number of primitive operators, so all we have
to do is support those primitives properly.
Likewise, in 2VL we know that all 2VL functions can be expressed in
terms of a small (very small!) number of primitive operators, and so again all
we have to do is support those primitives properly.
For 3VL, therefore, I was asking, first, for a suitable set
of primitive operators that would guarantee that all 19,710 logical functions
were supported (indeed, if any of those functions are not supported,
then it cannot be claimed that the system we are dealing with is 3VL). I was also asking for a suitable set of useful
operators (not necessarily the same thing as primitive operators). I was also asking for a proof of
completeness. I was also raising
questions of testing, debugging, and usability. And then I was asking the analogous questions all over again for
4VL, where there are over four billion possible functions. I believe these are serious questions that
advocates of 3VL and 4VL are morally obliged to address.
Taking the whole of Codd's comments into consideration, I
stand more firmly than ever by my original position.
REBUTTING THE REBUTTALS
(To close out this discussion, Dr. Codd offers some
comments on Date's rebuttals.)
Just about every database contains missing values scattered
over numerous parts of the database.
For example, an employee's birthdate might have to be marked
"missing but applicable," because it is at present unknown. Or, the employee's year-to-date commission
may have to be marked as missing and inapplicable because he or she is not a
salesperson.
Database management would be simpler if missing values didn't
exist.*
Unfortunately, for a variety of reasons, they do occur and need to be
managed. Date's assertion that a
relation containing missing values is not a relation is unacceptable. While relations that contain missing values
are not normally encountered in mathematics, the same operators in the
relational model continue to be applicable.
Requests expressed in a relational language must be able to cope with
missing values, without resorting to guessing.
----------
* I
like this sentence!
----------
With Date's default value approach, both the representation
and treatment of missing values can be peculiar to the columns in which missing
values are permitted. This might be
acceptable in a single-record-at-a-time DBMS; it's clearly not in a
multiple-record-at-a-time DBMS. The
principal reason for adopting an approach that is uniform across the entire
database is that a single relational request may involve data from numerous
distinct columns of the database, and many of these columns are likely to be
permitted to have missing values.
Imagine a request that deals with 12 or more such columns: A significant part of formulating this
request would involve detailed knowledge of the 12 or more representations and
treatments of missing values, if the default value scheme were adopted.
For uniformity, one might look to support in hardware. Today's memory technology, however, can't
distinguish between values to be taken seriously and those that are not--such as
those left in some condition by some previous activity. Once a disk is formatted, the computer
regards every bit as part of the value of something. In my relational model version 2 (RM/V2), any column in which
missing values are permitted is assigned one extra byte to indicate, for each
row, whether the associated value is:
1. To be treated seriously, or
2. Missing and applicable, or
3. Missing and inapplicable.
IBM's DB2 partially supports this representation.
A basic ground rule in the relational model is that the
DBMS must NOT provide a definite response to a query when it is not certain
about the response because values are missing. I remember well when I first arrived in New York City to reside
in the U.S. It was the fall of 1948 and
I was looking vigorously for a job.
Often I would have to ask people on the street how to get to specific
parts of the city. I received detailed
directions, but almost invariably these directions were wrong. I stopped asking and used street maps
instead. Similarly, people who use a
DBMS that guesses the answer to a query but delivers it as if it were
not guessing are likely to abandon its use.*
----------
* I
(Date) remark in passing that "guessing the answer and delivering it as if
it were not guessing" is exactly what SQL does when it applies an
aggregate function such as AVG to a collection of values that happens to
include some nulls. Suppose the user
asks for the average salary of employees, and at least one employee is shown in
the database as having a null salary.
Then SQL simply ignores such employees and returns the average salary of
the rest!--instead of responding "I don't know," which would be more
honest. (It does admittedly return a
warning code also, but there is no way to force the user to check for that
code.)
----------
Re: Date's Rebuttal I
In Date's Rebuttal I, he decries the inadequacies of
three-valued logic (3VL), which I first discussed in 1979. In 1986, I proposed a significant
improvement [3], which distinguished between two categories of missing database
values:
1. Temporarily unknown
2. Inapplicable, and hence unknowable
Semantics make this distinction necessary. I also proposed four-valued logic (4VL) and additional
general-purpose functions to permit adequate handling of missing
information. Date, in many of his
examples, ignores both. This double
oversight makes his examples merely cases of the incorrect use of the missing
value machinery in RM/V2 [1].
In his Rebuttal I,
Date asserts: "It is default values,
not nulls, that we use in the real world." Arguments of this type can, and have, been used [sic] to
delay every technical or scientific step forward. It could have been used to argue against the introduction of
computers: "It is mental
arithmetic that we use in the real world, not machines." The phrase "real world" is a
serious trap, because what is real is continually changing.
Let's take one of Date's examples: A user enters data with an element missing, and he or she does
not know whether the element is applicable or inapplicable. Date would claim that, as a consequence, we
need a third kind of missing value. In
his rebuttal, he goes on to say that we need more and more distinct types of
missing values.
I reject this sequence of arguments. In RM/V2, I discontinued using
"null" because the term has been so often misinterpreted. As I pointed out earlier, missing values are
either A-marked (applicable, presently unknown) or I-marked
(inapplicable, hence unknowable). Let's
assume we have an RDBMS that is faithful to RM/V2. As background, remember that:
·
For each column other than a primary key column, the
DBA may declare that A-marks be permitted or prohibited.
·
For each column other than a primary or foreign key
column, the DBA may declare that I-marks be permitted or prohibited.
· A-marks
are weaker and more flexible than I-marks [1].
·
A-marks likely occur more often than I-marks in a
relational database that is in conceptual normal form (p) because p
is the maximum percentage of I-marked values in any column, and p is
normally set by the DBA to be considerably less than 1.
·
Whenever a tuple is entered with a missing value, this
value is A-marked in the database, unless an integrity constraint exists that
clearly indicates an I-mark must be recorded.
In this example, we must assume that both types of marking
are permitted in the pertinent column, and that no declared integrity
constraint resolves the issue of whether an A-mark or I-mark should be
used. Then RM/V2 marks the missing
value as applicable. Later, if it is
discovered that the value should have been I-marked, not A-marked, then the DBA
or someone with suitable authorization changes the marking on this missing
value. Thus, I fail to see the need for
more than two kinds of markings.
Date also asserts that multi-valued logic destroys the
foundation upon which the Relational Model is built. I do not agree. There is
no theoretical impairment and no loss of usability, whereas both scope and
usability are lost if the default value scheme advocated by Date is
adopted.
Date also asserts that normalization becomes invalid when
multi-valued logic is introduced. This
is false, providing that this logic is introduced correctly (few RDBMSs do this
today) and care is taken with its use.
For example, if the RDBMS supported DBA-defined requests (few do), the
DBA could define integrity constraints that will be stored in the catalog to
enable the RDBMS to enforce the functional, multi-valued, and inclusion
dependencies discovered at database design time. However, the RDBMS must withhold the enforcement of these
constraints from the missing tuple components until they are replaced by
actual values. This enforcement should
occur at the time of attempted replacement.
Finally, I oppose the use of default values only if it's done
to represent that a value is missing.
Default values may be useful in other contexts. For example, a bank teller shouldn't be
required to re-enter his or her terminal identifier every time he or she enters
a customer transaction. The terminal
should handle this itself.
TECHNICAL CORRESPONDENCE
The foregoing debate, perhaps not surprisingly, led to a
number of letters from readers. In
fact, more than one reader wrote in with attempts to salvage the idea that
"automatic" null support be provided without having to stray
beyond the boundaries of conventional two-valued logic (2VL). I regret that I did not find these attempts
very convincing, and choose not to discuss them here.
Ceuan Clement-Davies
from Frankfurt, West Germany, offered the following comment, which I cannot do
better than quote verbatim:
One thing struck me forcibly ... [Codd's] remarks on tautologies
seemed a significant admission. Since
the example he gives [regarding birth years] isn't a tautology in Lukasiewicz's
system (and any system of three-valued logic in which this was made to be a
tautology would show unfortunate effects elsewhere), it isn't at all clear to
me whether Codd is suggesting that a RDBMS should use two-valued logic
to detect tautologies, and three-valued logic for everything else. This would be a curious mixture.
Stephen Ferg of
the U.S. Department of Labor (Bureau of Labor Statistics) wrote claiming that:
1. The concept of null is deeply embedded in
the relational model, and probably cannot be removed from it.
2. Nulls often, and in some cases must, have
no semantic content whatever: They
simply mean that there is no value in a given column of a given row.
He then went on to
say (this is an edited extract from his letter):
Consider an EMP table with columns E#, ETYPE, and TOTSALES. ETYPE indicates whether the employee in
question is a member of the sales staff ... TOTSALES indicates the total number
of sales the employee has made since the beginning of the year ... For an
employee not on the sales staff, TOTSALES will be null. For a sales employee, TOTSALES may be null
until the employee files his or her first sales report (until then, the number
is applicable, but unknown).
It is this kind of example that both Codd and Date seemed to
have in mind during their debate. But
there is another use for null that is far more important. Suppose we have two entities, EMP and DEPT,
and a relationship, ASSIGNED, with the constraint that (at any given time) each
employee is assigned to at most one department. There are two employees (E1, E2) and two departments (D1,
D2). E1 is assigned to D1, but E2 is
not currently assigned to any department (E2 is on leave of absence and will be
assigned to a department when [he or she] returns from that leave). A logically ideal implementation of this
situation is shown in Fig. 3.
+----------------------------------------------------------------+
¦ EMP ASSIGNED DEPT ¦
¦ +----+ +---------+ +----+ ¦
¦ ¦ E# ¦ ¦ E# ¦ D# ¦ ¦ D# ¦ ¦
¦ +----¦ +----+----¦ +----¦ ¦
¦ ¦ E1 ¦ ¦ E1 ¦ D1 ¦ ¦ D1 ¦ ¦
¦ ¦ E2 ¦ +---------+ ¦ D2 ¦ ¦
¦ +----+ +----+ ¦
+----------------------------------------------------------------+
Fig. 3: The departments and employees database (first
version)
Because ASSIGNED is a many-to-one relationship, however, the
ASSIGNED table and the EMP table have the same primary key. This permits the schema to be
"optimized" by merging the two tables, so that ASSIGNED.D# becomes a
foreign key in the EMP table, producing the structure shown in Fig. 4.
+----------------------------------------------------------------+
¦
EMP DEPT ¦
¦
+---------+ +----+ ¦
¦ ¦ E# ¦
D# ¦ ¦ D# ¦ ¦
¦
+----+----¦ +----¦ ¦
¦ ¦ E1 ¦
D1 ¦ ¦ D1 ¦ ¦
¦ ¦ E2 ¦
-- ¦ ¦ D2 ¦ ¦
¦
+---------+ +----+ ¦
+----------------------------------------------------------------+
Fig. 4: The departments and employees database (second
version)
There are three things worth noting about Fig. 4.
1. First, such an optimized design is extremely common--so
common that many database designers think of it as the ideal relational
implementation of such a situation, rather than as an optimized implementation
one step removed from the ideal implementation.
2. Second, such a design requires the use of nulls in foreign
keys ...
3. Third, in such a design a null has no semantic content
whatever: It exists only because of
optimization and implementation considerations ... It is not the case that E2
cannot be assigned to a department, and it is not true that E2's department
number is "missing" or "unknown" (as if such a number
really existed but we just don't know what it is). So ... we have a third kind of null: It doesn't mean "inapplicable" and it doesn't mean
"unknown"--it simply means that the column has no value.
Note that the design of Fig. 3 also does not avoid the need for
a "simply no value" null: If
we do a left outer join of the EMP and ASSIGNED tables in that figure, we will
produce a result table that is exactly the same as the EMP table in Fig.
4. An outer join produces nulls because
we ask the DBMS to show us data from one table even when no matching data can
be found in another table. Such nulls
have no meaning, no semantic content ...
My conclusions:
1. Despite the difficulties with null that Date points out, we
cannot have a relational model that is altogether free of null, so we had
better learn to live with it.
2. Despite Codd's impulse to distinguish different kinds of null
on semantic grounds, it is probably better just to let null mean "no value
here" ... [in order to avoid] the unmanageable complexities of 3-valued,
4-valued, ..., n-valued logic.
My reactions to Ferg's letter are as follows.
First of all, I agree with Ferg that the design of Fig. 3 is
preferable to that of Fig. 4. However,
Ferg claims that nulls are "deeply embedded in the relational model and
probably cannot be removed from it."
But he provides no proof of this claim, and I could not possibly
disagree with it more. (After all, the
model survived very well without nulls for about ten years!) In fact, I challenge him to produce an example
of a problem that appears to need nulls for its solution that cannot also be
solved without them.
Ferg goes on to claim that nulls "often ... have no
semantic content whatever," but subsequently contradicts himself on this
very point. In the example he uses to
support his claim, the null department number in the EMP row <E2,-->
certainly does have "semantic content"--it means, loosely, that the
employee will be assigned to a department when [he or she] returns from leave
of absence (Ferg's own words).
Later Ferg says that the nulls that appear in the result of
an outer join also "have no meaning, no semantic content." Again I disagree. The point is not that the nulls have no meaning, but rather that
(in such a result) different nulls have different meanings, as I have
illustrated elsewhere (see "Watch Out for Outer Join," in C. J. Date
and Hugh Darwen, Relational Database Writings 1989-1991, Addison-Wesley,
1992, pp. 328-330).
In any case, the question "Do nulls have semantic
content?" is the wrong question to ask.
The point is, operators (logical, computational, relational, or
whatever) must be defined to deal with nulls somehow. And the behavior of those operators
effectively defines the semantics of nulls. In other words, nulls must always--necessarily--have some
"semantic content." Ferg's
suggestion that we should just let nulls mean "no value here" thus
really makes no sense.
To pursue the previous point a moment longer: A large part of my objection to nulls is
based on the fact that the particular "semantic content" defined by
the operators of three-valued logic is of no practical value for the purpose at
hand, because that "semantic content" does not mimic the way the real
world works. As I put it in my original
December 1992 column [6], answers that 3VL says are correct are often not the
answers that are correct in the real world.
One further point:
Ferg also gives an example in which the TOTSALES figure for a sales
employee who has not yet filed a sales report is shown as null. It should of course be zero! This is another illustration of the kinds of
mistakes that people are likely to make in a system that supports nulls. Indeed, SQL itself makes the same mistake
when it defines the sum of an empty set to be null. See my April 1993 column "Empty Bags and Identity
Crises" [15].
Another correspondent, Martin H. Rusoff of Banc One Financial
Card Services in Columbus, Ohio, wrote as follows (again this is an edited
extract):
It seems to me that the entire discussion is slightly off target
... There are usually several alternatives to handling [missing
information]. These can range from
ignoring it up to taking exceptional measures to discover the missing
data. Any of these could be automated,
but it requires knowledge of what the data means and possibly why it
is missing to decide the correct response.
While this might be decided in the data model, more often it depends on
the use to which the data will be put--i.e., on the application.
Example: Suppose we are given table EMP with a row as
shown in Fig. 5, and we need to calculate the total amount needed to meet
payroll. Then there are a number of
possible responses:
+----------------------------------------------------------------+
¦
+------------------------------------------+ ¦
¦ EMP ¦ E# ¦ ENAME ¦ GRADE ¦ SALARY ¦ COMMISSION
¦ ¦
¦
+----+-------+-------+--------+------------¦ ¦
¦ ¦ E1 ¦ Smith
¦ T4 ¦ -- ¦ --
¦ ¦
¦
+------------------------------------------+ ¦
+----------------------------------------------------------------+
Fig. 5: Example EMP row
· I don't know.
· Use
the top and bottom figures for the salary for grade T4 to compute maximum and
minimum amounts (a similar technique could be used to determine the commission,
based on whether grade T4 is eligible for commission or not, and then looking
at the maximum and minimum values in the database).
· Use a
default of some kind, possibly calculated based on the maximums and minimums
computed above.
· Ignore
this employee.
· Use
statistical data stored elsewhere to come up with probable amounts and then
calculate a confidence [rating] for the entire result.
· Initiate
exception processing to determine the answer, possibly then continuing to
calculate the result ignoring the missing values.
Depending on the
situation, any of these might be acceptable.
However, an accountant would probably calculate the maximums after
determining if a commission might apply and attach a note saying that the data
was missing for certain listed individuals.
I do not see how a database engine could decide this. End of example.
I fully agree that the
current treatment is error prone ... [However,] I am not sure that legislating
3VL, fuzzy logic (with ranges or probabilities), or using defaults is the right
answer ... In the end, it might be appropriate to always generate an error
whenever a null is located and permit an application to rerun the query using
additional qualifiers telling the database engine how to process the nulls.
Instead of debates of
this kind, I would like to see an explanation of how the existing facilities
can be used correctly to handle missing data.
I responded to these comments as follows:
I completely agree with Martin Rusoff that it will usually be
the case that only the application can decide what to do when nulls are
encountered. His example makes the
point admirably. But I do not agree
with him that "the entire discussion is slightly off target." To be candid, I will admit to a sneaking
sympathy with this position--I know the debate must have seemed somewhat
academic and not too relevant to the rough-and-tumble of day-to-day
operations--but the fact is that such a perception is sadly mistaken.
First of all (as I have observed many times, in the pages of Database
Programming & Design and elsewhere), theory is practical! That is, the theory on which a given DBMS is
based necessarily has very practical consequences for the user of that
DBMS. And if that theory is bad, the
consequences will be bad too.
To see that this is so, it is sufficient to realize that it
will often be the DBMS, not the application, that has to "decide what to
do when nulls are encountered." I
have argued this point before for the particular case of the optimizer
component of the DBMS (see reference [6]).
Note, moreover, that not all nulls "preexist" in base
tables--some are generated dynamically (i.e., in the middle of executing a
query). As a result, it is a virtual
certainty that it will be the DBMS, not the application, that will have to
decide how to deal with them. And the
DBMS will typically not have the application-specific knowledge to enable it to
make the correct application-specific decisions. Consequence: Wrong
answers!
Also, of course, there is the point--admittedly only a
psychological point, not a logical one, but a point that is very much a
practical consideration--that users will be lulled into a false sense of
security by the fact that the system can "handle" missing
information. That is, applications
often do not include the necessary logic to deal with nulls, even when
they should, because the user is under the misapprehension that the system has
taken that burden off the user's shoulders.
Finally, James R. Alexander of the Goochland-Powhatan
Community Services Board, Goochland, Virginia, wrote as follows:
If I do not know the gender of Person A and I am asked "Is
Person A female?", I respond "I don't know." However, if I am asked "Do you know if
Person A is female?", I respond "No." The first question concerns Person A's gender, the second
question concerns my knowledge of Person A's gender. A query of a database is certainly a question of the second type
...
I have developed and
been using for over three years a data collection engine which is the front-end
for a database ... This engine understands that every attribute has, in
addition to its explicit domain, an implicit domain ... composed of unknown,
not applicable, and not represented (i.e., other), which are consistently
represented by ?, !, and #, respectively.
The human services organization I work for uses this data collection
engine to maintain data for many typical business applications ... We use
various off-the-shelf reporting [programs], which employ two-valued logic, and
we get correct results.
I could not agree more with Mr. Alexander. I said much the same thing in reference [11]
(page 234): "[We] obviously cannot
ask the system questions about the real world per se, only about its
knowledge of the real world as represented by values in the
database." The critical point, as
Mr. Alexander observes, is to stay within two-valued logic. I am glad to hear he has been using his
technique successfully for several years.
REFERENCES AND BIBLIOGRAPHY
1. E. F. Codd: The
Relational Model for Database Management Version 2. Reading, Mass.: Addison-Wesley (1990). RM/V2 treatment of missing values may be
found in Chapters 8 and 9. For a
discussion of the two types of missing values, see p. 191.
2. Ibid.,
Chapter 23. Author's note: "This chapter describes adverse
consequences of these errors, suggests corrective steps that DBMS vendors
should apply, and precautionary steps users should take in the meantime."
3. E. F. Codd:
"Missing Information (Applicable and Inapplicable) in Relational
Databases," ACM SIGMOD Record 15, No. 4 (1986). See also E. F. Codd, "More Commentary
on Missing Information," ACM SIGMOD Record 15, No. 5 (1986).
4. E. F. Codd:
"Extending the Database Relational Model to Capture More Meaning," ACM
TODS 4, No. 4 (September 1979).
5. E. F. Codd:
Chapter 23 of The Relational Model for Database Management Version 2. Reading, Mass.: Addison-Wesley (1990). Author's note: "Date's criticisms are refuted in
detail in Chapter 9. (Unfortunately, in
the first printing, two errors appeared in the truth table for logical OR on
pages 182 and 236. These have been
corrected.)"
6. C. J. Date:
"Why Three-Valued Logic Is a Mistake," in Relational Database
Writings 1991-1994. Reading, Mass.:
Addison-Wesley, 1995.
7. C. J. Date:
Section 5.5 ("Null Values") of An Introduction to Database
Management: Volume II. Reading,
Mass.: Addison-Wesley (1982). A more
recent treatment of this topic may be found in C. J. Date, "The Default
Values Approach to Missing Information," in C. J. Date and Hugh Darwen, Relational
Database Writings 1989-1991.
Reading, Mass.: Addison-Wesley (1992).
8. C. J. Date:
"Three-Valued Logic and the Real World," InfoDB 4, No. 4
(Winter 1989). This article was
republished in C. J. Date and Hugh Darwen, Relational Database Writings
1989-1991. Reading, Mass.:
Addison-Wesley (1990).
9. E. F. Codd: The
Relational Model for Database Management Version 2, page 183. Reading, Mass.: Addison-Wesley (1990). (The quotation is a paraphrasing.)
10. Ibid., page 189.
11. C. J. Date: "NOT Is Not "Not"! (Notes on
Three-Valued Logic and Related Matters," in Relational Database
Writings 1985-1989. Reading, Mass.:
Addison-Wesley (1990).
12. C. J. Date: Section 5.5 ("Null Values") of An
Introduction to Database Management: Volume II. Reading, Mass.: Addison-Wesley (1982).
13. C. J. Date: "Tables with No Columns," in Relational
Database Writings 1991-1994.
Reading, Mass.: Addison-Wesley, 1995.
14. C. J. Date: "Nothing in Excess," in Relational
Database Writings 1991-1994.
Reading, Mass.: Addison-Wesley, 1995.
15. C. J. Date: "Empty Bags and Identity Crises,"
in Relational Database Writings 1991-1994. Reading, Mass.: Addison-Wesley, 1995.