MUCH ADO ABOUT NOTHING
by E. F. Codd and C. J. Date

 

 

 

ABSTRACT

 

This article contains the text of a debate between Dr. E. F. Codd and myself (C. J. Date) on nulls and related matters.  It probably doesn't resolve anything, but it at least touches on most of the arguments on both sides of the issue.  The article was originally published in Database Programming & Design 6, No. 10 (October 1993), where it was described as "a point/counterpoint on the tough issue of missing values."  An edited version of the original article subsequently appeared in my book RELATIONAL DATABASE WRITINGS 1991-1994.  What follows is a slightly revised version of that chapter. 

 

 

INTRODUCTION

 

(The introduction that follows is a lightly edited version of Database Programming & Design's own introduction to the original debate.) 

 

Dr. E. F. Codd and C. J. Date are two of the best known figures in the history, development, and exposition of what was a breakthrough concept in database technology: the relational model.  Ever since the model was first defined by Codd in 1970,* in his famous paper A Relational Model of Data for Large Shared Data Banks, we have been reading, listening to, and interpreting their commentaries on it.  While Codd and Date have agreed upon much during the course of the relational model's evolution and implementation, on some issues they definitely do not agree.  One important disagreement—and the topic of this debate—centers on the issue of nulls and missing values, and the underlying theoretical problems of three- and four-valued logic. 

 

----------

*  Actually Codd's first paper on the relational model (Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks) was dated 1969. 

----------

 

Both have written extensively on these topics, as is noted in the “References and Bibliography" section at the end.  The comments presented here were sparked by Date's columns in our magazine beginning last December, when he discussed the three-valued logic approach to missing information.  Codd then sent us his criticisms of Date's writings; his commentary is presented here.  Date then provided a rebuttal to specific points of Codd's, which follows Codd's remarks.  Finally, we give Codd a chance to rebut Date's rebuttal. 

 

For the reader's convenience, we kept Codd's commentary together, so it may be read as a whole.  We noted throughout, however, where Date's specific rebuttals apply, and should be read.  This way, the reader may follow Codd's comments all the way through, and then return to his essay, reading Date's rebuttals as appropriate. 

 

At first, the issues may seem arcane and theoretical, but most developers and DBAs know they clearly are not, and merit serious debate.  Missing values remain one of the toughest—and potentially, most dangerous--problems in database technology.

 

[Ed. Note: The reader is referred to PRACTICAL DATABASE FOUNDATIONS paper #8, The Final NULL in the Coffin for a possible relational solution to the missing values problem.] 

 

We begin with Dr. Codd's commentary. 

 

 

CODD'S COMMENTARY

 

Although C. J. Date has been a strong supporter of the relational approach to database management for over 20 years, from time to time I have found that his criticisms of the relational model have been incorrect.  I do agree with many of his criticisms of SQL:  However, he often fails to make a clear distinction between SQL and the relational model.  SQL came after the relational model was described; it was invented by a small IBM group in the Yorktown Heights [N.Y.] Research Laboratory.  In my book, THE RELATIONAL MODEL FOR DATABASE MANAGEMENT: VERSION 2 [1], I make it clear what semantic properties a relational language should have if it is to conform to the model, and label such a language RL.  I also describe three major shortcomings of SQL (there are, of course, numerous others [2]): 

 

Ø       As a user option, SQL permits rows to occur within a single relation that are complete duplicates of each other.  I call this a tabular error because it is based on two misconceptions: 

 

1. That relations and tables are in one-to-one correspondence

 

2. That duplicate rows are essential to some applications

 

Ø       Full support of first-order predicate logic is sacrificed in the name of user friendliness.  I call this a psychological mixup:  A logically sound language is absolutely necessary as a foundation.  Any useful "user-friendly features" should be grafted as a layer on top, along with rigorously defined translation between layers. 

 

Ø       The treatment of missing information is wrong for two reasons: 

 

1. Support in the language for multi-valued logic is grossly inadequate. 

 

2. A user is permitted to designate a value that is acceptable to a column specifically to indicate the fact that some value is missing from that column.  I call this latter error one of missing-value misrepresentation

 

Date has criticized the multi-valued logic approach to missing values in the relational model, claiming it can lead to catastrophic errors.  He has advocated the missing-value misrepresentation approach, which he calls the default value approach.  In 1986, when Date had his original paper reprinted in the U.S., I prepared a technical response [3]. 

 

The ideas behind Date's default value approach came completely from prerelational products that used single-record-at-a-time processing.  The default-value approach appealed to RDBMS vendors because it placed all of the responsibility for the representation and handling of missing values in a relational database completely on the users.  However, I think it is best described as a nonsolution to the problem, and a complete evasion of the issue.  The approach contains no clear description of how missing values in a column are to be treated.  That means that the treatment will often be invented by application programmers and buried in their programs.  It also means that there are likely to be many different treatments buried in numerous programs. 

 

Now that we are dealing with RDBMSs that employ multiple-record-at-a-time processing, this default-value approach is unacceptable for the following reasons: 

 

1. The meaning of the fact that a value is missing from some part or column of a relational database is quite different from the meaning of a value that is legitimate within that part or column. 

 

2. A single relational request can touch many different columns in a relational database, and therefore it is intolerable that in conceiving such a request the user should have to understand and cope with as many different representations and treatments of missing values as the columns that are touched.  In a relational database, both the representation and treatment of missing values must be uniform across the entire database. 

 

[See Date's Rebuttal I]

 

Date and other critics of multi-valued logic claim that serious errors are inevitable if a multi-valued logic is made available to users.  However, such critics have failed to provide a single example of a severely wrong answer being delivered as a result of a multi-valued logic.  A result is severely incorrect if the logical expression is evaluated by the DBMS to be either: 

 

·  True when it is actually false or unknown

·   False when it is actually true or unknown

 

A result is mildly incorrect if the DBMS evaluates an expression as unknown when it is actually either true or false.  In the paper in which I introduced three-valued logic (3VL) [4], I cited an example of a request mildly mishandled by 3VL:  For some requests the condition would be evaluated as unknown when the correct answer was true or false, if the DBMS were unable to recognize tautologies.  This example shows that simple 3VL should be augmented by some inferential capability.  An example would be the following:  Suppose that the birth year is recorded for most employees, but it is missing from the database for a few.  Now, consider the request:  Retrieve the serial numbers and names of employees for each of whom

 

 1. The birth year is 1960, or

 2. The birth year is earlier than 1960, or

 3. The birth year is later than 1960. 

 

Suppose the DBMS does NOT have the capability of recognizing that the whole condition must be true for every employee, whether the birth year happens to be missing or not.  That is, it is unable to detect tautologies or contradictions. Then for those employees whose birth year is unknown, the DBMS comes up with unknown for each of the three subconditions.  And, using the rule that for truth values

 

    unknown OR unknown is unknown,

 

it evaluates the whole condition to be unknown.  This is an example of a mild error.  This kind of error is just as likely to occur (and other kinds much more likely) if the responsibility for handling missing information is placed totally on the users. 

 

[See Date's Rebuttal II]

 

Now, an obvious cure for this is to equip the DBMS not only with 3VL, but also with the capability of recognizing for any whole condition whether it is a tautology.  This would be easy if only propositional logic were being supported.  However, the relational model requires the more powerful predicate logic to be supported in specifying the condition part of a request.  It is well known that it is a logically undecidable problem to determine whether an arbitrary formula in predicate logic is a tautology or a contradiction. 

 

Therefore, it is pointless to search for an allegedly universal algorithm for detecting all possible tautologies and all possible contradictions.  A reasonably good algorithm can be developed that will take care of at least all of the simple cases that will be encountered in commercial activities, and this algorithm should be incorporated into every RDBMS product.  The RDBMS will then make mild errors only when a most unusual request is made.  An RDBMS must admit its inability to deduce a sound response to a user request whenever this is impossible because of missing values.  Also, present treatment by SQL of missing values is, in my opinion, totally unsatisfactory.  For a more complete treatment of missing values and a refutation of Date's criticisms, refer to my book [5]. 

 

[See Date's Rebuttal III]

 

Date's argument that true and false are the only truth values, and that, therefore, unknown cannot be treated as a logical value makes no sense to me.  After all, it is very common in mathematics to label unknown values by letters such as m, n, x, y, z.  The fact that the letters m, n do not "look like" any of the integers does not prevent them from actually having integer values in an expression such as m + n, m - n, or an assertion that m * m = m.  In any event, when dealing with missing values, an RDBMS must be able to determine whether NOT A, A OR B, and A AND B is true, false, or unknown when A, or B, or both are unknown

 

Date's argument that the number of distinct functions from truth values to truth values is very large, and that fact makes 3VL and four-valued logic (4VL) unusable is ridiculous.  After all, the number of distinct functions from integers to integers is infinite, because the number of distinct integers is infinite.  However, no one in his right mind would use that as an argument that integers are unusable. 

 

Taking the whole of Date's article into consideration, I completely reject Date's claims: 

 

·  To have inserted "more nails into the 3VL coffin"

·  That it is time to drop the pretense that 3VL is a good thing

 

[See Date's Rebuttal IV]

 

 

DATE'S REBUTTAL I

 

Before I begin, let me make one thing crystal clear:  My quarrel is not with the relational model.  On the contrary, I felt at the time when it was first introduced (and I still feel) that the original model was a work of genius.  All of us owe Dr. Codd a huge debt of gratitude for his major contribution.  And, as the originator and "elder statesman" of relational theory, Codd always deserves the courtesy of very close attention to his remarks on relational matters. 

 

So my quarrel is not with the relational model, but rather with nulls and three-valued logic (3VL), which—in a database context—were first discussed by Codd in 1979.  It is true that Codd now regards 3VL as an integral part of the relational model, but I do not (and I am not alone in taking this position).  Indeed, the whole question of how to handle missing information is largely independent of whether the underlying model is relational or something else.  Thus, I would like to distinguish very carefully between what we might call "RM" (the original model, with two-valued logic) and "RM+3VL" (Codd's version, with three-valued logic).  My quarrel, to repeat, is with the "3VL" portion of "RM+3VL." 

 

Now, regarding Codd's first point (that default values misrepresent the fact that information is missing):  I do not dispute this!  However, I would make two points: 

 

1. It is default values, not nulls, that we use in the real world, as I pointed out in my December 1992 column in Database Programming & Design [6]. 

 

2. Nulls misrepresent the semantics too (see below).  In other words, I don't think we yet know how not to misrepresent the semantics; and given that this is so, I take the position that we should not undermine the solid foundation of the relational model with something as suspect as 3VL, when it demonstrably doesn't solve the problem anyway.* 

 

----------

* In other words, I invoke the Principle of Cautious Design—see the article of that name in RELATIONAL DATABASE WRITINGS 1989-1991, by C. J. Date and Hugh Darwen. 

----------

 

Note:  When I talk about "undermining the foundations of the relational model," what I mean is that a "relation" that includes nulls, whatever else it might be, is not a relation!—at least, not in the formal sense of that term.  As a consequence, the entire foundation crumbles; we can no longer be sure of any aspect of the underlying theory, and all bets are off.  I cannot believe that Codd really wants to destroy the entire edifice that he has so painstakingly constructed over the years. 

 

As for nulls also misrepresenting the semantics, consider the following two points: 

 

Ø       A (Codd-style) 3VL system supports just one type of null, "value unknown."  There is thus a strong likelihood that users will use that null for purposes for which it is not appropriate.  For example, suppose employee Joe is not a salesperson and so does not qualify for a commission.  Then Joe's commission is quite likely to be misrepresented as "value unknown" (it should of course be "value does not apply").  One simple consequence of this misrepresentation error is that Joe's total compensation (salary plus commission) will incorrectly evaluate to "unknown" instead of to just the salary value. 

 

What's more, an analogous argument will continue to apply so long as the system supports fewer types of null than are logically necessary.  In other words, simply adding support for a "value does not apply" null might solve the specific problem mentioned in the previous paragraph, but it will not solve the general problem.  Thus, a system that supports fewer types of null than are logically necessary is just as open to misuse--perhaps even more so--than a system that does not support nulls at all. 

 

Ø       Now suppose the system supports two kinds of null, "value unknown" and "value does not apply" and four-valued logic (4VL), and suppose employee Joe's job is unknown.  What do we do about Joe's commission?  It surely must be null--the information is surely missing--but we don't know whether that null should be "value unknown" or "value does not apply."  Perhaps we need another kind of null, and five-valued logic ... This argument clearly goes on for ever, leading to an apparent requirement for an infinite number of kinds of null.  What do we conclude from this state of affairs? 

 

Next, regarding Codd's allegation that the default-value approach lacks a "clear description of how [default] values are to be treated":  I have published several such descriptions over the past few years, the first in 1982, the most recent in 1992 [7]. 

 

Of course, Codd is quite right to warn of the dangers of undisciplined use of default values.  That's why I have consistently advocated a disciplined approach.  By the way, a system that supports nulls can still be used in an undisciplined way, as I have already shown.  In fact, an argument can be made that such a system is more susceptible to lack of discipline, partly (a) because of the false sense of security provided by the fact that nulls are supported ("Missing information?  Don't worry about it, the system can handle it"), and partly (b) because

 

 1. The system designers assume that users are going to use nulls, and therefore

 2. They typically don't provide explicit system support for defaults, and therefore

 3. Users who have made the (in my opinion, very sensible) decision to avoid nulls are on their own--the system doesn't help (in fact, it positively hinders). 

 

Finally, I completely reject Codd's suggestion that the default values idea comes from prerelational systems—on the contrary, it comes from the real world, as I have already said.  I also reject the suggestion that it has anything to do with "record-at-a-time" thinking—how to deal with missing information has nothing to do with whether the operators are record-at-a-time or set-at-a-time. 

 

On behalf of the vendors, I also reject the suggestion that default values appealed to them "because it placed all of the responsibility on the users."  Might it not have been that the vendors had their own misgivings concerning 3VL?  In any case, I know of no vendor that actually supported a proper default values scheme before supporting 3VL.  Moreover, a proper default values scheme does not "place all of the responsibility on the users."  To contend otherwise is to misrepresent the semantics of the default values scheme.* 

 

----------

*  I don't claim that default values are a good solution to the problem, however.  I'd still like to find something better. 

----------

 

 

DATE'S REBUTTAL II

 

First, a small point regarding Codd's claim that the mild error he notes "is just as likely to occur (and other kinds much more likely)" in a default values scheme.  It seems to me that there is all the difference in the world between: 

 

·   Building a system--i.e., one based on 3VL--in which we know errors will occur, because the system has logical flaws in it, and

 

·   Building a system that is at least logically correct but is open to misuse.  Any system is open to misuse.  That's why we have to have discipline. 

 

Next, and more important:  Contrary to Codd's claim that "[I] have failed to provide a single example of a severely wrong answer," I gave the following example in 1989 [8], and repeated it in my December 1992 column in Database Programming & Design.  The database (DB1) is shown in Fig. 1 (the "--" represents "value unknown").  The query is: 

 

SELECT E#

FROM   DEPT, EMP

WHERE  NOT ( DEPT.D# = EMP.D# AND EMP.D# = 'D1' )

 

 

+----------------------------------------------------------------+

¦                   +----+          +---------+                  ¦

¦              DEPT ¦ D# ¦      EMP ¦ E# ¦ D# ¦                  ¦

¦                   +----¦          +----+----¦                  ¦

¦                   ¦ D2 ¦          ¦ E1 ¦ -- ¦                  ¦

¦                   +----+          +---------+                  ¦

+----------------------------------------------------------------+

Fig. 1: Example database DB1

 

It's not worth going through the example again in detail here.  The basic point is that the expression in the WHERE clause is "actually unknown" (Codd's phraseology) but is treated as false, with the result that employee number E1 is not retrieved but in fact should be (the real-world answer to the query is E1).  This is a severe error by Codd's definition. 

 

Please note too that the foregoing is a 3VL error, not just an SQL error.  Paraphrasing slightly from Codd's book [9], he states:  "Executing a query delivers only those cases in which the condition part evaluates to true."  This is tantamount to treating unknown as false

 

In case the reader is not convinced, let me give another example.  The database (DB2) is given in Fig. 2.  The query is "Does anyone in department D1 earn a salary of 100K?"  This query will involve a test to see whether the literal row "<D1,100K>" appears in the projection of EMP over D# and SAL.  In forming that projection, however, the row "<D1,--->" will be eliminated [10].  Result:  The 3VL answer to the query is false; the real-world answer, by contrast, is unknown.  This is surely also a severe error by Codd's definition. 

 

+----------------------------------------------------------------+

¦                        +---------------+                       ¦

¦                    EMP ¦ E# ¦ D# ¦ SAL ¦                       ¦

¦                        +----+----+-----¦                       ¦

¦                        ¦ E1 ¦ D1 ¦ 50K ¦                       ¦

¦                        ¦ E2 ¦ D1 ¦ --- ¦                       ¦

¦                        +---------------+                       ¦

+----------------------------------------------------------------+

Fig. 2: Example database DB2

 

Finally, and much more important still:  The whole business of "severe" vs. "mild" errors is in any case surely nonsense.  It seems to me that it is nothing more than a rearguard attempt to shore up an already suspect position.  After all, if we were talking about integers instead of truth values, what would we think of a system that occasionally produced the answer 2 when the correct answer was 1 or 3?  And in what sense could this be any more acceptable than one that occasionally produced 1 or 3 when the correct answer was 2? 

 

Suppose the DBMS says it doesn't know whether Country X is developing a nuclear weapon, whereas in fact Country X is not doing so; and suppose Country Y therefore decides to bomb Country X "back to the Stone Age," just in case.  The error here can hardly be said to be mild.  (This example is not to my taste; I choose it deliberately for its shock value.) 

 

 

DATE'S REBUTTAL III

 

Here I would just like to raise a few questions. 

 

1. What evidence is there that "a reasonably good algorithm can be developed"? 

2. Is there a precise characterization of the "simple cases" that such an algorithm will handle? 

3. Is that characterization intuitively understandable?  In other words, will the user be able to predict with any confidence whether or not the DBMS is going to give the right answer to a given query? 

4. If the answer to the previous question is no, then why would any user ever use the system for any purpose at all? 

5. In fact, Codd is requiring the DBMS itself to "admit its inability to deduce a sound response to a user request" whenever applicable.  In other words, he is asking for a decision procedure regarding the decidability of formulas in three-valued logic.  What evidence is there that such a procedure exists? 

6. What percentage of real-world queries that are "encountered in commercial activities" are "simple" in the foregoing sense? 

7. What evidence exists to support the answer to the previous question? 

8. If we are limited to using "simple" queries only, exactly what incremental value is the "RM+3VL" system providing over a prerelational, record-at-a-time system? 

9. What does "commercial activities" include?  Does it include financial applications?  Manufacturing applications?  Hardware and software design applications?  Geographic and mapping applications?  Document-handling applications?  Medical applications?  Chemical applications? 

 

I think it is time to quote Wittgenstein again:  All logical differences are big differences

 

Now, turning to Codd's discussion of missing values in his book, and refutation of my views therein:  Codd claims that his book "refutes my criticisms in detail."  I don't think it does.  The two major criticisms dealt with in his book are (1) "the alleged counterintuitive nature [of nulls and 3VL]" and (2) "the alleged breakdown of normalization." 

 

Ø       Regarding (1), Codd does not address the counterintuitive nature of 3VL per se, but instead claims that default values are counterintuitive too.  In doing so, incidentally, he confuses the semantics of the two very different expressions "not known" and "known not"--a trap that is all too easy to fall into, of course (indeed, this confusion is precisely one of the reasons why I claim that 3VL is counterintuitive).  In an earlier paper [11], I gave an example of a (very simple!) query involving 3VL that Codd and I both seriously misinterpreted when we first considered it (after a somewhat lengthy discussion too, I might add).  The misinterpretation rested once again on the distinction between "not known" and "known not."  I stand by my contention that 3VL is difficult to deal with on an intuitive level. 

 

Ø       Regarding (2), I originally claimed that "the fundamental theorem of normalization" breaks down in the presence of nulls [12], and so it does.  Codd's counterargument is unconvincing. 

 

In any case, I have several other serious criticisms of 3VL that Codd's chapter does not address at all.  They include (and this is not an exhaustive list): 

 

Ø       The fact that we apparently need an infinite number of types of null

 

Ø       The semantic overloading or "misrepresentation" that will occur if not all types of null are supported (bound to be the case, given the previous point)

 

Ø       The lack of a convincing justification for the different treatment of equality of nulls in comparisons vs. equality of nulls in duplicate elimination

 

Ø       The fact that the (admittedly informal) argument in support of the entity integrity rule ("primary keys in base relations do not permit nulls") quite obviously extends to every column in the database--implying that nulls should be inadmissible everywhere

 

Ø       If TABLE_DUM corresponds to false and TABLE_DEE corresponds to true, what corresponds to unknown? (see my March 1993 column in Database Programming & Design [13])

 

 

DATE'S REBUTTAL IV

 

Regarding my argument that there are only two truth values:  Codd's counterargument here makes no sense to me.  Is he suggesting that unknown is not a truth value after all, but just a variable whose actual value at any given time is either true or false?  So we aren't really dealing with 3VL after all? 

 

The only way I might make sense of Codd's position here is to interpret his remarks as actually agreeing with what I said in my January 1993 column (which I'm sure was not what he intended).  Here's what I said in that column [14]: 

 

How many truth values are there?  The answer, of course, is two, namely true and false.  We might SAY that unknown is a third truth value, but that doesn't make it one.  After all, I might say that oggle-poggle is another integer, but that doesn't make it one; it has absolutely no effect on the set of all integers.  Likewise, the set of all truth values just IS the set {true, false}, and there is nothing more to be said. 

 

If we are given some proposition, say the proposition "Employee E1 works in department D1," then that proposition is either true or false.  I might not know which it is, but it is one of the two (if it isn't, it isn't a proposition).  Let's assume, in fact, that I don't know which it is.  Then I certainly might say, informally, that the truth value of the proposition is unknown to me; but that "unknown" is a very different kind of object from the truth values true and false themselves.  And pretending that it is the same kind of object--in other words, pretending that we have three truth values--is bound to lead to problems of interpretation (as of course it does). 

 

Note:  Of course, I understand the point that we are free to define a purely formal system in which there are as many "truth values" as we like.  This does not alter the fact that, in the real world, the values true and false (on the one hand) and the value unknown (on the other) are totally different kinds of things. 

 

Finally, regarding my argument concerning truth-valued functions:  Here Codd both misrepresents my position and misses the point.  My argument was not that because there were so many functions, we should not support 3VL; rather, it was that if we want to support 3VL, we should be sure that we support all possible 3VL functions.  Now, in the case of integers, it is true that the total number of functions is infinite--BUT we know that any computable function is expressible in terms of a small number of primitive operators, so all we have to do is support those primitives properly.  Likewise, in 2VL we know that all 2VL functions can be expressed in terms of a small (very small!) number of primitive operators, and so again all we have to do is support those primitives properly. 

 

For 3VL, therefore, I was asking, first, for a suitable set of primitive operators that would guarantee that all 19,710 logical functions were supported (indeed, if any of those functions are not supported, then it cannot be claimed that the system we are dealing with is 3VL).  I was also asking for a suitable set of useful operators (not necessarily the same thing as primitive operators).  I was also asking for a proof of completeness.  I was also raising questions of testing, debugging, and usability.  And then I was asking the analogous questions all over again for 4VL, where there are over four billion possible functions.  I believe these are serious questions that advocates of 3VL and 4VL are morally obliged to address. 

 

Taking the whole of Codd's comments into consideration, I stand more firmly than ever by my original position. 

 

 

REBUTTING THE REBUTTALS

 

(To close out this discussion, Dr. Codd offers some comments on Date's rebuttals.) 

 

Just about every database contains missing values scattered over numerous parts of the database.  For example, an employee's birthdate might have to be marked "missing but applicable," because it is at present unknown.  Or, the employee's year-to-date commission may have to be marked as missing and inapplicable because he or she is not a salesperson. 

 

Database management would be simpler if missing values didn't exist.*  Unfortunately, for a variety of reasons, they do occur and need to be managed.  Date's assertion that a relation containing missing values is not a relation is unacceptable.  While relations that contain missing values are not normally encountered in mathematics, the same operators in the relational model continue to be applicable.  Requests expressed in a relational language must be able to cope with missing values, without resorting to guessing. 

 

----------

*  I like this sentence! 

----------

 

With Date's default value approach, both the representation and treatment of missing values can be peculiar to the columns in which missing values are permitted.  This might be acceptable in a single-record-at-a-time DBMS; it's clearly not in a multiple-record-at-a-time DBMS.  The principal reason for adopting an approach that is uniform across the entire database is that a single relational request may involve data from numerous distinct columns of the database, and many of these columns are likely to be permitted to have missing values.  Imagine a request that deals with 12 or more such columns:  A significant part of formulating this request would involve detailed knowledge of the 12 or more representations and treatments of missing values, if the default value scheme were adopted. 

 

For uniformity, one might look to support in hardware.  Today's memory technology, however, can't distinguish between values to be taken seriously and those that are not--such as those left in some condition by some previous activity.  Once a disk is formatted, the computer regards every bit as part of the value of something.  In my relational model version 2 (RM/V2), any column in which missing values are permitted is assigned one extra byte to indicate, for each row, whether the associated value is: 

 

1. To be treated seriously, or

2. Missing and applicable, or

3. Missing and inapplicable. 

 

IBM's DB2 partially supports this representation. 

 

A basic ground rule in the relational model is that the DBMS must NOT provide a definite response to a query when it is not certain about the response because values are missing.  I remember well when I first arrived in New York City to reside in the U.S.  It was the fall of 1948 and I was looking vigorously for a job.  Often I would have to ask people on the street how to get to specific parts of the city.  I received detailed directions, but almost invariably these directions were wrong.  I stopped asking and used street maps instead.  Similarly, people who use a DBMS that guesses the answer to a query but delivers it as if it were not guessing are likely to abandon its use.* 

 

----------

*  I (Date) remark in passing that "guessing the answer and delivering it as if it were not guessing" is exactly what SQL does when it applies an aggregate function such as AVG to a collection of values that happens to include some nulls.  Suppose the user asks for the average salary of employees, and at least one employee is shown in the database as having a null salary.  Then SQL simply ignores such employees and returns the average salary of the rest!--instead of responding "I don't know," which would be more honest.  (It does admittedly return a warning code also, but there is no way to force the user to check for that code.) 

----------

 

 

Re: Date's Rebuttal I

 

In Date's Rebuttal I, he decries the inadequacies of three-valued logic (3VL), which I first discussed in 1979.  In 1986, I proposed a significant improvement [3], which distinguished between two categories of missing database values: 

 

1. Temporarily unknown

2. Inapplicable, and hence unknowable

 

Semantics make this distinction necessary.  I also proposed four-valued logic (4VL) and additional general-purpose functions to permit adequate handling of missing information.  Date, in many of his examples, ignores both.  This double oversight makes his examples merely cases of the incorrect use of the missing value machinery in RM/V2 [1]. 

 

 In his Rebuttal I, Date asserts:  "It is default values, not nulls, that we use in the real world."  Arguments of this type can, and have, been used [sic] to delay every technical or scientific step forward.  It could have been used to argue against the introduction of computers:  "It is mental arithmetic that we use in the real world, not machines."  The phrase "real world" is a serious trap, because what is real is continually changing. 

 

Let's take one of Date's examples:  A user enters data with an element missing, and he or she does not know whether the element is applicable or inapplicable.  Date would claim that, as a consequence, we need a third kind of missing value.  In his rebuttal, he goes on to say that we need more and more distinct types of missing values. 

 

I reject this sequence of arguments.  In RM/V2, I discontinued using "null" because the term has been so often misinterpreted.  As I pointed out earlier, missing values are either A-marked (applicable, presently unknown) or I-marked (inapplicable, hence unknowable).  Let's assume we have an RDBMS that is faithful to RM/V2.  As background, remember that: 

 

·   For each column other than a primary key column, the DBA may declare that A-marks be permitted or prohibited. 

 

·   For each column other than a primary or foreign key column, the DBA may declare that I-marks be permitted or prohibited. 

 

·  A-marks are weaker and more flexible than I-marks [1]. 

 

·   A-marks likely occur more often than I-marks in a relational database that is in conceptual normal form (p) because p is the maximum percentage of I-marked values in any column, and p is normally set by the DBA to be considerably less than 1. 

 

·   Whenever a tuple is entered with a missing value, this value is A-marked in the database, unless an integrity constraint exists that clearly indicates an I-mark must be recorded. 

 

In this example, we must assume that both types of marking are permitted in the pertinent column, and that no declared integrity constraint resolves the issue of whether an A-mark or I-mark should be used.  Then RM/V2 marks the missing value as applicable.  Later, if it is discovered that the value should have been I-marked, not A-marked, then the DBA or someone with suitable authorization changes the marking on this missing value.  Thus, I fail to see the need for more than two kinds of markings. 

 

Date also asserts that multi-valued logic destroys the foundation upon which the Relational Model is built.  I do not agree.  There is no theoretical impairment and no loss of usability, whereas both scope and usability are lost if the default value scheme advocated by Date is adopted. 

 

Date also asserts that normalization becomes invalid when multi-valued logic is introduced.  This is false, providing that this logic is introduced correctly (few RDBMSs do this today) and care is taken with its use.  For example, if the RDBMS supported DBA-defined requests (few do), the DBA could define integrity constraints that will be stored in the catalog to enable the RDBMS to enforce the functional, multi-valued, and inclusion dependencies discovered at database design time.  However, the RDBMS must withhold the enforcement of these constraints from the missing tuple components until they are replaced by actual values.  This enforcement should occur at the time of attempted replacement. 

 

Finally, I oppose the use of default values only if it's done to represent that a value is missing.  Default values may be useful in other contexts.  For example, a bank teller shouldn't be required to re-enter his or her terminal identifier every time he or she enters a customer transaction.  The terminal should handle this itself. 

 

 

TECHNICAL CORRESPONDENCE

 

The foregoing debate, perhaps not surprisingly, led to a number of letters from readers.  In fact, more than one reader wrote in with attempts to salvage the idea that "automatic" null support be provided without having to stray beyond the boundaries of conventional two-valued logic (2VL).  I regret that I did not find these attempts very convincing, and choose not to discuss them here. 

 

    Ceuan Clement-Davies from Frankfurt, West Germany, offered the following comment, which I cannot do better than quote verbatim: 

 

One thing struck me forcibly ... [Codd's] remarks on tautologies seemed a significant admission.  Since the example he gives [regarding birth years] isn't a tautology in Lukasiewicz's system (and any system of three-valued logic in which this was made to be a tautology would show unfortunate effects elsewhere), it isn't at all clear to me whether Codd is suggesting that a RDBMS should use two-valued logic to detect tautologies, and three-valued logic for everything else.  This would be a curious mixture. 

 

    Stephen Ferg of the U.S. Department of Labor (Bureau of Labor Statistics) wrote claiming that: 

 

 1. The concept of null is deeply embedded in the relational model, and probably cannot be removed from it. 

 

 2. Nulls often, and in some cases must, have no semantic content whatever:  They simply mean that there is no value in a given column of a given row. 

 

    He then went on to say (this is an edited extract from his letter): 

 

Consider an EMP table with columns E#, ETYPE, and TOTSALES.  ETYPE indicates whether the employee in question is a member of the sales staff ... TOTSALES indicates the total number of sales the employee has made since the beginning of the year ... For an employee not on the sales staff, TOTSALES will be null.  For a sales employee, TOTSALES may be null until the employee files his or her first sales report (until then, the number is applicable, but unknown). 

 

It is this kind of example that both Codd and Date seemed to have in mind during their debate.  But there is another use for null that is far more important.  Suppose we have two entities, EMP and DEPT, and a relationship, ASSIGNED, with the constraint that (at any given time) each employee is assigned to at most one department.  There are two employees (E1, E2) and two departments (D1, D2).  E1 is assigned to D1, but E2 is not currently assigned to any department (E2 is on leave of absence and will be assigned to a department when [he or she] returns from that leave).  A logically ideal implementation of this situation is shown in Fig. 3. 

 

 

+----------------------------------------------------------------+

¦           EMP             ASSIGNED            DEPT             ¦

¦          +----+          +---------+         +----+            ¦

¦          ¦ E# ¦          ¦ E# ¦ D# ¦         ¦ D# ¦            ¦

¦          +----¦          +----+----¦         +----¦            ¦

¦          ¦ E1 ¦          ¦ E1 ¦ D1 ¦         ¦ D1 ¦            ¦

¦          ¦ E2 ¦          +---------+         ¦ D2 ¦            ¦

¦          +----+                              +----+            ¦

+----------------------------------------------------------------+

Fig. 3: The departments and employees database (first version)

 

Because ASSIGNED is a many-to-one relationship, however, the ASSIGNED table and the EMP table have the same primary key.  This permits the schema to be "optimized" by merging the two tables, so that ASSIGNED.D# becomes a foreign key in the EMP table, producing the structure shown in Fig. 4. 

 

+----------------------------------------------------------------+

¦                   EMP                 DEPT                     ¦

¦                  +---------+         +----+                    ¦

¦                  ¦ E# ¦ D# ¦         ¦ D# ¦                    ¦

¦                  +----+----¦         +----¦                    ¦

¦                  ¦ E1 ¦ D1 ¦         ¦ D1 ¦                    ¦

¦                  ¦ E2 ¦ -- ¦         ¦ D2 ¦                    ¦

¦                  +---------+         +----+                    ¦

+----------------------------------------------------------------+

Fig. 4: The departments and employees database (second version)

 

There are three things worth noting about Fig. 4. 

 

1. First, such an optimized design is extremely common--so common that many database designers think of it as the ideal relational implementation of such a situation, rather than as an optimized implementation one step removed from the ideal implementation. 

 

2. Second, such a design requires the use of nulls in foreign keys ...

 

3. Third, in such a design a null has no semantic content whatever:  It exists only because of optimization and implementation considerations ... It is not the case that E2 cannot be assigned to a department, and it is not true that E2's department number is "missing" or "unknown" (as if such a number really existed but we just don't know what it is).  So ... we have a third kind of null:  It doesn't mean "inapplicable" and it doesn't mean "unknown"--it simply means that the column has no value. 

 

Note that the design of Fig. 3 also does not avoid the need for a "simply no value" null:  If we do a left outer join of the EMP and ASSIGNED tables in that figure, we will produce a result table that is exactly the same as the EMP table in Fig. 4.  An outer join produces nulls because we ask the DBMS to show us data from one table even when no matching data can be found in another table.  Such nulls have no meaning, no semantic content ...

 

My conclusions: 

 

1. Despite the difficulties with null that Date points out, we cannot have a relational model that is altogether free of null, so we had better learn to live with it. 

 

2. Despite Codd's impulse to distinguish different kinds of null on semantic grounds, it is probably better just to let null mean "no value here" ... [in order to avoid] the unmanageable complexities of 3-valued, 4-valued, ..., n-valued logic. 

 

My reactions to Ferg's letter are as follows. 

 

First of all, I agree with Ferg that the design of Fig. 3 is preferable to that of Fig. 4.  However, Ferg claims that nulls are "deeply embedded in the relational model and probably cannot be removed from it."  But he provides no proof of this claim, and I could not possibly disagree with it more.  (After all, the model survived very well without nulls for about ten years!)  In fact, I challenge him to produce an example of a problem that appears to need nulls for its solution that cannot also be solved without them. 

 

Ferg goes on to claim that nulls "often ... have no semantic content whatever," but subsequently contradicts himself on this very point.  In the example he uses to support his claim, the null department number in the EMP row <E2,--> certainly does have "semantic content"--it means, loosely, that the employee will be assigned to a department when [he or she] returns from leave of absence (Ferg's own words). 

 

Later Ferg says that the nulls that appear in the result of an outer join also "have no meaning, no semantic content."  Again I disagree.  The point is not that the nulls have no meaning, but rather that (in such a result) different nulls have different meanings, as I have illustrated elsewhere (see "Watch Out for Outer Join," in C. J. Date and Hugh Darwen, Relational Database Writings 1989-1991, Addison-Wesley, 1992, pp. 328-330). 

 

In any case, the question "Do nulls have semantic content?" is the wrong question to ask.  The point is, operators (logical, computational, relational, or whatever) must be defined to deal with nulls somehow.  And the behavior of those operators effectively defines the semantics of nulls.  In other words, nulls must always--necessarily--have some "semantic content."  Ferg's suggestion that we should just let nulls mean "no value here" thus really makes no sense. 

 

To pursue the previous point a moment longer:  A large part of my objection to nulls is based on the fact that the particular "semantic content" defined by the operators of three-valued logic is of no practical value for the purpose at hand, because that "semantic content" does not mimic the way the real world works.  As I put it in my original December 1992 column [6], answers that 3VL says are correct are often not the answers that are correct in the real world. 

 

One further point:  Ferg also gives an example in which the TOTSALES figure for a sales employee who has not yet filed a sales report is shown as null.  It should of course be zero!  This is another illustration of the kinds of mistakes that people are likely to make in a system that supports nulls.  Indeed, SQL itself makes the same mistake when it defines the sum of an empty set to be null.  See my April 1993 column "Empty Bags and Identity Crises" [15]. 

 

Another correspondent, Martin H. Rusoff of Banc One Financial Card Services in Columbus, Ohio, wrote as follows (again this is an edited extract): 

 

It seems to me that the entire discussion is slightly off target ... There are usually several alternatives to handling [missing information].  These can range from ignoring it up to taking exceptional measures to discover the missing data.  Any of these could be automated, but it requires knowledge of what the data means and possibly why it is missing to decide the correct response.  While this might be decided in the data model, more often it depends on the use to which the data will be put--i.e., on the application

 

    Example:  Suppose we are given table EMP with a row as shown in Fig. 5, and we need to calculate the total amount needed to meet payroll.  Then there are a number of possible responses: 

 

+----------------------------------------------------------------+

¦          +------------------------------------------+          ¦

¦     EMP  ¦ E# ¦ ENAME ¦ GRADE ¦ SALARY ¦ COMMISSION ¦          ¦

¦          +----+-------+-------+--------+------------¦          ¦

¦          ¦ E1 ¦ Smith ¦    T4 ¦   --   ¦     --     ¦          ¦

¦          +------------------------------------------+          ¦

+----------------------------------------------------------------+

Fig. 5: Example EMP row

 

·    I don't know.

·   Use the top and bottom figures for the salary for grade T4 to compute maximum and minimum amounts (a similar technique could be used to determine the commission, based on whether grade T4 is eligible for commission or not, and then looking at the maximum and minimum values in the database). 

·   Use a default of some kind, possibly calculated based on the maximums and minimums computed above. 

·   Ignore this employee. 

·   Use statistical data stored elsewhere to come up with probable amounts and then calculate a confidence [rating] for the entire result. 

·   Initiate exception processing to determine the answer, possibly then continuing to calculate the result ignoring the missing values. 

 

   Depending on the situation, any of these might be acceptable.  However, an accountant would probably calculate the maximums after determining if a commission might apply and attach a note saying that the data was missing for certain listed individuals.  I do not see how a database engine could decide this.  End of example

   I fully agree that the current treatment is error prone ... [However,] I am not sure that legislating 3VL, fuzzy logic (with ranges or probabilities), or using defaults is the right answer ... In the end, it might be appropriate to always generate an error whenever a null is located and permit an application to rerun the query using additional qualifiers telling the database engine how to process the nulls. 

   Instead of debates of this kind, I would like to see an explanation of how the existing facilities can be used correctly to handle missing data. 

 

I responded to these comments as follows: 

 

I completely agree with Martin Rusoff that it will usually be the case that only the application can decide what to do when nulls are encountered.  His example makes the point admirably.  But I do not agree with him that "the entire discussion is slightly off target."  To be candid, I will admit to a sneaking sympathy with this position--I know the debate must have seemed somewhat academic and not too relevant to the rough-and-tumble of day-to-day operations--but the fact is that such a perception is sadly mistaken. 

 

First of all (as I have observed many times, in the pages of Database Programming & Design and elsewhere), theory is practical!  That is, the theory on which a given DBMS is based necessarily has very practical consequences for the user of that DBMS.  And if that theory is bad, the consequences will be bad too. 

 

To see that this is so, it is sufficient to realize that it will often be the DBMS, not the application, that has to "decide what to do when nulls are encountered."  I have argued this point before for the particular case of the optimizer component of the DBMS (see reference [6]).  Note, moreover, that not all nulls "preexist" in base tables--some are generated dynamically (i.e., in the middle of executing a query).  As a result, it is a virtual certainty that it will be the DBMS, not the application, that will have to decide how to deal with them.  And the DBMS will typically not have the application-specific knowledge to enable it to make the correct application-specific decisions.  Consequence:  Wrong answers! 

 

Also, of course, there is the point--admittedly only a psychological point, not a logical one, but a point that is very much a practical consideration--that users will be lulled into a false sense of security by the fact that the system can "handle" missing information.  That is, applications often do not include the necessary logic to deal with nulls, even when they should, because the user is under the misapprehension that the system has taken that burden off the user's shoulders. 

 

Finally, James R. Alexander of the Goochland-Powhatan Community Services Board, Goochland, Virginia, wrote as follows: 

 

If I do not know the gender of Person A and I am asked "Is Person A female?", I respond "I don't know."  However, if I am asked "Do you know if Person A is female?", I respond "No."  The first question concerns Person A's gender, the second question concerns my knowledge of Person A's gender.  A query of a database is certainly a question of the second type ...

 

    I have developed and been using for over three years a data collection engine which is the front-end for a database ... This engine understands that every attribute has, in addition to its explicit domain, an implicit domain ... composed of unknown, not applicable, and not represented (i.e., other), which are consistently represented by ?, !, and #, respectively.  The human services organization I work for uses this data collection engine to maintain data for many typical business applications ... We use various off-the-shelf reporting [programs], which employ two-valued logic, and we get correct results. 

 

I could not agree more with Mr. Alexander.  I said much the same thing in reference [11] (page 234):  "[We] obviously cannot ask the system questions about the real world per se, only about its knowledge of the real world as represented by values in the database."  The critical point, as Mr. Alexander observes, is to stay within two-valued logic.  I am glad to hear he has been using his technique successfully for several years. 

 

 

REFERENCES AND BIBLIOGRAPHY

 

 1. E. F. Codd: The Relational Model for Database Management Version 2.  Reading, Mass.: Addison-Wesley (1990).  RM/V2 treatment of missing values may be found in Chapters 8 and 9.  For a discussion of the two types of missing values, see p. 191. 

 

 2. Ibid., Chapter 23.  Author's note:  "This chapter describes adverse consequences of these errors, suggests corrective steps that DBMS vendors should apply, and precautionary steps users should take in the meantime." 

 

 3. E. F. Codd: "Missing Information (Applicable and Inapplicable) in Relational Databases," ACM SIGMOD Record 15, No. 4 (1986).  See also E. F. Codd, "More Commentary on Missing Information," ACM SIGMOD Record 15, No. 5 (1986). 

 

 4. E. F. Codd: "Extending the Database Relational Model to Capture More Meaning," ACM TODS 4, No. 4 (September 1979). 

 

 5. E. F. Codd: Chapter 23 of The Relational Model for Database Management Version 2.  Reading, Mass.: Addison-Wesley (1990).  Author's note:  "Date's criticisms are refuted in detail in Chapter 9.  (Unfortunately, in the first printing, two errors appeared in the truth table for logical OR on pages 182 and 236.  These have been corrected.)" 

 

 6. C. J. Date: "Why Three-Valued Logic Is a Mistake," in Relational Database Writings 1991-1994.  Reading, Mass.: Addison-Wesley, 1995. 

 

 7. C. J. Date: Section 5.5 ("Null Values") of An Introduction to Database Management: Volume II.  Reading, Mass.: Addison-Wesley (1982).  A more recent treatment of this topic may be found in C. J. Date, "The Default Values Approach to Missing Information," in C. J. Date and Hugh Darwen, Relational Database Writings 1989-1991.  Reading, Mass.: Addison-Wesley (1992). 

 

 8. C. J. Date: "Three-Valued Logic and the Real World," InfoDB 4, No. 4 (Winter 1989).  This article was republished in C. J. Date and Hugh Darwen, Relational Database Writings 1989-1991.  Reading, Mass.: Addison-Wesley (1990). 

 

 9. E. F. Codd: The Relational Model for Database Management Version 2, page 183.  Reading, Mass.: Addison-Wesley (1990).  (The quotation is a paraphrasing.) 

 

10. Ibid., page 189. 

 

11. C. J. Date: "NOT Is Not "Not"! (Notes on Three-Valued Logic and Related Matters," in Relational Database Writings 1985-1989.  Reading, Mass.: Addison-Wesley (1990). 

 

12. C. J. Date: Section 5.5 ("Null Values") of An Introduction to Database Management: Volume II.  Reading, Mass.: Addison-Wesley (1982). 

 

13. C. J. Date: "Tables with No Columns," in Relational Database Writings 1991-1994.  Reading, Mass.: Addison-Wesley, 1995. 

 

14. C. J. Date: "Nothing in Excess," in Relational Database Writings 1991-1994.  Reading, Mass.: Addison-Wesley, 1995. 

 

15. C. J. Date: "Empty Bags and Identity Crises," in Relational Database Writings 1991-1994.  Reading, Mass.: Addison-Wesley, 1995.