I ended Part 1 with the following paragraph from Stephen Henley's Missing Data and let you ponder it:
If the SiO2 attribute contains what is reported about the SiO2 content of sample 102 then there is no obvious reason to exclude semi-numeric data (such as "below 0.1% detection limit") or non-numeric data (such as "sample contaminated, submitting for re-analysis") or (Heaven forbid!) "missing" - even the word "null" if this is not a red rag to a bull. Any such data values (or non-values) might perfectly validly be transcribed from a laboratory report, where conventionally a "-" character is used to signify that an analysis value is missing (or alternatively some such code as "n/a" for "not analysed"). If the attribute is interpreted as "reported as ...", there is no a priori reason to discriminate against putting such codes into the database."Reported as" or any such like means nothing to the DBMS: it belongs to the R-table interpretation that only users understand. This is the very advantage of the DBMS being (thankfully) oblivious to meaning--it gives it the versatility to represent any kind of facts. Consequently, when the CWA does not hold, it cannot treat the data correctly in manipulation and integrity enforcement--query results are not guaranteed to be correct with respect to the real world.
What we will have is of course not the SQL 'null' which is deprecated by Date, Darwen, and Pascal. However, it does not remove the basic problem - the reason why they feel that 'null' must be prohibited. This basic problem is that if one allows 'null' or anything like it (such as text descriptors in a 'reported as...' attribute), a corollary is the need for a three-valued logic, which includes 'unknown' as a third truth value. The main justification by Date of his insistence on use of the closed world assumption and a strict true/false two-valued logic is that this represents the 'real world'. This may indeed be the real world of warehouse management and parts ordering, but it is most certainly not the real world of observational science, where data items are quite routinely imprecise, incomplete, or missing. In this real world, the correct result of a query is not in general either 'true' or 'false' but can be 'unknown' or even 'estimated probability 0.317'.
In Date and Darwen (1998) the only solution offered to avoid the use of nulls is to provide for user-defined 'special-value' coding on such 'reported as...' attributes with types that are not simple character or numeric. However, apart from the practical implementation difficulties, and more seriously the threats to database integrity from possible non-standard coding by users, this leaves open the door to 3VL which Date otherwise wishes to keep firmly closed.
However, Codd (1990, p.203-4) had already pointed out a total of no less than six problems with such a special-value (he calls it default-value) approach - and these problems were not addressed by Date's suggestion. The fact remains that working within the CWA and 2VL, although Date, Darwen, and Pascal have each proposed methods by which the 'null' representation of missing data can be avoided, none have suggested any way in which the 'missingness' of data can properly be manipulated. The basic reason for this is that when the required correct answer is "unknown", this simply cannot be produced by a two-valued logic which knows only "true" or "false".Actually, there are two basic problems. First, there is no sound--consistent and sufficient--3VL (look for David McGoveran four part series on the subject). Worse, NULL lures users--who think in 2VL--into a false sense of security that obscures from them the implications for integrity enforcement, manipulation and the data language of SQL's (botched) 3VL implementation--confusion, unintuitiveness, complexity and wrong results. This is precisely why NULL behavior is ad-hoc, arbitrary and very often wrong (particularly since it is used to represent 4VL, not 3VL databases; how?).
But, second, the fundamental problem is what Date calls "confusion over realms", the real world and (imperfect) knowledge of it--and it traps Henley too. The evidence is in this statement above--can you figure it out?
This may indeed be the real world of warehouse management and parts ordering, but it is most certainly not the real world of observational science, where data items are quite routinely imprecise, incomplete, or missing. In this real world, the correct result of a query is not in general either 'true' or 'false' but can be 'unknown' or even 'estimated probability 0.317'.Note: Probabilistic databases--probabilities would have to be represented explicitly in the database--is a separate issue, beyond the scope of this discussion.
Henley argues that the real world obeys 3VL and 2VL databases misrepresent it. But this is backwards: to reiterate, the real world obeys 2VL independently of our knowledge (we're not in the quantum world here) and it's 3VL databases that are unfaithful representations of it! This confusion of the "2VL perfect" world and our imperfect knowledge of it is the root problem that inhibits ability to comprehend why 3VL, NULLs and any other deviations from 2VL/CWA are the problem, not the solution. Overcome this confusion and a relational solution reveals itself.
I urge you to read The Last NULL in the Coffin and judge whether, at least insofar as my proposed method is concerned, Henley is right that it is not possible to represent imperfect knowledge without abandoning CWA/2VL. Hint: Think of Donald Rumsfeld's "known unknowns".
Do you like this post? Please link back to this article by copying one of the codes below.URL: HTML link code: BB (forum) link code: