A proposition is a statement that is unequivocally true (a fact), or false. A proposition about the real world is true or false whether we know which it is or not. In other words, the real world obeys a two-valued logic (2VL). If a relational database represents the real world, it also obeys 2VL. This requires that users authorized to populate and update the database have perfect knowledge of the real world, specifically, they always know whether propositions are true or false and represent only the former and never the latter in the database. Consequently:
- The Closed World Assumption (CWA) holds: The rows present in every R-table represent all the propositions that are true in the real world; and all the rows absent from it represent false propositions;
- 2VL inferences from the database--query results--are provably logically correct with respect to the real world.
No DBMS can know whether a row submitted for insertion into the database represents a proposition that is true in the real world or not. A row submitted by an authorized user that does not violate integrity constraints in effect is accepted by the DBMS as an explicit assertion by the user that the represented proposition is true (unsubmitted rows are implicitly taken as false propositions). In other words, the CWA holds only if knowledge is perfect.
In Missing Data Stephen Henley writes (I prefer R-table to relation and row to tuple):
There are two ways of interpreting the following tuple:
sample #102 contains 49.2% SiO2
sample #102 is reported to contain 49.2% SiO2If sample #102 exists and has some % of SiO2, the % either is 49.2, or not, which the DBMS has no way of knowing. But if knowledge is perfect, what is reported is true and interpretation 2 is logically identical to interpretation 1.
In other words, interpretation (1) is a statement about the real world (from his writings it seems that this is how Chris Date intends databases to be understood). Interpretation (2), in contrast, is a statement about our knowledge of the real world.
What if knowledge is not perfect, as is often, if not usually, the case? What if users do not know whether the % is 49.2 or not? Should the above row be submitted to the DBMS, or not? If it is, should the DBMS accept, or reject it? On what grounds?
Under interpretation (1) the omission of this tuple from a relation to which it would otherwise belong, according to the CWA would mean that the SiO2 value in sample 102 is not 49.2%. If there is no tuple at all in this relation for sample 102, it would mean that there is no valid value of SiO2 for sample 102. This is patently untrue since the sample MUST contain some level of SiO2 (even if zero). Hence if this relation and the database that contains it are considered to be models of the real world, the interpetation is inconsistent with the CWA - and therefore the CWA cannot apply.If the row is submitted, the user is asserting the proposition is true when he does not know it to be a fact. If the row is not submitted, the only interpretation can be that there is no sample #102 which, indeed, is incorrect with respect to the real world. In other words, in either case the database will not represent the real world and 2VL queries will not not be guaranteed to be correct with respect to it.
Under interpretation (2) the omission of this tuple from a relation to which it would otherwise belong, according to the CWA would mean that the SiO2 value reported for sample 102 is not 49.2%. If there is no tuple at all in this relation for sample 102, it would mean that there is no reported value for SiO2 for sample 102 - a perfectly valid statement. However, the cost of this statement is that the type of attribute SiO2 is descriptive ("reported as 49.2") and not numeric ("49.2"). Hence numeric operators such as >, <, and perhaps even = cannot be used. The tuple would appear as:As Henley recognizes, this "trick" violates domain types, including operations applicable to it (question: what if the domain is of type string?). But that is not the main problem here.
102 reported as 49.2
It means that this attribute is a step removed from the underlying reality which Date requires to support his case for prohibition of nulls - and we are prevented from carrying out perfectly reasonable numerical queries and operations.
A possible way around the problem would be to allow the attribute to have an unconventional type "numeric OR text". Where there is a numeric value, the attribute will be numeric (just "49.2"), and where there is not, then it will contain text. Such a solution would produce relations that are similar to the 'PERS_INFO' relation of Darwen's "How to Handle Missing Data Without Using Nulls" , in which details of salary are held in a SAL_INFO attribute which can also contain descriptive text such as 'unsalaried' or 'unknown' [FP: I have reservations about Darwen's solution.]When Codd tried to find a domain-independent solution to missing data to replace the special values/exception application code employed at the time, he made his one major mistake: he proposed missing and (worse) inapplicable marks for the absence of values. The bad implementation of this idea is IBM's SQL NULL. Later Date came up with a default value scheme along the line suggested by Henley, that encapsulated the special values into domains, but he subsequently renounced it. Given that neither Codd, nor Date found a theoretically sound solution, permit me to conclude that a solution to missing data is a non-trivial issue.
This will allow numeric operators to work as normal on the numeric values. However, where an operand is non-numeric, the normal numeric operations are invalid and special rules would need to be created to define the results - and the results of queries cannot in general be restricted to 'true' or 'false'.
If the SiO2 attribute contains what is reported about the SiO2 content of sample 102 then there is no obvious reason to exclude semi-numeric data (such as "below 0.1% detection limit") or non-numeric data (such as "sample contaminated, submitting for re-analysis") or (Heaven forbid!) "missing" - even the word "null" if this is not a red rag to a bull. Any such data values (or non-values) might perfectly validly be transcribed from a laboratory report, where conventionally a "-" character is used to signify that an analysis value is missing (or alternatively some such code as "n/a" for "not analysed"). If the attribute is interpreted as "reported as ...", there is no a priori reason to discriminate against putting such codes into the database.Do you agree? Why?
Stay tuned for part 2.
Do you like this post? Please link back to this article by copying one of the codes below.URL: HTML link code: BB (forum) link code: