MORE ON “THE FINAL NULL IN THE COFFIN”
with Fabian Pascal and David McGoveran

 

 

 

From: SH

To: Editor

Date: 6 Feb 2005

 

Have now had a fair amount of time to digest The Final Null in the Coffin, and I have compiled some notes that you might like to respond to. These form one section of a paper that I'm writing on the problems of nulls, incomplete data, and suchlike, in observational science databases. Maybe I've missed the point, and you were just making valid criticisms of SQL, but it seemed to me that you were using these as arguments against the use of NULL. Though I agree they are certainly valid reasons not to use NULL in an SQL database system !

 

Still needs a lot more thinking, but one of the key differences between scientific and business databases is that you can't accept the 'closed world' assumption for scientific data. By its very nature it is 'open-world', i.e. the database will always be incomplete, and absence of information from the database doesn't imply its non-existence or falsehood. Although some people have suggested that open-world requires nVL, I don't think we need to go that far. After all, that would require us to ditch the relational model for scientific data - even though there is no practical alternative.

 

Pascal makes a series of criticisms of the way in which SQL handles NULL, and these criticisms must be considered insofar as they are potentially relevant to the use of 'NULL' itself. It must be noted that SQL in itself is a language developed in a (flawed) attempt to implement relational theory. Although it has been taken up very widely by the developers of database management systems, SQL must never be confused with relational theory - and criticisms of SQL, even if upheld, are not necessarily of general relevance. Pascal's 2004 paper contains (page 9) a set of 10 specific criticisms (italicised in the list below) of SQL and its incomplete and inconsistent 3VL. Each of Pascal's items will be examined below to identify any general points which might arise in relation to the use of NULL :

 

·   Aggregate functions e.g. SUM(), AVG() ignore NULLs, COUNT() does not. However useful they may be, aggregate functions have nothing to do with relational database theory. Strictly speaking they are applications, and it should be possible to define whatever behaviour and rules are required into appropriate aggregate functions.

·   A scalar expression on a table without rows evaluates incorrectly to NULL, instead of 0.This behaviour of SQL may indeed be wrong, but this says nothing about whether NULL should be acceptable or otherwise.

·   The expression NULL=NULL evaluates to NULL, yet ORDER BY treats NULLs as equal.NULL=NULL correctly evaluates as NULL. The behaviour of ORDER BY is immaterial. In a relvar the ordering of tuples is irrelevant - this is merely an export/presentation matter. In any case, the ordering of NULLs is by definition arbitrary and whatever ordering is chosen by SQL cannot be 'wrong'.

·   SQL's NOT is not "not" of natural language. This is a problem with SQL, and not relevant to the NULL question.

·   It's unknown how SQL's EXISTS should behave, because no 3VL is defined for SQL (the definition is ambiguous in 3VL systems). Again this is an SQL problem because it attempts to use a 3VL, but is irrelevant to the question of the acceptability or otherwise of NULL.

·   Expressions evaluating to NULL do not violate integrity constraints (NULL is treated as true). NULL is of course not in itself a truth value, and expressions in a 2VL cannot evaluate to unknown. This is an SQL problem rather than a problem with NULL. The only expressions that can evaluate to NULL are those that are of a defined data type. Given a truth-value data type, it can indeed contain NULLs but these data items do not have any truth value at all - by definition - and they are certainly not of truth value 'unknown' in the SQL sense.

·   The truth-valued data type can take the values true, false, and the non-value NULL, but the literal NULL cannot appear in contexts in which any literals can appear. This is a problem with SQL and SQL-based DBMSs. It violates the principle that NULL must not be capable of confusion with any valid data representation. This principle must be rigidly adhered to in any truly relational DBMS. Codd's marks, kept separately from the data, are one solution to this problem.

·   Since 3VL referencing rules are not defined, it's unknown which primitive operators out of the 27 monadic and 19,683 dyadic would suffice to express others and, therefore, whether all necessary operators are supported or not. This is a possibly valid criticism of SQL's 3VL, but does not bear directly on the NULL question.

·   Highly complex operators. This is a matter concerning SQL as an implementation language but irrelevant to the NULL question.

 

Thus it appears that none of Pascal's criticisms of SQL's NULLs have any general validity as criticism of the use of NULL in relvars - though they do contribute to a very strong case that SQL does not offer an appropriate environment for handling databases whose relvars may contain NULLs. They are criticisms of SQL, not of NULL.

 

They do of course leave open the question of whether or not NULL itself is acceptable (i.e. quite separately from its flawed implementation in SQL). In the observational sciences, such as geology, there are very often gaps in knowledge - missing data. This is the nature of science. Sometimes the data are not altogether missing, or there are qualifiers explaining the reason for its missingness. But this is another story for another day. It is possible that the closed-world assumption built into database management systems for business applications, is itself inappropriate, and that an open-world assumption should be used for observational science data. If that is the case, then some writers have suggested that a 2VL is inadequate and we need to look at nVL solutions. I would not go that far yet - since that implies throwing out the entire relational model for scientific data. This seems a little extreme. There is another option - to use the relational model, and accept the prohibition on NULLs (where they are strictly placeholders for items on which nothing at all is known), but allow for rather richer and more intelligent logic to handle incomplete data.

 

 

Fabian Pascal Responds: I’ll let David handle the deeper issue, and respond to the more mundane one. Many-valued logic—which, as the paper argues, practically reduces to three-valued logic—is outside the scope of the relational model, and that of two-valued logic of the real world, on which the model is based. Therefore, even if a consistent three-valued logic were defined, inferences made from databases supporting it cannot be guaranteed to be correct in the real world. Furthermore, such a logic would be so complex and unintuitive, that confusion and errors would be unavoidable. So while the criticisms SH refer to are specifically of NULL, SQL's version of three-valued logic (which is neither well-defined, nor consistent), the problems it and its commercial implementations exhibit are inherently endemic to many-valued logics, in itself reason enough to eschew them. Indeed, practically no discussion on the subject that we are aware of is devoid of confusion and errors (see, for example, NULLs Nullified).

 

 

David McGoveran Responds: At the ontological level, yes, the scientific method is an "open world". But in theory making it is crucial that the presumed universe of discourse be carefully defined. This step is often overlooked when scientists are working in a particular field, because it is "understood" and within the "culture", but even within the most narrow scientific fields errors and miscommunication can result. The universe may be iteratively redefined, but permitting enlargement (or contraction) without re-evaluation of the impact on other hypotheses (axioms and therefore database design in the database world) is a good way to draw erroneous conclusions. Indeed, a changeable universe of discourse requires second order logic which permits paradoxes (i.e., contradictions not subject to resolution, therefore predictions that can't be disproved by evidence, and so NOT science).

 

While reasoning within any theory, the closed world assumption (CWA) holds. Its universe of discourse is determined by the entities modeled, relationships, constraints, dependencies, and by the domains of permissible values. So, if you build a database for testing a theory (however "imprecise") and find real world data that either doesn't fit or creates a contradiction, you've invalidated the theory. You should then redesign the database to meet the next hypothesis. This may be as simple as dropping a constraint, or require an entirely new design with different entities: It all depends on how broad the initial theory used for the design, and how radical (relatively speaking) the new hypothesis. Welcome to science.

 

Point: In a scientific database such as the type to which you allude, a reasonable interpretation of True and False under CWA is "valid by experiment and consistent with hypotheses" and "not validated by experiment or inconsistent with hypotheses". If you give this differentiation up with CWA and nulls, you've given up scientific reasoning and the scientific method.

 

Usually this doesn't have to be burdensome. Ordinary theorizing and experimentation makes reasonable assumptions about domains. At the theory level, a domain might be "infinite", but at the level of experiments that always terminate in some finite measurement, and at the level of the database which resides on a real computer with real physical limitations used by real people with finite brains and time, the domain can be given a bounded representation that will suffice for many hypothesis within the broader theory.

 

Don't confuse CWA with an assumption that we must know everything and embed that knowledge in a permanently fixed database schema. All we need is understand the current boundaries on our application or investigation. If we don't understand that, we doing philosophy, not science.

 

 

Posted 4/1/05