From: SH
To: Editor
Date: 6 Feb 2005
Have now had a fair amount of time to digest The Final Null in the
Coffin, and I have compiled some notes that you might like to respond
to. These form one section of a paper that I'm writing on the problems of
nulls, incomplete data, and suchlike, in observational science databases. Maybe
I've missed the point, and you were just making valid criticisms of SQL, but it
seemed to me that you were using these as arguments against the use of NULL.
Though I agree they are certainly valid reasons not to use NULL in an SQL
database system !
Still needs a lot more thinking, but one of the key
differences between scientific and business databases is that you can't accept
the 'closed world' assumption for scientific data. By its very nature it is
'open-world', i.e. the database will always be incomplete, and absence of
information from the database doesn't imply its non-existence or falsehood.
Although some people have suggested that open-world requires nVL, I don't think
we need to go that far. After all, that would require us to ditch the
relational model for scientific data - even though there is no practical
alternative.
Pascal makes a
series of criticisms of the way in which SQL handles NULL, and these criticisms
must be considered insofar as they are potentially relevant to the use of
'NULL' itself. It must be noted that SQL in itself is a language developed in a
(flawed) attempt to implement relational theory. Although it has been taken up
very widely by the developers of database management systems, SQL must never be
confused with relational theory - and criticisms of SQL, even if upheld, are
not necessarily of general relevance. Pascal's 2004 paper contains (page 9) a
set of 10 specific criticisms (italicised in the list below) of SQL and
its incomplete and inconsistent 3VL. Each of Pascal's items will be examined
below to identify any general points which might arise in relation to the use
of NULL :
·
Aggregate functions e.g. SUM(), AVG() ignore
NULLs, COUNT() does not. However useful they may be, aggregate
functions have nothing to do with relational database theory. Strictly speaking
they are applications, and it should be possible to define whatever behaviour
and rules are required into appropriate aggregate functions.
·
A scalar expression on a table without rows
evaluates incorrectly to NULL, instead of 0.This behaviour of
SQL may indeed be wrong, but this says nothing about whether NULL should be
acceptable or otherwise.
·
The expression NULL=NULL evaluates to NULL, yet
ORDER BY treats NULLs as equal.NULL=NULL correctly evaluates as
NULL. The behaviour of ORDER BY is immaterial. In a relvar the ordering of
tuples is irrelevant - this is merely an export/presentation matter. In any
case, the ordering of NULLs is by definition arbitrary and whatever ordering is
chosen by SQL cannot be 'wrong'.
·
SQL's NOT is not "not" of natural language.
This is a problem with SQL, and not relevant to the NULL question.
·
It's unknown how SQL's EXISTS should behave,
because no 3VL is defined for SQL (the definition is ambiguous in 3VL systems).
Again this is an SQL problem because it attempts to use a 3VL, but is
irrelevant to the question of the acceptability or otherwise of NULL.
·
Expressions evaluating to NULL do not violate
integrity constraints (NULL is treated as true). NULL is of course not
in itself a truth value, and expressions in a 2VL cannot evaluate to unknown.
This is an SQL problem rather than a problem with NULL. The only expressions
that can evaluate to NULL are those that are of a defined data type. Given a
truth-value data type, it can indeed contain NULLs but these data items do not
have any truth value at all - by definition - and they are certainly not of
truth value 'unknown' in the SQL sense.
·
The truth-valued data type can take the values
true, false, and the non-value NULL, but the literal NULL cannot appear in
contexts in which any literals can appear. This is a problem with SQL
and SQL-based DBMSs. It violates the principle that NULL must not be capable of
confusion with any valid data representation. This principle must be rigidly
adhered to in any truly relational DBMS. Codd's marks, kept separately from the
data, are one solution to this problem.
·
Since 3VL referencing rules are not defined, it's
unknown which primitive operators out of the 27 monadic and 19,683 dyadic would
suffice to express others and, therefore, whether all necessary operators are
supported or not. This is a possibly valid criticism of SQL's 3VL, but
does not bear directly on the NULL question.
·
Highly complex operators. This is a
matter concerning SQL as an implementation language but irrelevant to the NULL
question.
Thus it appears that
none of Pascal's criticisms of SQL's NULLs have any general validity as
criticism of the use of NULL in relvars - though they do contribute to a very
strong case that SQL does not offer an appropriate environment for handling databases
whose relvars may contain NULLs. They are criticisms of SQL, not of NULL.
They do of course
leave open the question of whether or not NULL itself is acceptable (i.e. quite
separately from its flawed implementation in SQL). In the observational sciences,
such as geology, there are very often gaps in knowledge - missing data. This is
the nature of science. Sometimes the data are not altogether missing, or there
are qualifiers explaining the reason for its missingness. But this is another
story for another day. It is possible that the closed-world assumption built
into database management systems for business applications, is itself
inappropriate, and that an open-world assumption should be used for
observational science data. If that is the case, then some writers have
suggested that a 2VL is inadequate and we need to look at nVL solutions. I
would not go that far yet - since that implies throwing out the entire
relational model for scientific data. This seems a little extreme. There is
another option - to use the relational model, and accept the prohibition on
NULLs (where they are strictly placeholders for items on which nothing at all
is known), but allow for rather richer and more intelligent logic to handle
incomplete data.
Fabian Pascal Responds: I’ll let David handle the
deeper issue, and respond to the more mundane one. Many-valued logic—which, as
the paper argues, practically reduces to three-valued logic—is outside the
scope of the relational model, and that of two-valued logic of the real world,
on which the model is based. Therefore, even if a consistent three-valued logic
were defined, inferences made from databases supporting it cannot be guaranteed
to be correct in the real world. Furthermore, such a logic would be so complex
and unintuitive, that confusion and errors would be unavoidable. So while the
criticisms SH refer to are specifically of NULL, SQL's version of three-valued
logic (which is neither well-defined, nor consistent), the problems it and its
commercial implementations exhibit are inherently endemic to many-valued
logics, in itself reason enough to eschew them. Indeed, practically no
discussion on the subject that we are aware of is devoid of confusion and
errors (see, for example, NULLs
Nullified).
David McGoveran
Responds: At the ontological level, yes, the scientific method is an
"open world". But in theory making it is crucial that the presumed
universe of discourse be carefully defined. This step is often overlooked when
scientists are working in a particular field, because it is
"understood" and within the "culture", but even within the
most narrow scientific fields errors and miscommunication can result. The
universe may be iteratively redefined, but permitting enlargement (or
contraction) without re-evaluation of the impact on other hypotheses (axioms
and therefore database design in the database world) is a good way to draw
erroneous conclusions. Indeed, a changeable universe of discourse requires
second order logic which permits paradoxes (i.e., contradictions not subject to
resolution, therefore predictions that can't be disproved by evidence, and so
NOT science).
While reasoning within any theory, the closed world
assumption (CWA) holds. Its universe of discourse is determined by the entities
modeled, relationships, constraints, dependencies, and by the domains of
permissible values. So, if you build a database for testing a theory (however
"imprecise") and find real world data that either doesn't fit or
creates a contradiction, you've invalidated the theory. You should then
redesign the database to meet the next hypothesis. This may be as simple as
dropping a constraint, or require an entirely new design with different
entities: It all depends on how broad the initial theory used for the design,
and how radical (relatively speaking) the new hypothesis. Welcome to science.
Point: In a scientific database such as the type to which you
allude, a reasonable interpretation of True and False under CWA is "valid
by experiment and consistent with hypotheses" and "not validated by
experiment or inconsistent with hypotheses". If you give this
differentiation up with CWA and nulls, you've given up scientific reasoning and
the scientific method.
Usually this doesn't have to be burdensome. Ordinary
theorizing and experimentation makes reasonable assumptions about domains. At
the theory level, a domain might be "infinite", but at the level of
experiments that always terminate in some finite measurement, and at the level
of the database which resides on a real computer with real physical limitations
used by real people with finite brains and time, the domain can be given a
bounded representation that will suffice for many hypothesis within the broader
theory.
Don't confuse CWA with an assumption that we must know
everything and embed that knowledge in a permanently fixed database schema. All
we need is understand the current boundaries on our application or
investigation. If we don't understand that, we doing philosophy, not science.
Posted
4/1/05