From: Miro Jakubowski
Date: 9 Nov 2005
I am trying to put together a presentation whose purpose
would be to convince colleagues of mine that NULL is a serious flaw of SQL,
among others. They strongly believe that NULLs are just fine, nothing wrong
with them at all. Proof? ORACLE handles them very well and they can always
decorate their queries with IS NULL, IS NOT NULL, and IFNULL(). So, where is
the problem? Our ORACLE database has over 1000 tables and tonnes of NULLs. Some
tables are basically NULLs. The only non-NULL columns are surrogate and a
natural keys. All the rest is just NULLs. The best argument would be to find a
case, or better yet several cases, that would show that this database would
return wrong answers to some queries.
From: Fabian Pascal
So what else is new?
My paper The Final NULL in the
Coffin only outlines generally problems with NULLs. You may want to provide
a list of all the references by myself, McGoveran and Date/Darwen on the
detailed problems.
Even so, don't set your expectations too high, though: even
with tons of evidence, most people fail to get it. The industry and society are
so dumbed down, that people are no longer capable of thinking and judging for
themselves, and they don't even want to. Whatever Oracle or MS (or the
government or media, for that matter) say or do, must be right. Why bother
thinking, which is so hard?
From: Hugh Darwen
I'm afraid I can't answer that one right away, but I'll try
to remember to ask Chris Date next time I speak to him.
Most of his writings on the subject, and my
own comparatively meagre contribution, are to be found in RELATIONAL DATABASE WRITINGS,
1985-1989, RELATIONAL
DATABASE WRITINGS, 1989-1991 and RELATIONAL DATABASE WRITINGS
1991-1994, all published by Addison-Wesley.
But I couldn't find that particular title (which I do
remember). Fabian Pascal (copied on
this) might be able to help.
Regarding "incorrect results", I agree completely
with Lex's take. With every language,
you get the result of the question you asked.
With bad languages, the question you asked isn't always the one you
meant to ask.
Regarding "null values", when I joined the ISO SQL
standards committee in 1988, the standard actually contradicted itself,
explicitly. Somewhere it said that NULL
is a value in every data type; somewhere else it said the NULL is not a value.
When Date and I developed THE THIRD MANIFESTO, we
decided that in view of the vast amount of confusion we perceived in the
literature—academic as well as industrial—we would make no assumptions about
the meanings of the terms for the concepts that underpin our foundation:
"value", "variable", "operator" and
"type". So we took the
trouble to make it as clear as possible precisely what we understood by them.
And that cleared the matter up for us easily.
For if x is a value, then by definition x=x is TRUE.
(And also by definition, equals comparison
is defined for all values.)
I also like to make the point that relational theory is very
firmly founded on 2-valued logic, with its Law of The Excluded Middle.
And I don't just mean WHERE conditions and
IF ... THEN ... ELSE constructs. I am
referring to relations themselves and The Closed World Assumption.
Our interpretation of an n-ary relation r in a given context
is an n-adic predicate P. The presence
of a tuple t in r is interpreted as expressing a belief that P(t) is TRUE;
contrariwise, the absence of t is interpreted as expressing a belief that P(t)
is FALSE. If a language is to be
founded on 3-valued logic, then it needs a way of expressing some third
possible value for P(t), and—excuse me for shouting—WE NEED AN INTERPRETATION
FOR THAT THIRD VALUE!
SQL admits a third value but has no way of representing (via
something akin to a relation) a belief that P(t) has that value.
And SQL has no fixed interpretation of that
third truth value, in the way that the whole of humanity has a common
understanding of TRUE and FALSE (perhaps I exaggerate).
For example (one of Lex's), the third value
cannot always mean that we do not know of a given proposition whether it is
TRUE or FALSE; for we all do know that the sum of no numbers is zero and yet
when SQL compares the sum of the empty set with 0 it yields UNKNOWN.
We conclude that the introduction of a third truth value is
futile. Some (e.g. Codd) choose to
slide further down the slippery slope by introducing a fourth truth value, and
then a fifth, and so it goes on.
(People who wish to ridicule such endeavour—and I include myself among
those people—enjoy exploring the multiplicity of reasons why a piece of
information might be "missing".
I believe the record is somewhere in the low twenties.)
Finally, I wish to add something to what Lex has said with
reference to the claim that "it doesn't really matter: we know how NULL
behaves and so all we have to do is be careful to write correct code that
avoids the pitfalls". Lex has
given our reasons for dismissing the claim, but I think it is important to
understand why such claims are made, and made so vehemently, in the first
place. I strongly believe that the
answer lies in "job protection".
Bad languages give rise to experts, who can charge highly for their
expertise. SQL is such an incredibly
bad language (on all sorts of counts, not just NULL) that I was able make a
good living out of for the last 10 years of my long career in IBM, not by being
a consultant to users of SQL databases, or an SQL application program
developer, or an SQL DBA, but by drafting text for the SQL international
standard. Some of the text I had to
draft was more difficult than anything I had ever had to write, in programming
languages, in my former job as a computer programmer!
I.e., it is more difficult to write the (retrospective)
specification than it is to write the code!
From: Fabian Pascal
I don't have anything to add to what Hugh says (I am not
familiar with Lex's stuff, though), except that in my opinion the 2V-logically
correct solution to missing (read: unknown and only unknown) data that
is transparent to the user and places the burden where it belongs, the DBMS, is
the one outlined in The
Final NULL in the Coffin.
Regarding “the multiplicity of reasons why a piece of information
might be ‘missing’”, as I argue in my paper, unless one strives to keep track
of those reasons, they are not an issue with respect to logic: essentially
there is only one kind of missing value, unknown (for whatever reasons);
inapplicable is an artifact of incorrect database design, and was indeed
a big Codd mistake. SQL exacerbated this mistake by allowing only one kind of
NULL for both unknown and inapplicable, which clearly produces logically
incorrect query results.
From: Miro Jakubowski
Thanks. Unfortunately your case cannot be executed in my
ORACLE as it supports SQL99 only. Have to wait or find something else. IS NULL,
NOT IS NOT NULL and NOT IS NULL and IS NOT NULL are no equivalent for row with
more than one nullable column. I do not think SQL99 supports IS NULL and IS NOT
NULL for rows. I'd might be mistaken though.
From: Hugh Darwen
Well, you are mistaken, but Oracle's claim of conformance is
probably only for what is known as "Core SQL", defined in the
Conformance Clause towards the end of each Part. I think you will find
that ROW( ... ) is not supported in Core.
However, try leaving out the word ROW. (c1, c2) IS [NOT] NULL is defined
to mean the same as ROW ( c1, c2 ) IS [NOT] NULL. In fact, c1 IS [NOT]
NULL is defined to mean the same as ROW(c1) IS [NOT] NULL.
I believe the shorthands, omitting the word ROW, were available in original SQL
(System R) and in Oracle-1979.
The shorthands were a bad idea, violating certain well-known language design
principles. They were a convenience that made the language impossible to
extend in a proper manner when certain other deficiencies, also the result of
naive original language design decisions, came to light in the form of user requirements.
The mistakes in original SQL that are relevant here:
1. Invention of NULL, violating the principle of "no ad hoc-ery".
2. Support for expressions of types that have no name, violating the principle
of orthogonality. Such types, although they exists in parts of the language,
cannot be the declared type of a variable, column,
parameter or user-defined function (or anything else for
which a type has to be declared by name). The
types in question here are what are now called row types. A similar
mistake was made in connection with truth-valued expressions, leading to the type BOOLEAN in SQL:1999 but again not widely implemented.
3. Support for "coercion", or implicit type
conversion, whereby the result of an expression whose type
does not match that of a comparand or a target is implicitly "converted"
to a corresponding value of the required type.
Coercion is handy at times but in general not thought by language experts to be
handy enough to compensate for the problems it gives rise to. I guess
these are mainly problems encountered when the language is to be
extended. In "c1 IS NULL", c1 is implicitly converted into a
singleton row, and that's one of the main reasons why "is the null
value" and IS NULL do not mean the same thing. There are several
other cases of coercion in SQL, notably the so-called scalar subquery, as in
SELECT * FROM T1 WHERE c1>= ( SELECT AVG(c2) FROM T2 ). The expression
in parens is a table but c1 is a number. As > isn't supported for
tables but is for numbers, the system implicitly converts the table to a number,
namely the number in the only column of the only row of the table. This
bad design decision by the System R cavaliers made it impossible to introduce
support explicit table types into the language except by means of special key
words, and even then it turned out to need more than one key word (e.g., TABLE,
LATERAL) because of other stupid mistakes that I understand but are too messy
for me to explain here.
4. Allowing a parenthesised expression to have a different meaning from the
expression enclosed by the parens. The expression (1+2) sometimes denotes
the integer 3 but at other times denotes a row having 3 as the value of its
single column. It denotes 3 in (1+2)-3 but denotes the row in (1+2) IS
NULL, and also, strangely, in (1+2) = 3, where the second comparand, 3, also in
this case denotes a row! The rules governing when (1+2) denotes a row and
when it denotes 3 expressed in the BNF and are bizarre. The user requirement for explicit row types is clear (think of
postal addresses, for example), but adding them to the language was such a mess
that the international standard is quite possibly the only place where they
exist!
5. Last and perhaps least, IS NULL doesn't even mean what it says.
It should be ARE ALL NULL, because in general the operand is a list of expressions
(denoting a row, as already explained) and the intention of (a, b, c ... ) IS
NULL was to return TRUE if and only if each one of a, b, c, ... "is the
null value". (In the end, the intention wasn't met in any case, as
already explained.) Perhaps IS NULL could have been a synonym for use in
the special case of x IS NULL.
Finally, as Oracle doesn't support explicit row types, let me give you a
similar example involving something that Oracle certainly does support: user-defined types. Suppose values of
type
POINT are represented by components defined as X REAL and Y REAL.
Consider the point p5N whose X coordinate is 5.0 and whose Y coordinate is CAST
NULL AS REAL. What is the result of the comparison P5N = P5N? And
P5N IS NULL? And what about the point pNN whose coordinates are both
"the null value"? Consider the wider implications of the
anomaly you have just observed (assuming you answered the questions correctly:
UNKNOWN and FALSE, respectively, in both cases). Think of applications
and user-defined functions that like to use IS NULL to
test that their input contains no appearances of NULL. (I am aware of the
fact that hardly anybody uses user-defined types in SQL,
in spite of the enormous amount of money that was spent of their development in
the 1990s.)
I know that your colleague will remain completely unfazed by these
observations. What saddens me is that such people are ignorant of what
might have been, if only Codd's inspiration had led to a decent implementation
(which it did, actually, but nobody seemed to notice, thanks in part to Larry
Ellison but perhaps mainly because the decent implementation, ISBL, which
predated System R, wasn't developed in the U.S.A.).
From: Miro Jakubowski
Just did test the user defined types and NULL. I've got
different answers. Here is what I did:
- CREATE TYPE POINT AS OBJECT (X
REAL, Y REAL);
- CREATE TABLE T (P POINT);
- INSERT INTO T VALUES(POINT(NULL,NULL));
- INSERT INTO T VALUES(POINT(5.0,NULL));
- INSERT INTO T VALUES(POINT(5.0,10.0));
- INSERT INTO T VALUES(POINT(NULL,10.0));
- SELECT * FROM T;
C(X, Y)
--------------
POINT(NULL, NULL)
POINT(5, NULL)
POINT(5, 10)
POINT(NULL, 10)
- SELECT * FROM T WHERE C=C;
C(X, Y)
------------------------------------------
POINT(5, 10)
- SELECT * FROM T WHERE
C=POINT(NULL,NULL)
no rows selected
- SELECT * FROM T WHERE NOT C=POINT(NULL,NULL);
no rows selected
- SELECT * FROM T WHERE C=POINT(5.0,NULL);
no rows selected
- SELECT * FROM T WHERE NOT C=POINT(5.0,NULL);
no rows selected
- SELECT * FROM T WHERE C=POINT(5.0,10.0);
C(X, Y)
---------------
POINT(5, 10)
- SELECT * FROM T WHERE NOT C=POINT(5.0,10.0);
no rows selected
- SELECT * FROM T WHERE C=POINT(NULL,10.0)
no rows selected
- SELECT * FROM T WHERE NOT C=POINT(NULL,10.0);
no rows selected
- SELECT * FROM T WHERE C IS NULL;
no rows selected
- SELECT * FROM T WHERE NOT C IS NULL;
C(X, Y)
------------------------------------------
POINT(NULL, NULL)
POINT(5, NULL)
POINT(5, 10)
POINT(NULL, 10)
- SELECT * FROM T WHERE C IS NOT NULL;
C(X, Y)
------------------------------------------
POINT(NULL, NULL)
POINT(5, NULL)
POINT(5, 10)
POINT(NULL, 10)
Thus, I've got:
- IS NOT NULL = NOT IS NULL
- C=POINT(X,Y) OR NOT C=POINT(X,Y) holds only for X=5 and
Y=10. Hence, Tertium non datur doesn't hold.
More, C=C returns only one row and C IS NOT NULL returns all
four rows.
From: Hugh Darwen
Magnificent findings! I have a feeling that some of
these results are contrary to the international standard, but I don't have time
or inclination to check right now. I can't tell you how much I hate
putting my brain into 3VL gear!
But these results are to my mind a graphic and compelling demonstration of the
folly of even considering basing a would-be relational language on 3VL.
Thanks for the good work!
On reflection, I think the Oracle results conform to the
standard with the exception of the key word OBJECT used in the CREATE TYPE
statement (what does it mean?).
POINT(null,null) IS NULL is defined to return FALSE. Under the default
method for evaluation of "equals", POINT(null,null) = POINT (null,
null) is defined to return UNKNOWN.
One of my observations about the introduction of row types and user-defined
types was that IS NULL returning FALSE is no longer a guarantee that X=X yields
TRUE.
From: Lex de Haan
This is on Oracle, right?
All these results "make sense" to me, within the
insanity of 3VL of course. Oracle simply decomposes UDTs to the level of the
basic built-in data types, so all these comparisons are about comparing two
numbers.
as soon as one of the four operands involved (when comparing
two point values) is NULL, the = comparison operator returns UNKNOWN and the
negation of UNKNOWN is UNKNOWN.
the only one that confuses me at first sight is "C IS
NOT NULL returns all four rows" but unfortunately I have no time right now
to investigate this any further—although I do remember vaguely that
POINT(NULL,NULL) is treated differently from just NULL in a UDT column.
From: Miro Jakubowski
It makes sense to me. POINT(NULL,NULL) is a value of type
POINT. Thus, it is not NULL.
Yes, it is Oracle 9i release 2.
From: Hugh Darwen
Isn't "sense" a bit of an exaggeration here? :-)
You appear to be assuming that for all x, for all t, x is a value of type t
implies that x is not NULL. But we aren't even clear on what "is not
NULL" means. We know that if we study SQL very carefully indeed we
can find out exactly what the SQL expression "x IS NOT NULL" means,
but when we do so we find that it is not consistent with our intuitive understanding
of the English expression "x is not NULL".
Also, I have to point out that POINT(NULL, NULL) is not actually a valid SQL
expression. To obtain the effect that I think we all agree is required
here, you would have to write POINT(CAST NULL AS REAL, CAST NULL AS
REAL). In the light of that observation, I have to ask if you think CAST
NULL AS REAL denotes a value of type REAL. Care is needed, because the
expression "CAST NULL AS REAL IS NULL" is most definitely defined to
return TRUE.
I hope you get my drift. If CAST NULL AS REAL does not denote a value of
type REAL, how can it be justified that POINT ( CAST NULL AS REAL, CAST NULL AS
REAL ) does denote a value of type POINT?
From: Miro Jakubowski
Here's what I've got:
1.
SQL> SELECT
* FROM T;
C(X, Y)
------------
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
7 rows selected.
2. SQL> SELECT * FROM T WHERE CAST (NULL AS
REAL) IS NULL;
C(X, Y)
------------
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
7 rows selected.
3. SQL>
SELECT * FROM T WHERE NULL IS NULL;
C(X, Y)
------------
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
7 rows selected.
4. SQL> SELECT
* FROM T WHERE NOT CAST (NULL AS REAL) IS NULL;
no rows selected
5. SQL> SELECT
* FROM T WHERE NOT NULL IS NULL;
no rows selected
6. SQL> SELECT
* FROM T WHERE CAST (NULL AS REAL) IS NOT NULL;
no rows selected
7. SQL>
SELECT * FROM T WHERE NULL IS NOT NULL;
no rows selected
8. SQL>
SELECT * FROM T WHERE NOT CAST (NULL AS REAL) IS NOT NULL;
C(X, Y)
------------
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
7 rows selected.
9. SQL>
SELECT * FROM T WHERE NOT NULL IS NOT NULL;
C(X, Y)
------------
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
7 rows selected.
Thus, even if I cast NULL as REAL, it keeps behaving as NULL
not as a value of type REAL.
10.
SQL> SELECT * FROM T WHERE C=C OR NOT C=C;
C(X, Y)
-----------
POINT(1, 2)
POINT(1, 2)
Thus, OR is not commutative!?
11. SQL>
INSERT INTO T VALUES (POINT(NULL,2));
1 row created.
12. SQL>
SELECT * FROM T;
C(X, Y)
-------
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
POINT(1, NULL)
POINT(1, 2)
POINT(NULL, NULL)
POINT(NULL, 2)
8 rows selected.
13. SQL>
SELECT * FROM T WHERE NOT C=C OR C=C;
C(X, Y)
-------
POINT(1, NULL)
POINT(1, 2)
POINT(1, NULL)
POINT(1, 2)
14. SQL>
SELECT * FROM T WHERE C=C OR NOT C=C;
C(X, Y)
-------
POINT(1, 2)
POINT(1, 2)
Thus, NOT C=C OR C=C holds for POINT(1,NULL) but doesn't for
POINT(NULL,2). Why?
From: Hugh Darwen
To: Miro Jakubowski
There are some bizarre results here.
By the way, unless type BOOLEAN is supported it should be
impossible to give a WHERE condition that returns rows in which C IS NOT NULL
and yet C=C returns UNKNOWN. The
standard makes this possible by supporting IS UNKNOWN (along with IS TRUE and
IS FALSE) as part of its definition of the BOOLEAN type.
Thus, with support for BOOLEAN you could
write WHERE C = POINT(1,1) IS UNKNOWN.
Note the arbitrary comparand here,
You could also write WHERE C = C IS UNKNOWN, but that could be
burdensome when C is replaced by a very complicated expression.
10. I don't
understand. WHERE NOT C=C OR C=C should give the same result and in fact SQL's
OR is defined to be commutative. Incidentally, WHERE NOT ( C=C OR NOT C=C
) should also give the same result! Where did that column heading come
from? One expects to see a column name. I can see the
justification, though.
11. Ugh!! We
don't "create" rows!
13. I wish you
had inserted parens to show the order of operation,
but it shouldn't make any difference. The only rows that should appear in this
result are those which have genuine POINT values (i.e., the two with
POINT(1,2)).
14. Yes, that's
the "right" answer. And now
we can suspect that OR is not commutative, if NOT C=C OR C=C is (NOT ( C=C)) OR
C=C. If instead it is NOT ( C=C OR C=C
), then we have to conclude that p OR p isn't equivalent to p!
Search me!
Posted 2/3/06
© Fabian Pascal 2006 All Rights Reserved