Saturday, July 28, 2012

NULL Confusion


AP: Just got this remark from one of my pages:
I was asked to look at NULLs in a Database by someone for whom I'd done some performance improvements to their production database, which included making a number of fields nullable. The basis of the argument in the article - that the value of a NULL field is unknown - is completely wrong. We know precisely what the value is. A NULL value in a field means that it "has no value". This is not the same as unknown, unsure, don't know, forgotten, haven't been told, etc... as all of those imply that is does have some value. It doesn't. Space precludes me from giving a full explanation of this concept, but on request I will happily do that plus give examples. But your example of count(available) from accounts belies your lack of understanding of the concept. With three rows in the table of which one is NULL, the count MUST return two, as only two rows have a value. The rules of mathematics state this.


Regarding your final points, introducing NULLs does not mean n-valued logic - it means 3-valued logic (three state logic). Neither does it introduce uncertainty. In fact, the reverse is true. I have a recent example where a number field was made not nullable, and held zero when it should have held NULL. The problem was they could not differentiate between records that really had a zero value and those that had just defaulted to zero (and should have been NULL). I have been designing relational databases for twenty years now, and it worries me that such articles get published where a lack of understanding of basic mathematics is shown. People will get the wrong impression, and write bad solutions. The only good thing is that as a freelance, I may get plenty of work.
Well, well.



This is one of the many excellent demonstrations that SQL's NULL is a bad idea: it confuses practitioners over and over and yet they fail to draw the logical conclusion that the confusion is due precisely to the problems of three-valued logic (3VL) and its bad implementation as NULL in SQL (which is itself due to the same).

If "a column does not have some value", there are only two possibilities: either there should be a value, but it is missing because it is unknown, or the value is inapplicable, in which case nothing is missing.

He seems to imply that NULL means inapplicable and that is what 3VL is about. Alas, if that is the case, NULL is inapplicable too: if certain attributes do not apply for some entities, the columns representing those attributes should not apply to the rows representing those entities. In other words, there is an entity supertype-subtype situation and the table was improperly designed and this created a fake need for NULLs.

That is why I agree that we need to worry only about 3VL--true/false/unknown, but that leaves us enough to worry about. By focusing on their representation in tables, but ignoring their implications for integrity enforcement and manipulation  most practitioners are lured by NULLs into a false sense of security. An enormous amount of evidence for the problematics involved has been provided by C. J. Date and others  and I won't repeat it here (see the references in my The Final NULL in the Coffin paper).

If his case is about unknown, here are some examples:

  • Confusion #1: It is not the "NULL field" that is unknown, but the value that ought to reside in that field and is missing.
  • Confusion #2: A "NULL value" is a contradiction in terms. Per his own admission the whole point of a NULL is precisely that it is not a value, but rather a mark for the absence thereof. How can "there is no value" mean "we know exactly what the value is"?
  • Confusion #3: What is the query to which COUNT(available) is the answer? If the DBMS yields 2, it answers the question "What is the number of accounts with known information"; if it yields 3 it answers the question "What is the total number of accounts?". Which did the user intend, on what grounds does the SQL DBMS decide and does the user know which query was answered? Does he care? He certainly should.

What about those 0's? Well, if they stood for inapplicable, then they were due to poor design, not to non-nullability and repairing the design would correct the mistake. But if they stood for unknown, then all the problematics of 3VL and its bad SQL implementation would kick in.

What is more, SQL's NULL is used for both unknown and inapplicable. This exacerbates the problems by applying a poorly implemented SQL 3VL to a 4VL situation. This would not be a problem if nobody ever misdesigned tables with NULLs representing inapplicable, but it looks like this is exactly what happened in his case and yet he defends NULLs and criticizes others for what he is guilty of.

For an outline of a relational solution to missing data without NULLs, see my above mentioned paper.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

3 comments:

  1. Allen,

    Can you please reply to this comment? Thanks.

    ReplyDelete
  2. Do I understand you correctly on proper database design, in the following attempted paraphrase?

    Both issues arise in real-world information management:

    1) Some records are such that a given attribute is inapplicable to them.

    For example:
    Problem - some contacts have fax numbers but others do not
    Solution - design the CONTACTS table not to have the attribute [fax number]; make that attribute part of another table such as FAX NUMBERS

    2) At times information which is applicable to all records in a table is simply missing from the database in certain records.

    For example:
    Problem - a database of published books is missing some of the ISBNs
    Solution - design without NULLs [see your paper "The Final NULL in the Coffin"]; (how to do this is what most of us don't know)

    ReplyDelete
  3. 1) The situation is one where entities have some attributes in common and some unique attributes -- iow, an entity supertype-subtype situation, which can be somewhat tricky if there are multiple kinds of entity subtypes. In the simple case of one entity supertype and one subtype, the Principle of Orthogonal Design (POD) calls for (1) one table for the entity supertype with all common columns (2) one table for the entity subtype, with ***both the common and unique columns*** and (3) a disjunctive constraint: no entity can appear in both tables.

    2. My paper is a concept paper. I outline the idea and illustrate it. It requires serious research to flesh out its many implications for integrity and manipulation. And for users to take advantage of it, it must be implemented in DBMSs.

    If it is proven correct, the beauty of it is that the user does not have to do anything differently--you will just need to get used to getting multitable query results that will signal to you the presence of missing data.

    ReplyDelete