Saturday, January 9, 2021

OBG: Missing Data -- Many-valued Logics and NULL Part 1

Note: To demonstrate the correctness and stability of a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may break long pieces into multiple posts, revise, and/or add comments and references.

In response to a LinkedIn exchange we continue the series about missing data, NULL and the RDM. In Parts 1,2 and 3 we re-published a past exchange between myself and Hugh Darwen on the pros and cons of our relational solution to missing data vs. Hugh's "horizontal decomposition".

Here we re-publish my debunking of reactions to an article of mine exhibiting the common confusions evoked by NULL.


DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.


-12/24/20: Added 2021 to the POSTS page

-12/26/20: Added “Mathematics, machine learning and Wittgenstein" to LINKS page

- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- The PostWest blog for monthly samples of global Antisemitism – the only universally acceptable hatred left – as the (traditional) response to the existential crisis of decadence and decline of Western  civilization (including the US).
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

NULLs Nullified

(originally published 03/2008; revised for re-publication for consistency with current state of knowledge and for clarity) 

“The idea that you will always know everything is arrogant.” Joe Celko 

“The idea that anything can be inferred from what you don’t know is ignorant.” Fabian Pascal

It is not possible to be a database professional — distinct from a sheer practitioner — without a good grasp of data fundamentals. The majority in the industry possess little thereof, if any and, worse, are outright unaware of such. Most of what is taught and published in academia and the trade media centers almost exclusively on tools and is totally devoid of fundamentals[1]. On the rare occasions that it refers to them, it is wrong[2].

In "If You Liked SQL, You’ll Love XQuery" I criticized the principal author of SQL — which, after all, was supposed to be a relational data sub-language — for numerous violations of relational principles, one of which is NULL. I wrote:

“The relational model is first order predicate logic [adapted for and] applied to databases. Predicate logic is the real-world’s true/false two-valued logic (2VL) ... that guarantees correctness — defined as [logical validity and semantic] consistency. It is to preserve correctness, therefore, that Codd’s Information Principle mandates all information in relational databases be represented [explicitly and exactly in one and only one way:] as values [of attributes defined on domains] in relations. The term "NULL values" suggests that Chamberlin does not realize the core problem [with] NULL [is precisely] that it is not a value — indeed, it is [a mark for] the absence of a value — [a violation of 2VL and, thus, RDM's] IP. Whatever a table with NULLs is, [it is not a R-table that visualizes a relation], because they cannot be treated as a value by the DBMS.”

Consider now some reactions to my article at

“If you removed NULLs from relational database design, people would reinvent them (poorly) -- probably by using IDs of -1 or 0, or IDs to a special magic "null" row, which I suspect is what he’s talking about by "it can be handled relationally." To suggest that missing or inapplicable values are not part of "the real world" is so wrong it’s ... well ... wrong. Anyone who’s actually done database work (or programming work, for that matter) knows this.”
Most practitioners are unfamiliar with the history of their field of practice and, therefore, it's hardly surprising that they would regress back to the pre-database world of those very "special values", for the problems of which NULL was intended as a solution. But those who don’t know the past are condemned to repeat it[3]: SQL authors did not understand the RDM either — NULLviolates it and replaces the complexity of  special values in application code with another in the data sublanguage, SQL statements heavily qualified with IS NULL/IS NOT NULL is only a small part thereof. So no, that's not what I meant by "can be handled relationally" — see our above paper for that

Furthermore, it is important not to confuse:

  • the real world and our imperfect knowledge thereof: missing data do not reflect the former, which obeys 2VL — facts are either true or false — but the latter, which obeys 3VL (true/false/unknown). That's what I mean by "NULL does not represent anything in the real world" — 3VL results are incorrect in it. Humans think in 2VL and, thus, like all many-valued logics, besides serious complications and problems (which are beyond the scope of this discussion[4]), 3VL is also counterintuitive, which induces errors.
  • missing with inapplicable data: data is missing only when information is unknown; if it does not apply nothing is missing. For example, consider salaried employees, only some of whom also earn a commission. In the SQL table
the rows for the employees who don't has NULLs in the COMMISSION column. But an inapplicable property means the entities don't have it, so why is there a COMMISSION column representing that property for the rows representing the tuples representing those entities? Those NULLs are, in fact, an artifact of semantically inconsistent design (i.e., a database representation unfaithful to the conceptual model — for the correct design see our paper). Worse, even if SQL implemented 3VL and the RDM correctly (it doesn't), practitioners use NULL to represent both unknown/missing and inapplicable data and SQL DBMSs do not support 4VL. Thus, while SQL DBMSs yield results that are not only inconsistent with the real world, but also with the 3VL of imperfect knowledge too! This causes no end of confusion.

Note: We do not know what "null row" means (there is such a thing as "NULL row" in SQL, but that’s not what is meant here).

“SQL NULLs are the worst thing since unsliceable bread. They break boolean logic. You would think that if (X = Y) is false, then (X != Y) would be true. With SQL, if either X or Y or both are NULL, then any expression evaluating it is false. I understand the argument (NULL indicates no data — so you can’t claim it’s equal to anything). Academic ##@!&. Anyone who’s maintained code using SQL NULL semantics will agree. If you really want to claim that NULL is so much 'not a value’ that you can’t compare it to anything, then do it the man’s way and throw a ##@!& exception. Of course, anyone can see that doing so would make code which MIGHT encounter a null value even MORE difficult to maintain, so they came up with this ‘any comparison to NULL is false’ ##@!&. The easiest way to define NULL is that it’s equal to another NULL value, but not equal to anything else. Then I don’t need any special ‘is null’ clause either. The very definition of broken.”

  • That any nVL where n>2 is complex, counterintuitive and prone to errors in both implementation and use is confirmed by both critics and defenders of NULLs often making the same mistakes when they justify their position, which were also made by the authors of the SQL standard and its various commercial dialects[5].
  • If you don't know X and Y, neither do you know whether X = Y is true or false — comparison of any value or NULL to NULL always evaluates to NULL.
  • It's not that "we want to claim that NULL is so much 'not a value'" — the absence of a value is really not a value and cannot be treated as such for integrity and manipulation purposes. That's why "NULL value" is a contradiction in terms, a violation of IP and the RDM, with loss of guaranteed correctness.

“It’s not the SQL standard’s fault if your code’s logic can’t handle that case. Nor is it the standard’s fault that you can’t see fit to NOT USE the NULL feature when you don’t want to (and any reasonable database even goes so far as to give you the option of making *sure* you don’t use it — that’s what the NOT NULL declaration when defining a column is for). Getting rid of NULL isn’t going to help you when you suddenly discover that you really DO need to be able to represent ‘missing data’ somehow.”
Aside from the internal inconsistency in this paragraph, it’s the SQL standard committee, not the application code, that is at fault. NULL is a blunder — a unnecessary one given that there is a relational solution to missing data.
“The simple reason for NULL being an integral part of relational databases comes from the foundations of the concept: relations are sets, and the whole idea is based on set theory. A table is a set of rows, and a row is a set of attributes. Every set contains the null set as an element. Therefore, without changing the relational model to be based on some other premise, null must exist. Granted, there are some problems, but it is what it is...”

  • NULL is a violation, not an integral part of the theoretical foundation of relational databases and all bets are off;
  • Relations are grounded in naive (i.e. simple) set theory (SST)[6,7]: sets of tuples drawing values from simple domains (that do not have sets as values)[8,9] (and NULL has nothing to do with the null set);
  • R-tables can visualize relations, but play no role in the RDM[10] — tables that display NULLs are not R-tables;
  • NULL advocates focus on relational structure (which they confuse with tables[6,7]) and are oblivious to the implications for data integrity and manipulation (RA), which is where the problems manifest themselves[8].

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.



[1] Pascal, F., Database Management No Progress Without Data Fundamentals


[3] Pascal, F., Graph Databases They Who Forget the Past...

[4] McGoveran, D., Nothing From Nothing (4 part series)

[5] Date, C. J., and Darwen, H., A GUIDE TO THE SQL STANDARD, 4th ed.

[6] Pascal, F., What Relations Really Are and Why They Are Important

[7] Pascal, F., Tables — So What?

[8] McGoveran, D., and Date, C. J., Nothing to Do With the Case





No comments:

Post a Comment

View My Stats