Saturday, April 1, 2017

"NULL Value" is a Contradiction in Terms



There is nothing wrong with Hugo Kornelis' picture of SQL NULL in NULL: The database's black hole. In fact, I recommend the series of which it is one part. It's SQL's picture of how to treat missing data that's wrong.
"Let’s first take a look at what NULL is supposed to be. Here is the definition of NULL from the SQL-2003 standard: null value--A special value that is used to indicate the absence of any data value."
While the absence of a value may be represented by some value at the implementation level, I strongly recommend users not think of NULL as any kind of value at the model level. The problems with NULL stem precisely from the fact that it is not a value, but rather a marker for the absence of a value. NULL value is a contradiction in terms that distracts from the problems.

--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.
---------------------------------------------------------------------------------
"... NULL does not mean “not applicable”. Of course, a NULL in a table is often a result of the attribute not being applicable for a specific occurrence of the entity stored in the table (e.g., a column “birthday” in a customer table that stores details of businesses as well as humans)--but in other columns and other tables, or even in another row of the same table, there might be a completely different reason for the data being missing (absent)!"
Indeed. As I explained many times, the need for "inapplicable NULLs" is an artifact of poor design. If a property of a real world object "does not apply", the object does not have that property, so there should be no attribute representing the property for the tuple representing (the fact about) that object. The only reason NULL arises is because the design assigns an attribute to a tuple that does not have it--i.e., the logical model does not represent the conceptual model accurately [1]. Nothing is missing!
"... NULL is also definitely not meant to signify “unknown”. Again, a NULL in a table might result from the value being unknown at data entry time ... but there might be other reasons as well. Unfortunately, many text books insist on explaining the behaviour of NULL in expressions by describing NULL as unknown, rather than missing or absent, causing this misconception about NULL to be the most widespread and the hardest to combat ... If the business doesn’t care why a birthday is missing, then the fact that there might be different causes doesn’t have any consequences on the database. If, on the other hand, the business does case about the reason ... then this reason should of course be modeled and stored in the database--but not in the same column ..."
Couldn't agree more and I explicitly state this in [2], including how the DBMS could document the reasons in the database catalog. But it is important to be clear that while a value can be missing for any of a number of reasons, if it is missing for any reason, it is unknown to the DBMS.
"Things get more complex when NULLable columns are used in expressions and predicates. In a procedural language, this wouldn’t have been a problem--if a procedural program fails to find the information it needs, it enters a conditional branch to handle this situation, as defined by the programmer. In a declarative, set-based language such as SQL, this was not possible. The alternatives were either to have the SQL developer add conditional expressions for each nullable column in a query to handle missing data, or to define a decent default behavior in SQL for missing data so that developers only have to write explicit conditional expressions if they need to override the default behavior."
While NULL default behavior may have been expedient for SQL implementers, they disregarded implications for the formal theoretical foundation of the RDM. The RDM is an adaptation of simple set theory (SST) and first order predicate logic (FOPL) to database management that works correctly only with two-valued logic (2VL) and under the Closed World Assumption (CWA): in the real world a fact is unequivocally true or false, whether we know which is the case or not. If our knowledge of the world is imperfect, a database system based on many-valued logic (nVL where n>2) will not yield results that are correct in the real world (in fact, efforts notwithstanding, no consistent nVL has been found for any n>2, and not for lack of trying [3]). nVL makes data language expressions complex, unintuitive and prone to error due to realm confusion. Users think in 2VL, while SQL systems think in a (SQL's poorly implemented) 3VL. Worse, in practice NULL is used to represent both 'missing unknown' and 'inapplicable' (i.e., a 3VL is implemented for 4VL databases), exacerbating the problems: queries are not always the intended ones and results are easily misinterpreted (see next).
"The default NULL handling in expressions is very much based on how humans would handle similar situations ... if I refuse to specify my birthday, you would be unable to answer the question--so you wouldn’t answer it at all. And this is exactly what a database does--if any value to be used in an expression is missing (NULL), there won’t be any result of the expression; in other words, the result is missing (NULL) as well ... the rule of NULL propagation causes any expression that has a NULL in it to have no result, as if the other input values disappear as well ... the result being either true (row is included), false (row is omitted), or unknown (in which case the row is omitted as well)."
Except that when a human says explicitly "I don't know" you can then probe to find what exactly he does and does not know and figure out the implications. A SQL system eliminates parts of the result--in fact, more than is necessary--without the user being aware of it, not to mention the query and integrity implications.
"Hey dude, I can’t answer that" is obviously not a valid value in any numeric, date, time, or datetime domain ... NULL however is valid, in any domain. So there really was no other choice but to return NULL if input data for an expression is missing ... This choice solves the problem, since we are now able to evaluate each predicate in a WHERE clause for each row..."
Actually the validity of NULL in every domain obscures, rather than solves the problems, which are not in data representation, but in data manipulation and integrity enforcement. Practitioners focused on the former are oblivious to the latter.

As we have demonstrated, these problems can be avoided by a 2VL, relational solution to missing data without NULL [2]. It shifts the complexity burden from users to DBMS vendors, where it belongs. But, sadly, because they lack foundation knowledge, instead of demanding it, many data professionals defend NULLs, some ferociously. If you are one of those who ask, "If true relational systems are so great, why has nobody implemented one?", therein lies the the answer. Why bother if everybody believes that SQL systems are relational, which they are not and NULL is one reason why. [4]


See also: The worst mistake of computer science
 

References


[1] Pascal, F., Business Modeling for Database Design, PRACTICAL DATABASE FOUNDATIONS paper #3.

[2] Pascal, F., The Final NULL in the Coffin: A Relational Solution to Missing Data, PRACTICAL DATABASE FOUNDATIONS paper #6.

[3] McGoveran, D., Nothing From Nothing, Parts 1-4.

[4] Pascal, F., DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS: A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL (Database Debunkings, 2017).








14 comments:

  1. Clear and informative reading as always. Pursuant to NULL, I noted the following answer on Quora the other day: https://www.quora.com/What-are-C-J-Dates-criticisms-of-SQL. The author of that answer noted:

    >>As for NULLs, the embedded database I helped to implement started out with >>them, and we realized that supporting NULLs massively bloated out the engine, >>the storage layer, and the generated C code, and greatly hindered >>performance, in addition to lots of semantic evil.

    >>After much thought, we decided to rip out NULL support. This made the core >>engine about 15% smaller, reduced storage footprint by about 10%, and made >>the generated C about 75% smaller. It meant we had to jump through hoops to >>support outerjoins (we ended up having a non-storable NULL and "ifnull()" >>function to support them), but few customers missed NULLs once they learned >>to deal with them not being available.

    Interesting, and interesting also would be if the customer's of that DBMS requested a DBMS solution to missing data as you suggest vs. having to "learn to deal with" a solution not being available at the DBMS level. Do you think as a customer of a SQL DBMS like Oracle or SQL Server it would be a good idea to ask for such support as an alternative to NULL, or is NULL so baked into SQL that the only solution is as Date suggests to start over with a new language specification?

    ReplyDelete
  2. The fundamental problem with NULL is that it violates two-valued logic (2VL), the Closed World Assumption (CWA), the Information Principle (IP) and, therefore, the formal theoretical foundation of simple set theory (SST) and first-order-predicate logic of the RDM. Otherwise put, correctness is no longer system-guaranteed all bets are off. All complications and problems are rooted in that.

    The solution is certainly not to "rip out NULL" and leave it to users and application developers to deal with the implications of missing data--that's why we used to do before we had databases and we invented them to avoid this--but to substitute a logically correct 2VL solution that is relationally compliant, such as the one proposed in "The Final NULL in the Coffin".

    Neither DBMS customers, nor vendors have the decent enough grasp of data fundamentals and the RDM necessary to appreciate the problems and their causes and to ask for and to implement the correct solution. Indeed, to a vast majority SQL systems ARE relational and blame the RDM blame for problems due to SQL's relational INFIDELITY. Upside down and backwards.

    Under these circumstances what is the incentive for producing true RDBMSs/data languages and where is the expertise to do it?

    Every new system that I see is ad-hoc and arbitrary and it's not even clear what is their formal data model, because the designers are unaware that one is needed and what it should consist of.

    ReplyDelete
  3. Fabian, Would you mind adding a practical example to demonstrate the problem of NULL?

    ReplyDelete
    Replies
    1. As I state in every post, as a rule I do not respond to anonymous comments.

      There isn't A problem with NULL, there are many problems which have been amply documented in the literature e.g., by Date and McGoveran beside myself. I recommend my paper "The Last NULL in the Coffin" which lists references.

      But without some basic notion of data and relational fundamentals it's harder to grasp all the problems.

      Delete
    2. And then those references are only about NULL in database land. Remarkably similar misery has been observed with NULL in programming land : https://www.lucidchart.com/techblog/2015/08/31/the-worst-mistake-of-computer-science/

      Delete
    3. The consequences of violating 2VL do not go away if you're in higher logic than FOPL.

      Delete
  4. You can also try the references in my post and the link that Todd provided in his comment.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. There's nothing relationally wrong with "ripping out nulls". If all that's desired is that some entities have attributes that others don't have, the relational approach requires distinct relations. Nothing Codd wrote prevents the presentation of multiple relations, he only assumed the selection (ie. qualification he described in his first two papers - "Some Linguistic Aspects") and updating of tuples involves a single relation AS FAR AS THE USER VIEW IS CONCERNED (multiple relations might be updated without the user knowing). The original use of nulls (was it by System R?) seems to have been an attempt to force-fit table presentation aspects into relations,

    As long as whatever presentation facilities (eg. a presentation language distinct from relational algebra) guarantee that a function such as expressed by "isnull(...)" qualifies a single declared relation or subset of one (eg. one with a distinct predicate or set membership), a relational dbms can select and update such single relations. Of course in practice by itself a device such as "isnull" is not likely to always to qualify a set of tuples in a single relation which is to say one or more application so-called entities so you do have to wonder why it would be needed when Codd provided primary keys for qualifying subsets or speaking roughly sets of entities.

    ReplyDelete
  7. what about modeling a non-identifying relationship where the FK can be null? How can you model that if you reject all NULLs?

    ReplyDelete
    Replies
    1. I have no idea what a "non-identifying relationship" is. What does a NULL FK *****MEAN*****? Anything that you cannot identify is meaningless.

      My paper "The Last NULL in the Coffin answers your question.

      Delete
    2. well have a look at the "Nonidentifying/optional" part on page 279 of book "Beginning Relational Data Modeling"

      https://books.google.com/books?id=62CFtFea0NsC&pg=PA279&lpg=PA279&dq=nullable+foreign+key+optionality&source=bl&ots=CM0t0Jfb6L&sig=9F_WdkXta3v-3Uo6Ym8r7UsYJ9A&hl=en&sa=X&ved=0ahUKEwi1v_Sm77rTAhUJOsAKHXM_BMgQ6AEISjAG#v=onepage&q=nullable%20foreign%20key%20optionality&f=false

      Delete
    3. 99% of what is published on the RDM is clueless. My recommendation is to be extremely skeptical of what you read. Anybody who accepts NULLs and uses that kind of terminology cannot possibly have a grasp of the RDM.

      I urge you to read my paper and see if it proves my points.

      Delete

View My Stats