Saturday, January 16, 2021

OBG: Missing Data -- Many-valued Logics and NULL Part 2



Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references.

In response to a LinkedIn thread where interest was expressed in a "NULL research" project, we re-publish as "Oldies But Goodies" a series of old exchanges with readers reacting to our The Final NULL in the Coffin: A Relational Solution to Missing Data: in Parts 1, 2, 3  we re-published a past exchange between myself and Hugh Darwen with a reader on the pros and cons of our solution vs. Hugh Darwen's "horizontal decomposition" NULLless solution, and in Part 4 my debunking of reactions to an article of mine on the subject which exhibit the confusion inevitably evoked by NULL.

We complete this series with an exchange with a reader who claimed that our criticism applies to SQL's specific implementations of NULL, not to the NULL concept itself. I and David McGoveran explained that the implementation flaws in SQL derive directly from the deeper problems endemic to many-valued logics

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST UPDATES
-12/24/20: Added 2021 to the
POSTS page

-12/26/20: Added “Mathematics, machine learning and Wittgenstein to LINKS page

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- The PostWest blog for monthly samples of global Antisemitism – the only universally acceptable hatred left – as the (traditional) response to the existential crisis of decadence and decline of Western  civilization (including the US).
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

More on "The Final NULL in the Coffin"

(originally posted on 4/2005, revised for consistency with the new understanding of the RDM and for clarity)

“Have now had a fair amount of time to digest The Final Null in the Coffin, https://www.dbdebunk.com/p/papers_3.html and I have compiled some notes that you might like to respond to. These form one section of a paper that I'm writing on the problems of nulls, incomplete data, and suchlike, in observational science databases. Maybe I've missed the point, and you were just making valid criticisms of SQL, but it seemed to me that you were using these as arguments against the use of NULL. Though I agree they are certainly valid reasons not to use NULL in an SQL database system !

Still needs a lot more thinking, but one of the key differences between scientific and business databases is that you can't accept the 'closed world' assumption for scientific data. By its very nature it is 'open-world', i.e. the database will always be incomplete, and absence of information from the database doesn't imply its non-existence or falsehood. Although some people have suggested that open-world requires nVL, I don't think we need to go that far. After all, that would require us to ditch the relational model for scientific data - even though there is no practical alternative.

Pascal makes a series of criticisms of the way in which SQL handles NULL, and these criticisms must be considered insofar as they are potentially relevant to the use of 'NULL' itself. It must be noted that SQL in itself is a language developed in a (flawed) attempt to implement relational theory. Although it has been taken up very widely by the developers of database management systems, SQL must never be confused with relational theory - and criticisms of SQL, even if upheld, are not necessarily of general relevance. Pascal's 2004 paper contains (page 9) a set of 10 specific criticisms (italicised in the list below) of SQL and its incomplete and inconsistent 3VL. Each of Pascal's items will be examined below to identify any general points which might arise in relation to the use of NULL:

  • Aggregate functions e.g. SUM(), AVG() ignore NULLs, COUNT() does not. However useful they may be, aggregate functions have nothing to do with relational database theory. Strictly speaking they are applications, and it should be possible to define whatever behaviour and rules are required into appropriate aggregate functions.
  • A scalar expression on a table without rows evaluates incorrectly to NULL, instead of 0.This behaviour of SQL may indeed be wrong, but this says nothing about whether NULL should be acceptable or otherwise.
  • The expression NULL=NULL evaluates to NULL, yet ORDER BY treats NULLs as equal.NULL=NULL correctly evaluates as NULL. The behaviour of ORDER BY is immaterial. In a relvar the ordering of tuples is irrelevant - this is merely an export/presentation matter. In any case, the ordering of NULLs is by definition arbitrary and whatever ordering is chosen by SQL cannot be 'wrong'.
  • SQL's NOT is not "not" of natural language. This is a problem with SQL, and not relevant to the NULL question.
  • It's unknown how SQL's EXISTS should behave, because no 3VL is defined for SQL (the definition is ambiguous in 3VL systems). Again this is an SQL problem because it attempts to use a 3VL, but is irrelevant to the question of the acceptability or otherwise of NULL.
  • Expressions evaluating to NULL do not violate integrity constraints (NULL is treated as true). NULL is of course not in itself a truth value, and expressions in a 2VL cannot evaluate to unknown. This is an SQL problem rather than a problem with NULL. The only expressions that can evaluate to NULL are those that are of a defined data type. Given a truth-value data type, it can indeed contain NULLs but these data items do not have any truth value at all - by definition - and they are certainly not of truth value 'unknown' in the SQL sense.
  • The truth-valued data type can take the values true, false, and the non-value NULL, but the literal NULL cannot appear in contexts in which any literals can appear. This is a problem with SQL and SQL-based DBMSs. It violates the principle that NULL must not be capable of confusion with any valid data representation. This principle must be rigidly adhered to in any truly relational DBMS. Codd's marks, kept separately from the data, are one solution to this problem.
  • Since 3VL referencing rules are not defined, it's unknown which primitive operators out of the 27 monadic and 19,683 dyadic would suffice to express others and, therefore, whether all necessary operators are supported or not. This is a possibly valid criticism of SQL's 3VL, but does not bear directly on the NULL question.
  • Highly complex operators. This is a matter concerning SQL as an implementation language but irrelevant to the NULL question.
  • Thus it appears that none of Pascal's criticisms of SQL's NULLs have any general validity as criticism of the use of NULL in relvars - though they do contribute to a very strong case that SQL does not offer an appropriate environment for handling databases whose relvars may contain NULLs. They are criticisms of SQL, not of NULL.

They do of course leave open the question of whether or not NULL itself is acceptable (i.e. quite separately from its flawed implementation in SQL). In the observational sciences, such as geology, there are very often gaps in knowledge - missing data. This is the nature of science. Sometimes the data are not altogether missing, or there are qualifiers explaining the reason for its missingness. But this is another story for another day. It is possible that the closed-world assumption built into database management systems for business applications, is itself inappropriate, and that an open-world assumption should be used for observational science data. If that is the case, then some writers have suggested that a 2VL is inadequate and we need to look at nVL solutions. I would not go that far yet - since that implies throwing out the entire relational model for scientific data. This seems a little extreme. There is another option - to use the relational model, and accept the prohibition on NULLs (where they are strictly placeholders for items on which nothing at all is known), but allow for rather richer and more intelligent logic to handle incomplete data.”

Note on re-publication: The specific points raised have been extensively discussed over the years -- the reader is referred to the references listed in our paper and in the chapter on missing data in my third book, PRACTICAL ISSUES IN DATABASE MANAGEMENT, listed on the BOOKS page.

Fabian Pascal
I’ll let David handle the deeper CWA issue, and respond to the more mundane ones. Many-valued logics (nVL where n>2) -- which, as our paper argues, reduce to three-valued logic (3VL) for database purposes -- is inconsistent with two-valued logic (2VL) of the real world in which the RDM is grounded. Even if a consistent 3VL were defined, inferences made by DBMSs implementing it would not be correct with respect to the real world. Because humans think in 2VL, the complexity and unintuitiveness of 3VL inevitably induce errors, the NULL implementation flaws in SQL being a reflection thereof and practically no discussion of NULL escapes them (see the example below). SQL's NULL implementation is neither a well-defined, nor consistent 3VL (and is used as 4VL to boot!), but that is due to the inevitable errors inherently endemic to many-valued logics. This is reason enough to eschew them, given the relational 2VL solution that we proposed in the paper (and which is consistent with McGoveran's new understanding of the RDM).

David McGoveran
At the ontological level, yes, the scientific method is an "open world". But in theory making it is crucial that the presumed universe of discourse be carefully defined. This step is often overlooked when scientists are working in a particular field, because it is "understood" and within the "culture", but even within the most narrow scientific fields errors and miscommunication can result. The universe may be iteratively redefined, but permitting enlargement (or contraction) without re-evaluation of the impact on other hypotheses (axioms and therefore database design in the database world) is a good way to draw erroneous conclusions. Indeed, a changeable universe of discourse requires second order logic which permits paradoxes (i.e., contradictions not subject to resolution, therefore predictions that can't be disproved by evidence, which is NOT science).

While reasoning within any theory the closed world assumption (CWA) holds.  Its universe of discourse is determined by the entities and properties/relationships modeled (and, consequently their formal representation (dependencies constraints, domains of permissible values and so on). So, if you build a database for testing a theory (however "imprecise") and find real world data that either doesn't fit or creates a contradiction, you've invalidated the theory. You should then redesign the database to meet the next hypothesis. This may be as simple as dropping a constraint, or require an entirely new design with different entities: It all depends on how broad the initial theory used for the design, and how radical (relatively speaking) the new hypothesis. Welcome to science.

Point: In a scientific database such as the type to which you allude, a reasonable interpretation of True and False under CWA is "valid by experiment and consistent with hypotheses" and "not validated by experiment or inconsistent with hypotheses". If you give this differentiation up with CWA and NULLs, you've given up scientific reasoning and the scientific method.

Usually this doesn't have to be burdensome. Ordinary theorizing and experimentation makes reasonable assumptions about domains. At the theory level, a domain might be "infinite", but at the level of experiments that always terminate in some finite measurement, and at the level of the database which resides on a real computer with real physical limitations used by real people with finite brains and time, the domain can be given a bounded representation that will suffice for many hypothesis within the broader theory.

Don't confuse CWA with an assumption that we must know everything and embed that knowledge in a permanently fixed database schema. All we need is understand the current boundaries on our application or investigation. If we don't understand that, we're doing philosophy, not science.

SQL's NULL -- just one example

(originally posted on 2/4/05)

“I've been reading your paper with great interest... Haven't finished it completely yet (it's getting late here), but I surely will tomorrow. Nevertheless, there is a mistake regarding NULLs and the SQL standard though.

According to your paper, page 9, COUNT does not ignore NULLs. That is not entirely correct. According to the SQL standard, NULLs are ignored if that's the only thing your counting. This does get you pretty weird results though - have a look at this example:
CREATE TABLE null_count
(id INTEGER);
INSERT INTO null_count(id) VALUES (1);
INSERT INTO null_count(id) VALUES (2);
INSERT INTO null_count(id) VALUES (NULL);
INSERT INTO null_count(id) VALUES (NULL);
This returns 4:
SELECT COUNT(*)
FROM null_count;
This, however, returns 2:
SELECT COUNT(id)
FROM null_count;
And this returns 0:
SELECT COUNT(id)
FROM null_count
WHERE id IS NULL;
Unlike this, which returns 2 again:
SELECT COUNT(*)
FROM null_count
WHERE id IS NULL;
And you mention that x IS NOT NULL is not the same as NOT(x IS NULL). I'm unsure if this is the case, as the following two statements return the same for me:
SELECT *
FROM null_count
WHERE NOT (id IS NULL);
and
SELECT *
FROM null_count
WHERE id IS NOT NULL;
Just letting you know.”

C. J. Date

“COUNT(*) does not ignore NULLs.

x IS NOT NULL and  NOT(x IS NOT NULL) are the same if the tuple X has degree one. Not otherwise. Note that X here is always a tuple (in the standard), not, as most people believe, a scalar!

Point is, there are two operators in SQL both (lovely) called COUNT:

COUNT(*)
COUNT(…)
and they behave differently (different semantics).”

 

 

 

 

No comments:

Post a Comment

View My Stats