Saturday, April 8, 2023

MISSING DATA: RDM VS SQL -- A REAL WORLD COMPARISON (t&n)



Note: "Then & Now" (t&n) is a new version of what used to be the "Oldies but Goodies" (OBG) series. To demonstrate the superiority of a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, as well as the disregarded evolution/progress of RDM, I am re-visiting my old debunkings, bringing them up to the current state of knowledge. This will enable you to judge how well arguments have held up and realize the increasing gap between industry stagnation -- and scientific progress.

(This a revised version of an earlier post with clarity improvements).

Q: “What would you suggest for a datetime field where the value is not known and should therefore be not-applicable?”
A: ”NULL sounds good to me.”
While searching through records I came across an old consulting project involving the migration of a neo-natal research database from Focus -- an old hierarchic DBMS -- used to record extensive details about hundreds of monthly births at a university hospital for more than 20 years. The person who had designed the Focus database was the only one who knew and understood its complexity sufficiently to maintain it. Each time a researcher needed some subset of data to analyze, he would extract it and serve it upon request. Aside from the inefficiency of the process, the person was retiring at a time when hierarchic DBMSs reached the end of their usefulness, Focus experts were already few and expensive and "relational" (read: SQL) was the dominant fad. 

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST POSTS

03/31 I Left Ceausescu's Romania for AI Algorithms??? At Least Him I Understood!!

03/26 RELATIONSHIPS AND THE RDM V2 PART 1: RELATIONS & DATABASE RELATIONS

03/19 ON PROPERTIES IN CONCEPTUAL MODELING (rm)

UPDATES

04/03 Added First OrderLogic to LINKS page

04/03 Added Mathematical Logic - Reasoning in First Order Logic to LINKS page

03/26 Added Modeling of Integrity Constraints Dependencies to LINKS page

03/14 Added Russell’s On Denoting to LINKS page

03/14 Added Russell’s Paradox to LINKS page.

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)

08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my AllAnalytics columns no longer work. I re-published only the 2017 columns @dbdebunk, and within them links to sources external to AllAnalytics may or may not work.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
------------------------------------------------------------------------------------------------------------------

 I was hired and tasked with designing the SQL database to replace it. It took about two years to extract from the person's mind all the information necessary to reconstruct the conceptual model to be represented by the new database at the logical level (too complex to documents and, besides, no documentation = job security).

With perfect knowledge of the reality being recorded -- births details -- the conceptual model of a multi-group -- a collection of related entity groups, each representing entities of a single type -- each group would have formalized as a database relation (in 5NF), the attributes representing the research variables being recorded for analysis.

"Inapplicable Data"

There was, however, a significant complication: new variables/columns were continuously being added and old ones dropped. In a one-relation-per group design, each new attribute would have ended up as 'inapplicable data' prior to its introduction and each dropped attribute subsequent to its discontinuation. Consider, for example, the Admissions entity group: six admission variables were recorded since the database’s inception in January 1976, with one variable added in March 1995. Mapping the group to one database relation would yield

ADMISSIONS {ADM#,PAT_ID,CLASS,SPONSOR,HOSP_ID,THOSP_ID}

where THOSP_ID represents the variable added in 1995. With hundreds of admissions per month recorded over 20 years, ADMISSIONS would not be a database relation because in SQL it would contain a large number of NULLs. For the Deliveries group -- the bulk of the data -- new variables were added in 1/76, 1/78, 5/78, 1/80, 9/80, 10/80, 7/86, 8/89, 4/94, 3/95 and others dropped on other various dates.

We have already established, however, that "inapplicable data" is an artifact of poor design: if the data does not apply, the entities do not have the pertinent properties, so why should the relation represent them in the database? In other words, such design would be contradictory -- a failure to represent accurately the conceptual model ("NULL VALUE" IS A CONTRADICTION IN TERMS)

Note: No DBMS can prevent users from such designs -- this is a matter of user discipline; even if the DBMS prohibits NULLs (which true RDBMs should and would) it's always likely that users would use default/special values, which is not a solution either.

The quote at the beginning of this post is one of the reactions I got during the project. Aside from NULL violating the RDM (MISSING DATA: MANY-VALUED LOGICS AND NULL), even if you tried to stick to it, it would still require four-valued logic (4VL) -- true, false, inapplicable, unknown -- that SQL does not support and for good reason: it is mind-bogglingly more problematic than SQL's 3VL (which is sufficiently confusing itself to have induced an erroneous SQL implementation (READ MY LIPS: IF THERE'S NULLs, IT'S NOT RELATIONAL).

The conceptualization that should obviate such designs is entity supertype-subtypes (ESS) relationship. That would produce the design:

ADM1976 {ADM#,PAT_ID,CLASS,SPONSOR,HOSP_ID}
ADM1995 {ADM#,THOSP_ID}


with the necessary mutual-exclusivity constraints, where ADM1978 would represent the entity supergroup with data since inception and ADM1995 a post-1995 entity subgroup -- a design obeying THE PRINCIPLE OF ORTHOGONAL DESIGN.  But SQL DBMSs do not support ESS relationships either. What is more, the dropping of variables at random times posed an additional complication -- the ESS relationships changed with time!

The complexity and dynamics of such realities are the reason why the DBMS, not users/applications, should manage them. For reasons beyond the scope of this discussion we believe that a truly relational DBMS can not only support ESS(including time-varying) relationships, but do it in 2VL and without NULLs and we outlined a truly relational solution in (THE FINAL NULL IN THE COFFIN: A RELATIONAL SOLUTION TO MISSING DATA) and (MISSING DATA AND MULTI-RELATION QUERY RESULTS).

In its absence, SQL systems will relegate data management functions to users and applications -- precisely what the RDM was introduced to prevent. Even if users adhered relational design principles, the problems of unknown (truly missing) data would still be there to contend with. And the only way to guarantee correctness and minimize unnecessary complexity is for both the DBMS and user practices to comply with RDM, which means 2VL, which means our proposed solution.

 

 

 

 

No comments:

Post a Comment

View My Stats