Saturday, January 19, 2019

Data and Meaning Part 4: Query and Result Correctness




As we have seen in Parts 1, 2, and 3, the RDM is a formal theory adapted and applied to database management: database relations (1) preserve the formal properties of mathematical relations, but also (2) have interpretations -- carry a real world meaning assigned by a conceptual model: facts about entities, entity groups, and multigroups (i.e., their properties, some of which are relationships, specified by business rules (BR)). A relation is formally in 5NF and constrained for semantic consistency (i.e., to represent facts about an entity group).
“When we create specific domains, relations, and attributes we are constraining (restricting) an abstract logical system to a specific interpretation (meaning). Seen the other way around, an interpretation of the logical system is a representation of a specific segment of the world, and that is exactly the purpose of database design. For example, an attribute name created by the designer is assigned meaning intended by the modeler as representing an entity property, which is the very meaning of semantics. That is why full normalization cannot be achieved or assessed without reference to some conceptual model -- what attribute names mean, and how they are related to each other (i.e., their dependencies), and so on.” --David McGoveran
Yet requesting and giving design advice without a conceptual model is routine in the industry[1]. What is more, most practitioners are oblivious to the implications for correctness of queries and results[2].


------------------------------------------------------------------------------------------------------------------
SUPPORT THIS SITE 

Up to 2018, DBDebunk maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 the website was has been discontinued. The content is not available anywhere else, and if you deem it useful, particularly if you are a regular reader, please ensure its continuation and free availability by supporting it with as much as you can afford via purchases of publications, or donations. Thank you.

LATEST PUBLICATIONS 

· Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING OF THE REAL RDM series, is available for ordering here.
· Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING OF THE REAL RDM series, is available for ordering here.

SOCIAL MEDIA 

I deleted my Facebook account. You can follow me on Twitter:
  • @dbdebunk: will contain links to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture, which I am bringing back.
  • @ThePostWest: will contain evidence for, and my take on the spike in Anti-semitism that usually accompanies existential crises. The current one is due to the decadent decline of the West and the corresponding breakdown of the world order.
HOUSEKEEPING
  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 
------------------------------------------------------------------------------------------------------------------

Here's an example from actual practice. Table MPR displays data of only one region in the corporate database of a distributor of products by various manufacturers to retailers in multiple regions.

 MPR
---------------------------------------
 MFGR             PROD      RET
=======================================
 Morphy Richards  Iron      Debenhams
 Morphy Richards  Iron      Selfridges
 Russell Hobbs    Iron      Selfridges
 Russell Hobbs    Kettle    Selfridges
---------------------------------------

Tables MP and PR display this data as recorded in the region's database.

 MP
--------------------------
 MFGR             PROD
==========================
 Morphy Richards  Iron
 Russell Hobbs    Kettle
--------------------------
 PR
---------------------
 PROD     RET     
=====================
 Iron     Debenhams   
 Iron     Selfridges
---------------------

An analyst joins MP and PR and obtains table MPR' with a row that is not in MPR.

 MPR'
---------------------------------------
 MFGR             PROD      RET
=======================================
 Morphy Richards  Iron      Debenhams
 Morphy Richards  Iron      Selfridges
 Russell Hobbs    Iron      Debenhams
 Russell Hobbs    Iron      Selfridges
 Russell Hobbs    Kettle    Selfridges
---------------------------------------

He interprets the row as an insight ("valuable sales are concealed by the corporate database"). A consultant argues, however, that "the projections are an error of overnormalization": “...a few had heard of 5NF, but no one had any idea what it was, other than some esoteric data rule that might have some relevance in academia, but none in a commercial enterprise”; querying individual projections is OK, but joining them produces "false" rows.

Test your foundation knowledge: who is right?


Correctness = Truth and Provability


We query databases to make inferences: from premises -- facts recorded in the database (axioms), data manipulation derives conclusions -- facts that are logical implications of the axioms (theorems). Query results are correct (i.e., the derived facts are true) if and only if:
  • The database facts are true;
  • The derivation is logically valid (i.e., the conclusion is provable from the premises via logic rules of inference).

Truth is sensitive to interpretation (intended meaning) of the data: a fact that is inconsistent with the BRs by definition cannot be true, but a consistent fact can be false. That is why we rely on both:
  • The database designer to formalize the BRs as constraints[3], and the DBMS to enforce them; and,
  • Trusted authorized users to record only true facts in the database[4].

Logical validity, on the other hand, is strictly a matter of provability: conclusions follow, via logic rules of inference, from the premises. It does not depend on what the data means, only on the soundness of the inference system implemented by the DBMS (i.e., data manipulation).

Given a conceptual model, courtesy of the RDM's SST-FOPL formal foundation, if the database designer and users do their part for database facts, a true RDBMS enforces the constraints corresponding to BRs and implements the relational algebra (RA), guaranteeing correctness of query results (i.e., truth of the derived facts), one of the core advantages of the RDM[5] that practitioners fail to appreciate.

It follows that without a database designed for consistency with a conceptual model, and the soundness of a RDBMS, there is no system-guaranteed correctness.

All Bets Off


Given a semantically consistent (i.e., properly constrained) relation, as a consequence of 5NF the join of its complementary projections recovers the original relation (why?). Thus, if P1(PK,A1,A2) and P2(PK,A3,A4) are complementary projections of relation R(PK,A1,A2,A3,A4) then:
P1 JOIN P2 = R
if the projection and join are RA operations. Thus, if (1) MPR is a R-table (i.e., displays a semantically consistent relation), and (2) RA projections and join are applied, the result is MPR.

As we have seen in Part 3, MPR would be a bona-fide R-table under only one out of several possible models but, as is common, the example specifies no model, and recall that the model cannot and should not be inferred from tables. Given the state of knowledge in the industry (and in the company), even if there was a model, the chance that (1) it was the one under which MPR is a R-table, and (2) included all the pertinent BRs, and (3) all the corresponding constraints were formalized, is practically nil. Moreover, there are no true RDBMSs, and SQL DBMSs are not relational[6]: they do not enforce all relational constraints[7], nor do they manipulate data in strict conformity with the RA[8]. It is hardly surprising, therefore, that MPR' ≠ MPR.

Note: For the results to be correct, the queries themselves must be meaningful under the conceptual model. Consider the Cartesian product RA operation: each tuple of one relation is concatenated with every tuple of another relation. The result is logically valid -- in fact, Cartesian product is part of joins -- but on its own carries no real world meaning.

So the analyst is wrong. But the consultant is not quite right either.

Full normalization means, informally, that a relation represents facts about entities of one type (i.e. one entity group), but -- as we contend today -- 5NF is a necessary but insufficient formal requirement of the RDM[9], and is well defined. Consequently, 1NF-4NF -- which mean "bundled" representation of facts about more than one group -- are violations of the RDM that lose relational advantages, including system-guaranteed correctness. But they are also formally well-defined, which is why we can repair <5NF designs by further normalization (i.e., "unbundling" groups each to its own 5NF relation)[10].

"Overnormalization", on the other hand, is not a formal, well defined concept. If, as implied, it refers to complementary base RA projections, they represent less facts about one group, which is not the same as facts about less than one group, which would follow from what full normalization is; moreover, while they have practical drawbacks (which?), unlike 1NF-4NF, they are relations, and, thus, preserve relational advantages and correctness.

But if that is not what it means, it's not clear what else it could mean. And given that MPR is not a R-table, and/or that SQL projections are not RA projections, "overnormalization" outside the RDM is a misleading term anyway.

Note: We ignore constraint inheritance, for the purpose of this discussion, which a true RDBMS would support, but SQL DBMSs do not[11].
 

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.


References

[1] Pascal, F., Don't Design Databases Without Foundation Knowledge and Conceptual Models.

[2] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS: A REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER.

[3] Pascal, F., Relationships and the RDM Part 2: Integrity Constraints.

[4] Pascal, F., Don't Confuse/Conflate Database Consistency with Truth.

[5] Pascal, F., SQL Sins.

[6] Pascal, F., What Is a True Relational System (and What It Is Not).

[7] Pascal, F., To Really Understand Integrity, Don't Start with SQL.

[8] Pascal, F., DISTINCT and ORDER BY Are Not Relational.

[9] Pascal, F., Database Design: What It Is and Isn't.

[10] Pascal, F., The Costly Illusion: Normalization, Integrity and Performance.

[11] Pascal, F., Integrity Is Not Only Referential: DBMS vs Application Enforced Constraints.




No comments:

Post a Comment

View My Stats