Thursday, November 11, 2021

Nobody Understands the Relational Model: Semantics, Relational Closure and Database Correctness Part 2



 with David McGoveran 

(Title inspired by Richard Feynman)

In Part 1 we explained that all database relations are, mathematically, relations, but not all relations are database relations, which are in both 1NF and 5NF and we agreed with a statement in a LinkedIn discussion ending as follows: "Update anomalies are not as big of a problem as an algebra where relations aren't closed under join". Unfortunately, update anomalies, closure, and how relational operators were defined are all interrelated and represent an even "bigger problem". Update anomalies are not "bugs", let alone irrelevant, but actually a reflection of  that much bigger problem.

In this second part we delve into that problem.

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST POSTS

- 11/05 OBG: Database Consistency and Physical Truth

- 10/27 Nobody Understands the Relational Model: Semantics, Relational Closure and Database Correctness Part 1

- 10/09 Relational and Referential Integrity

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

An algebra is closed if and only if the class of objects it operates on and of the objects it produces is the same. The RA certainly should have the property of closure: all relational expressions on relations produce relations. But this implies that the relational operators cannot have anomalies. Anomalies indicate that the operations and the class of both input and output objects over which they are defined do not form a properly defined algebra. As relational theory stands, we have treated normalization beyond 1NF and update anomalies as -- somehow -- outside or independent of the definition of RA, despite simultaneously claiming that anomalies need to be systematically avoided or prevented. In fact, update anomalies should be called the semantic errors they are, and it should be recognized that they are built into, or inherent, in the traditional definition of the algebra.

The fact that normalization beyond 1NF is not preserved by relational operations (i.e., the RA is not closed over database relations) means that fully normalizing a set of base relations just defers the appearance of those semantic errors until re-operating on the relations derived from them with RA. What we need is an improved RA with closure over database relations. In the meantime, we should be doing everything we can to ensure that every relational expression we use produces a database relation.

So, first, given the POFN mandate, in RDM a proper relational algebra (RA) would be closed iff it operated on 1NF+5NF (i.e., database) relations and produced 1NF+5NF relations (i.e. 5NF closure) (McGoveran has defined such a RA but has not yet published it).

And second, because databases carry real world meaning, for consistency a RA must preserve information, namely every relationship in the conceptual model, whether formally expressed as a constraint, or the special type of constraint we call a dependency. Otherwise RA operations will lose the semantics of the database design and the DBMS cannot guarantee correctness. The complete information (business rules) expressed formally as semantic constraints (domain, attribute, key, relationship, dependency and so on) is formalized in relation (and database) predicates (RP, DP).

Note: Information expressed declaratively in FOPL based data sublanguage and enforced by the RDBMS is not lost by RA, but there are many ways non-declarative dependencies can be, or result tables can imply erroneous dependencies. This subject is complex and beyond the scope of this post.

“Closure, consistency and RA operators must be defined to work seamlessly together, and a tabular representation is not expressively powerful enough to capture all the relationship information required -- it has to be augmented with the RPs and DPs. This is why I've always argued that unique names for relations, attributes, domains, etc. are not sufficient -- predicates are required and so, names and unique naming must never be used as anything other than a unique mnemonic for a formal predicate known to and used by the RDBMS.”
--David McGoveran
So a proper RA must have 5NF closure and integrate RPs and DPs to preserve information. Alas, the "bigger problem" is that this is not the case for the RA as currently defined, conventionally "understood" and implemented in industry DBMSs (SQL). 5NF closure requires significant redefinition of more sophisticated RA operators and is part of McGoveran's work and forthcoming book (unfortunately, the early draft chapters do not include those on this specific subject).

As is obvious from the existence of update anomalies, 5NF closure is required for correctness. That, in turn, requires full integration of the RPs/DPs into every RA operation to avoid semantic loss and anomalies. But the RA operators were never defined to provide 5NF closure and they are not guaranteed to preserve any normal form other than 1NF. The original 1969 join was meaningful only for two 5NF relations, produced 1NF relations and there was no way to join two 1NF relations, which meant it was of very limited utility. Update anomalies existed even then, because 5NF relations joined to 5NF relations guarantees only a 1NF relation, so updating that result relation could produce anomalies; further joining that anomalous result with a third 5NF relation produces another 1NF result and so on.
“An update anomaly is just a situation in which the DBMS has lost information about dependencies and, given the definitions of RA operators, is unable to consistently enforce all the corresponding semantic constraints. This can only be fixed by re-defining the RA operations in a way that they are "cognizant" of all the dependencies (functional or other). Requiring all dependencies to be manifest in the ("physical") table representation isn't possible, unless a table can represent multiple relations and the constraints/dependencies among them. This is another way of saying that RPs (integrity) rules over tabular representation (structure)! Domains, attributes, and all dependencies must be taken into account and tabular representation is not expressive enough for the purpose.”
--David McGoveran

This is why I must keep telling practitioners that the meaning of a relation is not visible in its tabular presentation, nor is its normal form determinable except with reference to the conceptual model.

“Where can I find the definition of a "multi-relation result"? You linked to a post "on view updating", is that what you mean by the "missing data paper"? Because that doesn't answer that question either.
Note: The old post on view updating did not involve multi-relation results. I linked to it to illustrate one effect of update anomalies. By way of validation, the disagreement about whether view updating is a theoretically solved problem (McGoveran) or is not (Date) hinges precisely on the reliance by the latter on the tabular representation without sufficient RP augmentation. There are other differences between what I call the McGoveran and the Date interpretations of the RDM (e.g., TTM permits encapsulated nested relations which, while not a violation of 1NF raises other issues that are way beyond the scope of this post).

The missing data paper illustrates the workings of multi-relation results at which I arrived independently in the context of missing data. It proved consistent with earlier work by McGoveran:
“I described multi-relation results in the Nothing from Nothing series. The idea is that the DBMS automatically fully normalizes the usual results, potentially producing multiple relations. The easiest example is an union of arbitrary relations. A special case is an outer union, which normally forces inclusion of nullable attributes in order to produce a single "relation" -- not database relation! -- result. A better solution is for union to produce multiple database relations (each without NULLs). At some point, TTM added this concept, despite its having been rejected by Date in the early 1990s."
--David McGoveran
The post I linked to in a comment has an example of an equi-join view of two database relations. If, as we propose, a M:1 relationship between relations is represented the same as a M:N relationship, by an associative relation instead of foreign keys, a join re-defined for 5NF closure could produce a result of three database relations instead of the usual single relation result, and more easily preserve richer information with a tabular presentation.
“This last post you linked to was very insightful about the issues I've been talking about, and helps me understand this perspective a lot more. So after re-reading it I think I understand it now. I now think that the sentence: "relations can be manipulated mathematically as sets by the relational algebra (RA) with relations as results (relational closure)" is potentially ambiguous. I was (apparently incorrectly by your intent) reading it as:"relations can be manipulated mathematically as sets by the relational algebra and relations will result (i.e. there will be relational closure)". But the former link makes me think this reading is intended: "relations can be manipulated mathematically as sets by the relational algebra (correctly) only in those cases where relations are the result (i.e. where there is relational closure)."”

No, the first reading is correct. Under the McGoveran interpretation of RDM, the RA is re-defined for 5NF closure (i.e., database relations are closed under it), hence our contention that both base and derived database relations are fully normalized by definition and results are multi-relation. It might help to understand that every traditionally defined RA operation on 5NF relations that would produce a non-5NF relation can be re-written to produce a set of 5NF relations, thereby obtaining 5NF closure.

(Continued in Part  3)



References

First Normal Form in Theory and Practice (series)
Simple Domains and Value Atomicity
Normalization and Further Normalization (series)
5NF, Association Relations, and Join

Database Design: What It Is and Isn't

No comments:

Post a Comment

View My Stats