Friday, August 28, 2020

TYFK: Denormalization Does Not Have Fundamentals

Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).
  ““Main Question: How do we trade-off while doing denormalization? 
  • Sub-question 1: the standard to implement
- Do we always have to denormalize a model? For what kind of project must we use denormalization techniques while others may not?
- Since denormalization has its gains and losses, how well should we denormalize a data model? Perhaps, the more complete we denormalize, the more complex, uncertain and poor the situation will be.
  • Sub-question 2: the characteristics of normalization
-Does denormalization have several levels/forms the same as that of normalization? For instance: 1DNF, 2DNF...
- Given we can denormalize a data model, it may never be restored to the original one because to do normalization, one can have many ways while to build a data model, you can have multiple choices in determining entities, attributes, etc.””

In Part 1 we discuss the relevant fundamentals in which we will ground the debunking in Part 2.


DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

-12/24/20: Added 2021 to the
POSTS page

-12/26/20: Added “Mathematics, machine learning and Wittgenstein to LINKS page

- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- The PostWest blog for monthly samples of global Antisemitism – the only universally acceptable hatred left – as the (traditional) response to the existential crisis of decadence and decline of Western  civilization (including the US).
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.


Levels of Representation

We have long deplored the common and entrenched confusion in the industry of the three levels of representation that inhibits understanding of data management, for which reason we recommended the following three-fold terminology that helps avoid it:
  • Conceptual modeling (defines scope and ontology -- relevant objects and properties, including relationships);
  • Logical database design (defines domains, attributes, and relations and relationships among them -- whether expressed implicitly via structure, or explicitly via constraints -- so as to establish a logical model that is correct and consistent under the relational algebra (RA));
  • Physical implementation (determines [throughput, response time, concurrency, availability, scalability, and recoverability -- all via choice of physical data structures that can be mapped to and from logical structure without information loss; resource allocation/deallocation for data structures, memory blocks, file structures, indexes, concurrency control methods, recovery methods, and locality of reference.
DBMS selection will have an enormous affect (1) on the utility of a logical model due to how much the RDM is or is not correctly and completely implemented and, therefore, what works-around and corruptions are forced on the database designer; and (2) on the physical options it provides -- data structure, index, allocation/deallocation, access method, concurrency control and so on. Moreover, in a relational system (i.e., RDBMS and properly designed database):
“The one critical component that connects logical database design to the physical implementation is the DBMS optimizer, which has four key functions:
  • To translate any query language expression into an relational algebra (RA) expression (including relevant constraints);
  • To optimize that RA expression;
  • To map relevant attributes and relations to (or from) their physical implementations via optimal access methods (e.g., indexed, sequential, or hashed); and,
  • To optimize RA operations on relations via choice of a suitable algorithm for processing the data.
The optimizer is the "glue" that binds the logical design to the physical implementation and, ultimately, determines the implementation characteristics above. Notice that it mustperform both purely logical and purely physical optimizations and connect them.” --David McGoveran
Other than SQL DBMS limitation, performance problems usually arise from sub-optimal user exploitation of physical options offered by the DBMS, and applications designed to be navigational, insisting on browse and update, record-at-a-time conditional logic, etc., rather than set-at-a-time and declarative (which may or may not improve response time, but it certainly harms optimization and throughput).

The combination of poorly implemented non-relational DBMSs and absence of foundation knowledge by both vendors and users induce logical-physical confusion (LPC) and the illusions of "denormalization for performance" (DfP) and trading integrity for it.

Normal Forms and Correctness

Denormalization clearly refers to logical database design, since it is the "reverse" of normalization, which is logical. However, using it in connection with performance, which is determined exclusively at the physical level, reflects LPC.

Normal forms are properties of logical database relations (i.e., mathematical relations adjusted and applied to database management). As we have explained:

  • A relation in 1NF has attributes drawing their values from simple domains, the values of which are treated as atomic ("non-decomposable") by the data sublanguage (i.e., even if they have internal structure/components, they are not accessible to the language).
  • A relation in 5NF is one in which the only attribute dependencies that hold are functional dependencies of the non-key attributes on the primary key (PK) (informally, it represents a group of entities of a single type).
5NF implies 1NF.

The relational algebra (RA) operations were initially (1969) defined assuming that every relation was in 1NF, except for join, which was applicable strictly to (something close to) 5NF relations and was otherwise invalid. In effect, this meant that the set of RA expressions (including the now familiar operations of join, projection, restriction, and set operations like union) on a set of relations would require all those relations to be in 5NF, else the join operation would be disallowed. For (mathematical) reasons the RA was revised in later papers so that join was no longer so restricted, but it permitted loss of information (it appears that this situation was a consequence of exploring the causes of this loss of information and preventing it). Loss of information is a formal way of saying incorrect query results. By 1971, the time of Codd's paper on "further normalization" beyond 1NF, this concern with information loss ("update anomalies") had been generalized from a characteristic of RA operations (especially join) to a characteristic of relations themselves, as operands of RA operations.

"Unlike a 1NF relation, a 5NF relation need not be susceptible to update anomalies if the engine of a true RDBMS is properly implemented. All RA operations (and in consequence relational expressions) should preserve information. Once this property of RDM is nailed down, all RA operations can be properly defined so that relational closure is obtained without possibility of loss of information. Although this is a consequence of the RDM, the RDM must be slightly enhanced by a minor re-definition of the operations such that they are closed over and produce 5NF relations and thus are information preserving. This is one of the objectives of my current work.” --David McGoveran
In other words, to ensure correct query results (1) relations must be in 5NF and (2) RA operations must be closed over them

Note: Closure is a possible property of an abstract algebra -- it consists of a set of operations, each of which operate on some number of objects (operands), each of a type, and produce a result of some type. If, for some operation, all the operands have the same type, the operation is said to be closed over that type (e.g., the numeric algebra called arithmetic is closed over and produces numbers). In RDM the relational algebra -- and so all its operations -- is closed over relations. Without this property, RA operations would not be arbitrarily nestable (e.g., as we do sub-queries).

Note very carefully that not only updates, but retrievals can also lose information.

“Every update can be formulated in terms of a relational expression (a read-only) and relational assignment (necessarily an update). When a RDBMS processes a relational expression of even moderate complexity (e.g., involving a join), it often creates temporary relations and updates them. If there are non-relations in the database (e.g., SQL tables), voila: update anomalies! Accepting each intermediate result as a relation when it isn't just compounds information loss -- the more the nesting of expressions, the more denormalized the results -- either something valid (i.e., correct inference) is lost, or something invalid (i.e., an incorrect inference) is gained, or both. The consequences for query results may be difficult to detect, or can even be masked in SQL DBMSs. In the more obvious cases, vendors work hard to correct them internally. However, it can be shown that it is impossible to prevent all of them all of the time. If you value correctness, repairing the logical design (i.e., full normalization) is critical.” --David McGoveran
This is why we currently contend that database relations should be in 5NF by definition, otherwise should be considered non-relations and all bets are off.

As we mentioned elsewhere, a McGoveran (as yet unproved) conjecture holds that adherence to his three core logical database design principles produces 5NF databases (though a 5NF database does not imply compliance with the principles). If the conjecture is true, correct logical database design (i.e., adherence to the three principles) obviates the need for explicit normalization (to 1NF) and/or further normalization (to 5NF), which become necessary only to "repair" bad designs (i.e., that did not adhere to the three principles) -- although we warn that such cases probably require conceptual full re-modeling.

If logical correctness requires 5NF relations and performance is determined at the physical level, what induces the LPC underlying DfP -- why denormalize?

Anomalies vs. Redundancies

As far as we know, claims to the contrary notwithstanding, until his 1990 RM/V2 book Codd did not mention elimination of redundancy as motivation for further
(i.e., beyond 1NF) normalization, but emphasized admissible (consistent) states of the database (i.e., correctness). He motivated and explained "further normal forms" by emphasizing the objective of removing update anomalies (he called them "dependencies"), "...making ... relations easier to understand and control, simpler to operate upon, and more informative...". Unfortunately, this was later forgotten and anomalies were explained in terms of (physical!) redundancy, mistaking the logical objective of full normalization for physical efficiency (i.e., saving storage space).
“... And usually, going to [higher normal form] ... will improve performance as well ... because it will reduce the size of your data.”
“Database normalization is the process of efficiently organizing data in a database ... Eliminating redundant data ... reduce the amount of space a database consumes and ensures that data is logically stored.”
Second, industry tools and practice completely conflate logical database design and physical implementation.
“There are many physical structures that could be used to represent a logical relation in storage (e.g., a linked list with multiple orderings, or an indexed file storing the join of the relations and so on, all maintained transparently by the DBMS), but products do not offer such choices -- only one is usually possible (essentially one or two flat files, possibly ordered by some key in SQL DBMSs). Because of this 1:1 mapping of a logical model to one specific physical representation, practitioners end up modifying the former to manifest a change in the latter, which creates the (false) impression that denormalization improves performance.” --David McGoveran
Take a SQL DBMS with a direct image representation -- CUSTOMERS and ORDERS logical tables are stored as two flat files. To avoid joins, practitioners denormalize them as one logical non-relation. Because of the 1:1 mapping the non-relation will be stored in one file. It is this physical change from two files to one -- not the logical denormalization -- that may affect performance, but it creates a DfP illusion, distracting from the responsibility of the DBMS (limited storage options, weak physical independence) for poor performance (and from its vendor's obligation to improve relational fidelity and implementation). Practitioners accept to trade correctness for performance as a solution, and blame the RDM and normalization for it.


Denormalization cannot improve performance; it is a workaround limitations of DBMSs and poor practices that
should not be accepted as a solution -- it obscures them and inhibits improvements.

DfP is, in fact, a misnomer. Practitioners should use physical optimization for performance instead -- this would dispel the very harmful LPC underlying DfP -- and focus on where real progress can be achieved: relational and implementation improvements of DBMSs and practices, including applications that take full advantage of the RDM. There will be no progress without education of vendors and users.

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.

Further Reading

The Costly Illusion: Normalization, Integrity and Performance

Database Design: What It Is and Isn't

Denormalization: Database Bias, Integrity Trade-off and Complexity

Denormalization for Performance: Don't Blame the Relational Model

Normalization and Further Normalization series

No comments:

Post a Comment

View My Stats