Monday, May 3, 2021

OBG: Don't Confuse Levels of Representation Part 2

Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references.

This is a continuation of an email exchange with readers in response to my post Normalization and Performance: Never the Twain Shall Meet started in Part 1

On Normalization, Performance and Database Correctness

(originally posted 03/22/2001)

“I read your article on Normalization and database speed. The rules of normalization are used to provide a guide when designing a logical approach to managing information. If that information management plan (IMP) is to be implemented using a database, then the ultimate "test" of the IMP is its performance on the physical level. The IMP architecture is commonly expressed using an Entity Relationship Diagram or ERD. Sure CPU, RAM, and RAID all play an important role in database performance. But when a DBA changes the way tables, columns, or indexes are structured he is changing the IMP and corresponding ERD. Now there are two sides to this story.
  • The Pro De-Normalization group viewpoint: If the above change results in a "faster" database then denormalization has been "proven" to be necessary.
  • The Pro Normalization group viewpoint: A change is only valid if it maintains the integrity, and flexibility that are inherent in a fully Normalized database. For example, violation of the second rule "Eliminate Redundant data" can result in corrupt data when data in two or more locations is not properly update, changed, or inserted in all locations.
When the second rule is violated then data corruption must be safeguarded against. The safeguard must be implemented on the database level because that is the only way corrupt data can be prevented regardless of the source (i.e. data loaded from a flat file, DBAs and users logged onto SQL*Plus changing information, other application software that also has access to the database, and last but not least; application developers that are not aware of the duplicate data stored in more than one location). I usually implement the safeguard by using database triggers so that when a record is changes in one table a trigger will make the corresponding change in another table. I have found that most "performance gains" achieved by denormalization are eradicated by the use of the trigger needed to safeguard the integrity of the data. The bottom line is that De-Normalization can often increase database speed. But this is most often achieved at the expense of data integrity.”


DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

-03/15/21: Pruned the
POSTS page

-12/26/20: Added “Mathematics, machine learning and Wittgenstein to LINKS page

- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- The PostWest blog for monthly samples of global Antisemitism – the only universally acceptable hatred left – as the (traditional) response to the existential crisis of decadence and decline of Western  civilization (including the US).
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.


FP: The point of my article was that [logical design (i.e., normalization and denormalization) cannot possibly affect performance: only physical implementation can]. Period. Even when performance of fully normalized databases [which is what they must be to be relational] improves after denormalization, all effects are still due to implementation factors, but [lacking foundation knowledge] most practitioners miscredit denormalization (see below).

[Even practitioners who are aware of the correctness and integrity implications of denormalization -- most are not -- fail to enforce the additional constraints necessary to prevent corruption (they rarely know how to formulate them and SQL DBMSs don't support them). Such constraints, if enforced, would cancel out denormalization gains, making practitioners realize that they are trading integrity for performance, but they are not enforced.

I assume that what you enforce with triggers are the "regular" constraints such as PKs, referential and so on. But I dare you to show me one database where the special additional constraints imposed by denormalization are enforced for every denormalized relation. And without those all bets are off.

“Do you believe that when a DBA changes the way tables, columns, or indexes are structured he is changing the IMP and corresponding ERD?”


FP: ERDs are at the conceptual level [and are irrelevant in this context]. [Relations and attributes, not tables and columns] are at the logical level, indexes are at the physical level. You are thoroughly confusing levels of representation, which is what is causing you to link logical design with performance: Normalization and denormalization are purely logical constructs -- changes in the logical structure of databases. [Performance is determined exclusively at the physical level.] Reread the library analogy given in my original article (I don't know what you mean by "change way the structure of columns".

Had they been relational, DBMSs would have allowed DBAs to make any physical changes they saw fit to maximize performance, without affecting the fully normalized logical structure. But that's not true for SQL DBMSs -- they have poor implementation options and support of physical independence [see below]. But that is an implementation problem, not a normalization [i.e., logical design] problem.

I suggest you read the chapters on redundancy, integrity and normalization in my book  and if you have any specific questions on that, ask.

Comments on Republication

Currently I recommend my paper denormalization and all posts here on the subject (see a sample below).

Normalization (to 1NF) refers to "un-nesting" relations, further normalization (to 5NF) to "unbundling" data about distinct entity groups each to its own relation. They are unnecessary except as repair of poor designs.

Here's an example of how DBMS poor implementations and support of physical independence induce, in the absence of foundation knowledge, the "denormalization for performance" illusion (we shall assume that SQL tables are relations). Assume a fully normalized design (i.e., every entity group is represented in its own table) and the DBMS implements each table as a single ISAM file.

Note, first, that if the performance is poor, it is due to the physical implementation (ISAM files), not the logical design.

Second, if you denormalize" by bundling multiple entity groups from multiple into single tables, the data that was stored in multiple files is now stored in single files. If performance improves, again, it is that physical change that is responsible, not the logical re-design -- hence the "denormalization for performance illusion". Note very carefully that this happens because the DBMS supports a single implementation and imposes a relationship between a logical table and it. 

But now suppose a true RDBMS supports multiple implementations (storage and access methods) for a given logical design and any one can be chosen, or changed without it being affected (e.g., for two relations their join can be stored as a single file at the physical level and voila: the improved performance with full normalization -- what we refer to as physical independence. You can then change the implementation of a fully normalized database at will to maximize performance, thus preserving the relational benefits, foremost among them system guaranteed logical validity and by-design semantic consistency.

Further Reading

The Costly Illusion: Normalization, Integrity and Performance

Denormalization for Performance Don't Blame the Relational Model

Denormalization Fundamentals

Denormalization Database Bias, Integrity Tradeoff and Complexity

Data Independence and Physical Denormalization

Don't Confuse Levels of Representation


No comments:

Post a Comment

View My Stats