Sunday, June 29, 2014

Denormalization: Database Bias, Integrity Tradeoff and Complexity

The common and entrenched misconception about normalization was recently visible yet again in a LinkedIn exchange.
R: Unless the need is for ACID compliant transactions, denormalization is generally not considered logically, physically or whatever-ally-–so essentially a thoroughly normalized mode is relevant for a write-infrequently consumption of data and data integrity can be guaranteed by design.

Points arising:
  • ACID is separate and orthogonal to normalization.
  • Denormalization has costly practical drawbacks for any database, whether it is being updated (transactions), or is read-only (see next).
  • There are at least three, inter-related drawbacks to denormalization (see The Costly Illusion: Normalization, Integrity and Performance).
1. Redundancy: Data values are stored more than once
2. Update anomalies: Certain reasonable data operations are not possible
3. Complexity/proneness to errors: Harder to understand, protect the integrity of and query databases
Out of these, only 1 and 2 pertain primarily to updates. 3 is a major, vastly underestimated, if not almost entirely ignored problem for read-only denormalized databases too (see, for example, Big Data, Normalization & Analytics).
R: When the need is for read over and over again (as frequently is the case for reporting/analytics) in a business context where data from multiple entities needs to sewn together, denormalization is a reasonable, popular and efficient practice. With the right controls in place, data integrity CAN be ensured – so Fabian – I have to disagree with your statement “The gains do not come from denormalization, but from trading integrity for it” ... Question simply boils down to whether the organization has the stomach to bless & absorb the rigor and cost associated with ensuring this integrity and provide the data infrastructure to honor/meet SLAs.
I do not deny denormalization is popular, but that is hardly proof that it is always reasonable or efficient. Since it involves "bundling" of multiple entity classes into one table in order to avoid joins, consider what happens to queries that require access only to data about individual classes: they read more data than necessary, which degrades performance. It depends, of course, on the patterns of access to data and their relative importance and response requirements. Furthermore, denormalized databases are harder to understand and complicates the formulation of queries and interpretation of results, increasing error-proneness.

As for updates, my paper mentioned above proves that denormalization for performance is a costly illusion. Performance is determined exclusively at the physical level by a myriad of factors. To the extent that (1) all  physical optimization options are exhausted (very big if!) and (2) performance improves significantly when a database is denormalized, the gains do not come from denormalization per se, but rather from failure to add the integrity constraints necessary to control the redundancy introduced by the denormalization. Such constraints involve the very joins denormalization was intended to avoid! So adding them to every denormalized R-table imposes a serious burden on both users and the DBMS, only to defeat the whole purpose of denormalization.

If an organization knew and understood these fundamentals, appreciated their implications and consciously decided to trade integrity for performance, or bias databases for some queries and against others, fine. But that is hardly the reality.

What is more, without pressing vendors for more efficient implementations, organizations are dooming themselves to continue to make such popular compromises, to their own detriment.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

1 comment:

  1. The unfortunate term denormalization is orthogonal in the ways you say as well as to pretty much everything relational. Products are partly at fault but so are the numerous immature posters on LinkedIn and elsewhere who conflate what Codd called compound sentences or propositions, such as those represented by a join's tuple, with simple, non-compound sentences.

    The many immature products encourage that conflation and I'd bet that the majority of practitioners believe that usage of such a tuple is no different from one that represents a simple sentence. When they dominate the audience or at least the vociferous in the audience, it's no wonder that vendors can foist the nonsense that joins can't perform as well as so-called denormalized tables. So the popular wisdom says pretend the join is a non-join and presto, the whole logical basis goes out the window.

    Nothing in Codd's theory prevents stored joins. Neither does anything in his theory require the application interface to specify storage. Why should products require users to specify storage aka materialization? Nostalgia?

    Part of the widespread immaturity is a hangover from file system days when the logical join phenomenon wasn't made apparent by practical systems so all updating involved the pretence that there is only one kind of sentence/fact, the simple non-compound one. We now have the absurd situation where some users, eg. denormalization fans, think the proposition P (t) = P1 (t) AND P2 (t) means NOT P (t) = NOT P1 (t) AND NOT P2 (t) at the same time as vendors refuse to support NOT P1 (t) IMPLIES NOT P (t). Codd allowed P, P1 and P2 to have distinct keys (more than allowed, he strongly encouraged such), so the deletion of simple sentences is actually just a special case of join deletion and it's generally never necessary to specify NOT P (t). NOT P ( key of P) is enough, likewise NOT P1 ( key of P1), if desired.