Sunday, June 29, 2014

Denormalization: Database Bias, Integrity Tradeoff and Complexity

The common and entrenched misconception about normalization was recently visible yet again in a LinkedIn exchange.
R: Unless the need is for ACID compliant transactions, denormalization is generally not considered logically, physically or whatever-ally-–so essentially a thoroughly normalized mode is relevant for a write-infrequently consumption of data and data integrity can be guaranteed by design.

Points arising:
  • ACID is separate and orthogonal to normalization.
  • Denormalization has costly practical drawbacks for any database, whether it is being updated (transactions), or is read-only (see next).
  • There are at least three, inter-related drawbacks to denormalization (see The Costly Illusion: Normalization, Integrity and Performance).
1. Redundancy: Data values are stored more than once
2. Update anomalies: Certain reasonable data operations are not possible
3. Complexity/proneness to errors: Harder to understand, protect the integrity of and query databases
Out of these, only 1 and 2 pertain primarily to updates. 3 is a major, vastly underestimated, if not almost entirely ignored problem for read-only denormalized databases too (see, for example, Big Data, Normalization & Analytics).
R: When the need is for read over and over again (as frequently is the case for reporting/analytics) in a business context where data from multiple entities needs to sewn together, denormalization is a reasonable, popular and efficient practice. With the right controls in place, data integrity CAN be ensured – so Fabian – I have to disagree with your statement “The gains do not come from denormalization, but from trading integrity for it” ... Question simply boils down to whether the organization has the stomach to bless & absorb the rigor and cost associated with ensuring this integrity and provide the data infrastructure to honor/meet SLAs.
I do not deny denormalization is popular, but that is hardly proof that it is always reasonable or efficient. Since it involves "bundling" of multiple entity classes into one table in order to avoid joins, consider what happens to queries that require access only to data about individual classes: they read more data than necessary, which degrades performance. It depends, of course, on the patterns of access to data and their relative importance and response requirements. Furthermore, denormalized databases are harder to understand and complicates the formulation of queries and interpretation of results, increasing error-proneness.

As for updates, my paper mentioned above proves that denormalization for performance is a costly illusion. Performance is determined exclusively at the physical level by a myriad of factors. To the extent that (1) all  physical optimization options are exhausted (very big if!) and (2) performance improves significantly when a database is denormalized, the gains do not come from denormalization per se, but rather from failure to add the integrity constraints necessary to control the redundancy introduced by the denormalization. Such constraints involve the very joins denormalization was intended to avoid! So adding them to every denormalized R-table imposes a serious burden on both users and the DBMS, only to defeat the whole purpose of denormalization.

If an organization knew and understood these fundamentals, appreciated their implications and consciously decided to trade integrity for performance, or bias databases for some queries and against others, fine. But that is hardly the reality.

What is more, without pressing vendors for more efficient implementations, organizations are dooming themselves to continue to make such popular compromises, to their own detriment.

No comments:

Post a Comment