ON NORMALIZATION
with Fabian Pascal

 

 

 

From: LC
To: Editor

 

I recently discovered dbdebunk.com, and am very interested in what you are doing. I work for a business intelligence software company, and work with customers to implement our query tools against their existing relational data warehouses.

 

I am amazed on a regular basis about the lack of fundamental understanding of databases in the industry. Many implementers are constantly in search of a "quick fix" to each problem they find, and are unwilling or unable to step back and realize the fundamental design flaws in their projects. I think a major reeducation of the industry is in order, and I appreciate what you're doing.

 

However, I would like to respectfully disagree with one of your recurring assertions, and hopefully get your feedback. An example of the assertion is in the "The Logical-Physical Confusion" article, from the May 2002 issue of the "Journal of Conceptual Modeling".

You state repeatedly that denormalization is essentially a logical fallacy, and that the idea of performance gain from denormalization is a widespread myth. I understand the heart of the issue: that logical and physical design are (wrongly) tied so closely in current RDBMS products that to change the physical storage design, in many cases the logical table design must be changed. I agree that this is a problem.

 

However, you go on to assert that the denormalization efforts are still not worth the effort, even given the current state of the products:

 

"What is ironic about all this is that the whole concept of denormalization for performance is actually an illusion. While performance does improve sometimes, this is possible only if the integrity implications of denormalization are ignored. The industry does not realize that denormalized databases suffer from redundancy and, thus, have a considerably higher risk of data corruption, which imposes a highly prohibitive integrity burden on users."

 

I agree that this does apply to multi-purpose database systems, which are used both for data entry and analysis. However, in my field the database systems we work with are entirely optimized for analysis. Generally, a customer will have one or more transactional systems that are optimized for insert/update performance, and a central warehouse system that is optimized for analysis of large volumes of data.

 

In these analysis-optimized systems, the only insert/update operations are performed via a single, centrally administered loading process. The process is designed to denormalize tables where appropriate, and to carefully guarantee consistency throughout the process. Generally the load occurs during time periods in which users are not performing heavy analysis, so the overhead required for this consistency checking does not affect the ultimate goal of the system, which is complex query performance.

 

I have seen many concrete examples of systems that achieve vast gains in query performace via denormalization. The designers of these systems understand the consistency implications, and while it is true that ensuring consistency does require significant resources, this resource consumption is time-shifted in such a way that it does not hinder performance.

 

So, while the current crop of RDBMS products inadequately separate the logical from the physical design of a system, it is not a myth that denormalization can achieve huge performance gains if it is properly understood and carefully implemented.

 

I would love to hear your feedback to this, if you are interested.
 
 

Fabian Pascal Responds: Read the chapter on normalization in my book, my several articles in the Against the Grain series (particularly my answers to questions piece), and The Dangerous Illusion: Normalization, Performance and Integrity, Part 1 and Part 2.

 

1.      There are no relational DBMSs or databases (warehouses are certainly not that). There are only SQL ones--which are not even close--or worse.

 

2.       The 'for analysis only' is somewhat of an illusion, given that warehouses/DSS databases are populated from production databases which are themselves poorly designed and managed by DBMSs that do not guarantee correctness themselves. Databases/DBMSs were invented to avoid relying on users/applications, and with all due respect to your company's "single, centrally administered loading process... designed to denormalize tables where appropriate, and to carefully guarantee consistency throughout the process", I see no reason to relax the general principle. Moreover, performance tradeoffs occur not just between retrievals and updates, but between different retrievals too; queries that need to access the narrower normalized tables will be forced to access wider denormalized tables.

 

3.      Correctness must be ensured not just for data in the database, but also for its manipulation. Relational operations expect fully normalized databases, in the absence of which even if results are correct, interpretability may be difficult.

 

4.       Acceptance of denormalization as a solution to performance problems is usually based on lack of awareness of its (integrity) costs and a major factor robbing vendors from the incentive to develop correct and true RDBMSs which would perform well with fully normalized databases. I'm afraid that if those became available your company would have to find something else to do, but it would certainly benefit all users.

 

 

Posted 07/05/02

 

 

 

[ABOUT] [QUOTES] [LINKS]