From: LC
To: Editor
I recently discovered dbdebunk.com, and am very interested in
what you are doing. I work for a business intelligence software company, and
work with customers to implement our query tools against their existing
relational data warehouses.
I am amazed on a regular basis about the lack of fundamental
understanding of databases in the industry. Many implementers are constantly in
search of a "quick fix" to each problem they find, and are unwilling
or unable to step back and realize the fundamental design flaws in their
projects. I think a major reeducation of the industry is in order, and I
appreciate what you're doing.
However, I would like to respectfully disagree with one of
your recurring assertions, and hopefully get your feedback. An example of the
assertion is in the "The Logical-Physical Confusion" article, from
the May 2002 issue of the "Journal of Conceptual Modeling".
You state repeatedly that denormalization is essentially a
logical fallacy, and that the idea of performance gain from denormalization is
a widespread myth. I understand the heart of the issue: that logical and
physical design are (wrongly) tied so closely in current RDBMS products that to
change the physical storage design, in many cases the logical table design must
be changed. I agree that this is a problem.
However, you go on to assert that the denormalization efforts
are still not worth the effort, even given the current state of the products:
"What is ironic about all this
is that the whole concept of denormalization for performance is actually an
illusion. While performance does improve sometimes, this is possible only if
the integrity implications of denormalization are ignored. The industry does
not realize that denormalized databases suffer from redundancy and, thus, have
a considerably higher risk of data corruption, which imposes a highly
prohibitive integrity burden on users."
I agree that this does apply to multi-purpose database
systems, which are used both for data entry and analysis. However, in my field
the database systems we work with are entirely optimized for analysis.
Generally, a customer will have one or more transactional systems that are
optimized for insert/update performance, and a central warehouse system that is
optimized for analysis of large volumes of data.
In these analysis-optimized systems, the only insert/update
operations are performed via a single, centrally administered loading process.
The process is designed to denormalize tables where appropriate, and to
carefully guarantee consistency throughout the process. Generally the load
occurs during time periods in which users are not performing heavy analysis, so
the overhead required for this consistency checking does not affect the
ultimate goal of the system, which is complex query performance.
I have seen many concrete examples of systems that achieve
vast gains in query performace via denormalization. The designers of these
systems understand the consistency implications, and while it is true that
ensuring consistency does require significant resources, this resource
consumption is time-shifted in such a way that it does not hinder performance.
So, while the current crop of RDBMS products inadequately
separate the logical from the physical design of a system, it is not a myth
that denormalization can achieve huge performance gains if it is properly
understood and carefully implemented.
I would love to hear your feedback to this, if you are
interested.
Fabian Pascal Responds: Read the chapter on
normalization in my
book, my several articles in the Against
the Grain series (particularly my answers to questions piece), and The
Dangerous Illusion: Normalization, Performance and Integrity, Part 1 and Part 2.
1.
There are no relational DBMSs or databases (warehouses
are certainly not that). There are only SQL ones--which are not even
close--or worse.
2.
The 'for analysis
only' is somewhat of an illusion, given that warehouses/DSS databases are
populated from production databases which are themselves poorly designed and
managed by DBMSs that do not guarantee correctness themselves. Databases/DBMSs
were invented to avoid relying on users/applications, and with all due respect
to your company's "single, centrally administered loading process...
designed to denormalize tables where appropriate, and to carefully guarantee
consistency throughout the process", I see no reason to relax the general
principle. Moreover, performance tradeoffs occur not just between retrievals
and updates, but between different retrievals too; queries that need to access
the narrower normalized tables will be forced to access wider denormalized
tables.
3.
Correctness must be ensured not just for data in the database,
but also for its manipulation. Relational operations expect fully
normalized databases, in the absence of which even if results are correct,
interpretability may be difficult.
4.
Acceptance of denormalization as a solution to
performance problems is usually based on lack of awareness of its (integrity)
costs and a major factor robbing vendors from the incentive to develop correct
and true RDBMSs which would perform well with fully normalized databases. I'm
afraid that if those became available your company would have to find something
else to do, but it would certainly benefit all users.
Posted
07/05/02
[ABOUT]
[QUOTES]
[LINKS]