“If
liberty means anything at all, it means the right to tell people what they do
not want to hear."
--George
Orwell
From: AC
To: Editor
I recently found the website dbdebunk.com, and I'm riveted
... having worked as a DBMS expert for years, I've found some of your insights
quite, well, rather insightful.
On the other hand, I have to take issue with the article The
Dangerous Illusion: Denormalization, Performance and Integrity, Part 1. You
seem to have fallen victim to a bugbear of your own devising: the alleged
Physical vs. Logical confusion! But wait. Don't jump to conclusions. I'll
explain.
Yes, there is a body of developing knowledge called
"Relational Theory". But equally, there is a body of developing
knowledge called "Designing Databases to Meet User Requirements".
Sometimes these two fundamentally different fields converge, at other times
they do not. How and why is this sorry state of affairs possible? Well, here's
my take on it. There is a gaping hole in Relational Theory. Not
*intrinsically*, but rather in its application outside of the field of, well,
Database Theory. The gaping hole is the notion of efficiency, or more
specifically, the idea of 'Cost'... and the notion of Cost as distinct from the
details of any idea of physical implementation.
Relational Theory purports to be applicable to the design of
real-world databases... but it is not. By failing to adequately address Cost,
the Theory becomes inapplicable to real world environments where Cost is the
single overriding factor! This notion of Cost is not a confusion of Physical
vs. Logical: rather, this is a function, directly, of the logical model. Costs
need not be rooted in any physical implementation. We might simply allow that a
the cost of a particular join is some function of the number of rows of the
relevant tables, perhaps, or some function of the number of tables involved, or
an arbitrary constant, or anything. The point is, we have a metric, and we can
build a usage model for our database. By modeling usage of the database, the
possible growth of tables, the various data volumes, etc, we can construct a
purely logical construct of a database in use. The set of metrics and functions
being used to define Costs *could* be matched to the observed physical
properties of any particular implementation of a database, but it might not. In
fact, the opposite might be true: the availability of a cost-based logical
model could be used to define the development goals of physical
implementations!
If Relational Theory is intended to be a complete tool for
the design of databases (and not just one of several tools), it *must* address
the notion of Cost. You don't address this issue... instead, you specifically
*avoid* it, instead suggesting that any mention of Cost by denormalization
apologists is actually the work of a deficient intellect incorrectly separating
logical from physical!
Suffice to say that any discussion of the relative merits of
denormalizing vs. normalizing has to be quantitative with regards to Cost, in
addition to qualitative. Why? Because if Cost is not a factor, then we might as
well not normalize at all! We normalize a data set because there are Costs
associated with *not* normalizing. But these costs remain vague, implied
things. For example, on one hand, you defend the implementation of fully
normalized databases, suggesting that it is possible that faulty programmatic
logic would allow data integrity issues into a denormalized database--a very
low level, physical-level consideration of code quality and the related costs
of data integrity issues--while on the other hand, you disallow any possible
benefit from denormalization, based on the idea that any consideration of
physical implementation with regards to normalization is due to faulty
reasoning.
Thoughts?
From: Fabian Pascal
To: AC
When you have to take so long to explain something so simple,
it is obviously wrong. I am sorry, but that is the case. I think you either
worked too long in the industry, or do not have a formal education, or both.
And there is little I can do via email to correct that.
Let me just say that you do not understand what theory is,
and your perspective on the article is off.
From: AC
I'm surprised that you seem so hostile. Certainly your use of
argumentum ad hominem might appear a little out of place in a technical
discussion?
I'll phrase the argument a couple of different ways, short
and sweet:
- Since the only rational reason for normalization is
cost management--costs being measured in terms of the risk of data
integrity issues or in terms of some sort of theoretical computational
operation required to maintain integrity in a denormalized database--a
complete theory for the design of a database must address relative cost.
- A database design theory that fails to address the
relative merits of two different database designs--indicating that both
are "correct", when in fact, given a particular cost function,
one or other is better--is clearly incomplete.
"...take so long to explain something so simple, it is
obviously wrong...": You're such a comedian... at least my argument
doesn't run to two *articles*, my friend. ;)
Anyway, keep up the good work!
From: Fabian Pascal
To: AC
1st, I have no idea where you find hostility. My reply that
you are wrong was factual, not hostile.
2nd, I have heard your kind of argument for as long as I've
been in the field and all my writings debunk it, so my patience for it has
faded over the years. As Chris Date says, life is short and there's a limit to
how many times we can respond to the same error that we explained so many times
before, and with utmost clarity.
3rd, this is especially true for long messages, which require
even longer replies. This is a free endeavor and our time is not unlimited.
4th, your argument is not technical, but one of what you call
"cost-effectiveness" (see below).
[Your new phrasing] may be short, but it is certainly not
sweet. Why is a denormalized database more costly than a normalized one? Do you
include the cost of corruption in your calculations? If so, how do you
calculate it? (Frankly, I don't believe you were even aware of the integrity
cost before you read my article; 99% of practitioners are not.)
The reason for which people denormalize is not due to cost,
but to ignorance of the integrity implications. Neither do they know/understand
that the products are flawed and force them to denormalize and, thus, trade-off
performance for integrity, which would be unnecessary with a well implemented,
truly relational DBMS
This is pure BS--it sounds reasonable, but is too
vague and lacks any justification to mean anything. Correct means logically
correct, not economically effective. Db practitioners work in a field founded
on logic, of which they know appallingly little, and can't even get their
concepts and terms right, let alone reason. Even the concept of theory you
don't understand, they equate it, like you, with lack of practicality. Can you
define theory?
I reiterate: how do you factor in the cost of corruption, if
you do at all (which I don't believe)? My bet is that you were unaware of the
integrity implications until you read my article. If so, than you are
inconsistent with your own framework of "cost function". If anything
is incomplete, it's that.
Actually, it is you who try to be a comedian, but fail: do
you expect to be taken seriously with such comments?
Here's my suggestion: publish your message as an article and
make it available to public review as I did, then we'll see if my article is
too long, or yours make sense. But before you do, I suggest you check out my
rebuttals to two other reactions to my article, and see if you
want to expose yourself to that.
From: AC
Thanks for your reply. In understand that you do not suffer
fools gladly. I actually find healthy debate very stimulating. ;)
(1)
Since normalization, one tool suggested by Relational Theory,
does not address every possible flaw in a proposed schema--for example,
intra-table dependencies--it can hardly be suggested that a schema that
achieves a certain normal form is "correct", nor can it be claimed
that normalization is "complete" in any sense.
(2)
If you truly do not care about economic effectiveness, why do
you bother talking to people for whom this is the overriding concern, that is,
professionals?
I'll spend some more time working through more of the
material on your excellent site.
From: Fabian Pascal
Well, there is little one can do about fools, so there’s no
point investing time and effort. What I don't suffer is laziness: there are
dozens of my articles, a web site and 3 books which already addressed
repeatedly the issues people keep raising for me to answer over and over again.
And if they think that the issues have not been addressed, well, then they
don't understand and there is little I can do about that too.
Key in your statement are debate and healthy:
there must be a debate and it must be healthy. In most cases neither
is true, and that includes your arguments.
1st, normalization is not a tool, but a set of design
principles/guidelines.
2nd, let me get this straight: are you saying that because
normalization does not address every possible design flaw, then it should
not address any of those that it can address? And if relational design
addresses intra-table dependencies separately from normalization, then
normalization should not be used to address those dependencies that it
can address?
You've got the common problem of poor reasoning. How people
who cannot think logically can do good work in a field founded on logic
escapes me. It's like a quantum theorist who does not know mathematics.
3rd, you avoid answering my questions by asking
another. To repeat:
·
Why is denormalization less costly than
normalization?
·
Is corrupt data a cost of denormalization and does it
enter in your cost-effectiveness calculations at all? If so, how?
Answering this question would make your question (2) moot -
more evidence of poor reasoning.
It says something about the quality of discourse (and
intellectual abilities) in the IT world that all reactions to my article
completely ignore the core of its message: the integrity cost of
denormalization. And you call this "healthy debate that stimulates"?
Sounds like you need much more than [reading my site].
From: AC
I'll be frank: I think you're dragging this debate all over
the place. My original question was simple... where does cost come into the
database design picture. In your article you position normalization as the
be-all-and-end-all of database design, when in fact, I posit, *efficiency* is
the be-all-and-end-all. Normalization is simply a tool for achieving efficiency
in a database that is prone to random access update. Yes, it is very satisfying
in it's application; yes, I'm sure a lot of your correspondents are morons, but
no, and no again, normalization does not address cost, and so cannot be
considered a complete methodology for database design.
I will not be drawn into defending denormalization, I don't
advocate it, and I am not its apologist. I'm merely suggesting that there is
more work in this area that still needs to be done to address cost, in all its
forms, at the logical level.
Let me briefly change the subject. I greatly admire your work
and your apparent goals, but you, after all, for all your comedic vitriol, are
just as human as those poor sods learning database design from "J2EE for
Dummies", and treating these people as less than you is not helping your
cause. You see, normalization does not have any problems addressing
"intra-table dependencies". In fact, there is no such term. I
invented it. No, nothing to do with an improvement suggested by Ling, Tompa,
and Kameda. Gotcha. ;)
From: Fabian Pascal
It is now obvious that your problem is more serious than just
[lack of] technical knowledge and understanding--you don't seem to have
understood anything in my replies and ignored all my questions and challenges.
You keep repeating the same vague, meaningless statements and I get the
distinct impression that you simply cannot address the problems in your
arguments that I raised. No wonder you want to change the subject from
technical issues (which you claimed this is all about) to style issues, which
are irrelevant to the arguments at hand.
As to your gotcha, you have actually put your foot in your
mouth. It so happens that there is such a thing as intra-table
dependencies--this is in fact what normalization is all about!! The notion that
you "invented" the term makes it very clear that you don't understand
the subject you purport to argue about. What is more, neither did you catch my
own "gotcha" (although I did not intend it to be that): normalization
does deal with those dependencies. So you see, you're out of your
league.
I find that this exchange is--as I suspected from the
beginning--neither a debate, nor stimulating, but a waste of time. Should have
listened to my gut and never allowed myself yet again to be baited into this
kind of nonsense--my instincts are almost always right.
I will post the exchange on the site and let the readers
decide if there is anything behind your "cost" and
"efficiency" mantra, who got whom, and what is being dragged where.
Ed. Comments: That a “DBMS expert” with
“years of experience”
knows so little about the foundation of his field and exhibits such inability
to see the logical implications of his arguments is excellent evidence for my
take on the sad state of database management. For consider:
Ø Relational
theory is not just “a body of developing knowledge” or “purports to be
applicable only to the design of real-world databases”. It is the scientific
foundation of database management - logic as it applies to databases. Does AC
mean that the correctness of databases and the answers they produce - of which
logic is the guarantor--is not part of user requirements? How can logic be
“fundamentally different” than those requirements? And if it is, what does AC
propose as an equivalent/better substitute? AC got it backwards here: the sad
story is precisely that database practice diverges from, instead of converging
with logic, as it should. That does not reduce costs, it increases them (By
correctness I mean, of course, the only kind a DBMS is capable of ensuring,
namely consistency, see PRACTICAL
ISSUES IN DATABASE MANAGEMENT).
Ø Says
AC “Costs need not be rooted in any physical implementation. We might simply
allow that a the cost of a particular join is some function of the number of
rows of the relevant tables, perhaps, or some function of the number of tables
involved, or an arbitrary constant, or anything”. But unless the join is
performed - that is, executed physically - what on earth is its so-called
“logical cost” and how does the number of tables or rows enter into it? And how
can any theory be applicable to anything arbitrary? (AC seems to subscribe to
the notion, rather common in the industry, that theory is “just theory and,
therefore, not practical”, which is, of course, bunk. Would he advocate
building bridges that violate physical theories, to produce “less costly”
results? Or medical treatments that violate biochemical theories, for “higher
efficiency”? What happens when this is attempted is pretty well known.)
Ø Any
“metric” can be devised, the question is what does it measure that is
meaningful and valuable? I have no idea what AC’s metric is (at least based on
his “explanation” of it, I don’t), but whatever it is, it has little to do with
normalization. Once a business (or conceptual) model already exists - which
means that the business rules have been fixed - more than one logical model may
be possible (a choice of which may or may not have some costs attached to it),
but whichever one is chosen, it must be normalized, if data integrity is to be
maximized at minimal cost. The point of my article that AC misses is that there
is no cost-function that justifies undernormalization of a given logical model.
Ø AC
provides no concrete example of a situation where a denormalized design would
be less costly, in his sense, than its normalized version. He misses most of
the points of my article, among them:
·
Whether AC’s vague “logical costs” exist or not,
normalization in the “real-world” is performed because of flaws - imaginary or
real - in current, mainly SQL products (which are not really relational), that
force users to trade integrity for performance, without them being aware of it;
·
The number of logical tables--and, thus, of rows
too--per se says nothing about efficiency, which is a property of processing;
depending on storage and access, more rows can be processed more efficiently
than less rows;
·
In a rebuttal of another critique I give the example of
a DBMS physically storing table joins, rather than the “base tables” stored by
SQL products, in which case no join needs to be processed. Better still, what
about the new implementation technology I have alluded to in several writings,
which does away with physical rows altogether?
·
And I do not even mention efficiency trade-offs between
queries and updates, or even between queries (again, see PRACTICAL
ISSUES IN DATABASE MANAGEMENT).
Ø What
a vast majority of practitioners, including AC, fail to understand is that
relational technology cuts several major costs that tend to overwhelm
all others. Among them, keeping the logical level the users see insulated from
the physical implementation level allows DBMS vendors and DBAs complete
freedom to do whatever they damn please to maximize (physical) efficiency, without
affecting what users and applications see at the logical level, or its
integrity! Unlike AC’s vague cost, this is “big time”, clear, measurable
cost cutting. AC, in fact, has it backwards: he criticizes relational theory
for failure to address cost, when in reality, this is exactly what it achieves by
being silent on efficiency!!! Is it the theory’s fault that vendors flout
it and, thus, products fail to deliver its benefits, or impose unnecessary
costs? What is more, they get away with it precisely because practitioners
think like AC does.
Ø Relational
theory is not a “tool” for database design, let alone a complete one (see
above). In fact, it is more accurate to say that normalization is an adjunct to
the relational model, rather than part of it. And my book and articles cover
many other design principles (e.g. normalization is only one section of the
chapter on redundancy). However, normalization principles are general to all
logical designs, regardless of any other principles that are pertinent to them.
AC uses terms such as “quantitative vs. qualitative cost” for
effect, which obscures the fuzziness of his argument. This is that it lacks any
meaning. How about one concrete, proper application of his “cost function” to a
real-world design from his vast practical experience, that favored an
undernormalized model over the normalized version on purely logical grounds?
I will conclude by noting that this is hardly the first time
that my response is deemed “hostile”. But saying that an argument is bunk, or
vague, or uninformed, in conjunction with evidence to that effect, is a statements
of fact. Thinking practitioners should focus on whether it is true or
false, rather than allow their emotions to assign hostility to it. If they
cannot afford the latter, they should either make sure that they argue from
informed, well-reasoned positions, or refrain from public arguments altogether.
This is a thankless job, but somebody’s gotta do it.
Posted
11/23/02