From: PS
To: Editor
I was once asked at an employment interview whether I had
ever denormalized a table. Since I had
only worked in Oracle, and the like, and hadn't recently created a table without
a primary key (thereby disallowing multiple identical rows), and thinking any
table in first normal form is not denormalized, I said, "No." I didn't get the job.
Since then, I have been subjected to
numerous assertions by colleagues that this or that table has been denormalized
for performance, yet in all instances, the supposedly denormalized table was
obviously in third normal form. (In all
instances, the tables had surrogate keys, and no obvious composite key
candidates, so 2NF, BCNF, 4NF and 5NF did not apply.)
In all of these assertions that a table is denormalized, it
appeared to me that the people making the assertion believed that normalization
somehow extends across rows, and maybe across the entire database. Yet William
Kent, in A Simple Guide to Five Normal Forms In Relational Database Theory,
Section 6, Interrecord Redundancy, states in the very first sentence of that
section, "The normal forms discussed here deal only with redundancy
occurring within a single record type."
Believing that normalization extends across rows, these colleagues
assert that a tactic of placing an account balance column in an accounts table,
and perhaps creating a trigger on an account activity table to maintain it,
constitutes denormalization, because the account table balance column is
"redundant."
The belief that the example provided above is a genuine
example of denormalization for performance appears pervasive, yet, as far as I
can tell, it has nothing to do with normalization at all. Perhaps, since few of my colleagues have
ever worked with any DBMS other than Oracle, and the like, the definition of
1NF makes no sense. They have no
experience with tables having non-unique columns names, nor with tables that
have multiple definitions for records, and so they think I'm lying when I say
such things exist, or did exist. I
believe this inability to comprehend 1NF leads them to interpret redundancy as
"interrecord," not intrarecord, and so everything beyond 1NF gets
confused.
I think perhaps that you and Chris Date have overestimated
the sophistication of the IT community.
More generally, it seems to me that the two of you keep trying to pitch
balls to people who haven't even picked up a bat, nor know that they should, much
less made it to the batter's box.
Fabian Pascal Responds: I’ve been accused of many
things – which tells me I must be doing something right – but
overestimating the IT industry’s sophistication has not been one of them, so
this is a first.
There is no aspect of database management that is not
characterized by lack of foundation knowledge and riddled with confusion, and
normalization is not any different. It’s not just a matter of “intrarecord” and
“interrecord” confusion – there are all sorts of other confusions and
fallacies, and I have documented many of them in my writings and lectures. In
fact, in a recent exchange with a reader who responded to my two-part article in
DMReview,
he made precisely the error you are referring to. When I asked him to give an
example that refutes my general argument that denormalization may some times
achieve performance gains with current products only at the expense of
increased integrity risk, he used the following example:
“I want users to be able to quickly retrieve total monthly sales
for product A. They do this hundreds of
times a month. I create a table keyed
on Year, Month and Product, to hold the total sales. I then update the total as orders are
processed. In a completely normalized database, the
query to get the total would have to read thousands of rows of order lines, and
would be orders of magnitude slower.”
As long as the industry neither requires, nor rewards
knowledge of fundamentals, why should we expect anything else?
From: PS
Message-id:
<000b01c24b89$42be5d90$06fea8c0@xp01>
MIME-version: 1.0
X-MIMEOLE: Produced
By Microsoft MimeOLE V6.00.2600.0000
I once laid out a jobs table for a new construction
electrical contractor. New home construction sites were always identified by a
job number (by a previous and long established convention). The most important block of managerial
information with respect to a job were the twelve or thirteen inspections
dates. So I defined a jobs table with
the job number, some other job specific info, and the twelve or thirteen
inspection dates. I was labeled a dummy for creating a table with a repeating
group.
I noted in my previous email that my colleagues, because they
only have experience with products such as Oracle, SQL Server, and such, find
the definition of first normal form completely unintelligible. In this case, because they have no
experience with a DBMS that can create a repeating group, they create separate
tables for things such as the date fields I mention above (a table was
suggested with three fields and thirteen rows per job), because they believe
the several inspection dates conform to the definition of a repeating group.
There are performance implications the misidentified
repeating group imposes. There is the unnecessary join (by splitting the data
into two tables), an index for the second table, and its search, and the extra
data that gets dragged across the network when one row from the job table is
joined to the thirteen date rows in the inspection dates table.
In another case, the apparel industry and its customer have
long accepted the convention of providing a shipping address in one column, and
an alternate address in an immediately adjacent column. It was asserted that this was a repeating
group and should be removed to a separate table. My view was, firstly, the two addresses were not
an example of a
repeating group, and secondly, the industry was completely satisfied with the
provision of a single alternate address (no demand for multiple alternate
addresses), and so a separate table was unnecessary for business reasons, and
removal would only create an unnecessary additional table, an additional index
(and index search), and impose an unnecessary join.
I suspect some of the alleged performance problems attributed
to normalization are because the concept of repeating groups is misunderstood.
Unnecessary tables are being created, unnecessary indexes are being created and
searched, unnecessary joins are being written, and excess data is being moved
across networks. As I said in my
previous email, I don't believe my colleagues understand 1NF!
When I was teaching economics, statistics and finance, I
found very quickly that I could not ask students if they understood
anything. I found it was a stupid
question because, quite simply, how would they know if they understood? So instead, I would
demonstrate some
problem. Show them how to solve it.
Leave that problem on the board, but start another one immediately adjacent to
the first, then ask them to tell me how to solve the second one. Usually, they had considerable
trouble. I
had to point to the first example to get them started, and point to the first
example for each additional step in solving the second. Only after several repetitions would they
actually understand how to work the problems.
I suspect you may need to do the something similar in your
lectures and writings, because if people get stuck thinking they know 1NF when
they really don't, they'll be lost and confused for the rest of the lecture.
AND!!! Never, ever, believe anyone when
they say they've learned a lot! They
may think they have, but really, they haven't.
When they go home or back to work, and try to apply what they think
they've learned, they find themselves confused and unable to apply it.
Fabian Pascal Responds: Another very common fallacy
due to lack of foundation knowledge. In yet another response to my
above-mentioned article, a critic tries to persuade the reader that “multivalue
databases” have resolved the performance problems of relational systems by
using – guess what – repeating groups! And he then insists this does not
violate relational principles (see my rebuttals).
I have documented and debunked the fallacy you refer to years
ago in my second book, where I quoted Joe Celko in DBMS:
"The relational model states that in order to be in 1NF
[First Normal Form], a table can have no repeating groups. But since SQL does
not support repeating groups, the only way] people can write a repeating group
in SQL ... is:
CREATE TABLE employees
(emp# INTEGER PRIMARY
KEY,
ename CHAR(15) NOT
NULL,
child1 CHAR(15),
child2 CHAR(15),
: :
childN CHAR(15));”
Celko was a member of the ANSI SQL committee and is
considered an industry expert. This is hardly the only, or worst ignorance or
confusion he has exhibited over the years, and even illustrious industry
leaders are not much better e.g. Comments On an
Interview with Jim Gray. Is there any wonder that products and
practice are what they are? What can we expect from the average practitioner,
under these circumstances?
While using one table instead of two is, indeed, not a
violation of 1NF, this is still a logical design decision. Maximizing
performance is a physical optimization issue, which should not contaminate
logical design. If products were truly relational and properly implemented,
they would not lead users oblivious to the integrity risk down this slippery
slope (see Note
on a Quote.)
Ed. Note: On the current definition of 1NF see p. 144-45 in THE THIRD MANIFESTO
and Hugh Darwen’s Will the Real First Normal Stand Up? in RELATIONAL DATABASE
WRITINGS 1989-1991.
Posted
11/01/02
[ABOUT]
[QUOTES]
[LINKS]