ON NORMALIZATION AND REPEATING GROUPS
with Fabian Pascal

 

 

From: PS

To: Editor

 

I was once asked at an employment interview whether I had ever denormalized a table.  Since I had only worked in Oracle, and the like, and hadn't recently created a table without a primary key (thereby disallowing multiple identical rows), and thinking any table in first normal form is not denormalized, I said, "No."  I didn't get the job.  Since then, I have been subjected to numerous assertions by colleagues that this or that table has been denormalized for performance, yet in all instances, the supposedly denormalized table was obviously in third normal form.  (In all instances, the tables had surrogate keys, and no obvious composite key candidates, so 2NF, BCNF, 4NF and 5NF did not apply.)

 

In all of these assertions that a table is denormalized, it appeared to me that the people making the assertion believed that normalization somehow extends across rows, and maybe across the entire database. Yet William Kent, in A Simple Guide to Five Normal Forms In Relational Database Theory, Section 6, Interrecord Redundancy, states in the very first sentence of that section, "The normal forms discussed here deal only with redundancy occurring within a single record type."  Believing that normalization extends across rows, these colleagues assert that a tactic of placing an account balance column in an accounts table, and perhaps creating a trigger on an account activity table to maintain it, constitutes denormalization, because the account table balance column is "redundant."

 

The belief that the example provided above is a genuine example of denormalization for performance appears pervasive, yet, as far as I can tell, it has nothing to do with normalization at all.  Perhaps, since few of my colleagues have ever worked with any DBMS other than Oracle, and the like, the definition of 1NF makes no sense.  They have no experience with tables having non-unique columns names, nor with tables that have multiple definitions for records, and so they think I'm lying when I say such things exist, or did exist.  I believe this inability to comprehend 1NF leads them to interpret redundancy as "interrecord," not intrarecord, and so everything beyond 1NF gets confused.

 

I think perhaps that you and Chris Date have overestimated the sophistication of the IT community.  More generally, it seems to me that the two of you keep trying to pitch balls to people who haven't even picked up a bat, nor know that they should, much less made it to the batter's box.

 

 

Fabian Pascal Responds: I’ve been accused of many things – which tells me I must be doing something right – but overestimating the IT industry’s sophistication has not been one of them, so this is a first.

 

There is no aspect of database management that is not characterized by lack of foundation knowledge and riddled with confusion, and normalization is not any different. It’s not just a matter of “intrarecord” and “interrecord” confusion – there are all sorts of other confusions and fallacies, and I have documented many of them in my writings and lectures. In fact, in a recent exchange with a reader who responded to my two-part article in DMReview, he made precisely the error you are referring to. When I asked him to give an example that refutes my general argument that denormalization may some times achieve performance gains with current products only at the expense of increased integrity risk, he used the following example:

 

“I want users to be able to quickly retrieve total monthly sales for product A.  They do this hundreds of times a month.  I create a table keyed on Year, Month and Product, to hold the total sales.  I then update the total as orders are processed.  In a completely normalized database, the query to get the total would have to read thousands of rows of order lines, and would be orders of magnitude slower.”

 

As long as the industry neither requires, nor rewards knowledge of fundamentals, why should we expect anything else?

 

 

From: PS

 

Message-id: <000b01c24b89$42be5d90$06fea8c0@xp01>

MIME-version: 1.0

X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2600.0000

I once laid out a jobs table for a new construction electrical contractor. New home construction sites were always identified by a job number (by a previous and long established convention).  The most important block of managerial information with respect to a job were the twelve or thirteen inspections dates.  So I defined a jobs table with the job number, some other job specific info, and the twelve or thirteen inspection dates. I was labeled a dummy for creating a table with a repeating group.

 

I noted in my previous email that my colleagues, because they only have experience with products such as Oracle, SQL Server, and such, find the definition of first normal form completely unintelligible.  In this case, because they have no experience with a DBMS that can create a repeating group, they create separate tables for things such as the date fields I mention above (a table was suggested with three fields and thirteen rows per job), because they believe the several inspection dates conform to the definition of a repeating group.

 

There are performance implications the misidentified repeating group imposes. There is the unnecessary join (by splitting the data into two tables), an index for the second table, and its search, and the extra data that gets dragged across the network when one row from the job table is joined to the thirteen date rows in the inspection dates table.

 

In another case, the apparel industry and its customer have long accepted the convention of providing a shipping address in one column, and an alternate address in an immediately adjacent column.  It was asserted that this was a repeating group and should be removed to a separate table.  My view was, firstly, the two addresses were not an example of a repeating group, and secondly, the industry was completely satisfied with the provision of a single alternate address (no demand for multiple alternate addresses), and so a separate table was unnecessary for business reasons, and removal would only create an unnecessary additional table, an additional index (and index search), and impose an unnecessary join.

 

I suspect some of the alleged performance problems attributed to normalization are because the concept of repeating groups is misunderstood. Unnecessary tables are being created, unnecessary indexes are being created and searched, unnecessary joins are being written, and excess data is being moved across networks.  As I said in my previous email, I don't believe my colleagues understand 1NF!

 

When I was teaching economics, statistics and finance, I found very quickly that I could not ask students if they understood anything.  I found it was a stupid question because, quite simply, how would they know if they understood?  So instead, I would demonstrate some problem.  Show them how to solve it. Leave that problem on the board, but start another one immediately adjacent to the first, then ask them to tell me how to solve the second one.  Usually, they had considerable trouble. I had to point to the first example to get them started, and point to the first example for each additional step in solving the second.  Only after several repetitions would they actually understand how to work the problems.

 

I suspect you may need to do the something similar in your lectures and writings, because if people get stuck thinking they know 1NF when they really don't, they'll be lost and confused for the rest of the lecture. AND!!!  Never, ever, believe anyone when they say they've learned a lot!  They may think they have, but really, they haven't.  When they go home or back to work, and try to apply what they think they've learned, they find themselves confused and unable to apply it.

 

 

Fabian Pascal Responds: Another very common fallacy due to lack of foundation knowledge. In yet another response to my above-mentioned article, a critic tries to persuade the reader that “multivalue databases” have resolved the performance problems of relational systems by using – guess what – repeating groups! And he then insists this does not violate relational principles (see my rebuttals).

 

I have documented and debunked the fallacy you refer to years ago in my second book, where I quoted Joe Celko in DBMS:

 

"The relational model states that in order to be in 1NF [First Normal Form], a table can have no repeating groups. But since SQL does not support repeating groups, the only way] people can write a repeating group in SQL ... is:

 

  CREATE TABLE employees

   (emp# INTEGER PRIMARY KEY,

    ename CHAR(15) NOT NULL,

    child1 CHAR(15),

    child2 CHAR(15),

      :       :

    childN CHAR(15));”

 

Celko was a member of the ANSI SQL committee and is considered an industry expert. This is hardly the only, or worst ignorance or confusion he has exhibited over the years, and even illustrious industry leaders are not much better e.g. Comments On an Interview with Jim Gray. Is there any wonder that products and practice are what they are? What can we expect from the average practitioner, under these circumstances?

 

While using one table instead of two is, indeed, not a violation of 1NF, this is still a logical design decision. Maximizing performance is a physical optimization issue, which should not contaminate logical design. If products were truly relational and properly implemented, they would not lead users oblivious to the integrity risk down this slippery slope (see Note on a Quote.)

 

Ed. Note: On the current definition of 1NF see p. 144-45 in THE THIRD MANIFESTO and Hugh Darwen’s Will the Real First Normal Stand Up? in RELATIONAL DATABASE WRITINGS 1989-1991.

 

 

Posted 11/01/02

 

 

 

[ABOUT] [QUOTES] [LINKS]