ON THE HIGHER NORMAL FORMS
with Chris Date

 

 

 

From: SB

To: Editor

 

Since I am dealing with relational databases, I come across the 4th and 5th normal forms, which I never believed really existed.  After the visit of Fabian Pascal's Zurich seminar, I bought and read Fabian's PRACTICAL ISSUES IN DATABASE MANAGEMENT and Chris' INTRODUCTION TO DATABASE SYSTEMS. In both the 4th and 5th normal form is presented with the very same examples found in every book, which mentions them.

 

I am very interested in your opinion to the following:

 

On page 390 in INTRODUCTION TO DATABASE SYSTEMS the example for a violation of the 4th normal form is a relation CTX containing course (C), teacher (T), and text (X).  It is said in the text that the relation is “all key”.

 

I doubt that: A key by definition has to be the identifier of something. Something must be accessible ('opened') via this key. Even if we are not interested in (storing) the attribute identified by this key, it must exist, otherwise it makes no sense at all to call something a key. I was not able to find an attribute, which is identified by the triple CTX. So the process of packing the three attributes together into one relation and declaring them as 'key' was senseless and therefore in my opinion wrong from the beginning.

 

In my definition a relation consists of a key and at least one (non-key-member) attribute, functionally dependent on this key. For practical reasons, a relation can consist of a key only (if we are not interested in any real attribute), but it must be shown, that there EXISTS an attribute, which depends on this key. I believe, that with this definition every relation is automatically in 4th normal form.

 

Regarding the 5th normal form (page 395), I noticed the following: the relations SPJ, SP, PJ, and JS have something in common; none can be replaced by any others.

 

SP says, "Supplier S supplies part P"

PJ says "part P is used in project J"

SJ says "supplier S supplies for project J"

SPJ says "supplier S supplies part P for project J"

 

I understood the example in a way that we are interested in the fact stated by the binary relations. So "S supplies P" could be a fact, which is never brought in relation with any project. Only if all three form a circle, we have a fourth, the ternary relation.

Now, suppose in every relation we are interested in the first day the fact became known to the database.

 

SP. supplies_since

PJ. used_since

SJ. supplies_since

SPJ. supplies_since

 

Even if at some time SPJ is a fact, we cannot get rid of the three binary relations (and the other way round) because we could loose information. We could know the supplying of a part long before we learn about the supply of this part in a certain project. The four relations can contain 4 different facts, which are linked with a rule (whenever SPJ then SP, PJ and SJ must hold as well).

 

So far I did not find a way to get rid of the redundancy. Reducing the 4 relations to 3 binaries is possible in the special case mentioned in the book, where there are no non-key attributes. As soon as attributes are added to any of the relations, this won't work for any case. The only solution I see at the moment is to have all 4 and to formulate a constraint that guarantees consistency despite the redundancy.

 

As these thoughts are turning in my head for quite a long time now, and I never found anybody who understood enough of 4th and 5th normal form to be able to challenge my views, I would very much appreciate an answer and/or debate about this topic.

 

 

Chris Date Responds:I'll take your points in order.

 

Ø       Since I invented those examples myself, I'd be flattered if they truly were "found in every book," but I frankly doubt whether they are. 

 

Ø       Oh dear.  There are many separate points all muddled together in this objection.  Let me see if I can disentangle them, somewhat. 

 

First off, I wish you would use the terminology of relvars, not relations.  Proper and precise terminology is so important in the kind of discussion we're trying to have. Thus, please consider: To say that something has a certain key is to say that the something in question is subject to a certain constraint.  And to say that something is subject to a constraint, of any kind, is to say that the something in question is--by definition!--a variable. So we're talking about relation variables, not relation values (this latter being what the term "relations," unqualified, means). We allow "relation variable" to be abbreviated to just "relvar" and (as already mentioned) "relation value" to be abbreviated to just "relation." 

 

Now, the relvar (not relation) CTX just is "all key." I'm appealing here to the formal definition of the term "key."  After all, the relational model is a formal system, and all of its components have formal definitions, necessarily. Of course, we hope those formally defined components have a good correspondence with certain informally defined constructs in the real world; to the extent such a correspondence can be found, the relational model will be a good basis for dealing with real-world problems. Do please note, however, that the correspondence will be informal too!--we can't have a formal mapping between something formal and something informal, by definition.  (That really is "by definition," by the way.) 

 

By contrast, you seem to think the term "key" should be used to mean something informal, not formal.  Well, you're at liberty to use the term that way if you want to.  But you're not then at liberty to criticize the relational model because its use of terminology is different from yours! 

 

Even with your informal concept of what a "key" is, I can't believe you really want to insist that what you would probably call "entities" MUST have additional "attributes."  For one example, consider the relvar HOLIDAYS, with just one attribute, DATE, which simply lists all the national holidays for some country.  For another example, consider the relvar FRIENDS with two attributes A and B, with semantics "A and B are friends."  And so on. 

 

"Packing C, T, and X together was senseless in the first place" (paraphrased): Now, that's probably true.  And the discussion in AN INTRODUCTION TO DATABASE SYSTEMS goes on to make exactly that point (see page 394)!  But the point is an informal one ... The whole idea of normalization theory is to come up with formal arguments to bolster up our informal intuitions.  Informally, we can say we think CTX is a bad design. Formally, normalization theory tells us exactly why it's a bad design.  What's more, if we can formalize such matters, we can go on to mechanize them, too (i.e., we can get the machine to do the work). So your objection here really misses the point.  (You're not alone here, by the way--most critics of normalization tend to make exactly the same mistake.) 

 

Ø       Again, very muddled, I'm afraid. Here's a relvar (not a relation!)--with, I trust, self-explanatory semantics--that "consists of a key and at least one (non-key-member) attribute [that is] functionally dependent on this key": 

 

 EMP {EMP#,DEPT#,BUDGET}

 

Here {EMP#} is the key, and the functional dependencies (FDs)

 

{EMP#} _ {DEPT#}

{EMP#} _ {BUDGET}

{DEPT#} _ {BUDGET}

 

all hold--but this relvar isn't even in 3NF, let alone BCNF or 4NF. So your third sentence is wrong. 

 

Also, please note that if a relvar has key {K} and some "non-key-member" attribute A, the FD {K} _ {A} must hold!  So your first sentence leaves something to be desired, too. 

 

I can't make any sense of your second sentence. 

 

Ø       Your text beginning "Regarding the 5th normal form" and ending "The only solution ... is to have all 4 and to formulate a constraint which guarantees consistency despite the redundancy."  I can't follow all of your detailed arguments in this discussion, but I do agree with your overall conclusion.  My book AN INTRODUCTION TO DATABASE SYSTEMS makes the same point--see, for example, the annotation to reference [12.13] or the answer to Exercise 12.3.  The essential point is this:  If all relvars are in 5NF, then there is no redundancy in the database that can be removed by taking projections.  But this fact doesn't mean there's no redundancy!  It just means, to repeat, that there's no redundancy that can be removed by taking projections.  Some redundancies cannot be removed by taking projections. 

 

 

Ed. Comment: There are two very common fallacies in the industry regarding the relationship between redundancy and normalization.

 

·   On the one hand, many think, like SB, that a normalized design cannot have redundancy. Otherwise put, by fully normalizing an undernormalized design, all redundancies are eliminated. It is easy to demonstrate that is not the case: tables that do not have keys--permitted in SQL--can have duplicates, and normalization will not eliminate this kind of redundancy.

 

·   On the other hand, some believe that any procedure that eliminates any redundancy is normalization (see, for example, On Denormalization and Repeating Groups). This is obviously incorrect too. Raising normalization levels can eliminate only redundancies due to certain types of inter-column dependencies.

 

For a discussion of all the possible kinds of redundancy, including those due to undernormalization, and how they can be eliminated see the chapters on redundancy and normalization in my PRACTICAL ISSUES IN DATABASE MANAGEMENT.

 

Posted 02/14/03

 

 

 

[ABOUT] [QUOTES] [LINKS]