From: SB
To: Editor
Since I am dealing
with relational databases, I come across the 4th and 5th normal forms, which I
never believed really existed. After
the visit of Fabian Pascal's Zurich seminar, I bought and read Fabian's PRACTICAL
ISSUES IN DATABASE MANAGEMENT and Chris' INTRODUCTION TO DATABASE
SYSTEMS. In both the 4th and 5th normal form is presented with the very
same examples found in every book, which mentions them.
I am very interested
in your opinion to the following:
On page 390 in INTRODUCTION
TO DATABASE SYSTEMS the example for a violation of the 4th normal form is a
relation CTX containing course (C), teacher (T), and text (X). It is said in the text that the relation is
“all key”.
I doubt that: A key by
definition has to be the identifier of something. Something must be accessible
('opened') via this key. Even if we are not interested in (storing) the
attribute identified by this key, it must exist, otherwise it makes no sense at
all to call something a key. I was not able to find an attribute, which is
identified by the triple CTX. So the process of packing the three attributes
together into one relation and declaring them as 'key' was senseless and
therefore in my opinion wrong from the beginning.
In my definition a
relation consists of a key and at least one (non-key-member) attribute,
functionally dependent on this key. For practical reasons, a relation can
consist of a key only (if we are not interested in any real attribute), but it
must be shown, that there EXISTS an attribute, which depends on this key. I
believe, that with this definition every relation is automatically in 4th
normal form.
Regarding the 5th
normal form (page 395), I noticed the following: the relations SPJ, SP, PJ, and
JS have something in common; none can be replaced by any others.
SP says,
"Supplier S supplies part P"
PJ says "part P
is used in project J"
SJ says "supplier
S supplies for project J"
SPJ says
"supplier S supplies part P for project J"
I understood the
example in a way that we are interested in the fact stated by the binary
relations. So "S supplies P" could be a fact, which is never brought
in relation with any project. Only if all three form a circle, we have a
fourth, the ternary relation.
Now, suppose in every
relation we are interested in the first day the fact became known to the
database.
SP. supplies_since
PJ. used_since
SJ. supplies_since
SPJ. supplies_since
Even if at some time
SPJ is a fact, we cannot get rid of the three binary relations (and the other
way round) because we could loose information. We could know the supplying of a
part long before we learn about the supply of this part in a certain project.
The four relations can contain 4 different facts, which are linked with a rule
(whenever SPJ then SP, PJ and SJ must hold as well).
So far I did not find
a way to get rid of the redundancy. Reducing the 4 relations to 3 binaries is
possible in the special case mentioned in the book, where there are no non-key
attributes. As soon as attributes are added to any of the relations, this won't
work for any case. The only solution I see at the moment is to have all 4 and
to formulate a constraint that guarantees consistency despite the redundancy.
As these thoughts are
turning in my head for quite a long time now, and I never found anybody who
understood enough of 4th and 5th normal form to be able to challenge my views,
I would very much appreciate an answer and/or debate about this topic.
Chris Date
Responds:I'll take
your points in order.
Ø
Since I invented those examples myself, I'd be
flattered if they truly were "found in every book," but I frankly
doubt whether they are.
Ø
Oh dear. There
are many separate points all muddled together in this objection. Let me see if I can disentangle them,
somewhat.
First off, I wish you would use the terminology of relvars,
not relations. Proper and precise
terminology is so important in the kind of discussion we're trying to have.
Thus, please consider: To say that something has a certain key is to say
that the something in question is subject to a certain constraint. And to say that something is subject to a
constraint, of any kind, is to say that the something in question is--by
definition!--a variable. So we're talking about relation variables, not
relation values (this latter being what the term "relations,"
unqualified, means). We allow "relation variable" to be abbreviated
to just "relvar" and (as already mentioned) "relation
value" to be abbreviated to just "relation."
Now, the relvar (not relation) CTX just is "all
key." I'm appealing here to the formal definition of the term
"key." After all, the
relational model is a formal system, and all of its components have formal
definitions, necessarily. Of course, we hope those formally defined components
have a good correspondence with certain informally defined constructs in the
real world; to the extent such a correspondence can be found, the relational
model will be a good basis for dealing with real-world problems. Do please note,
however, that the correspondence will be informal too!--we can't have a
formal mapping between something formal and something informal, by
definition. (That really is "by
definition," by the way.)
By contrast, you seem to think the term "key" should
be used to mean something informal, not formal. Well, you're at liberty to use the term that way if you want
to. But you're not then at
liberty to criticize the relational model because its use of terminology is
different from yours!
Even with your informal concept of what a "key" is,
I can't believe you really want to insist that what you would probably call
"entities" MUST have additional "attributes." For one example, consider the relvar
HOLIDAYS, with just one attribute, DATE, which simply lists all the national
holidays for some country. For another
example, consider the relvar FRIENDS with two attributes A and B, with
semantics "A and B are friends."
And so on.
"Packing C, T, and
X together was senseless in the first place" (paraphrased): Now, that's probably true. And the discussion in AN INTRODUCTION TO
DATABASE SYSTEMS goes on to make exactly that point (see page
394)! But the point is an informal
one ... The whole idea of normalization theory is to come up with formal
arguments to bolster up our informal intuitions. Informally, we can say we think CTX
is a bad design. Formally, normalization theory tells us exactly why
it's a bad design. What's more, if we
can formalize such matters, we can go on to mechanize them, too (i.e.,
we can get the machine to do the work). So your objection here really misses
the point. (You're not alone here, by
the way--most critics of normalization tend to make exactly the same mistake.)
Ø
Again, very muddled, I'm afraid. Here's a relvar (not a
relation!)--with, I trust, self-explanatory semantics--that "consists of a
key and at least one (non-key-member) attribute [that is] functionally
dependent on this key":
EMP {EMP#,DEPT#,BUDGET}
Here {EMP#} is the key,
and the functional dependencies (FDs)
{EMP#} _ {DEPT#}
{EMP#} _ {BUDGET}
{DEPT#} _ {BUDGET}
all hold--but this relvar
isn't even in 3NF, let alone BCNF or 4NF. So your third sentence is wrong.
Also, please note that if
a relvar has key {K} and some "non-key-member" attribute A,
the FD {K} _ {A} must hold! So your first sentence leaves something to be desired, too.
I can't make any sense of
your second sentence.
Ø
Your text beginning "Regarding the 5th normal
form" and ending "The only solution ... is to have all 4 and to
formulate a constraint which guarantees consistency despite the
redundancy." I can't follow all of
your detailed arguments in this discussion, but I do agree with your overall
conclusion. My book AN INTRODUCTION TO
DATABASE SYSTEMS makes the same point--see, for example, the annotation
to reference [12.13] or the answer to Exercise 12.3. The essential point is this:
If all relvars are in 5NF, then there is no redundancy in the database
that can be removed by taking projections.
But this fact doesn't mean there's no redundancy! It just means, to repeat, that there's no
redundancy that can be removed by taking projections. Some redundancies cannot be removed by taking projections.
Ed.
Comment: There are two very
common fallacies in the industry regarding the relationship between redundancy
and normalization.
·
On the one hand, many think, like SB, that a normalized
design cannot have redundancy. Otherwise put, by fully normalizing an
undernormalized design, all redundancies are eliminated. It is easy to
demonstrate that is not the case: tables that do not have keys--permitted in
SQL--can have duplicates, and normalization will not eliminate this kind of
redundancy.
·
On the other hand, some believe that any
procedure that eliminates any redundancy is normalization (see, for
example, On Denormalization
and Repeating Groups).
This is obviously incorrect too. Raising normalization levels can eliminate
only redundancies due to certain types of inter-column dependencies.
For a discussion of all the possible kinds of
redundancy, including those due to undernormalization, and how they can be
eliminated see the chapters on redundancy and normalization in my PRACTICAL ISSUES IN
DATABASE MANAGEMENT.
Posted
02/14/03
[ABOUT]
[QUOTES]
[LINKS]