From: Dan-Andrei Sitar-Tăut
Date: Oct 31 2005
The reason for writing this letter is some concerns that I
have related to database normalization issues. Two years ago, I was deeply
involved into this subject. One of the problems is related to EKNF and the
second one to 3NF.
1. EKNF: It is told that it would be stronger than 3NF (?).
Let’s take some classical examples:
DEPT, MNGR, ACC# or STUDENTNUM, SUBJECTCODE, SUBJECTTITLE
DEPT and MNGR are replaceable
because DEPT -> MNGR, but also MNGR -> DEPT. None of them is candidate
key. In these circumstances, we will have a candidate (primary) key such as
DEPT, ACC#.
This relation is in 1NF but it
does not comply with 2NF because an non-fully functional dependency is occurred
here: DEPT, ACC# ->MNGR (due DEPT
-> MNGR). Why should we check for EKNF because not even 2NF is accomplished
here?
2. 3NF: Let’s take a little bit artificial example:
IDADDRESS, CITY, STREET, STREET#, ZIPCODE.
We have only different addresses here. IdAddress is the
primary key. We have a transitive dependency: City, Street, Street# ->
ZipCode. This means that we have to split, for the beginning, the original
relation in the following two: (IdAddress, City, Street, Street#) and (City,
Street, Street#, ZipCode). Because we will rarely have duplicates according to
the zip code (we do not intent to cover all possible addresses), should we
really split the original relation in order to bring it into 3NF? The
redundancy will be enormous. If we would have ZipCode -> City, Street,
Street# the things would be OK. Have I missed something?
Of course, I do have other questions related to
normalization, but I don’t think that the moment is chosen properly. This
assertion stands also for my questions.
If you have the chance to read this and if you consider that
my requests are not inadequate, please explain me briefly what mistakes I have
committed.
P.S. Related to the First Great Blunder, if relvar would be
equal to class object, then as we have the RDB (Relational DataBase) then the
corresponding name for OODB should be … CDB, where “C” is from “class”,
“classed” or something. Almost nothing academic for this, let’s say,
“etymological” assertion. I know it is only a naming coincidence; in fact a
mis-coincidence.
Dan Sitar
From: C. J. Date
I'll reformulate your questions here in order to make this
response more self-contained. First, you give a relvar (call it R) with
attributes DEPT, MNGR, and ACC#, in which the following FDs hold (I think it's
a good idea always to show the left and right sides of FDs (as well as keys)
enclosed in braces, to stress the point that we're talking about sets of
attributes in every case):
{ DEPT } -> { MNGR} { MNGR } -> { DEPT}
There are two candidate keys: { DEPT, ACC# } and { MNGR, ACC#
}. And you ask (paraphrasing): Why should we check R for EKNF when it's not
even in 2NF?
Well, actually your latter statement is incorrect: R is in
2NF, and in fact in 3NF also. That's because those normal forms as precisely
defined in E. F. Codd, Further Normalization of the Data Base Relational
Model, in Randall J. Rustin (ed.), DATA BASE SYSTEMS, Courant
Computer Science Symposia Series 6. Englewood Cliffs, N.J.: Prentice Hall
(1972)—not as they're usually, and loosely, defined in most of the literature—
don't require an attribute to be irreducibly (or
"non-fully") dependent on each key if the attribute in question is
itself a component of some key of the relvar. (Most people don't know this!)
Thus, for example, the fact that DEPT isn't irreducibly dependent on
{MNGR,ACC#} doesn't of itself mean that R isn't in 2NF or 3NF. See my book
AN INTRODUCTION TO DATABASE SYSTEMS.
So R is in 2NF, and indeed 3NF also; however, it is of course
not in BCNF, and so the discipline of BCNF would suggest that it be decomposed
into projections appropriately. Note: You actually ask about EKNF, not BCNF
(EKNF being a sort of halfway house between 3NF and BCNF), but I think BCNF is
the real issue here. It's my impression that nobody bothers much with EKNF
these days. I could be wrong.
Turning to your second question: You give a relvar (again
I'll call it R) with attributes IDADDRESS, CIT¥, STREET, STREET#, AND ZIPCODE,
in which the following FD holds:
{ CITY, STREET, STREET# } -> { ZIPCODE }
The sole key is { IDADDRESS}. And you go on to say
(paraphrasing): The discipline of 3NF would suggest that R be decomposed into
its projections on { IDADDRESS, CITY, STREET, STREET# } and { CITY, STREET,
STREET#, ZIPCODE}. But should we really perform this decomposition "in
order to [achieve] 3NF"-since, as you observe, "the redundancy will
be enormous" if we do?
By the way, you say this FD is "transitive," but
your use of this term is incorrect. If
A -> B and B -> C, then it's the FD A -> C that's transitive, not the
FD B -> C. There are several things I want to say here:
·
First, I hope you understand-I'm sure you do-that the
existing relvar R without this decomposition (but with the stated FD) certainly
isn't in 3NF. So if you want to achieve 3NF, you certainly have to do
something.
·
Second, the point doesn't seem to be widely appreciated
that normal forms are defined only with respect to a certain given set of
dependencies (see AN INTRODUCTION TO DATABASE SYSTEMS, page 361,
footnote); in other words, normalization should be carried out with respect to
all relevant dependencies, which isn't necessarily the same thing as all
dependencies. You don't have to state a given FD if you really don't care about
it very much (in particular, if you don't care whether it's ever violated, or
if you have some other mechanism in place for guaranteeing that it won't be).
In the case at hand, it might be perfectly reasonable not to specify the FD
{ CITY, STREET, STREET# } -> { ZIPCODE }
if you don't want to. In which
case, R will be in 3NF after all (absent any other explicitly specified FDs).
·
Third, if you do care about the FD, it might well be a
good idea to introduce a surrogate, CSS# say, to stand for the combination of
City, street, and Street#. Then your 3NF design might consist of three relvars,
thus:
Rl { idaddress, css# } KEY { idaddress }
R2 { css#, city, street, street# } KEY { css# }
KEY { city, street, street# }
R3 { city, street, street#, zipcode } KEY { city, street, street#
}
Now the redundancy goes away. But of course normalization as
such has nothing to say about the use of surrogates to reduce redundancy in
such a manner. While it's true that normalization is the one major piece of
science available to us in the endeavor of database design, the sad fact is
that it doesn't even begin to address the majority of design issues.
As I've written elsewhere: We need more science!
From: Dan-Andrei Sitar-Tăut
I read the 8th edition of your book. It is remarkable. I used
the Romanian version whose preface was written by Prof. Marin Fotache, a big
theorist and also practitioner in relational database field. Because I did not
have the original version I could not find exactly the paragraph you mentioned.
Anyway, in normalization process the functional dependencies have to not be
studied when the determined members could be the candidate keys. This is
briefly what you wanted to say, isn’t it? On the other hand, there are serious
problems related to 2NF in literature. I think 90% of papers debate the
particular case, that one when we have a single candidate key (the primary
one). The definitions do not entirely cover what 2NF really is.
Related to the second issue, I had already ignored those
functional dependencies which have multi-determinants. I ignored them, as also
my scientific advisor – Prof. Ştefan Niţchi – suggested before. I
apologize for the lack in expression when I wrote “transitive dependency”. I
know exactly what it is.
The thesis is entitled Database Distributed Systems. A
part of its content is based on 16 published papers and a book. Also, the
thesis is now already published. Briefly, it states that the relational model
is the most reliable for supporting a distributed system. I took a very “sick”
older database containing the students from our faculty. There were a lot of
inconsistencies there. Thus, I had to normalize it (not in all 11 NFs mentioned
in thesis). Even if I had enriched its (database’s) semantic content, bringing
it into 3NF caused a more than 23 times decrease of database size. Also the
main causes for update anomalies disappeared. Of course, according to physical
implementation, the normalized tables do not have only (straight or indirect)
advantages. Even if normalization is a theoretical concept I had to intercalate
some pragmatic aspects. Resulted tables were then nominated for distribution
due to some certain facts. I made a mixed fragmentation and a replicated
allocation design according to some operational criteria. The distributed
database systems are not, of course, the universal solution. In my opinion, the
current trend is developing federative database systems. For now I have
considered that in our case a distributed/federated Web integrated system would
be OK. As future directions I have proposed some warehousing, OLAP, or data
mining extensions.
Even if we pleaded for relational databases I have also
talked about OO systems, and also about so-called object-relational ones. I
pointed out the good and also the bad sides of all these systems.
From: C. J. Date
In my previous letter, I referred you to my book AN
INTRODUCTION TO DATABASE SYSTEMS, 8th edition, page 369.
As it turns out, the text I was referring to
is also on page 369 of the Romanian translation; it's the paragraph beginning
just above the middle of the page.
You say: "[The]
functional dependencies have to not be studied when the determined members
could be the candidate keys." If
my understanding of this sentence is accurate, it's not quite correct.
What I think it should say is: "2NF and 3NF do not exclude FDs in
which attributes on the right hand side [which is what I think you mean when
you refer to determined members] are components of a candidate key."
But you're quite right in saying there are
serious problems related to 2NF in the literature. Of course,
I don't really think 2NF is very interesting, anyway
(and the same goes for 3NF); in fact, I now believe the only important normal
forms are BCNF and 6NF (and/or possibly 5NF).
Turning to what you call "the second issue":
I don't really understand what you mean by
"functional dependencies which have multi-determinants"—but I'm quite
happy to ignore the matter if you're comfortable with it.
From: Dan-Andrei Sitar-Tăut
You are right about the paragraph from page #369. Is that
note that stipulates the existence of overridden candidate keys? Those keys
contain replaceable (interchangeable) attributes – such DEPT and MNGR – and the
other attributes from these keys are identical in each case. I have had to
admit that I did not count this rule and I have been also convinced that 90%
from literature did the same thing. Generally it is told that the problem of
candidate keys is brought into discussion for the first time only in the case
of BCNF study. Maybe I have exaggerated a little bit that 90%. Anyway, I’m
surprised how these people could arrive to test the 3NF, sometimes EKNF, and
BCNF compliance if they do not know this rule. Without considering this, a lot
of relvars would be restructured when they are brought in the 2NF and further
normalization process becomes objectless. It is another reason for clarifying
the 2NF’s definition. Of course it is my opinion and probably I’m wrong.
I noticed that you rated the same normal forms – BCNF and
5/6NF – as Mr. Darwen did. Of course, this fact has not to surprise me because
you are “old army comrades”. When your new co-production book will see the
sunlight? For my work the 1NF, 2NF, and 3NF provided pretty good results. That
23:1 reduction could have been more important if I had not been also interested
enriching the informational content of the database.
Sorry for the unconventional expression “functional
dependencies which have multi-determinants”. Remember I warned you in regard
with my “over 10,000 miles far” English.;-) I think my “term” can be better
explained with this example:
Attributei, Attributek, Attributej à Attributel
Attributei, Attributek, Attributej is
the “multi-determinant” and I debated the case when it is neither the primary
nor a candidate key.
What is your opinion related to the federated systems? I know
that the distributed systems have stronger theoretical foundations but there
are many cases when they cannot be practically applied or they are proved as
inefficient. That’s why we are looking for alternative solutions.
Posted 2/17/06
© Fabian Pascal 2006 All Rights Reserved