ON NORMALIZATION AND DISTRIBUTED DATABASES
with C. J. Date

 

 

 

From: Dan-Andrei Sitar-Tăut

Date: Oct 31 2005

 

The reason for writing this letter is some concerns that I have related to database normalization issues. Two years ago, I was deeply involved into this subject. One of the problems is related to EKNF and the second one to 3NF.

 

1. EKNF: It is told that it would be stronger than 3NF (?). Let’s take some classical examples:

 

DEPT, MNGR, ACC# or STUDENTNUM, SUBJECTCODE, SUBJECTTITLE

 

DEPT and MNGR are replaceable because DEPT -> MNGR, but also MNGR -> DEPT. None of them is candidate key. In these circumstances, we will have a candidate (primary) key such as DEPT, ACC#.

This relation is in 1NF but it does not comply with 2NF because an non-fully functional dependency is occurred here:  DEPT, ACC# ->MNGR (due DEPT -> MNGR). Why should we check for EKNF because not even 2NF is accomplished here?

 

2. 3NF: Let’s take a little bit artificial example:

 

IDADDRESS, CITY, STREET, STREET#, ZIPCODE.

 

We have only different addresses here. IdAddress is the primary key. We have a transitive dependency: City, Street, Street# -> ZipCode. This means that we have to split, for the beginning, the original relation in the following two: (IdAddress, City, Street, Street#) and (City, Street, Street#, ZipCode). Because we will rarely have duplicates according to the zip code (we do not intent to cover all possible addresses), should we really split the original relation in order to bring it into 3NF? The redundancy will be enormous. If we would have ZipCode -> City, Street, Street# the things would be OK. Have I missed something?

 

Of course, I do have other questions related to normalization, but I don’t think that the moment is chosen properly. This assertion stands also for my questions.

 

If you have the chance to read this and if you consider that my requests are not inadequate, please explain me briefly what mistakes I have committed.

 

P.S. Related to the First Great Blunder, if relvar would be equal to class object, then as we have the RDB (Relational DataBase) then the corresponding name for OODB should be … CDB, where “C” is from “class”, “classed” or something. Almost nothing academic for this, let’s say, “etymological” assertion. I know it is only a naming coincidence; in fact a mis-coincidence.

Dan Sitar

 

 

From: C. J. Date

 

I'll reformulate your questions here in order to make this response more self-contained. First, you give a relvar (call it R) with attributes DEPT, MNGR, and ACC#, in which the following FDs hold (I think it's a good idea always to show the left and right sides of FDs (as well as keys) enclosed in braces, to stress the point that we're talking about sets of attributes in every case):

 

{ DEPT } -> { MNGR} { MNGR } -> { DEPT}

 

There are two candidate keys: { DEPT, ACC# } and { MNGR, ACC# }. And you ask (paraphrasing): Why should we check R for EKNF when it's not even in 2NF?

 

Well, actually your latter statement is incorrect: R is in 2NF, and in fact in 3NF also. That's because those normal forms as precisely defined in E. F. Codd, Further Normalization of the Data Base Relational Model, in Randall J. Rustin (ed.), DATA BASE SYSTEMS, Courant Computer Science Symposia Series 6. Englewood Cliffs, N.J.: Prentice Hall (1972)—not as they're usually, and loosely, defined in most of the literature—

don't require an attribute to be irreducibly (or "non-fully") dependent on each key if the attribute in question is itself a component of some key of the relvar. (Most people don't know this!) Thus, for example, the fact that DEPT isn't irreducibly dependent on {MNGR,ACC#} doesn't of itself mean that R isn't in 2NF or 3NF. See my book AN INTRODUCTION TO DATABASE SYSTEMS.

 

So R is in 2NF, and indeed 3NF also; however, it is of course not in BCNF, and so the discipline of BCNF would suggest that it be decomposed into projections appropriately. Note: You actually ask about EKNF, not BCNF (EKNF being a sort of halfway house between 3NF and BCNF), but I think BCNF is the real issue here. It's my impression that nobody bothers much with EKNF these days. I could be wrong.

 

Turning to your second question: You give a relvar (again I'll call it R) with attributes IDADDRESS, CIT¥, STREET, STREET#, AND ZIPCODE, in which the following FD holds:

 

{ CITY, STREET, STREET# } -> { ZIPCODE }

 

The sole key is { IDADDRESS}. And you go on to say (paraphrasing): The discipline of 3NF would suggest that R be decomposed into its projections on { IDADDRESS, CITY, STREET, STREET# } and { CITY, STREET, STREET#, ZIPCODE}. But should we really perform this decomposition "in order to [achieve] 3NF"-since, as you observe, "the redundancy will be enormous" if we do?

 

By the way, you say this FD is "transitive," but your use of  this term is incorrect. If A -> B and B -> C, then it's the FD A -> C that's transitive, not the FD B -> C. There are several things I want to say here:

 

·   First, I hope you understand-I'm sure you do-that the existing relvar R without this decomposition (but with the stated FD) certainly isn't in 3NF. So if you want to achieve 3NF, you certainly have to do something.

 

·   Second, the point doesn't seem to be widely appreciated that normal forms are defined only with respect to a certain given set of dependencies (see AN INTRODUCTION TO DATABASE SYSTEMS, page 361, footnote); in other words, normalization should be carried out with respect to all relevant dependencies, which isn't necessarily the same thing as all dependencies. You don't have to state a given FD if you really don't care about it very much (in particular, if you don't care whether it's ever violated, or if you have some other mechanism in place for guaranteeing that it won't be). In the case at hand, it might be perfectly reasonable not to specify the FD

 

{ CITY, STREET, STREET# } -> { ZIPCODE }

 

if you don't want to. In which case, R will be in 3NF after all (absent any other explicitly specified FDs).

 

·   Third, if you do care about the FD, it might well be a good idea to introduce a surrogate, CSS# say, to stand for the combination of City, street, and Street#. Then your 3NF design might consist of three relvars, thus:

 

Rl { idaddress, css# } KEY { idaddress }

R2 { css#, city, street, street# } KEY { css# }

KEY { city, street, street# }

R3 { city, street, street#, zipcode } KEY { city, street, street# }

 

Now the redundancy goes away. But of course normalization as such has nothing to say about the use of surrogates to reduce redundancy in such a manner. While it's true that normalization is the one major piece of science available to us in the endeavor of database design, the sad fact is that it doesn't even begin to address the majority of design issues.

 

As I've written elsewhere: We need more science!

 

 

From: Dan-Andrei Sitar-Tăut

 

I read the 8th edition of your book. It is remarkable. I used the Romanian version whose preface was written by Prof. Marin Fotache, a big theorist and also practitioner in relational database field. Because I did not have the original version I could not find exactly the paragraph you mentioned. Anyway, in normalization process the functional dependencies have to not be studied when the determined members could be the candidate keys. This is briefly what you wanted to say, isn’t it? On the other hand, there are serious problems related to 2NF in literature. I think 90% of papers debate the particular case, that one when we have a single candidate key (the primary one). The definitions do not entirely cover what 2NF really is.

 

Related to the second issue, I had already ignored those functional dependencies which have multi-determinants. I ignored them, as also my scientific advisor – Prof. Ştefan Niţchi – suggested before. I apologize for the lack in expression when I wrote “transitive dependency”. I know exactly what it is. 

 

The thesis is entitled Database Distributed Systems. A part of its content is based on 16 published papers and a book. Also, the thesis is now already published. Briefly, it states that the relational model is the most reliable for supporting a distributed system. I took a very “sick” older database containing the students from our faculty. There were a lot of inconsistencies there. Thus, I had to normalize it (not in all 11 NFs mentioned in thesis). Even if I had enriched its (database’s) semantic content, bringing it into 3NF caused a more than 23 times decrease of database size. Also the main causes for update anomalies disappeared. Of course, according to physical implementation, the normalized tables do not have only (straight or indirect) advantages. Even if normalization is a theoretical concept I had to intercalate some pragmatic aspects. Resulted tables were then nominated for distribution due to some certain facts. I made a mixed fragmentation and a replicated allocation design according to some operational criteria. The distributed database systems are not, of course, the universal solution. In my opinion, the current trend is developing federative database systems. For now I have considered that in our case a distributed/federated Web integrated system would be OK. As future directions I have proposed some warehousing, OLAP, or data mining extensions.

 

Even if we pleaded for relational databases I have also talked about OO systems, and also about so-called object-relational ones. I pointed out the good and also the bad sides of all these systems.

 

From: C. J. Date

 

In my previous letter, I referred you to my book AN INTRODUCTION TO DATABASE SYSTEMS, 8th edition, page 369.  As it turns out, the text I was referring to is also on page 369 of the Romanian translation; it's the paragraph beginning just above the middle of the page. 

 

You say:  "[The] functional dependencies have to not be studied when the determined members could be the candidate keys."  If my understanding of this sentence is accurate, it's not quite correct.  What I think it should say is:  "2NF and 3NF do not exclude FDs in which attributes on the right hand side [which is what I think you mean when you refer to determined members] are components of a candidate key."  But you're quite right in saying there are serious problems related to 2NF in the literature.  Of course, I don't really think 2NF is very interesting, anyway (and the same goes for 3NF); in fact, I now believe the only important normal forms are BCNF and 6NF (and/or possibly 5NF). 

 

Turning to what you call "the second issue":  I don't really understand what you mean by "functional dependencies which have multi-determinants"—but I'm quite happy to ignore the matter if you're comfortable with it. 

 

From: Dan-Andrei Sitar-Tăut

 

You are right about the paragraph from page #369. Is that note that stipulates the existence of overridden candidate keys? Those keys contain replaceable (interchangeable) attributes – such DEPT and MNGR – and the other attributes from these keys are identical in each case. I have had to admit that I did not count this rule and I have been also convinced that 90% from literature did the same thing. Generally it is told that the problem of candidate keys is brought into discussion for the first time only in the case of BCNF study. Maybe I have exaggerated a little bit that 90%. Anyway, I’m surprised how these people could arrive to test the 3NF, sometimes EKNF, and BCNF compliance if they do not know this rule. Without considering this, a lot of relvars would be restructured when they are brought in the 2NF and further normalization process becomes objectless. It is another reason for clarifying the 2NF’s definition. Of course it is my opinion and probably I’m wrong.

 

I noticed that you rated the same normal forms – BCNF and 5/6NF – as Mr. Darwen did. Of course, this fact has not to surprise me because you are “old army comrades”. When your new co-production book will see the sunlight? For my work the 1NF, 2NF, and 3NF provided pretty good results. That 23:1 reduction could have been more important if I had not been also interested enriching the informational content of the database.

 

Sorry for the unconventional expression “functional dependencies which have multi-determinants”. Remember I warned you in regard with my “over 10,000 miles far” English.;-) I think my “term” can be better explained with this example:

 

Attributei, Attributek, Attributej à Attributel

 

Attributei, Attributek, Attributej is the “multi-determinant” and I debated the case when it is neither the primary nor a candidate key.

 

What is your opinion related to the federated systems? I know that the distributed systems have stronger theoretical foundations but there are many cases when they cannot be practically applied or they are proved as inefficient. That’s why we are looking for alternative solutions.

 

 

Posted 2/17/06

© Fabian Pascal 2006 All Rights Reserved