From: Lajos Nagy
Date: 20 Jul 2005
Have you ever heard of Dr. Halassy?
He's a Hungarian and used to argue a lot with E.F. Codd about
databases back in the 70s. I read his
books on databases, which are considered the best in the field in Hungary, and
I came upon an interesting idea of his concerning relational normal forms. In a
nutshell, he argues that the only thing we should look out for in our relations
are non-trivial transitive functional dependencies. These transitive dependencies are exactly the ones that cause all the well-known problems of not-normalized relations. Basically he generalizes the notions of 2NF, 3NF, BCNF, and replaces them with a single check for transitive dependencies. In this approach 2NF, 3NF, BCNF are not successive (or higher) normal forms that define a hierarchy but rather
different names for the same problem: transitive dependencies.
More formally:
Let R be a relation schema and A, B, C be subsets of
attributes, not necessarily disjoint, of R.
Furthermore, let A be a candidate key of R. There's a transitive
functional dependency (TFD) on R if the FDs: A -> B, B -> C, and A ->
C all hold on R.
Each FD in the TFD can be either trivial or non-trivial
FD. (Just for completeness: the FD A -> B is trivial if B is a subset of A, otherwise it's non-trivial.) This gives rise to eight possible situations that we have to consider.
(There are three FDs and each can be either trivial or non-trivial so that's 2^3 = 8.)
Let's "T" mark trivial and "N" non-trivial. This way a situation
can be described with a three letter sequence: "TNN" for example describes the situation
where A -> B is trivial, B -> C is non-trivial, and A -> C is
non-trivial.
Let's analyze each case in turn:
TTT - only trivial FDs, no problem.
TTN - impossible
TNT - not in BCNF
TNN - not in 2NF
NTT - equivalent to a single FD, A -> (B - C),
so no real transitivity
NTN - equivalent to a single FD, A -> B, so no
real transitivity
NNT - not in 3NF
NNN - not in 3NF
Basically, if we can find a TFD of type TNT, TNN, NNT, or NNN
in our schema then it is not normalized.
Using this generalized notion we can succinctly define the
"traditional" normal forms:
2NF = no TNN
3NF = 2NF and no NNT and no NNN
BCNF = 3NF and no TNT
One advantage of Dr. Halassy's approach is that it does away
with the clumsy 'key', 'non-key' distinction between attributes in the
definitions of the normal forms, presents them in a systematic and terse way,
and emphasizes the fact that the very existence of non-trivial transitive
functional dependencies lies at the heart of the problems with non-normalized
relations.
So, what do you think?
C. J. Date Responds: I would like to offer a few observations
on the proposed alternative perspective on normalization.
First an obvious point: Whatever the advantages and
disadvantages of that alternative perspective might be, it appears to be of no
help in connection with normal forms higher than BCNF (4NF, 5NF, 6NF), nor with
JDs or MVDs that are not FDs.
Second, I agree
that the classical definitions of 2NF and 3NF involve a "clumsy"
distinction between key and nonkey attributes, but I don't agree that the same
criticism applies to the classical definition of BCNF. BCNF is the normal form with respect
to functional dependencies, and the classical definition is as follows:
Relvar* R is in BCNF if and only if, for
every nontrivial FD A -> B satisfied by R, A is
a superkey for R.
(See my recent book DATABASE IN DEPTH: RELATIONAL THEORY
FOR PRACTITIONERS for further discussion.) This definition is simple,
straightforward, and elegant, and it
makes no mention of key vs. nonkey attributes; in fact, it also makes no
mention of those other "clumsy" concepts, reducible FDs and
transitive FDs, either. Thus, while I
do support the goal of making life easier for the database designer by reducing
the number of concepts he or she has to deal with—which Halassy's idea of
"nontrivial transitive FDs" seems to be all about—I think the
classical definition of BCNF does the job pretty well, too.
Third, I have some
specific issues with the proposed formalism.
Here's Halassy's definition of transitive FD, quoted verbatim:
Let R be a relation schema and A, B, C be subsets
of attributes, not necessarily disjoint, of R. Furthermore, let A be a candidate key of R. There's a transitive FD (TFD) on R if
the FDs: A -> B, B -> C, and A -> C
all hold on R.
First issue:
If A is a key, the FDs A -> B and A ->
C hold necessarily! Thus, the
definition reduces to saying there's a TFD if B -> C
holds. But B and C are
arbitrary subsets of the attributes of R, so the definition apparently
says that if R satisfies any FD at all, then it satisfies a TFD—which in
turn implies that every "relation schema" R satisfies a
TFD, since R certainly satisfies the FD H -> H, where H
is the set of all of the attributes of R. (Actually I suppose that's strictly correct. If A is any subset of the set of
attributes of R, we certainly have A -> A and A
-> A, and so A -> A is a TFD. But there must be more to the concept than
that. In other words, my first issue
is: I think the definition of TFD needs
to be tightened up.)
Second issue:
The analysis on the second page of Nagy's communication does not seem to
be completely correct. Consider the
usual suppliers-and-parts database.
Suppose supplier numbers and supplier names are both "unique,"
and consider the relvar
S { S#, SNAME, CITY }
Now take R as S and A, B, C as S#, SNAME, CITY,
respectively (note that A is indeed a key, as required). Then A -> B, B ->
C, and A -> C are all nontrivial; yet the relvar is in
3NF (and in fact BCNF), contrary to the claim that "all three FDs
nontrivial implies the relvar is not in 3NF."
Third issue:
The analysis on the second page of Nagy's communication seems to be
incorrect in another respect also.
Consider the relvar SJT with attributes S (student), J (subject), and T
(teacher), with predicate: Student S
is taught subject J by teacher T.
Assume that:
·
For each subject, each student of that subject is
taught by only one teacher.
·
Each teacher teaches only one subject (but each subject
is taught by several teachers).
This relvar satisfies the FDs {S,J} -> {T} and {T} ->
{J}. Now take R as SJT and B,
C, A as T, J, and the combination of S and J, respectively (note that A
is indeed a key, as required). Then A
-> B and B -> C are nontrivial while A -> C
is trivial; yet the relvar is in 3NF (though not BCNF), contrary to the claim
that "A -> B and B -> C nontrivial and A
-> C trivial implies the relvar is not in 3NF."
Fourth issue:
It's claimed that "if we can find a TFD of type TNT, TNN, NNT, or
NNN in our schema then it [i.e., the schema, not the TFD] is not
normalized." Two points:
·
The earlier "definition" of TFD doesn't
actually say what a TFD is! It says a
TFD exists if certain conditions obtain, but it doesn't say what it is that
constitutes the TFD in question. So,
strictly speaking, we haven't been told what "a TFD of type TNT"
(etc.) actually means. Nor (with
reference to the earlier claim regarding "nontrivial TFDs") do we
know, strictly speaking, what it means for a TFD to be trivial or
otherwise.
·
Assuming the definition of TFD is tightened up
appropriately and the analysis corrected in the light of (at least) the second
and third issues identified above, I now observe that the term
"normalized" as used in the quoted claim refers specifically to BCNF. Since there are so many distinct normal
forms (at least six of them, not counting 1NF), I think it's advisable to avoid
use of the unqualified term "normalized," at least in formal
contexts.
Posted 9/23/05