As a featured expert on relational/database fundamentals at searchdatabase.techtarget.com,I received the following question:
“Data Normalization. Can it be automated? Two questions, really:
·
Are you aware of any algorithm or utility that can be
used to verify whether my database model is in 3NF?
·
Is there an algorithm that can be used to convert a
bunch of records into a set of normalized tables?”
I asked Chris Date to respond because he published on this
very subject. He has provided a detailed and rigorous response, a short version
of which I posted on that site. His complete answer follows; my comments are
prefixed with FP).
Chris Date Responds: The short answer to both
questions is yes. For a much
longer answer, please read on!
I think the first thing I need to do is examine--and, I hope,
clarify--some of the terminology used in the two questions.
Ø
"Database
model": The term model is
grotesquely overused, and abused, in the IT world (see Models,
Models, Everywhere, Nor Any Time to Think. The term database model
in particular is usually used to refer to a generic model that's applicable to
databases in general (the relational model is the obvious example). I think database model in the
first question
would be better as database design (and the design in question must be understood
to be a logical one, not a physical one).
And I think it would be better still as simply database.
Ø "In
3NF": Strictly speaking, we can't
say a given database is or is not in 3NF--3NF is something that is or is not
satisfied by a given table. A given
database might include a mixture of 3NF and non3NF tables. I suppose you might say a given database
is
in 3NF if every table it contains is in 3NF, but it's a bit of a stretch, and
I'd prefer not to do it. (FP: I don’t
see a problem with this, provided the meaning is understood and properly used).
Ø "Bunch
of records": The process of
normalization to 3NF is a process that takes a given table as input and
produces a set of tables as output, such that every table in the output set is
in 3NF. If that process is carried out
on every table in turn in a given database as input, the various sets of tables
produced as output will together form a database in which every table is indeed
in 3NF. So I think bunch of records
should be set of tables. (I remark in passing that the writer is presumably
using the term record to mean record type, whereas the more usual
interpretation of the term is record occurrence. I note further that the writer seems to be
thinking that record
type and table are synonymous. I note
still further that relational tables do not contain records, they contain rows
or, better, tuples.)
Ø
"Normalized
tables": From the context, by
normalized here the writer presumably means 3NF. This usage is common but sloppy; strictly
speaking, normalized
just means first normal form (1NF)--see AN INTRODUCTION
TO DATABASE SYSTEMS. And you
might be surprised to learn that, by definition, all tables are in 1NF!
(equivalently, all tables are normalized).
I think normalized tables would be better as 3NF tables. (I'm ignoring throughout the possibility
that the table might contain either duplicates or nulls. Tables that contain duplicates or nulls do
not represent relations, and normalization theory has nothing to say about
them.)
Ø
"Tables": I've explained in many places and on many
occasions that the term table is unsatisfactory for a variety of reasons (see,
e.g., the book just mentioned). At
least we should be clear as to whether we're talking about table values or
table variables--or, as I would prefer to say, relation values or relation
variables. Somewhat against my own
better judgment, however, for the remainder of this response I'll stick with
the term table, which for present purposes I'll define to mean a table variable
specifically. Also against my own better judgment, I'll also take the term
table to mean a base table specifically (as I believe the writer does).
Ø
"3NF": I wonder why the writer focuses on third
normal form specifically, when there are at least four known and useful higher
levels of normalization. (The four are,
in sequence, Boyce/Codd, fourth, fifth, and sixth normal form. The last of these is defined in a
book
that's due to appear from Morgan Kaufmann in the fourth quarter this year: TEMPORAL DATA
AND THE RELATIONAL MODEL, by C. J. Date, Hugh Darwen, and Nikos A.
Lorentzos. It's pertinent only in the
context of temporal databases.) In
other words, database designers should always aim for tables that are in at
least fifth normal form. For
simplicity, however, I'll stick with 3NF for the remainder of this response,
since that's apparently what the writer wants.
Here then are the original two questions, now reworded to
take the foregoing terminological issues into account:
1.
Is there an algorithm that can be used to verify whether a
given table is in 3NF?
2.
Is there an algorithm that can be used to convert a given
table into a set of 3NF tables?
And now we come to an overriding point: Namely, it's meaningless to say simply
that
some table T either is or is not in 3NF--rather, we can say only that T
is or is not in 3NF with respect to a certain set of functional dependencies
(once again, see the above mentioned book).
Here's a definition (a precise definition!) of 3NF that makes
this point clear.
Let T
be a table and let S be the set of functional dependencies (FDs) that
are satisfied by T; let X be any subset of the columns of T,
and let C be any single column of T. Then T is in 3NF with respect to
S if and
only if, for every FD X _ C in S, at least one of the
following is true:
- X contains C;
- X is a superkey for T;
- C is contained within a
candidate key of T.
Note: Condition a.
means C is one of the columns in X (and the FD X _ C is said to be trivial in
such a case). Condition b. means T
never has a value in which two distinct rows have the same value for X (in
other words, X is a superset of some candidate key for T). Condition c. should be clear (though the
rationale for it might not be!).
So--turning to the two questions above--we have the
following:
1. Is there an algorithm that can be used to verify whether a
given table is in 3NF?
In order to discover whether table T is in 3NF, we
need to know the FDs that T satisfies.
Then we have to check whether every such FD satisfies either condition
a. or condition b. or condition c. above.
These checks can clearly be automated. (FP:
This is a formal way of saying that whether or not a table is in 3NF depends
on the business rules underlying the table representation, which
must be declared to a DBMS that understands them and uses them for
verification; see my article on this subject forthcoming at the Journal of Conceptual Modeling.) Given such a set
of FDs/business rules
and a DBMS, the answer to both questions is yes. For further details see AN INTRODUCTION TO
DATABASE SYSTEMS.
(Note that if we know the FDs, we can in principle infer the
candidate keys, though there are known to be limits on the degree to which such
inferencing might be feasible in practice--see Catriel Beeri and Philip A.
Bernstein:Computational Problems Related to the Design of
Normal Form Relational Schemas, ACM TODS 4, No. 1, March 1979.)
2. Is there an algorithm that can be used to convert a given
table into a set of 3NF tables?
So long as you have at least a basic understanding of what
the normalization process entails, the broad outlines of such an algorithm
should be clear from the foregoing discussion.
For further details, see AN INTRODUCTION TO
DATABASE SYSTEMS:
·
Page 361 gives an informal algorithm for converting a
1NF table to a set of 2NF tables;
·
Page 363 gives an informal algorithm for converting a
2NF table to a set of 3NF tables;
·
Page 384 gives a formal algorithm for converting a 1NF
table to a set of 3NF tables.
Posted
09/20/02