ON AUTOMATION OF NORMALIZATION VALIDATION
with Chris Date

 

 

 

As a featured expert on relational/database fundamentals at searchdatabase.techtarget.com,I received the following question:

 

“Data Normalization. Can it be automated?  Two questions, really:

 

·   Are you aware of any algorithm or utility that can be used to verify whether my database model is in 3NF?

 

·   Is there an algorithm that can be used to convert a bunch of records into a set of normalized tables?”

 

I asked Chris Date to respond because he published on this very subject. He has provided a detailed and rigorous response, a short version of which I posted on that site. His complete answer follows; my comments are prefixed with FP).

 

 

Chris Date Responds: The short answer to both questions is yes.  For a much longer answer, please read on! 

 

I think the first thing I need to do is examine--and, I hope, clarify--some of the terminology used in the two questions. 

 

Ø       "Database model":  The term model is grotesquely overused, and abused, in the IT world (see Models, Models, Everywhere, Nor Any Time to Think. The term database model in particular is usually used to refer to a generic model that's applicable to databases in general (the relational model is the obvious example).  I think database model in the first question would be better as database design (and the design in question must be understood to be a logical one, not a physical one).  And I think it would be better still as simply database.

 

Ø       "In 3NF":  Strictly speaking, we can't say a given database is or is not in 3NF--3NF is something that is or is not satisfied by a given table.  A given database might include a mixture of 3NF and non3NF tables.  I suppose you might say a given database is in 3NF if every table it contains is in 3NF, but it's a bit of a stretch, and I'd prefer not to do it.  (FP: I don’t see a problem with this, provided the meaning is understood and properly used).

 

Ø       "Bunch of records":  The process of normalization to 3NF is a process that takes a given table as input and produces a set of tables as output, such that every table in the output set is in 3NF.  If that process is carried out on every table in turn in a given database as input, the various sets of tables produced as output will together form a database in which every table is indeed in 3NF.  So I think bunch of records should be set of tables. (I remark in passing that the writer is presumably using the term record to mean record type, whereas the more usual interpretation of the term is record occurrence.  I note further that the writer seems to be thinking that record type and table are synonymous.  I note still further that relational tables do not contain records, they contain rows or, better, tuples.)

 

Ø       "Normalized tables":  From the context, by normalized here the writer presumably means 3NF.  This usage is common but sloppy; strictly speaking, normalized just means first normal form (1NF)--see AN INTRODUCTION TO DATABASE SYSTEMS.  And you might be surprised to learn that, by definition, all tables are in 1NF! (equivalently, all tables are normalized).  I think normalized tables would be better as 3NF tables.  (I'm ignoring throughout the possibility that the table might contain either duplicates or nulls.  Tables that contain duplicates or nulls do not represent relations, and normalization theory has nothing to say about them.)

 

Ø       "Tables":  I've explained in many places and on many occasions that the term table is unsatisfactory for a variety of reasons (see, e.g., the book just mentioned).  At least we should be clear as to whether we're talking about table values or table variables--or, as I would prefer to say, relation values or relation variables.  Somewhat against my own better judgment, however, for the remainder of this response I'll stick with the term table, which for present purposes I'll define to mean a table variable specifically. Also against my own better judgment, I'll also take the term table to mean a base table specifically (as I believe the writer does).

 

Ø       "3NF":  I wonder why the writer focuses on third normal form specifically, when there are at least four known and useful higher levels of normalization.  (The four are, in sequence, Boyce/Codd, fourth, fifth, and sixth normal form.  The last of these is defined in a book that's due to appear from Morgan Kaufmann in the fourth quarter this year: TEMPORAL DATA AND THE RELATIONAL MODEL, by C. J. Date, Hugh Darwen, and Nikos A. Lorentzos.  It's pertinent only in the context of temporal databases.)  In other words, database designers should always aim for tables that are in at least fifth normal form.  For simplicity, however, I'll stick with 3NF for the remainder of this response, since that's apparently what the writer wants. 

 

Here then are the original two questions, now reworded to take the foregoing terminological issues into account: 

 

1.      Is there an algorithm that can be used to verify whether a given table is in 3NF? 

2.      Is there an algorithm that can be used to convert a given table into a set of 3NF tables? 

 

And now we come to an overriding point:  Namely, it's meaningless to say simply that some table T either is or is not in 3NF--rather, we can say only that T is or is not in 3NF with respect to a certain set of functional dependencies (once again, see the above mentioned book).  Here's a definition (a precise definition!) of 3NF that makes this point clear.

 

Let T be a table and let S be the set of functional dependencies (FDs) that are satisfied by T; let X be any subset of the columns of T, and let C be any single column of T.  Then T is in 3NF with respect to S if and only if, for every FD X _ C in S, at least one of the following is true: 

 

  1. X contains C;
  2. X is a superkey for T;  
  3. C is contained within a candidate key of T

 

Note:  Condition a. means C is one of the columns in X (and the FD X _ C is said to be trivial in such a case).  Condition b. means T never has a value in which two distinct rows have the same value for X (in other words, X is a superset of some candidate key for T).  Condition c. should be clear (though the rationale for it might not be!). 

 

So--turning to the two questions above--we have the following: 

 

1. Is there an algorithm that can be used to verify whether a given table is in 3NF? 

 

In order to discover whether table T is in 3NF, we need to know the FDs that T satisfies.  Then we have to check whether every such FD satisfies either condition a. or condition b. or condition c. above.  These checks can clearly be automated. (FP: This is a formal way of saying that whether or not a table is in 3NF depends on the business rules underlying the table representation, which must be declared to a DBMS that understands them and uses them for verification; see my article on this subject forthcoming at the Journal of Conceptual Modeling.)  Given such a set of FDs/business rules and a DBMS, the answer to both questions is yes. For further details see AN INTRODUCTION TO DATABASE SYSTEMS.

 

(Note that if we know the FDs, we can in principle infer the candidate keys, though there are known to be limits on the degree to which such inferencing might be feasible in practice--see Catriel Beeri and Philip A. Bernstein:Computational Problems Related to the Design of Normal Form Relational Schemas, ACM TODS 4, No. 1, March 1979.) 

 

2. Is there an algorithm that can be used to convert a given table into a set of 3NF tables? 

 

So long as you have at least a basic understanding of what the normalization process entails, the broad outlines of such an algorithm should be clear from the foregoing discussion.  For further details, see AN INTRODUCTION TO DATABASE SYSTEMS

 

·   Page 361 gives an informal algorithm for converting a 1NF table to a set of 2NF tables;

·   Page 363 gives an informal algorithm for converting a 2NF table to a set of 3NF tables;

·   Page 384 gives a formal algorithm for converting a 1NF table to a set of 3NF tables.

 

 

Posted 09/20/02