Monday, October 1, 2012

Normalization, Further Normalization, Ease of Use, Integrity and Performance

Revised: 10/15/16
"Normalization was invented in the 70's as a way to put some structure around how developers were storing data in a database, in addition to trying to save disk space. You need to remember this was a time when 1MB was billions of dollars and a comput er needing 1GB of data storage was inconceivable. It was a way to squeeze as much data into as small a space as possible." --Tom Phillips,
Perhaps the lack of understanding of the relational model was so acute at the time when it was first published (1969-70), that it would not surprise me if a belief existed then that normalization would save storage space, even if I don't understand in what this belief was grounded. But there is no justification for such a belief to persist in 2012, no matter what else one thinks of normalization.

For the multiple advantages from full normalization (5NF) -- chief among them semantic correctness of query results (i.e., no anomalous side-effects)--see the just published THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, available via the BOOKS page).

As I explained so many times, normalization is a purely logical concept that has nothing to do with physical storage, which exclusively determines performance.

Initially, Codd thought that relations with attributes defined on non-simple domains that have meaningful components would require second- rather than first-order logic as a basis for the RDM, which would have robbed the data model of some of its core advantages and would have complicated it. If such domains are relation-valued (RVD) and have union-compatible values (the same number and types of attributes), they can be eliminated by "flattening" to a set of relations whose attributes are defined only on simple domains, whose values are treated as atomic by the data language. The resulting relations are in their simpler normal form, for which first order logic is sufficient. The normal form became first normal form (1NF) when, subsequently, dependency theory and further normalization (to 5NF) were introduced.

Note: RVDs/RVAs are consistent with the RDM if they are defined as simple and the data language treats their values as atomic, which means it has no access to value components--the attributes and tuples of the nested relations--and, therefore, nesting and nested relations cannot be jointly referenced in relational constraint and query expressions. This is why it is more useful to simplify by normalization (un-nesting) to 1NF.

"... most developers are not database designers ... create tables out of convenience for their application and how the application works. The word "normalization" usually does not enter the design discussion. Database design is an "art". This is why I have pushed developers to use stored procs and views for all access to the database layer. This insulates the developer from the physical database structure. It allows someone to "fix" the physical structure later and as long as the stored procs and views return the same values, the developer does not need to know the physical design."
First, dependency theory and the higher normal forms (2NF-5NF) are the science in database design. So why should we not require developers to familiarize themselves with it? After all, poor design will affect not just application programmers, but end-users and the enterprise as a whole.

  • Normalization repairs relations that are not in 1NF due to fact-nesting design (i.e., they have attributes defined on RVDs with union-compatible values that are not simple domains);
  • Further normalization (to 5NF) repairs 1NF relations that have certain drawbacks due to fact-merging design;
Please note very carefully that if developers were knowledgeable about proper database design, neither would be necessary  (as I demonstrate in paper #2). Phillips admits they don't have such knowledge, so what is the solution--accept denormalized designs, or education on the advantages of full normalization?

Third, Phillips is guilty of the all too common logical-physical confusion (LPC). Views provide logical independence (LI): it insulate applications from logical--not physical--reorganizations of the database. SQL DBMSs, of course, fail to support the full LI that the relational model permits (e.g. multi-relation views are not updatable).

The questions to be asked are:

  • Is it easier to work with denormalized databases? This is a function of data language (SQL) syntax with respect to constraint and query formulation and the answer is no. My paper #2 documents several other drawbacks (e.g., complexification of queries and harder to understand databases) caused by denormalization.
  • Does denormalization improve performance? Performance is an exclusive function of physical implementation and has nothing to do with logical design. Denormalization can sometimes create the illusion of better performance only because it ignores the integrity implications. It introduces redundancy that, unless controlled by the DBMS with special integrity constraints, is an integrity risk. They are are practically never declared and enforced, which is what causes the illusion--performance gains, if any, come not from denormalization, but from trading off integrity for them.
"I can argue if you have a table called Person with a FirstName and LastName, you could have a Person table with pointers to a FirstName and LastName table. You have multiple “Johns” in your database. Does anyone do this? No, because it is a nightmare to implement and use."
It is discouraging to see such poor grasp of further normalization by experienced practitioners. Such examples of "overnormalization" are, of course, strawmen.

No comments:

Post a Comment

View My Stats