Monday, October 1, 2012

Database Management, Application Development and Normalization

Tom Phillips "start[s] the conversation at social.technet.microsoft.com with [his] "normalization" rant":
Normalization was invented in the 70's as a way to put some structure around how developers were storing data in a database, in addition to trying to save disk space. You need to remember this was a time when 1MB was billions of dollars and a comput er needing 1GB of data storage was inconceivable. It was a way to squeeze as much data into as small a space as possible.
Perhaps the lack of understanding of the relational model was so acute at the time when it was first published (1969-70) that it would not surprise me at all if a belief existed then that normalization would save storage space, even if I don't understand on what this belief was grounded. But there is no justification for such a belief to persist in 2012, no matter what else one thinks of normalization.

As I wrote many times, normalization is a purely logical concept and has nothing to do with physical storage. When Codd invented the relational model, hierarchic and CODASYL (network) DBMSs reigned supreme. Repeating groups--groups of values that repeat--were a characteristic of hierarchic DBMSs. A relational representation of data in such systems was viewed as "flattening" the hierarchy to R-tables. Here is what I wrote in Truly Relational: What It Really Means, Codd’s 1969-70 Papers (available from this site):
First Normal Form (1NF)
In his 1969 paper Codd allowed for domains that had relations as values, or what we refer to as relation-valued domains/columns (RVD/RVC), “nested relations” for short. He considered RVDs the relational equivalent of “repeating groups” in hierarchic/network systems, which he deemed unnecessary complications. At the time Codd thought that RVDs would necessitate second- rather than first-order logic as a basis for RDBMSs, which (a) is problematic, and (b) would complicate implementation of relational systems. So in his 1970 paper he introduced the idea of eliminating nested relations through a process of normalization, for which he lists some benefits. The result is a collection of relations in their normal form, what we today call R-tables. Otherwise put, Codd’s initial position was that R-tables are by definition normalized (otherwise put: there is no such thing as an unnormalized R-table). 
It later turned out that RVDs/RVCs can be supported within first-order logic and that, consequently, 1NF essentially means that at every intersection of a row and column there is exactly one value, which can be anything, including an R-table. [Indeed, Tutorial D, Date and Darwen's relational language supports RVD/RVA explicitly].
Note:

(a) 1NF is only one requirement of a table to be an R-table (for the others see my recently published Business Modeling for Database Design).

(b) Data structure is not an end in itself, it is a target for manipulation and integrity enforcement and violations of relational principles should be judged on their implications for those two factors.

(c) Even though it later turned out that RVD/RVC do not require second-order logic, in the context of (b) they complicate the data language but do not provide a significant benefit except convenience in some rare cases.

Be that as it may, if properly supported--with necessary operators--RVD/RVC do not violate relational principles.

Phillips continues:
That been said, I do not believe it is all bad. However, most developers are not database designers. Most developers create tables out of convenience for their application and how the application works. The word "normalization" usually does not enter the design discussion. Database design is an "art". This is why I have pushed developers to use stored procs and views for all access to the database layer. This insulates the developer from the physical database structure. It allows someone to "fix" the physical structure later and as long as the stored procs and views return the same values, the developer does not need to know the physical design.
There are certain contradictions here. First, while it is true that many aspects of logical database design are art rather than science, dependencies and normalization rules are the little science we do have. And if that is sound and useful, why should we not require developers to familiarize themselves with it? After all, poor design will affect not just the programmer, but users and the enterprise.

Second, we need to be precise about terminology:
  • Normalization is a process of repairing tables that are not relational because they contain multi-valued columns--the table equivalent of repeating groups--and rendering them R-tables in 1NF;
  • Further normalization is a process of repairing R-tables that are in 1NF, but have certain drawbacks due to poor design; by raising their normalization level to 5NF (or possibly 6NF), those deficiencies are eliminated.
Please note very carefully that if developers were knowledgeable about proper database design, neither of repairs would not be necessary very often (as I demonstrate in the above mentioned paper). But Phillips admits they don't have such knowledge. So if fully normalized designs are a good thing, what is the solution: educating developers on the advantages of good design, or ignoring that altogether?

Third, using views to access tables is certainly the correct approach, but if the base R-tables are poorly designed, that will not fully insulate developers (and users) from the drawbacks of that poor design. Incidentally, I am somewhat uncomfortable with with Phillips' notion of "physical design", by which he means base tables. I suspect that SQL DBMSs, which tend to implement mirror images of tables contribute to the common logical physical confusion. While base tables are physically stored, from an application/user perspective they are logical tables. Views provide not physical data independence--as Phillips seems to think--but logical data independence (physical design involves files, indexes, hashes, not tables). Furthermore, neither does SQL provide the full logical data independence that the relational model permits, e.g. certain views that are theoretically updatable cannot be updated in SQL.
The questions that need to be asked are:
  • Is it easier to access and manipulate as one table?
  • Is it fast (enough) as one table or multiple tables?
These questions need to be evaluated on a case by case basis.
The former one is a function of the data language (SQL) and one could hazard a guess that the answer is probably no. The latter is a function of physical-level factors that have nothing to do with the data model and the answer would probably be also no if the necessary redundancy control is taken into account. By redundancy control I mean the additional integrity constraints that must be defined and declared to the DBMS to ensure that the redundancy caused by denormalized designs does not cause inconsistencies. In my paper on normalization I demonstrate that the addition of those constraints--of which many practitioners are unaware--is a lose-lose proposition: they impose an integrity burden on the one hand and defeat any performance gains that might be achieved by denormalization without the constraints. IOW, the performance gains, if any, come from trading off integrity for it. The paper describes several other drawbacks of denormalized designs.

Note that there are other drawbacks to denormalization, such as complexification of queries and harder to understand databases.
I can argue if you have a table called Person with a FirstName and LastName, you could have a Person table with pointers to a FirstName and LastName table. You have multiple “Johns” in your database. Does anyone do this? No, because it is a nightmare to implement and use.
Phillips is an experienced practitioner, so I am surprised that he uses such a strawman as an example of "overnormalization". That is certainly not what further normalization is for.

It would not hurt database practitioners to know more about application development and programmers to know more about database management. That is not easy given current demands from both, but it must not be lost on them that they serve somewhat different purposes, and one cannot do the job of the other with disregard for that difference.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment