Monday, March 18, 2013

Normal Forms: Dependent on Dependencies

by Erwin Smout and Fabian Pascal

The recently published rewrites of papers #1, Business Modeling for Database Design and #2, The Costly Illusion: Normalization, Integrity and Performance reflect search over the years for the most effective way to dispell the many persistent myths and misconceptions common in the industry about conceptual (business) modeling and logical database design. One intention of splitting the related topics into two papers was to make an important dual point:

(1) Business modeling and database design done "right" implicitly yield R-tables in 5NF in most cases (see paper #1).

(2) Explicit further normalization of R-tables is necessary only to repair "poorly" designed R-tables (1NF-4NF), in order to eliminate certain drawbacks (see paper #2).

The papers explain what is meant by "right" and "poorly" and what those drawbacks are.

Note: For the purposes of this discussion we exclude recently proposed normal forms higher than 5NF, which are targeted at specific narrow circumstances and have not yet attained general consensus. We deliberately overlook the distinction between BCNF and 5NF, because cases where there actually is a distinction are fairly rare in practice and the distinction does not materially affect our argument. If and when we refer to 'full normalization' we mean 5NF.

We have written and spoken frequently about the "denormalization for performance" misconception, rooted in logical-physical confusion (LPC), but there is another misconception, caused by what can be called conceptual-logical conflation (CLC). You encounter it whenever a bunch of tables schemas is specified, accompanied by the question whether they are fully normalized  and, if not, how to fully normalize them, without reference to the conceptual model that they are supposed to represent.

We regularly come across online exchanges exhibiting the LPC. One was initiated by the question When was the last time you were able to build a relational model without a non-DB person insisting on the 'need' for denormalizing?  But one comment brought the CLC to the fore. Before we delve into it, the basic terminology:

1. An R-table is normalized (in 1NF) by definition; historically, this means that it does not contain multivalued columns.

2. An R-table is commonly considered fully normalized (in 5NF) if and only if every join dependency that holds in it is exclusively on the key; informally, this means that it represents a single entity-type.

3. An R-table that is in a lower normal form--2NF, 3NF, BCNF and 4NF--can be further normalized to 5NF; informally, this means that it violates 2 by representing multiple entity types, which can be "unbundled" each to its own 5NF R-table.

4. R-tables in 2NF-5NF can be denormalized to a lower normal form by "bundling" the entity types they separately represent in one R-table.

And now to the comment:
The primary purpose of normalization in a logical entity-relationship model is to identify a single point of definition (or a single point of update for the ACID-oriented modellers) for any given data-item (or groups of data-items in the higher normal forms). Unfortunately just about all discussions of “normal forms” are very simplistic and nearly all the examples of the “normalization process” deal with very simple examples focussed on the values assigned to data-items rather than the purpose of the data-item itself.
Here we run into conceptual-logical confusion: an entity-relationship model (ERM) of a specific business is conceptual and informal (we note, in passing, The E/R Model Considered Harmful). As defined by Codd and others, normal forms are properties of logical R-tables. "Logical ERM", in this sense, confuses levels of representation (the commenter expressed personal preference for object-role modeling over ERM, but an ORM is also conceptual).

Regarding the objectives of normalization, some but not all practitioners may know one objective, eliminating a certain type of data redundancy from databases that contain <5NF R-tables. Fewer realize that uncontrolled redundancy increases the risk of the database being rendered inconsistent by updates. Fewer still know that redundancy control is via additional integrity constraints and would not know how to formulate them, but if they undertook this serious control burden, it would all be for naught, because that would defeat the very purpose of denormalization. And very few are aware of the other drawbacks of denormalization (see paper #2).

The comment offers a very unusual formulation of this objective, that very few would recognize as a logical implication of redundancy elimination (we did not). With a considerable slack in the interpretation of terms, it goes like this: Without redundancy, every data item appears only once in the logical structure and, thus, provides its own "unique single point" to which updates must be applied ("ACID modelers" is an equally odd term and we would not use it in this context).

We also pondered the meaning of "focused on the values assigned to data-items rather than the purpose of the data-item itself." It could be a reference to the practice of showing an R-table value (of an R-tablevar) and then asking whether it "is in 3NF", failing to realize that normal forms apply to R-table schemas, not values. If this interpretation is right, then it is another case of poor use of language.

Simple examples are not necessarily simplistic. Carefully devised simple examples are essential to convey general principles, which can then be applied to more complex, realities. The CLC is particularly insidious in this respect.

In fact, the commenter himself uses a simple example to demonstrate "overnormalization":
A particular bug-bear and a mistake that +90% of “data modelers” make, is analyzing “point in time” views of the business data and “normalizing” those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):
 Customer Name,
Nearly every time, through the blind application of normalization we get this (you’ll need to draw the diagram to get the picture):
  REFERENCES Customer-Account,

 Invoice-Address REFERENCES Address

To ascertain the normal form a table is in, we must know whether column dependencies, which reflect the corresponding attributes dependencies in the real world, satisfy the 5NF criterion. But we are not given attribute dependencies: we are expected to detect a "normalization mistake"--a violation of 5NF--by inferring/assuming/guessing those dependencies based solely on the column names. Hence Erwin's response:
Your example has nothing to do with normalization. To the extent that any "mistake" has been made, it's probably the failure to recognize that "address line as it was at the time of placing the order" is not necessarily the same as "address line as it is currently known right now, 10 years later". You might be interested to know that there are database design schemes that address this problem--correctly--without "denormalizing the details back into the invoice".
that evoked the following reply:
If it’s not a mistake then why do so many “data modellers” make that mistake? I’ve lost count of the times that someone, when converting my information mode into a database schema, has done that type of conversion “to normalize the data” and then I have to argue with them to explain the error of their ways whilst they accuse me of “denormalizing” their data. There’s even a term for it – it’s called “over-normalization”.
The mistake is due to CLC. What is called "overnormalization" here is a poor logical design induced by a conceptual modeling error, causing the modellers to wrongly deem the correct logical design "denormalized". Skipping the spelling out of attribute dependencies facilitates such mistakes. The one made here is that in the real world the functional dependency (FD) that holds is not
{customer} -> {address line}
{customer, invoice date} -> {address line}
In other words, had the conceptual model spelled out the correct FD's, it would have probably been clear that the single-table design is not denormalized. But if we are given either of the two tables and asked in what NF it is, without spelling out the FD, it is impossible to be sure.

Note, incidentally, that "blind application of normalization" would not have left NET_AMOUNT, VAT and TOTAL_AMOUNT in the SALES_INVOICE table, because of the real world FD
{net-amount, vat} -> {total-amount}
(unless VAT denotes a percentage and not an amount, which we are not told either). If this FD is taken into account, then TOTAL_AMOUNT will be separated into its own R-table with {NET_AMOUNT,VAT} as key. But, in fact, this R-table is redundant due to the calculated column and should not be defined. On the other hand, if the calculated column is left in SALES_INVOICE, an integrity constraint must be enforced to guarantee that at all times


An example of what Codd referred to as redundancy and derivability underlying relational theory.

In conclusion:
  • Distortions or misuse of normalization should not be confused with the real thing. Often normalization terminology is applied to unrelated design aspects (e.g. designs to avoid unknown data or represent entity supertype-subtypes). "Overnormalization" is a misleading label for what is essentially poor conceptual design (it is often applied, also erroneously, to further normalization beyond 3NF).
  • Full normalization alone is not sufficient to achieve "good" designs. For example, it allows a (poor) design with "details of historical customer data moved into invoice tables".
  • Attribute dependencies are axioms about the real world, normalization theory is logical design reasoning based on those axioms (e.g. identification of the key, assessing the normal form of a table). Different dependencies identify different keys and normal forms. Applying normalization without spelling out attribute dependencies--which a conceptual model should capture--is like reasoning without spelling out the axioms: any conclusion is possible.
  • Guessing normal forms correctly by inspecting R-tables and column names, without proper conceptual specifications, is a fool's errand. As we have seen, this can be tricky even with simple tables. It is practically impossible with realistically complex tables, particularly beyond BCNF. The best way to address poor understanding of normalization is to ensure well defined conceptual models and spell out clearly and completely attribute dependencies (see paper #1). Then further normalization is not necessary.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment