Monday, March 18, 2013

Depends on the Dependencies: Normal Forms and the Conceptual-Logical Conflation

by Erwin Smout and Fabian Pascal

REVISED: 10/19/16

Confusion of levels of representation is one of the most common and entrenched fallacies in data management [1]. We have written and spoken frequently about "denormalization for performance" [2,3,4,5,6] rooted in logical-physical confusion (LPC), but there is another fallacy, caused by  conceptual-logical conflation (CLC). There is even less awareness of the CLC than of the LPC.


  • We distinguish between normalization (to 1NF) and further normalization (beyond 1NF up to 5NF), which are qualitatively different. When we mean the latter, we spell it out.
  • For the purposes of this discussion we exclude recently proposed normal forms higher than 5NF, which are targeted at specific narrow circumstances and have not yet attained general consensus If and when we refer to 'full normalization' we mean 5NF.
  • We deliberately overlook the distinction between BCNF and 5NF, because cases where there actually is a distinction are fairly rare in practice and it does not materially affect our argument.
You encounter one form of CLC when one or more tables are accompanied by the question "Are these fully normalized and, if not, how do I fully normalize them?" without any reference to the conceptual model that they are supposed to represent. Or in the following:
"The primary purpose of normalization in a logical entity-relationship model is to identify a single point of definition (or a single point of update for the ACID-oriented modellers) for any given data-item (or groups of data-items in the higher normal forms). Unfortunately just about all discussions of “normal forms” are very simplistic and nearly all the examples of the “normalization process” deal with very simple examples focussed on the values assigned to data-items rather than the purpose of the data-item itself." --When was the last time you were able to build a relational model without a non-DB person insisting on the 'need' for denormalizing?
An enterprise-specific entity-relationship model (ERM) is at the conceptual level (about the real world). As defined by Codd and others, normal forms of relations are at the logical level (database representation). "Logical ERM" conflates the two levels. CLC is a frequent factor in poor database desig: normalization is not "in a E/RM", but in a logical model.

Note: The comment offers a very unusual formulation of the objective of full normalization that very few would recognize as a result of redundancy elimination (we did not). With a considerable slack in the interpretation of terms, it goes like this: "Without redundancy, every data item appears only once in the logical structure and, thus, provides its own "unique single point" to which updates must be applied ("ACID modelers" is an equally odd term and we would not use it in this context)." We also pondered the meaning of "focused on the values assigned to data-items rather than the purpose of the data-item itself." Full normalization applies to relations, not the attribute values within a relation.
First, the fundamentals:

  • A relation is normalized (in 1NF) if all its attributes are defined on simple domains without meaningful components, whose values are treated atomically by the data language;
  • A relation is fully normalized (in 5NF) if all dependencies that hold in it are functional dependencies (FD) of the non-key attributes on the key (informally, this means that the relation represents facts about a single type of entities).
  • A relation that is in 2NF-5NF (i.e., represents facts about entities of multiple types) can be further normalized by replacing it with its 5NF projections that represent single facts of a single type each without loss of information. 
  • 2NF-5NF relations can be denormalized to a lower normal form by replacing them with their join. [2]
Simple examples are not necessarily simplistic. Carefully selected simple examples are essential to understand general principles, which can then be applied to more complex, realities. The CLC is particularly insidious in this respect. In fact, the commenter himself uses a simple example to demonstrate "overnormalization":
"A particular bug-bear and a mistake that +90% of “data modelers” make, is analyzing “point in time” views of the business data and “normalizing” those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):
 Customer Name,
Nearly every time, through the blind application of [full] normalization we get this (you’ll need to draw the diagram to get the picture):
 REFERENCES Customer-Account,
 Invoice-Address REFERENCES Address

To ascertain the normal form a relation (pictured by a R-table) is in, we must know attribute dependencies -- which reflect entity property dependencies in the real world specified by a business rule in the conceptual model -- that are represented by an integrity constraint in the database. But we are asked to detect a normalization mistake -- a violation of 5NF -- in the absence of both the rule and the constraint (i.e., to infer/assume/guess the dependencies based solely on the column names). Hence Erwin's response:
"Your example has nothing to do with full normalization. To the extent that any "mistake" has been made, it's probably the failure to recognize that "address line as it was at the time of placing the order" is not necessarily the same as "address line as it is currently known right now, 10 years later". You might be interested to know that there are database design schemes that address this problem--correctly--without "denormalizing the details back into the invoice".
that evoked the following reply:
"If it’s not a mistake then why do so many “data modellers” make that mistake? I’ve lost count of the times that someone, when converting my information mode into a database schema, has done that type of conversion “to normalize the data” and then I have to argue with them to explain the error of their ways whilst they accuse me of “denormalizing” their data. There’s even a term for it – it’s called “over-normalization”."
The database field is riddled through with misconceptions and errors due to poor grasp of data fundamentals, CLC being one of them. "Overnormalization" is a label attached to all sorts of poor logical designs that have nothing to do with normalization. Skipping the spelling out of the conceptual model (business rules) and starting with logical database design is a sure way to risk such mistakes.

In the real world the dependency that holds is not

{customer} -> {address line}
{customer, invoice date} -> {address line}
Had the correct dependency been spelled out in the conceptual model, it would have probably been clear that the single relation is not denormalized. But if we are given either of the two relations and asked in what normal form it is in without the dependencies, it is impossible to say much.

Note: Incidentally, "blind application of normalization" would not have left NET_AMOUNT, VAT and TOTAL_AMOUNT in the SALES_INVOICE relation, because of the real world dependency

{net-amount, vat} -> {total-amount}
(unless VAT denotes a percentage and not an amount, which we are not told either). If this dependency is taken into account, then TOTAL_AMOUNT should be separated into its own relation with {NET_AMOUNT,VAT} as key. However, such a base relation would violate the Principle of Representational Minimality (PORM) [7] because it is redundant: it can be derived from the other base relations. If the calculated attribute is left in SALES_INVOICE, an integrity constraint must be enforced to guarantee that at all times
In conclusion:
  • Distortions or misuse of full normalization should not be confused with the real thing. Often normalization terminology is applied to unrelated design aspects (e.g. designs that avoid missing data, include calculated data or aggregation, or represent entity supertype-subtypes). "Overnormalization" is a misleading term used for any poor logical design.
  • Full normalization alone is not sufficient to achieve designs devoid of drawbacks. For example, "details of historical customer data moved into invoice tables" is bad design, but is not inconsistent with 5NF.
  • Property dependencies are axioms about the real world, represented in the database by attribute dependencies expressed as integrity constraints. Normalization theory is logical design reasoning based on those axioms (e.g., identification of the key, assessing the normal form of a relation). Different dependencies identify different keys and normal forms. Applying normalization without spelling out property dependencies specified in the conceptual model is like reasoning without axioms: any conclusion is possible.
  • Guessing normal forms by inspecting R-tables that picture relations, based on just column names is a fool's errand (which can be tricky even with simple relations) [7]. It is practically impossible with realistically complex relations, particularly beyond BCNF.
  • With knowledge of data and relational fundamentals, a well defined, complete and documented conceptual model and adherence to the Principle of Full Normalization (POFN) should implicitly yield a 5NF database. Explicit normalization and further normalization are necessary only for repair designs that did not adhere to the POFN. 
  • Practitioners may be aware of one -- redundancy -- but not all of the drawbacks of denormalization [6]. Few realize that the redundancy must be controlled by the DBMS to ensure database consistency with enforcement with special integrity constraints, do not know how to formulate them (if they are supported by SQL). What is more, if they undertook this serious constraint formulation burden, it would all be for naught, because it would reintroduce the very joins denormalization is intended to avoid [2].


[2] Pascal, F., The Costly Illusion: Normalization, Integrity and Performance.
[3] Pascal, F., Normalization, Further Normalization, Ease of Use, Integrity and Performance.
[4] Pascal, F., Tables, Full Normalization and Business Rules.
[5] Pascal, F., Denormalization for Performance: Don't Blame the Relational Model.
[6] Pascal, F., The Costly Illusion of Denormalization for Performance, forthcoming, All Analytics.
[7] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS, forthcoming (draft chapters).

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment