Monday, March 18, 2013

Depends on the Dependencies: Normal Forms and the Conceptual-Logical Conflation

by Fabian Pascal and Erwin Smout

REVISED: 10/19/16


Confusion of levels of representation is one of the most common and entrenched fallacies in data management [1]. We have written and spoken frequently about "denormalization for performance" [2,3,4,5,6] rooted in logical-physical confusion (LPC), but there is another fallacy, caused by  conceptual-logical conflation (CLC). There is even less awareness of the CLC than of the LPC.

Note:

  • We distinguish between normalization (to 1NF) and further normalization (beyond 1NF up to 5NF), which are qualitatively different. When we mean the latter, we spell it out.
  • For the purposes of this discussion we exclude recently proposed normal forms higher than 5NF, which are targeted at specific narrow circumstances and have not yet attained general consensus. If and when we refer to 'full normalization' we mean 5NF.
  • We deliberately overlook the distinction between BCNF and 5NF, because cases where there actually is a distinction are fairly rare in practice and it does not materially affect our argument.
You encounter CLC when one or more tables are accompanied by the question "Are these fully normalized and, if not, how do I fully normalize them?" without any reference to the conceptual model that they are supposed to represent. Or the following:
"The primary purpose of normalization in a logical entity-relationship model is to identify a single point of definition (or a single point of update for the ACID-oriented modellers) for any given data-item (or groups of data-items in the higher normal forms). Unfortunately just about all discussions of “normal forms” are very simplistic and nearly all the examples of the “normalization process” deal with very simple examples focussed on the values assigned to data-items rather than the purpose of the data-item itself." --When was the last time you were able to build a relational model without a non-DB person insisting on the 'need' for denormalizing?
An enterprise-specific entity-relationship model (ERM) is at the conceptual level (about the real world). As defined by Codd and others, normal forms of relations are at the logical level (database representation). "Logical ERM" conflates the two levels. CLC is a frequent factor in poor database design: normal forms are not "in a E/RM", but in a logical model.

Note: The comment offers a very unusual formulation of the objective of full normalization that very few would recognize as a result of redundancy elimination (we did not). With a considerable slack in the interpretation of terms, it goes like this: "Without redundancy, every data item appears only once in the logical structure and, thus, provides its own "unique single point" to which updates must be applied ("ACID modelers" is an equally odd term and we would not use it in this context)." We also pondered the meaning of "focused on the values assigned to data-items rather than the purpose of the data-item itself." Full normalization applies to relations, not the attribute values within a relation.
 
First, the fundamentals:

  • A relation is normalized (in 1NF) if all its attributes are defined on simple domains without meaningful components, whose values are treated atomically by the data language;
  • A relation is fully normalized (in 5NF) if all dependencies that hold in it are join dependencies (FD) of the non-key attributes on the key (informally, this means that the relation represents facts about a single type of entities).
  • A relation that is in 2NF-4NF (i.e., represents facts about entities of multiple types) can be further normalized by replacing it with its 5NF projections that represent facts of a single type each, without loss of information. 
  • 3NF-5NF relations can be denormalized down to 2NF form by replacing them with their joins. [2] without loss of information.
  • Relations in 1NF can be denormalized by introducing non-simple attributes (e.g., defined on relation valued domains with meaningful components) with values treated as non-atomic by the data language).
Simple examples are not necessarily simplistic for teaching purposes. Carefully selected simple examples are often essential to understand general principles, which can then be applied to more complex realities. In fact, the commenter himself uses a simple example to demonstrate what he calls "overnormalization":
"A particular bug-bear and a mistake that +90% of “data modelers” make, is analyzing “point in time” views of the business data and “normalizing” those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):
SALES-INVOICE
{Invoice-Date,
 Customer-Account-ID,
 Customer Name,
 Invoice-Address-Line-1,
 Invoice-Address-Line-2,
 Invoice-Address-Line-3,
 Invoice-Address-Line-4,
 Invoice-Address-Postcode,
 Net-Amount,
 VAT,
 Total-Amount
};
Nearly every time, through the blind application of [full] normalization we get this (you’ll need to draw the diagram to get the picture):
SALES-INVOICE
{Invoice-Date,
 Customer-Account-Id
 REFERENCES Customer-Account,
 Net-Amount,
 Vat,
 Total-Amount
};
CUSTOMER-ACCOUNT
{Customer-Account-Id,
 Customer-Name,
 Invoice-Address REFERENCES Address
};

ADDRESS
{Address-Line-1,
 Address-Line-2,
 Address-Line-3,
 Address-Line-4,
 Postcode
};"
To ascertain the normal form a relation (pictured by a R-table) is in, we must know the attribute dependencies that hold in it, which reflect property dependencies in the real world. They are specified by one of the business rules comprising the conceptual model -- that is represented by an integrity constraint in the database. But we are asked to detect a normalization mistake -- a violation of 5NF -- in the absence of both the rule and the constraint (i.e., to infer/assume/guess the dependencies based solely on the column names). Hence Erwin's response:
"Your example has nothing to do with full normalization. To the extent that any "mistake" has been made, it's probably the failure to recognize that "address line as it was at the time of placing the order" is not necessarily the same as "address line as it is currently known right now, 10 years later". You might be interested to know that there are database design schemes that address this problem--correctly--without "denormalizing the details back into the invoice".
that evoked the following reply:
"If it’s not a mistake then why do so many “data modellers” make that mistake? I’ve lost count of the times that someone, when converting my information mode into a database schema, has done that type of conversion “to normalize the data” and then I have to argue with them to explain the error of their ways whilst they accuse me of “denormalizing” their data. There’s even a term for it – it’s called “over-normalization”."
The database field is riddled through with misconceptions and errors due to poor grasp of data fundamentals, CLC being one of their consequences. "Overnormalization" is a label attached to all sorts of poor logical designs that have nothing to do with normalization/further normalization. Failure to spell out the conceptual model (business rules) and starting with logical database design (including confusing tables with relations) is certain recipe for poor design and all its costly implications.

In the real world the dependency that holds is not

{customer} -> {address line}
but
{customer, invoice date} -> {address line}
Had the correct dependency in the conceptual model been spelled out, it would have been clear that the single relation is not denormalized. But if we are given either of the two relations and asked in what normal form it is in without the dependencies, it is impossible to say with any certainty.

Note: Incidentally, "blind application of normalization" would not have left NET_AMOUNT, VAT and TOTAL_AMOUNT in the SALES_INVOICE relation, because of the real world dependency

{net-amount, vat} -> {total-amount}
(unless VAT denotes a percentage and not an amount, which we are not told either). If this dependency is taken into account, then TOTAL_AMOUNT should be separated into its own relation with {NET_AMOUNT,VAT} as key. However, such a base relation would violate the Principle of Representational Minimality (PORM) [7] because it is redundant: it can be derived from the other base relations. If the calculated attribute is left in SALES_INVOICE, an integrity constraint must be enforced to guarantee that at all times
NET_AMOUNT + VAT = TOTAL_AMOUNT
In conclusion:
  • Distortions or misuse of normalization/further normalization should not be confused with the real thing. Often normalization terminology is applied to unrelated design aspects (e.g. designs that avoid missing data, include calculated data or aggregation, or represent entity supertype-subtypes). "Overnormalization" is a misleading term used for any poor logical design.
  • Full normalization alone is not sufficient to achieve designs devoid of drawbacks. For example, "details of historical customer data moved into invoice tables" is not inconsistent with 5NF, yet is bad design.
  • Property dependencies are axioms about the real world, represented in the database by attribute dependencies expressed as integrity constraints. Normalization theory is logical design reasoning based on those axioms (e.g., identification of the key, assessing the normal form of a relation). Different dependencies identify different keys and normal forms. Applying normalization/further normalization without spelling out property dependencies specified in the conceptual model is like reasoning without axioms: any conclusion is possible.
  • Guessing normal forms by inspecting R-tables that picture relations, based on just column names is a fool's errand (which can be tricky even with simple relations) [7]. It is practically impossible with realistically complex relations, particularly beyond BCNF.
  • With knowledge of data and relational fundamentals, a well defined, complete and documented conceptual model and adherence to the Principle of Full Normalization (POFN) should implicitly yield a 5NF database. Explicit normalization/further normalization is necessary only for repair of designs that did not adhere to the POFN. 
  • Practitioners may be aware of one -- redundancy -- but not all of the drawbacks of denormalization [6]. What is more, few realize that the specific type of redundancy introduced only by further normalization (normalization does not cause redundancy!) must be controlled by the DBMS to ensure database consistency, via enforcement of special integrity constraints; do not know how to formulate them (if they are supported by SQL). Unfortunately, if they added these constraints -- a serious development burden -- it would all be for naught, because they would execute the very joins denormalization is intended to avoid [2].


References

[1] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS.
[2] Pascal, F., The Costly Illusion: Normalization, Integrity and Performance.
[3] Pascal, F., Normalization, Further Normalization, Ease of Use, Integrity and Performance.
[4] Pascal, F., Tables, Full Normalization and Business Rules.
[5] Pascal, F., Denormalization for Performance: Don't Blame the Relational Model.
[6] Pascal, F., The Costly Illusion of Denormalization for Performance, forthcoming, All Analytics.
[7] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS, forthcoming (draft chapters).







Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment