Monday, March 18, 2013

Normal Forms: Dependent on Dependencies

by Erwin Smout and Fabian Pascal

The recently published rewrites of papers #1, Business Modeling for Database Design and #2, The Costly Illusion: Normalization, Integrity and Performance reflect search over the years for the most effective way to dispell the many persistent myths and misconceptions common in the industry about conceptual (business) modeling and logical database design. One intention of splitting the related topics into two papers was to make an important dual point:

(1) Business modeling and database design done "right" implicitly yield R-tables in 5NF in most cases (see paper #1).

(2) Explicit further normalization of R-tables is necessary only to repair "poorly" designed R-tables (1NF-4NF), in order to eliminate certain drawbacks (see paper #2).

The papers explain what is meant by "right" and "poorly" and what those drawbacks are.

Note: For the purposes of this discussion we exclude recently proposed normal forms higher than 5NF, which are targeted at specific narrow circumstances and have not yet attained general consensus. We deliberately overlook the distinction between BCNF and 5NF, because cases where there actually is a distinction are fairly rare in practice and the distinction does not materially affect our argument. If and when we refer to 'full normalization' we mean 5NF.

We have written and spoken frequently about the "denormalization for performance" misconception, rooted in logical-physical confusion (LPC), but there is another misconception, caused by what can be called conceptual-logical conflation (CLC). You encounter it whenever a bunch of tables schemas is specified, accompanied by the question whether they are fully normalized  and, if not, how to fully normalize them, without reference to the conceptual model that they are supposed to represent.



We regularly come across online exchanges exhibiting the LPC. One was initiated by the question When was the last time you were able to build a relational model without a non-DB person insisting on the 'need' for denormalizing?  But one comment brought the CLC to the fore. Before we delve into it, the basic terminology:

1. An R-table is normalized (in 1NF) by definition; historically, this means that it does not contain multivalued columns.

2. An R-table is commonly considered fully normalized (in 5NF) if and only if every join dependency that holds in it is exclusively on the key; informally, this means that it represents a single entity-type.

3. An R-table that is in a lower normal form--2NF, 3NF, BCNF and 4NF--can be further normalized to 5NF; informally, this means that it violates 2 by representing multiple entity types, which can be "unbundled" each to its own 5NF R-table.

4. R-tables in 2NF-5NF can be denormalized to a lower normal form by "bundling" the entity types they separately represent in one R-table.

And now to the comment:
The primary purpose of normalization in a logical entity-relationship model is to identify a single point of definition (or a single point of update for the ACID-oriented modellers) for any given data-item (or groups of data-items in the higher normal forms). Unfortunately just about all discussions of “normal forms” are very simplistic and nearly all the examples of the “normalization process” deal with very simple examples focussed on the values assigned to data-items rather than the purpose of the data-item itself.
Here we run into conceptual-logical confusion: an entity-relationship model (ERM) of a specific business is conceptual and informal (we note, in passing, The E/R Model Considered Harmful). As defined by Codd and others, normal forms are properties of logical R-tables. "Logical ERM", in this sense, confuses levels of representation (the commenter expressed personal preference for object-role modeling over ERM, but an ORM is also conceptual).

Regarding the objectives of normalization, some but not all practitioners may know one objective, eliminating a certain type of data redundancy from databases that contain <5NF R-tables. Fewer realize that uncontrolled redundancy increases the risk of the database being rendered inconsistent by updates. Fewer still know that redundancy control is via additional integrity constraints and would not know how to formulate them, but if they undertook this serious control burden, it would all be for naught, because that would defeat the very purpose of denormalization. And very few are aware of the other drawbacks of denormalization (see paper #2).

The comment offers a very unusual formulation of this objective, that very few would recognize as a logical implication of redundancy elimination (we did not). With a considerable slack in the interpretation of terms, it goes like this: Without redundancy, every data item appears only once in the logical structure and, thus, provides its own "unique single point" to which updates must be applied ("ACID modelers" is an equally odd term and we would not use it in this context).

We also pondered the meaning of "focused on the values assigned to data-items rather than the purpose of the data-item itself." It could be a reference to the practice of showing an R-table value (of an R-tablevar) and then asking whether it "is in 3NF", failing to realize that normal forms apply to R-table schemas, not values. If this interpretation is right, then it is another case of poor use of language.

Simple examples are not necessarily simplistic. Carefully devised simple examples are essential to convey general principles, which can then be applied to more complex, realities. The CLC is particularly insidious in this respect.

In fact, the commenter himself uses a simple example to demonstrate "overnormalization":
A particular bug-bear and a mistake that +90% of “data modelers” make, is analyzing “point in time” views of the business data and “normalizing” those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):
SALES-INVOICE
{Invoice-Date,
 Customer-Account-ID,
 Customer Name,
 Invoice-Address-Line-1,
 Invoice-Address-Line-2,
 Invoice-Address-Line-3,
 Invoice-Address-Line-4,
 Invoice-Address-Postcode,
 Net-Amount,
 VAT,
 Total-Amount
};
Nearly every time, through the blind application of normalization we get this (you’ll need to draw the diagram to get the picture):
SALES-INVOICE
{Invoice-Date,
 Customer-Account-Id
  REFERENCES Customer-Account,
  Net-Amount,
  Vat,
  Total-Amount
}:

CUSTOMER-ACCOUNT
{Customer-Account-Id,
 Customer-Name,
 Invoice-Address REFERENCES Address
};

ADDRESS
{Address-Line-1,
 Address-Line-2,
 Address-Line-3,
 Address-Line-4,
 Postcode
};
To ascertain the normal form a table is in, we must know whether column dependencies, which reflect the corresponding attributes dependencies in the real world, satisfy the 5NF criterion. But we are not given attribute dependencies: we are expected to detect a "normalization mistake"--a violation of 5NF--by inferring/assuming/guessing those dependencies based solely on the column names. Hence Erwin's response:
Your example has nothing to do with normalization. To the extent that any "mistake" has been made, it's probably the failure to recognize that "address line as it was at the time of placing the order" is not necessarily the same as "address line as it is currently known right now, 10 years later". You might be interested to know that there are database design schemes that address this problem--correctly--without "denormalizing the details back into the invoice".
that evoked the following reply:
If it’s not a mistake then why do so many “data modellers” make that mistake? I’ve lost count of the times that someone, when converting my information mode into a database schema, has done that type of conversion “to normalize the data” and then I have to argue with them to explain the error of their ways whilst they accuse me of “denormalizing” their data. There’s even a term for it – it’s called “over-normalization”.
The mistake is due to CLC. What is called "overnormalization" here is a poor logical design induced by a conceptual modeling error, causing the modellers to wrongly deem the correct logical design "denormalized". Skipping the spelling out of attribute dependencies facilitates such mistakes. The one made here is that in the real world the functional dependency (FD) that holds is not
{customer} -> {address line}
but
{customer, invoice date} -> {address line}
In other words, had the conceptual model spelled out the correct FD's, it would have probably been clear that the single-table design is not denormalized. But if we are given either of the two tables and asked in what NF it is, without spelling out the FD, it is impossible to be sure.

Note, incidentally, that "blind application of normalization" would not have left NET_AMOUNT, VAT and TOTAL_AMOUNT in the SALES_INVOICE table, because of the real world FD
{net-amount, vat} -> {total-amount}
(unless VAT denotes a percentage and not an amount, which we are not told either). If this FD is taken into account, then TOTAL_AMOUNT will be separated into its own R-table with {NET_AMOUNT,VAT} as key. But, in fact, this R-table is redundant due to the calculated column and should not be defined. On the other hand, if the calculated column is left in SALES_INVOICE, an integrity constraint must be enforced to guarantee that at all times

NET_AMOUNT + VAT = TOTAL_AMOUNT

An example of what Codd referred to as redundancy and derivability underlying relational theory.

In conclusion:
  • Distortions or misuse of normalization should not be confused with the real thing. Often normalization terminology is applied to unrelated design aspects (e.g. designs to avoid unknown data or represent entity supertype-subtypes). "Overnormalization" is a misleading label for what is essentially poor conceptual design (it is often applied, also erroneously, to further normalization beyond 3NF).
  • Full normalization alone is not sufficient to achieve "good" designs. For example, it allows a (poor) design with "details of historical customer data moved into invoice tables".
  • Attribute dependencies are axioms about the real world, normalization theory is logical design reasoning based on those axioms (e.g. identification of the key, assessing the normal form of a table). Different dependencies identify different keys and normal forms. Applying normalization without spelling out attribute dependencies--which a conceptual model should capture--is like reasoning without spelling out the axioms: any conclusion is possible.
  • Guessing normal forms correctly by inspecting R-tables and column names, without proper conceptual specifications, is a fool's errand. As we have seen, this can be tricky even with simple tables. It is practically impossible with realistically complex tables, particularly beyond BCNF. The best way to address poor understanding of normalization is to ensure well defined conceptual models and spell out clearly and completely attribute dependencies (see paper #1). Then further normalization is not necessary.


Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

5 comments:

  1. I often use a "logical model" which is defined as the relational schema free from constraints and the "physical model" which is the relational schema as to be implemented in a DBMS accounting for constraints (the biggest of which is always political). The goal being the LM is "fully normalized". As you point out this approach is confused. My "logical" is really a conceptual, and my "physical" is really a logical - the physical really being the storage representation behind the scenes. If I adjusted terms, I could use the conceptual as a place to argue over correct dependencies, and the logical as the place to argue over alternatives to try as much as possible to stay true to the conceptual given the constraints. Unfortunately DAMA's data management framework seems to me to be confused as well. They define a conceptual, logical, and physical model, and then a database design to represent the DBMS internal design that does not impact the application. As I am bound to use this framework I remain in a state of confusion. It also seems to me the Zachman Framework is confused as well. It has a conceptual, logical, physical, and "detailed" layer. Do you consider both of these frameworks, which are about as gospel to data management as it gets, confused as well?

    ReplyDelete
    Replies
    1. >I often use a "logical model" which is defined as the relational schema free from constraints and the "physical model" which is the relational schema as to be implemented in a DBMS accounting for constraints (the biggest of which is always political). The goal being the LM is "fully normalized".

      There is a good chance for confusion with integrity constraints, which is not what you mean. Use "implementation limitations".

      >As you point out this approach is confused. My "logical" is really a conceptual, and my "physical" is really a logical - the physical really being the storage representation behind the scenes.

      No, your logical is logical, your physical is not physical, but also logical: it is not the denormalized schema due to implementation constraints, but the physical storage and access methods, hw resources, network load and so on).

      >If I adjusted terms, I could use the conceptual as a place to argue over correct dependencies, and the logical as the place to argue over alternatives to try as much as possible to stay true to the conceptual given the constraints.

      Yes, the conceptual is the real world--the business--and dependencies are relationships in the real world. Once you establish what they are, which determines the entity types and attributes, then if you represent each entity type by one R-table, you end up with 5NF logical designs/schemas. In that schema every join dependency that holds in each table is on the key, which is another way of saying that the attributes are of entities of one type.

      Now, if the DBMS performs poorly with the schema--which is the correct perspective--you should try to address this first at the physical level (hardware resources, network load, indexing and so on) and if you exhaust that and the product still fails to perform, then you may have to denormalize. But that is still at the logical level and performance gains, if any, come from trading integrity for them. IOW. they don't come from denormalization per se, but from accepting corruption risks.

      >Unfortunately DAMA's data management framework seems to me to be confused as well... the Zachman Framework is confused as well. Do you consider both of these frameworks, which are about as gospel to data management as it gets, confused as well?

      I would not expect the Zachman framework to rely on correct notions of levels of representation. I would expect it from DAMA, but it does not surprise me that there is confusion too (I recently offered to teach my seminars on database design and normalization at DAMA DC and was refused).

      But this is why it is important to be careful about sources of learning. The insidious aspect of all this is that many use language, concepts and terminology without a correct grasp of them, but they are unaware of it. So any attempt to educate is considered unnecessary. Then when you correct the mistakes, you end up accused of being offensive.

      Delete
  2. Thanks for these clarifications. Sloppy terminology is a pet peeve of mine and your comments will help me be a little less sloppy in the future. I was completely oblivious to using the word "constraint" to mean both database and project constraints and will adopt your suggestion. I am less likely to attend a DAMA conference knowing they refused you and it makes me wonder who else might they refuse who can lend valuable perspective to the discipline. The last paragraph of your reply hit home when I read your latest post. It seems too many people are concerned only with defending the correctness of their current understandings instead of seeking to continually improve.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Groups and organizations are not purely selfless--there are always incentives and interests.

      Because it is possible to (1) have a good career without undergoing education on the fundamentals, or (2) not required to know fundamentals, a considerable number of professionals are unaware that they need such knowledge, or that they hold inaccurate notions of such. So it's not entirely their fault--but the failing of the educational and employment systems. This is a systemic, not individual problem. That's why it is so difficult to solve.

      Delete