Sunday, June 2, 2019

Normalization and Further Normalization Part 2: If You Need Them, You're Doing It Wrong




In Part 1 we outlined some fundamentals of database design, namely the distinction between normalization to 1NF, and further normalization (to "full" 5NF), and explained that they are necessary only to repair poor designs -- if you (1) develop a complete conceptual model and (2) formalize it properly using the RDM, (3) adhering to the three core principles of database design, you should end up with a relational database in both 1NF and 5NF.

Here we apply this knowledge to the typical request for "normalization" help we presented in Part 1.


------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE 

Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. The content of this site is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. Thank you. 

NEW

  • 05/21/19: Updated the LINKS page.
  • 04/20/19: Added POSTS page with links to all site posts, to be updated monthly.

LATEST PUBLICATIONS (order PAPERS and BOOKS) 


USING THIS SITE 

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics may work.

SOCIAL MEDIA 

I deleted my Facebook account. You can follow me:

  • @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture posts, and my exchanges on LinkedIn.
  • @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
  • @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.
------------------------------------------------------------------------------------------------------------------

Full Normalization, Not "Overnormalization"


For the "overnormalization" claim to hold, SALES-INVOICE must be a R-table (i.e., visualize a relation in 1NF and 5NF that, informally, represents a group of entities of a single type). Groups are specified in the conceptual model and the author of the example provides none (he too starts "arbitrarily" with the SALES-INVOICE table, the very "bug-bear and mistake that +90% of 'data modelers' make" he deplores). On what grounds, then, does ADDRESS constitute "overnormalization"?

Consider Erwin Smout's response:

“Your example has nothing to do with full normalization. To the extent that any "mistake" has been made, it's probably the failure to recognize that "address line as it was at the time of placing the order" is not necessarily the same as "address line as it is currently known right now, 10 years later". You might be interested to know that there are database design schemes that address this problem -- correctly -- without "denormalizing the details back into the invoice".”
Formally, for SALES-INVOICE to be a R-table, Address would have to be functionally dependent on the Customer-Account-Id PK (i.e., a 1:1 relationship between addresses and customers ), but Erwin infers from the help request that due to the temporal factor the relationship is M:1 -- addresses and customers are entities of distinct types. If so, then SALES-INVOICE is not a R-table -- it "bundles" data about two entity types/groups. Bundling is the kind of design flaw that further normalization (not normalization!) repairs. Unbundling address data to a separate relation visualized by ADRRESS is, thus, not "overnormalization", but actually further normalization to 5NF, and rendering the table a R-table.

Note very carefully that this could be determined only by inferring conceptual information from the example! ("analyzing "point in time" views of the business data ... [and] failing to consider change over time and the need to reproduce historic viewpoints”). The three-table design as is, of course, is an oversimplification -- the temporal design scheme to which Erwin alludes is beyond the scope of this post.

Note: Total-Amount is not functionally dependent on the PK either, but on {Net-Amount,VAT} (unless VAT denotes a percentage and not an amount, conceptual information that is also missing), and introduces redundancy that would violate the Principle of Representational Parsimony (PORP) -- one of the three formal design principles (see Conclusion). If it is left in SALES-INVOICE, an integrity constraint must be declared and enforced to guarantee that at all times Total-Amount is consistent with Net-Amount and VAT.

CLC, 1NF, and 5NF


“The primary purpose of normalization in a logical entity-relationship model is to identify a single point of definition (or a single point of update for the ACID-oriented modellers) for any given data-item (or groups of data-items in the higher normal forms). Unfortunately just about all discussions of “normal forms” are very simplistic and nearly all the examples of the "normalization process" deal with very simple examples focussed on the values assigned to data-items rather than the purpose of the data-item itself.”

Points arising:
  • "Logical entity-relationship model" is CLC: E/RM is a conceptual (i.e., reality), not logical (i.e., data) modeling approach[1,2];
  • Normalization (to 1NF) and further normalization (to 5NF) are two distinct repairs of different design flaws, and the common reference to both as "normalization" obscures the difference. Elimination of redundancy is one objective of repair by further normalization to 5NF (not by normalization to 1NF, which has a different objective -- simple domains with atomic values). The example, however, has a very unusual formulation of this objective, which, with considerable slack in interpretation, seems to go like this: without redundancy, every data item appears only once in the logical structure and, thus, provides its own "unique single point" to which updates must be applied;
  • Simple examples are an effective, if not the only, way to convey general principles that can be applied to complex cases. While conceptual information can be inferred from tables in such examples, many help requests are are complex enough to render this practice risky, resulting in flawed advice and poorly designed databases;
  • "Focussed on the values assigned to data-items rather than the purpose of the data-item itself" is another example of unusual language, and it is probably a stretch to interpret it as failure to recognize that data modeling ("values assigned to data-items) derives from conceptual modeling ("purpose of the data-item itself") -- had this been the meaning, it would have obviated the help request[3,4,5].


Conclusion


  • There are a lot of misconceptions and confusion about database design due to practitioners' lack of foundation knowledge, particularly that data modeling is formalization of conceptual models using a formal data model.
  • Relations are by definition in both 1NF and 5NF, otherwise they are not relations, and all bets are off.
  • Proper database design produces relations that do not require normalization to 1NF, or further normalization to 5NF, which are distinct repairs of non-relational designs with different objectives that should not be confused.
  • Full normalization is not sufficient for database design: all three formal design principles must be adhered to because, while they jointly imply POFN, POFN does not imply them (for example, POFN applies to single relations, PORP to multiple relations.
  • The only dependencies that hold in a relation are FDs of the non-key attributes on the PKs. They represent at the logical level relationships in the real world among all the members of an entity group that are collective properties (3OPs) of the groups. It follows that whether a table visualizes a relation can only be determined with reference to the conceptual model that specifies those relationships. Database design without complete conceptual models is like reasoning without axioms: any conclusion is possible. Guessing dependencies by inspecting tables is an upside down and backwards fool's errand.
  • Practitioners are, at best, cognizant of just one drawback of not fully normalized (non-relational) databases -- redundancy -- and are unaware of others that defeat the beneficial purposes of the RDM. 

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored. 


References

[1] Pascal, F., ERM Models Reality, RDM Models Data.

[2] Pascal, F., Understanding Data Modeling: E/RM and RDM.


[3] Pascal, F., What Meaning Means Business Rules, Predicates, Integrity Constraints and Database Consistency.

[4] Pascal, F., Data Meaning: Analytics vs. Data Mining.

[5] Pascal, F.,  Data and Meaning Parts 1-4.



No comments:

Post a Comment

View My Stats