Saturday, October 8, 2022

NOBODY UNDERSTANDS NORMALIZATION 1 (sms)



Note: In "Setting Matters Straight" posts I debunk online pronouncements that involve fundamentals which I first post on LinkedIn. The purpose is to induce practitioners to test their foundation knowledge against our debunking, where we explain what is correct and what is fallacious. For in-depth treatments check out the POSTS and our PAPERS, LINKS and BOOKS (or organize one of our on-site/online SEMINARS, which can be customized to specific needs). Questions and comments are welcome here and on LinkedIn.

What's right/wrong with this database picture?

“Normalization in relational databases is a design process that minimizes data redundancy and avoids update anomalies. Basically, you want each piece of information to be stored exactly once; if the information changes, you only have to update it in one place. The theory of normal forms gives rigorous meaning to these informal concepts. There are many normal forms. In this article, we’ll review the most basic:
First normal form (1NF)
Second normal form (2NF)
Third normal form (3NF)
There are normal forms higher than 3NF, but in practice you usually normalize your database to the third normal form or to the Boyce-Codd normal form, which we won’t cover here.”

                                                                     --Vertabelo.com

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST POSTS

10/08 NEW "DATA MODELS" PART 1 (t&n)

09/12 DATABASE DESIGN: THE STATE OF KNOWLEDGE IN THE INDUSTRY

08/28 NOBODY UNDERSTANDS DATABASE DESIGN (sms)

UPDATES

08/20 Added Logic and databases course to LINKS page.

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my AllAnalytics columns no longer work. I re-published only the 2017 columns @dbdebunk, and within them links to sources external to AllAnalytics may or may not work.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
------------------------------------------------------------------------------------------------------------------
 

Fallacies

  • Neither normalization (to 1NF), nor further normalization (to 5NF) are part of proper database design.
  • "Update anomalies" is a misnomer -- they (and logical redundancy) are a consequence of a fundamental problem with current poor database design practices.
  • Stored redundancy is part of physical implementation orthogonal to logical database design.
  • None of the five normal forms (plus a special one) is "more basic" than the others -- relations that are in 3NF but not in 5NF are not database relations.

Fundamentals

Contrary to industry wisdom, a database relation is by definition in 5NF: the only dependencies that hold in it are functional dependencies (FD) of non-key attributes on the PK (i.e., for each PK value there is exactly one value of every non-key attribute value, but not vice-versa). This is a formal way of saying that each relation represents at the logical level a group of entities of a single type at the conceptual level (why?). Then only a PK constraint is sufficient for the DBMS to enforce the FDs for semantic consistency (i.e., faithful representation of the entity group) and, thus, correctness of the database and inferences from it.

Note: Non-5NF relations are, mathematically, relations, but not database relations.

Three relational design principles, if jointly adhered to, produce database relations in 5NF. But poor conceptual modeling and/or failure to adhere to the principles produces relations that bundle multiple entity types. Four ways to bundle each produces relations in a lower NF (1NF-4NF). Because each represents multiple entity types, more dependencies beyond those on the PKs hold. For semantic consistency, additional constraints must be expressed in the data sublanguage, declared to and enforced by the DBMS, some of which control redundancies. These can be complex and due to poor knowledge and because this is a burden (likely to also affect performance), it is usually not done, which results in semantic inconsistencies, some of which are misleadingly labeled "update anomalies"). Proper design of 5NF relations would obviate any explicit design repair (if enforced, the constraints would control the redundancy, but 5NF design is much simpler and, thus, safer and obviates any explicit repairs). Otherwise put, it's not the redundancies per se that is the problem, but the failure to control it).

Note: Database relations (in 5NF) are generally considered fully normalized. A 6NF was discovered by Date in the context of temporal databases, although the principle is slighly broader. This is beyond the scope of this discussion, except to say:

“If an attribute has multiple versions correlated by ANY well-ordered parameter (whether time, space, ratings, etc.), the necessity of 6NF (as more generally defined) becomes a possibility and maybe even a likelihood. One problem with it is that 6NF relations breaks up relations even further than does 5NF (what do they represent? not groups). Constraints that ensure you can properly correlate those attribute values related by the order parameter (e.g., having the same time stamp or being within a time interval) must be defined. If data is correlated with multiple well-ordered parameters (which you want to be able to select at will), these constraints become pretty complicated. It's not as somehow of a different category than 5NF, but just that it's really very specialized.”   --David McGoveran

If and only if a database is poorly designed, it is necessary to repair it:

  • Non-1NF relations are normalized to 1NF (no non-1NF relations are designed in practice).
  • 1NF relations that are not in 5NF (NF-4NF) are further normalized to 5NF.

Note very carefully:

  • The distinction between normalization and further normalization is important: the latter should not, as is common, be lumped together with the former. Neither is part of database design, in the sense that proper design (5NF) renders both unnecessary.
  • Industry's misleading teaching notwithstanding, repairs are not done stepwise from 1NF through 2NF, 3NF and 4NF to 5NF, but from any of the lower NFs directly to 5NF (see part 2)

 Setting Matters Straight

In light of the above, we revise the above comment as follows:

There are five normal forms (and a specialized one), none of which are more "basic" than the others. Normalization to 1NF is a repair procedures of poor designs to eliminate nested relations and ensure that data sublanguages are FOPL based -- decidable and declarative. Further normalization from 2NF, 3NF and 4NF directly to 5NF is a repair procedures of poor designs to reduce the constraints necessary to ensure semantic consistency and, thus, avoid incorrect databases and inferences from them, including data corruption due to logical redundancy uncontrolled by the DBMS. 

(Continued in Part 2)

 


 

 

 

 

No comments:

Post a Comment

View My Stats