Friday, December 2, 2022

NOBODY UNDERSTANDS FURTHER NORMALIZATION 4 (sms)



Note: In "Setting Matters Straight" posts I debunk online pronouncements that involve fundamentals which I first post on LinkedIn. The purpose is to induce practitioners to test their foundation knowledge against our debunking, where we explain what is correct and what is fallacious. For in-depth treatments check out the POSTS and our PAPERS, LINKS and BOOKS (or organize one of our on-site/online SEMINARS, which can be customized to specific needs). Questions and comments are welcome here and on LinkedIn.

In Part 3 we set the matter straight about normalization to 1NF. In this part we do it wit respect to further normalization to 5NF. Non-1NF relations (i.e., with relation-valued attributes) are no longer part of industry practice, so we focus on 2NF-5NF violations. The term further normalization originates with Codd, who initially thought 1NF was sufficient and 2NF-5NF were discovered later (hence, further = beyond 1NF). The industry lumps both under normalization, but the two are distinct (e.g., only further normalization involves redundancy).

What's right/wrong with the following?

“So, what is this theory of normal forms? It deals with the mathematical construct of relations (which are a little bit different from relational database tables). First, second, and third normal forms are the basic normal forms in database normalization. Normalization in relational databases is a design process that minimizes data redundancy and avoids update anomalies. Basically, you want each piece of information to be stored exactly once; if the information changes, you only have to update it in one place. The normalization process consists of modifying the design through different stages, going from an unnormalized set of relations (tables), to the first normal form, then to the second normal form, and then to the third normal form.”
--Vertabelo.com

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST POSTS

11/10 NEW "DATA MODELS" 4 (t&n)

11/03 NOBODY UNDERSTANDS NORMALIZATION 3 (sms)

10/29 NEW "DATA MODELS" 3 (t&n)

UPDATES

08/20 Added Logic and databases course to LINKS page.

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my AllAnalytics columns no longer work. I re-published only the 2017 columns @dbdebunk, and within them links to sources external to AllAnalytics may or may not work.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
------------------------------------------------------------------------------------------------------------------

It reflects an industry consensus that, due to poor understanding of data and relational fundamentals, exhibits multiple misconceptions.

Fallacies

1. Relations are not just "a little bit different from relational database tables" (which play no part in the theory).
2. There are not just three, but five equally basic normal forms (1NF-5NF) and a specialized sixth (6NF).
3. Database relations are not just mathematical relations, but are specially adapted for database management. Each is constrained to represent a group of entities of a single type and, thus, by design in 5NF (and implicitly in all lower normal forms).
4. Poorly designed non-5NF relations -- base and derived -- are, mathematically, relations, but not database relations and require repair:

  • Non-1NF (unnormalized) relations are normalized to 1NF;
  • 1NF (normalized) but non-2NF/3NF/4NF/5NF relations are further normalized to 5NF.

We have debunked these misconceptions repeatedly and extensively -- see our publications and posts. Here we focus on two others of which many (if not most) practitioners are unaware:

5. "Update anomalies" is a misnomer that inhibits understanding of the real issue: semantic loss.
6. The design repair process of further normalization does not usually proceed "in stages" (i.e., from non-1NF through 2NF, 3NF and 4NF).

Update Anomalies Are Semantic Loss

Database relations are mathematical relations specially adapted for database management. Each represents a group of entities of a single type and, thus:

  • Is by design in 5NF;
  • Is semantically constrained for faithful group representation.

Because the only attribute dependencies that hold in them are functional dependencies (FD) of every non-key attribute on the primary key (PK) (i.e., for each PK value there is exactly one value of every non-key attribute), one major advantage of 5NF relations is that the PK uniqueness constraint is sufficient to guarantee semantic consistency with respect to attribute dependencies.

If, however, database design "bundles" multiple entity groups in single relations, other more complex dependencies are introduced:

  • Functional dependencies on part of the PK (2NF), or on non-key attributes (3NF);
  • Non-functional dependencies (4NF,5NF).

which necessitates declaration by users and enforcement by DBMS of the corresponding additional constraints. It is such poor designs that produce redundancy and it is the purpose of those additional constraints to control the redundancy and prevent update anomalies. While theoretically possible, in practice not only would that be an unnecessary burden, but:

  • Practitioners are unfamiliar with those dependencies and do not know how to formulate the complex corresponding constraints;
  • Data sublanguages (SQL) do not express all of them.

It is reasonable to expect that even if these were not issues, these constraints would be disregarded (semantic loss), generating update anomalies. Indeed, the complexity of non-functional dependencies is the very reason further normalization in the industry stops at 3NF!

Note very carefully, therefore, that the real issue is semantic consistency -- faithful database representation of the conceptual model. The objective of 5NF design is to minimize the burden of guaranteeing it -- amount and complexity of constraints -- that would be otherwise necessary to enforce dependencies, the likely skipping of which would be semantic loss. In other words, the issue is not redundancy per se, which could be controlled, but the considerable yet unnecessary burden that control would take, which is unlikely to be undertaken, resulting in semantic loss.

Depends on Dependencies

A database relation is by design in 5NF. There are five distinct ways to bundle multiple entity groups into a single relation, each of which gives rise to a distinct violation of one of the five normal forms (1NF-5NF) and, thus, also violates 5NF. Relations that violate any of the normal forms are, mathematically, relations, but not database relations and must be repaired:

  • Non-1NF relations by normalization;
  • Non-2NF/3NF/4NF/5NF relations by further normalization.

To debunk the misleading claim of further normalization "in stages" we shall use an example. Consider the table:

 BIKEPARTS
--------------------------------------------
 PART        SUPPLIER     QUANTITY  SCOUNTRY
-======================---------------------
 Saddle       Bikeraft         10   USA
 Brake lever  Tripebike         5   Italy
 Top tube     UpBike            3   Canada
 Saddle       Tripebike         8   Italy
--------------------------------------------
We are told that:
“The set {PART,SUPPLIER} is the only candidate key of this relation. The value of supplier country is functionally dependent on supplier. Supplier country is not part of the candidate key, so it is ... functionally dependent on part of the candidate key, not the entire candidate key.”
In other words, it is not a R-table: its body visualizes a relation that bundles two entity groups, bike parts and suppliers, in a specific way such that it violates 2NF: aside from the FD on the (composite) PK:
{PART,SUPPLIER} --> QUANTITY
it introduces a FD of a non-key on part of the PK:
SUPPLIER --> SCOUNTRY.
which is why the PK constraint is not sufficient to enforce dependencies.

Further normalization unbundles the two groups to their own 2NF relations:

 PARTS
----------------------------------
 PART         SUPPLIER   QUANTITY
-====-----------------------------
 Saddle       Bikeraft         10
 Brake lever  Tripebike         5
 Top tube     UpBike            3
 Saddle       Tripebike         8
---------------------------------- 

 SUPPLIERS
--------------------
 SUPPLIER   COUNTRY
-========-----------
 Bikeraft   USA
 Tripebike  Italy
 UpBike     Canada
--------------------

But they are also in 5NF!! So what about the industry notion that further normalization to 5NF proceeds "in stages", in this case from 2NF through 3NF and 4NF?

Well, if violation of a specific normal form is due to a distinct type of bundling that introduces a distinct type of dependency, you would have to start further normalization with a 1NF relation incorporating all four types of bundling -- a design messup that even the sorry industry practice rarely engages in (not the least reasons being the need to know all four types of bundling and the huge difficulty to understand and disentangle relations so bundled to the hilt). If, like in most cases, there there is just one type of bundling, repair will have one stage stage -- in the example further normalization to 2NF also yields 5NF.

Setting Matters Straight 

We revise the paragraph as follows:

“So, what is this theory of normal forms? It deals with the construct of database relations (distinct from mathematical relations and not to be confused with tables). First, second, third, fourth and fifth normal forms are the basic normal forms in database normalization (to 1NF) and further normalization (up to 5NF). Further normalization in relational databases is a design repair process that simplifies the prevention of semantic loss by avoiding instead of producing and controlling logical redundancy (avoiding stored redundancy is an implementation, not database design issue). Depending on the dependencies introduced by bundling, further normalization takes non-2NF/3NF/4NF/5NF relations usually directly to 5NF, or rarely through intermediate normal forms.”

Compare.

 

 

 

 

 

No comments:

Post a Comment

View My Stats