Sunday, September 19, 2021

TYFK: Calculated Attributes -- Redundancy, Full Normalization and Relational Theory



Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“If you have shopping cart, you probably have some field "TOTAL" somewhere that stores the final amount due for the customer. It so happens that such a thing violates relational theory...”

“Having a "TOTAL" field in your "order" table *might* violate relational theory, but if you make it so that only a trigger can update it based on what's in your "order_item" table, then I think it's fine. You still get data integrity and that is what matters.”

“I still fail to see what you mean by the "calculated TOTALS field" (attribute, really) violates the Relational Model.”

“The result of having the field ... is what is called a DELETE ANOMALY.”

“Most denormalizing means adding columns to tables that provide values you would otherwise have to calculate as needed.”

“There are four practical problems with a fully normalized database, three of which I have listed before. I will list them all here for completeness:
* No calculated values. Calculated values are a fact of life for all applications, but a normalized database lacks them. The burden of providing calculated values must be taken up by somebody somehow. Denormalization is one approach to this, though there are others.
--Database Programmer blog

“...I'm now working with IT to normalize part of the database to remove calculated fields...:
`lineitems`.`extended total` = `lineitems`.`units` * `biditems`.`price`.
`jobs`.`jobvalue` = the sum of related `lineitems`.`extended total` records
`orders`.`ordervalue` = the sum of related `jobs`.`jobvalue` records.”
--mySQL.com

Do calculated attributes (not fields!) violate relational theory and must be "normalized" out of them? Determining that requires foundation knowledge that is scarce in the industry, which has a poor and outdated understanding of the RDM.

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST POSTS

- 09/11 OBG: Data Warehouses Are Non-Relational Application Views

- 09/04 Understanding Relational Constraints

- 08/31 TYFK: Normalized, Fully Normalized, Non-Normalized, Denormalized -- Clearing the Mess

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- The PostWest blog for monthly samples of global Antisemitism – the only universally acceptable hatred left – as the (traditional) response to the existential crisis of decadence and decline of Western  civilization (including the US).
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

Fundamentals

A database relation is in 5NF (fully normalized) if it represents a group of entities of a single type. Formally:

  • It is in 1NF; and,
  • The only attribute dependencies that hold in it are functional dependencies of all the non-PK attributes on the PK (i.e., for each PK value there is exactly one value of every non-PK value, but not vice-versa).

Note: A relation is in 1NF if all its attributes are defined on simple domains with values treated as atomic by the data sublanguage.

A non-5NF relation "bundles" multiple entity groups and exhibits other dependencies, due to which it exhibits redundancy that carries risk of "update anomalies" -- corruption due to partial update of redundant data. Ever since the introduction of the RDM in 1969, the industry perceives full normalization as intended to eliminate the redundancy and avoid the associated risk of update anomalies: database design is explicit stepwise further normalization of non-5NF relations until they are fully normalized (unless performance dictates denormalization -- a fallacy that disregards that risk altogether). If you believe with the majority that full normalization is for elimination of redundancy in order to prevent update anomalies, you might also think that other redundancies, such as calculated attributes ought also to be further "normalized away".

First, note that redundancy is a database design/integrity, not relational issue -- can you specify exactly on what RDM tenets it infringes? As the second commenter points out, update anomalies can be prevented even in the presence of the redundancy in non-5NF relations (although that defeats the purpose of "denormalization for performance").

Second, further normalization is geared to eliminate only redundancy due to "group bundling" and attribute dependencies by applying relational projections (the reverse of joins) to non-5NF relations. Per a chapter in my book, there are redundancies not due to bundling -- duplicate tuples and calculated attributes are examples -- which cannot be eliminated by projections and, thus, are not further "normalizable away".

Third, and most important, the industry is unaware that elimination of redundancy, while an added practical benefit of full normalization, is not its primary purpose, which is more fundamental: we now contend that non-5NF relations violate relational theory. Without going into mathematics that is beyond the scope of this post, when Codd introduced the RDM in 1969, the initial version of the relational algebra (RA) was different than today's and its join operator was defined to work with relations in what Codd defined then as their normal form (NF) -- this is the origin of the earlier mandate that database relations are at least in 1NF (even though today's 1NF is different than the NF). That caused some problems which pushed Codd into the "higher" normal forms culminating with 5NF and later the revision of join to the currently used version.

While reviewing Codd's work, McGoveran had the insight that 5NF was to the new join what NF was to the old: having found the NF-old join combination problematic, Codd had to revise both to 5NF-new join for the RA to work properly. Thus, if applied to non-5NF relations, the current join can produce anomalous results, which turn out to be the source of problems with view updating.

Note: For correctnes, the SST foundation of RDM mandates three database design principles:

  • The Principle of Expressive Completeness (POEC);
  • The Principle of Representational Parsimony (PORP);
  • Principle of Orthogonal Design (POOD);

McGoveran conjectures (but has not yet proved) that they jointly imply a fourth Principle of Full Normalization (PoFN), though not the other way around. If so, joint adherence to the three principles implicitly produces fully normalized databases, which:

  • Obviates explicit stepwise further normalization;
  • Enables view updating and logical independence.

which reinforces the insight, and it's why we now contend that relational theory mandates full normalization. Had Codd had the insight himself, 2NF-4NF (and other questionable normal forms floating in the industry) wouldn't have probably come up.

Conclusion

The primary purpose of full normalization is to avoid RA anomalies -- redundancy is a database design/integrity, not relational issue. While update anomalies can be avoided for non-5NF relations, they impair the RA and, thus, violate relational theory.

Calculated attributes are a type of redundancy that is not due to attribute dependencies and, thus, cannot be further normalized away, and do not violate PoFN and relational theory. They are simply poor database design induced by a combination of lack of foundation knowledge and poor DBMS implementations.

 

 

 

 

No comments:

Post a Comment

View My Stats