Saturday, March 4, 2023

ON NORMALIZATION AND THE SCIENTIFIC METHOD (t&n)



(originally published August 2002)

Note: "Then & Now" (T&N) is a new version of what used to be the "Oldies but Goodies" (OBG) series. To demonstrate the superiority of a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, as well as the evolution/progress of RDM, I am re-visiting my 2000-06 debunkings, bringing them up to my with my knowledge and understanding of today. This will enable you to judge how well my arguments have held up and appreciate the increasing gap between scientific progress and the industry’s stagnation, if not outright regress.

Then ...

Email exchange with a reader:
“I find [your article in DM Review] to contradict your stated devotion to scientific methods and the value of theory. You present a single example of denormalization, then proceed to draw a conclusion about denormalization in general. In addition, the example chosen is not typical of real world denormalizations.” In order to be half-way consistent with your own ideals, you would need to present at a minimum an exhaustive list of the types of denormalizations used in practice, along with an objective list of the pros and cons of each.  I would expect that if this were undertaken, you would end up with a more balanced view, and some exceptions to your black-and-white conclusions. Of course, to prove your point scientifically would require far more effort than this, if indeed it were at all possible to prove or disprove your statements. This brings me to my key point: if your contention is not falsifiable, it does not belong in the realm of true science at all, instead it belongs in the domain of mere opinion and belief. Please tell us how you have proved your propositions, or else refrain from claiming that you are working from a sound scientific foundation and everyone else is somehow misguided. Relational algebra has nothing to say about real-world performance.”

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST POSTS

02/04 CONCEPTUAL MODELING, LOGICAL DATABASE DESIGN AND PHYSICAL IMPLEMENTATION (sms)

01/22 CONCEPTUAL BUSINESS RULES AND LOGICAL CONSTRAINTS (sms)

01/02 NEW "DATA MODELS" 5.2 (t&n)

UPDATES

12/22 Added Finitary Relation to LINKS page

08/20 Added Logic and databases course to LINKS page.

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my AllAnalytics columns no longer work. I re-published only the 2017 columns @dbdebunk, and within them links to sources external to AllAnalytics may or may not work.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
------------------------------------------------------------------------------------------------------------------

First, you are confusing formal theory with empirical theory. RDM (and normalization) is the former. Second, please provide one example of denormalization for which the logic of my arguments does not apply.

“I want users to be able to quickly retrieve total monthly sales for product A. They do this hundreds of times a month.  I create a table keyed on Year, Month and Product, to hold the total sales.  I then update the total as orders are processed.  In a completely normalized database, the query to get the total would have to read thousands of rows of order lines, and would be orders of magnitude slower. Also, I would appreciate it if you could explain how my arguments do not apply to formal theory.”

As I guessed, like so many practitioners, you do not understand what normalization is. Your example is one of storing derived (i.e., calculated) data -- a form of redundancy different from that due to denormalization. The chapter on redundancy in my book has separate sections for denormalization and for derived data (your example seems to refer to historic data, which are not updated and hence redundancy does not constitute a problem).

My arguments apply to any redundancy of data that is being updated. The only reason you may get better performance is because you trade integrity for it -- you ignore the risk of inconsistency. Now, if practitioners knew and understood this and consciously decided to give up integrity, I would still worry, but if that's their choice, fine. The problem is that the vast majority are oblivious to the integrity implications of denormalization.

Regarding your claim that "Relational algebra has nothing to say about real-world performance", [indeed, that's precisely what my article says -- physical independence is a core objective of RDM]: normalization/denormalization [is logical database design] and cannot possibly affect performance, which is determined [exclusively] by physical implementation. What this means is that if you get poor performance, it is [because the physical implementation] (database and DBMS, network, hardware, load) is not up to par with respect to the he fully normalized (i.e., correctly designed] database; so do not mistakenly blame the RDM. Like so many, you confuse logical and physical levels, which is so entrenched that even an article which makes every effort to disabuse of such confusion, does not get thru.

The distinction between empirical and formal theory is much beyond databases and computing -- it requires an understanding of science; the difference between the two kinds of theory is not something that can be explained and learned via email. If this is of interest to you, I suggest you educate yourself on the subject, particularly if you want to engage in public discussion on it.

... and Now

According to industry "understanding" -- to the extent there is any -- of RDM:

  • Database relations must be in 1NF and preferably also in 5NF.
  • The purpose of full normalization is avoidance of redundancy and "update anomalies".

But according to proper understanding, which is absent, RDM mandates database relations to be in 5NF to prevent semantic loss.

If performance with a 5NF database is poor and denormalization improves it, it only means that the implementation does not support physical independence.  Users should not have to trade semantic consistency for performance by denormalizing, but ensure maximum optimization of the the implementation and/or expect/demand better RDBMSs from vendors.

For the difference between formal and empirical theory see https://en.wikipedia.org/wiki/Formal_science.



No comments:

Post a Comment

View My Stats