Friday, May 31, 2019

Normalization and Further Normalization Part 1: Databases Representing ... What?




Note: This is a re-write of older posts (which now link here), to bring them into line with the McGoveran formalization, re-interpretation, and extension[1] of Codd's RDM.
“A particular bug-bear and a mistake that +90% of "data modelers" make, is analyzing "point in time" views of the business data and "normalizing" those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):
SALES-INVOICE
 {Invoice-Date,
  Customer-Account-ID,
  Customer Name,
  Invoice-Address-Line-1,
  Invoice-Address-Line-2,
  Invoice-Address-Line-3,
  Invoice-Address-Line-4,
  Invoice-Address-Postcode,
  Net-Amount,
  VAT,
  Total-Amount
 };
Nearly every time, through the blind application of normalization we get this ... there’s even a term for it -- it’s called "over-normalization":
SALES-INVOICE
 {Invoice-Date,
  Customer-Account-Id
   REFERENCES Customer-Account,
  Net-Amount,
  VAT,
  Total-Amount
 };

CUSTOMER-ACCOUNT
 {Customer-Account-Id,
  Customer-Name,
  Invoice-Address
   REFERENCES Address
 };

ADDRESS
 {Address-Line-1,
  Address-Line-2,
  Address-Line-3,
  Address-Line-4,
  Postcode
 };”
A measure of scarcity of foundation knowledge in the industry are the attempts to correct a plethora of common misconceptions[2] that suffer from the very misconceptions they aim to correct. One of the most common fallacies is confusion of levels of representation[3] that takes two forms[4]. We have written extensively about the logical-physical confusion (LPC)[5,6,7,8] underlying "denormalization for performance"[9], and the conceptual-logical conflation (CLC) that lumps conceptual with data modeling[10,11,12], inhibiting understanding that the latter is formalization of the former. 


------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE 

Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. The content of this site is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. Thank you. 

NEW

  • 05/21/19: Updated the LINKS page.
  • 04/20/19: Added POSTS page with links to all site posts, to be updated monthly.

LATEST PUBLICATIONS (order PAPERS and BOOKS) 

USING THIS SITE 

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics may work.
SOCIAL MEDIA 

I deleted my Facebook account. You can follow me:

  • @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture posts, and my exchanges on LinkedIn.
  • @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
  • @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

Note: Here's an example of both:
“I am working on Data Modelling for a database (DB2). I need help on deciding keys for the table.
Table T1 has columns Col1 Col2 Col3 Col4 Col5, etc. The natural key is a composite key (Col1, Col2, Col3, Col4). In most of the scenarios, Col1,2 and 3 can be the same, and they can be distinguished apart by Col4. So in order to normalize the table, I split the tables as
Table T1 --> Col1,Col2, Col3 P.Key-->Surrogate Key, make the combination unique(Unique Key/index)
Table T2--> T1.Surrogate KEy, Col4, Col5, etc P.Key-->(T1.SK, Col4)
My questions are
1: Is Surrogate Keys better [sic] than Natural PK(Composite PK) in this scenario?
2: Will index rebuilding/maintenance become a major issue, because this is a transactional table, and will be hit a lot of times during the day?
3: This table will be joined with multiple tables, so will Surrogate keys cause overheads while hashing?
4: Which one is better for joins in this case? Surrogate Key as a PK or composite natural key as PK? Please advise which is a better approach. Composite PK or Surrogate Key for this scenario.
A database relation is by definition in both 1NF and 5NF. Whether a table visualizes a relation can only be determined with reference to the conceptual model which the relation represents, yet due to CLC requests for design help do not provide one.

Fundamentals


A relational database is a formal representation of a conceptual model of reality consisting of business rules (BR) expressed in specialized natural language that specify facts about groups of entities -- properties of individual group members, and their collective properties as groups, some of which arise from relationships[13]. In a relational database, each relation represents a set of such facts about a group of entities of a single type. There are three formal principles of database design (POOD, POEC, PORP) that (according to a McGoveran as yet unproven conjecture) jointly imply the Principle of Full Normalization (POFN) (but not vice-versa)[1,14] -- all of which apply to relations, not tabular visualizations thereof[15].

According to traditional understanding, a database relation is by definition in first normal form (1NF)[16]: all attributes are defined on simple domains without meaningful components, the values of which are treated as atomic by a relational data sublanguage[17]. Preferably -- to avoid redundancy and update anomalies -- it should also be in fifth normal form (5NF): the only dependencies that hold in it are functional dependencies (FD) of the non-key attributes on the primary key (PK) (i.e., for each PK value there is exactly one value of every non-key attribute)[18]. This reflects the fact that a relation represents a group of entities of a single type (why?)

However, based on his work formalizing, re-interpreting, and extending Codd's RDM[1], McGoveran has concluded (which is not universally accepted yet) that semantic correctness and system-guaranteed logical validity require that relations are also in 5NF by definition[19],  otherwise they are not relations, the database is not relational, and all bets are off.

It follows that explicit normalization to 1NF and further normalization to 5NF are necessary only to repair flawed designs. But instead of developing complete conceptual models and formalizing them as logical models using the RDM while adhering to the three principles -- which would produce 1NF and 5NF relational databases -- they "start with a list of data-items" as columns in some ad-hoc tables, try to [further] normalize them stepwise through 2NF-4NF (often only to 3NF), get stuck, and request "normalization" help. Here are two examples:

"I have a invoice that I am trying to normalize to 3NF. I got it to 2NF but am stuck on 3NF because I am not sure what to do with the Order Number attribute.”
“The single most important thing, by far, is to do a data design before you do a database design. That is, a true logical data design before a physical db design. You do that by ... Going thru the normalization steps, in order: 1NF, 2NF, 3NF ...”

Instead of asking for the model, those who respond, in turn, make conceptual guesses, which may be possible in very simple cases, but many of the help requests involve cases complex enough to render such guesses highly questionable, which results in bad advice, and the proliferation of flawed designs.

Note: Data structures 

  • Not in 1NF require normalization;
  • In 1NF but not in 5NF (in 2NF-4NF that "bundle" entity groups) require further normalization: replacement with their 5NF projections without loss of information[20].

Continued in Part 2




Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK (draft chapters),
forthcoming.

[2] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS.

[3] Pascal, F., Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation.

[4] Pascal, F., The Conceptual-Logical Conflation and the Logical-Physical Confusion.

[5] Pascal, F., Data Warehouses and the Logical-Physical Confusion.

[6] Pascal, F., Physical Independence Parts 1,2.

[7] Pascal, F., Don't Conflate/Confuse Primary Keys, PK Constraints, and Indexes.

[8]  Pascal, F., No Such Thing As Primary Key Tuning.

[9] Pascal, F., Denormalization for Performance: Don't Blame the Relational Model.

[10] Pascal, F., Conceptual Modeling Is Not Data Modeling.

[11] Pascal, F., Don't Conflate Reality and Data.

[12] Understanding Conceptual vs. Data Modeling Parts 1-3.

[13] Pascal, F., Relationships and the RDM Parts 1-3.

[14] Pascal, F., What Relations Really Are and Why They Are Important.

[15] Pascal, F., Tables? So What?

[16] Pascal, F., First Normal Form in Theory and Practice Parts 1-3.

[17] Pascal, F., Simple Domains and Value Atomicity.

[18] Pascal, F., The Key to Relational Keys: A New Understanding.

[19] Pascal, F., Relational Database Design, Logical Validity and Semantic Correctness.

[20] Pascal, F., The Costly Ilusion: Normalization, Integrity, and Performance.

No comments:

Post a Comment

View My Stats