EVERYBODY THINK THEY KNOW FIRST NORMAL FORM, BUT NOBODY DOES

Follow @DBDebunk Follow @ThePostWest

“I have read this article in an effort to boost my academic knowledge on data modeling a bit and still have no idea what this academic author wanted to say. Apparently First Normal Form (1NF) doesn't get enough respect and then proceeds to talk about Non-First Normal Form (NFNF). But what about First Normal Form (1NF) damnit.”

By sheer chance this was posted on LinkedIn just after I published my new paper The First Normal Form: A Definitive Guide.

PRACTICAL DATABASE FOUNDATIONS

FIRST NORMAL FORM

A DEFINITIVE GUIDE

(September 2023)

Fabian Pascal

Table of Contents

Introduction

1. The Normal Form

2. The First Normal Form

3. Domain Decomposability & Atomicity

4. 1NF & Tables

5. SQL & 1NF

5.1. Repeating Groups & Repeated Attributes

5.2. Information Principle & SQL

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST POSTS

09/17 DBDEBUNK REFRESH

08/17ENTITIES, PROPERTIES AND CODD'S SLEIGHT OF HAND

07/31ON RELATIONAL KEYS (& DOMAINS) (t&n)

UPDATES

08/13/23 Added Good explanation of 'class' and 'type' to the LINKS page

LATEST PUBLICATIONS (order from PUBS pages)

08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the SEARCH page. For detailed instructions on how to understand and use the labels in conjunction with that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my AllAnalytics columns no longer work. I re-published only the 2017 columns @dbdebunk, and within them links to sources external to AllAnalytics may or may not work.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me @DBDdebunk on Twitter

------------------------------------------------------------------------------------------------------------------

The reason I wrote it is in the title of this post. To be fair, some of the problem is due to the changing terminology originally used by E.F. Codd, but most is due to the disregard for and lack of foundation knowledge in general and RDM in particular in the industry.

When I checked the article I instantly understood why the reader still "has no idea what this academic author wanted to say"--this is typical of the author (who, by the way, is not an academic). I stopped debunking his output decades ago and I would have not bothered with it now, but I could not resist the opportunity to promote my paper.

The following problem motivated the article:

"...storing International Classification of Diseases (ICD) values. Alternatively, to put it more accurately ... The non-table he was creating had a single text column that concatenated exactly four ICD codes ... ICD-10-CM codes consist of 3 to 7 alpha-numeric characters (case-insensitive), and codes longer than three characters have a decimal point between the third and the fourth character, e.g., “E11.9” represents Type 2 diabetes mellitus without complications... First Normal Form (1NF) says that this concatenated string is a repeated group, and we need to replace it with a proper relational construct. However, instead of getting some help on how to do it right, people posted various bits of code to slice up the original string!"

This is a common database design flaw, known in the industry as "overloading attributes" that I wrote about in 2017, specifically the relational principle it violates (Outsmarting the DBMS--Analysts Should Beware.) Those interested in what 1NF means will have to read my paper. Here I will only address issues with the article.

“Repeating groups originate with the pre-DBMS physical records of application program files (as in COBOL) having a variable number of fields, which could be indicated by a special field count in the "record header" by a special end of "group" indicator value in the last field, or—worse—recorded only in the program code that accessed those records. This same method was used at the field level: some fields—like one-dimensional arrays, or lists—contained multiple values or "subfields" and determining where the list ended could be tricky.”
--David McGoveran

Note: Yes, I know, practitioners use this term to refer to what should be "replaced with a relational construct", but 1NF should be defined relationally, not by the absence of a pre-database implementation feature.) The article gets closer to this.

"Since 1NF is the basis for RDBMS, it was the one academics played with first. And we have real multi-valued databases to see if it works. Most of the academic work was done by Jaeschke and Schek at IBM and Roth, Korth, and Silberschatz at the University of Texas, Austin. They added new operators to the relational algebra and calculus to handle “nested relations” while keeping the relational model’s abstract set-oriented nature. 1NF is inconvenient for handling data with complex internal structures, such as computer-aided design and manufacturing (CAD/CAM). These applications have to handle structured entities, while the 1NF table only allows atomic values for attributes."

"The basis of RDBMS" is somewhat vague. For the purposes of this discussion let's just say that 1NF is a required, if insufficient property of database relations. The article claims that if the concatenated strings
(whether they are repeating groups or not) are replaced with "nested relations" the 1NF mandate is satisfied.

Multivalued databases (MVDB) permit such relations--formally relations defined on Relation Valued Domains/Attributes (RVD/RVA). In other words, such relations are the relational construct to replace the concatenated string (whether they are a repeating group) and the solution to the design problem raised above. But is it?

Again, the reader will have to read my paper to decide whether RVDs/RVAs are consistent with 1NF. But while the IBM work cited in the article was successful, there is this:

"We discuss the issues encountered in the extended algebra and calculus languages for nested relations defined by Roth, Korth, and Silberschatz. Their equivalence proof between algebra and calculus fails because of the keying problems and the use of extended set operations [which] also have unintended side effects. Furthermore, their calculus seems to allow the generation of power sets, thus making it more powerful than their algebra" [and, thus, undecidability risk?] --On Roth, Korth, and Silberschatz's extended algebra and calculus for nested relational databases, ACM Transactions on Database Systems

The advantage of nesting/unnesting of relations is beneficial relative to relational DBMSs that do not permit RVDs/RVAs is unclear. The article admits a practical drawback--complexity:

"...the first problem of Non-First Normal Form (NFNF) data; it is simply complicated. In the SQL standards, a “field” is defined as a part of a column that has some meaning by itself but is not a complete attribute. The classic example is that a date breaks down into (year, month, day) fields, and a timestamp breaks down into (year, month, day, hour, minute, seconds). You get the fields out of temporal datatypes with a statement like EXTRACT(<field name> FROM <temporal expression>), or in SQL Server DATEPART. Unfortunately, if you’re going to split up a NFNF column, you have to do all the work yourself."

While it alludes to MVDBs inconsistency with the RDM, the article does not detail the relational principles they and SQL violate and uses the eufemism "non-first normal form (NFNF)" databases for what are, essentially, non-relational databases.

"Non-first normal form (NFNF) databases allow a column in a table to hold nested relations and break the rule about a column only containing scalar values drawn from a known domain. In addition to NFNF, these databases are also called 2NF, NF2, and ¬NF in the literature. Since they are not part of the ANSI/ISO standards, you will find different proprietary implementations and academic notations for their operations."

What is more, first, as we stressed so many times, tables and columns are partial depictions of database relations and attributes, but play no role in RDM. And second, 1NF does not mandate "attributes with only scalar values", this is one of the misunderstandings of what 1NF means debunked in my paper.

I am not going to bother with the rest of the article, but if you want to bother and see why the reader could not understand what the author was talking about, go ahead--it will make clear the superiority of 1NF ove NFNF on sheer simplicity grounds.

POSTS

Friday, October 13, 2023

EVERYBODY THINK THEY KNOW FIRST NORMAL FORM, BUT NOBODY DOES

09/17 DBDEBUNK REFRESH

No comments:

Post a Comment