Tuesday, March 1, 2016

First Normal Form in Theory and Practice Part 1

09/19/23: For the latest on this subject see: FIRST NORMAL FORM - A DEFINITIVE GUIDE


Note: This is a 11/23/17 revision of Part 1 of a three-part series that replaced all of my previous posts on the subject (pages of which redirect here), in order to further tighten integration with the McGoveran formalization and interpretation [1] of Codd's true RDM.

On the one hand:

"... there is no generally accepted definition of 1NF ... the word that you see most often is 'atomic'. It is common to say that a relation is in 1NF if all its attributes [sic] are atomic ... Does 1NF equate to “atomic attribute [values]”? ... what [do] people have in mind [when they claim] atomicity? ... the [meaning] behind definitions is that you should rarely need to extract information from a value of an attribute ... But that explains why one cannot decide, depending on theory only, whether a relation is in 1NF ... it is a habitual use of data that makes attributes atomic, not theory. No wonder, there is so much mess in theory about what 1NF should be."
--What is the actual definition of First Normal Form, Vertabelo.com

On another, according to a DBDebunk reader:
"Codd in 1969/70 (and RM V/2 20 years later) gave a precise, theory-based definition of "atomic" aka "simple" aka "non-decomposable" (later aka non-"compound" aka non-"structured"): not relation-valued. And he gave a precise definition of "normalized" (1NF): relations free of relation-valued-domains (RVD)!"
 All sorts of other definitions proliferate, for example:
 "First normal form enforces these criteria:
- Eliminate repeating groups in individual tables.
- Create a separate table for each set of related data.
-Identify each set of related data with a primary key."
--First normal form, Wikipedia
Note: A relational databases consists of relations that can be visualized as R-tables. Normal forms are a property of relations, not R-tables -- a "R-table in 1NF" is shorthand for consistency with the underlying relation. 
The redefinition of join in 1970 substituted a single normal form with five (1NF-5NF). It is commonly accepted in the industry that
  • 1NF: simple domains with values that are treated as atomic by the data sublanguage;
  • Relations are (at least in) 1NF by definition (otherwise they are not relations);
  • The original normal form is not the same as the current 1NF, except to say that the former is to the original join what the 5NF is to the current join;
  • For logical validity, semantic correctness and data independence 1NF is insufficient -- relations must also be in 5NF by definition.

Simple Domains with Atomic Values

"So far, we have discussed examples of relations which are defined on simple domains -- domains whose elements are atomic (non-decomposable) values. Non-atomic values can be discussed within the relational framework. Thus, some domains may have relations as elements."

--E. F. Codd (1969)
If we add the 1NF requirement, then according to Codd's definition, a relation has
  • Unique, unordered n-valued tuples with atomic, non-missing values;
  • Uniquely named, unordered attributes defined on simple domains;
Also keep in mind the Information Principle -- Codd's Rule 1 -- that mandates all information in a database be represented explicitly and in exactly one way -- as values of attributes defined on domains.

Consider now the following table -- is it a R-table (i.e., does it visualize a relation, which must be in at least 1NF)?

 ID  NAME                 ADDRESS
  1  Mark Tomers      56 Tomato Road
  2  Fred Askalong    3277 Hadley Drive
  3  May Anne Brice   225 Century Avenue
The answer depends on how the domains from which NAME and ADDRESS draw their values are defined by the database designer, which should be consistent with the intended use of the data by applications (and cannot be determined from just inspecting the table). Are last name, first name, street  individually meaningful to users, or not?
(1) If yes, the IP requires that each should be an attribute defined on a simple domain;
(2) If not, applications will not access them, so NAME and ADDRESS can each be defined on simple domains with (first,last) and (#,street) combinations as atomic values;
If not sure, (1) is the cautious option.

If the designer, based on user requirements analysis, opts for (2) and any application needs to access attribute components, the domain definitions are inconsistent with the intended use. NAME and ADDRESS are defined on domains, but their components are not, so allowing such subversion of atomicity violates the IP -- information meaningful to applications is represented implicitly and is, thus, "hidden" from the DBMS  and, therefore, the RDM. In fact, as practitioners should know, SQL has functionality that can be used to subvert atomicity -- in effect on-the-fly creation of new domains and relations "behind the back" of the DBMS.

"For a while scientists thought that atoms were literally indivisible. Any representation of them would then be "atomic" (no pun intended). Then we discovered electrons and the nucleus and they were understood as atomic, at least until we discovered protons and neutrons. Now we talk about quarks and so on. Clearly, atomicity is not an intrinsic concept and it must be defined in the context of the relevant knowledge —and how the value will be used."

--David McGoveran
In other words, atomicity is a database design choice for consistency with the intended data use. 

Repeating Groups

Before I continue, let me get this out of the way.
"Repeating groups (RG) originate with physical records in application files having no fixed number of fields (as in COBOL). The number of fields could be indicated by a special field count in the "record header", by a special “end of group” indicator value in the last field, or -- worse -- recorded only in the code that accessed those records. This same method was used at the field level: some fields -- like one-dimensional arrays, or lists -- contained multiple values or "sub-fields" and determining where the list ended could be tricky."
--David McGoveran
Some such structure was replicated in hierarchic databases that preceded the RDM and SQL: a physical “parent” segment pointed to a variable number of “child” segments. These, in turn, might consist of a variable number of repeated fields -- a group -- for every child segment of the corresponding parent segment; one of those fields indicated the number of the other fields. In other words, each parent segment in the hierarchy was linked via physical pointers to a set of child segments and those pointers were exposed to applications and queries and had to be explicitly navigated from the parent through the hierarchy to access the children.

The RDM introduced physical independence (PI), in part by replacing physical segments and pointer navigation with logical tuples in relations and joins by values. RGs have no place in a relational database and, therefore, they cannot be eliminated from them. But in conversions of hierarchic databases to SQL there was an "instinct" to emulate RGs in those databases with relation-valued attributes (RVA). So, normalization to 1NF eliminates RVAs, not RGs. 

(Continued in Part 2)


[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

No comments:

Post a Comment

View My Stats