Tuesday, March 1, 2016

First Normal Form (1NF) in Theory and Practice (UPDATED), Part 1

This series is a full rewrite of my last post on the normal form (1NF). It replaces all the previous posts (redirected here) and all my writings on the subject. I would like to thank David McGoveran for reviewing drafts, offering important comments and sharing some details on the genesis and history of the 1NF and Codd's thinking on the subject.

UPDATE (4/25/16): Minor refinements and clarifications.

First Normal Form Misunderstood

On the one hand,
"... there is no generally accepted definition of 1NF ... the word that you see most often is 'atomic'. It is common to say that a relation is in 1NF if all its attributes [sic] are atomic ... Does 1NF equate to “atomic attribute [values]”? ... what [do] people have in mind [when they claim] atomicity? ... the [meaning] behind definitions is that you should rarely need to extract information from a value of an attribute ... But that explains why one cannot decide, depending on theory only, whether a relation is in 1NF ... it is a habitual use of data that makes attributes atomic, not theory. No wonder, there is so much mess in theory about what 1NF should be." --What is the actual definition of First Normal Form (1NF)

On the other,
"Codd in 1969/70 (and RM V/2 20 years later) gave a precise, theory-based definition of "atomic" aka "simple" aka "non-decomposable" (later aka non-"compound" aka non-"structured"): not relation-valued. And he gave a precise definition of "normalized" (1NF): relations free of relation-valued-domains (RVD)!" --DBDebunk reader

The definitions are theoretically grounded, but were not well explained and theory is disregarded in the industry anyway, so they are misunderstood. For example, an old Wikipedia definition was:
"... a property of a table in a relational database with the domain of each attribute
  • contains no repeating groups, only atomic values;
  • every attribute has a single value from the corresponding domain in every row."
The current page is somewhat better, but still includes
"First normal form enforces these criteria:
  • Eliminate repeating groups in individual tables.
  • Create a separate table for each set of related data.
  • Identify each set of related data with a primary key." --First normal form
Note: A relational databases consists of relations, not R-tables. It's relations that are either in their normal form (later 1NF), or not. R-tables, not any tables, are one way to "picture" relations--visualize them on some physical medium, paper, screen and so on. A "R-table in 1NF" is a shorthand for "the relation pictured by the R-table is in its normal form".

Repeating Groups

Repeating groups (RG) originate with physical records having no fixed number of fields (as in COBOL). The number of fields could be indicated by a special field count in the "record header", by a special “end of group” indicator value in the last field, or--worse--recorded only in the code that accessed those records. This same method was used at the field level: some fields--like one-dimensional arrays, or lists--contained multiple values or "sub-fields" and determining where the list ended could be tricky. --David McGoveran

In hierarchical databases that preceded the RDM and SQL, a “parent” segment pointed physically to a variable number of “child” segments. These in turn might consist of a variable number of repeated fields--a group--for every child segment of the corresponding parent segment; one of those fields indicated the number of the other fields. In other words, each parent segment in the hierarchy was linked via physical pointers to a set of child segments and those pointers were exposed to applications and queries and had to be explicitly navigated from the parent through the hierarchy to access the children.

The RDM introduced physical data independence (PDI), in part by replacing physical segments and pointer navigation with logical tuples in relations and joins by value. The core Information Principle--Rule 0 of the famous E. F. Codd (EFC) rules--mandates that all information in a relational database be represented explicitly and in exactly one way: as domain values in tuples in relations. Thus 

  • relations are the exclusive logical data structure in relational databases;
  • tuples are sets of attribute values drawn from domains.
Since RG'’s are not logical, let alone relations, 1NF has nothing to do with them.

Value Atomicity: Simple Domains as a Database Design Choice

"For a while scientists thought that atoms were literally indivisible. Any representation of them would then be "atomic" (no pun intended). Then we discovered electrons and the nucleus and they were understood as atomic, at least until we discovered protons and neutrons. Now we talk about quarks ... and so on. Clearly, atomicity is not an intrinsic concept and it must be defined in the context of the relevant knowledge—and how the value will be used." --David McGoveran
When E. F. Codd (EFC) wrote:
"So far, we have discussed examples of relations which are defined on simple domains--domains whose elements are atomic (non-decomposable) values. Non-atomic values can be discussed within the relational framework. Thus, some domains may have relations as elements."
in his initial 1969 paper, he distinguished between simple domains, defined to the DBMS as having no components that are meaningful (of interest) to users and, therefore, there is no need for applications to access them and non-simple domains defined as having components. EFC gave several examples of non-simple domains, with special attention to relation-valued domains (RVD): domains that have relations as values, with tuples and attributes as components.

Consider the table

 ID  NAME             ADDRESS
  1  Mark Tomers      56 Tomato Road
  2  Fred Askalong    3277 Hadley Drive
  3  May Anne Brice   225 Century Avenue
If it is properly constrained as a R-table (see Business Modeling for Database Design)--so it pictures a relation. The database designer can define NAME:
  • on a simple domain with combinations of first and last name as atomic values;  
  • on a non-simple domain with values that have first and last name as components.
The choice should be consistent with the use of NAME by intended applications i.e., if first name and last name are meaningful entity properties for users, they should be represented as explicit attributes defined on simple domains, not as implicit components of non-simple domains.

The first choice informs the DBMS that insofar as applications are concerned, they will not invoke query or integrity relational expressions in the data language that refer to value components, the second choice that they may. If the designer chooses the simple domain and applications try to access components, they are subverting the value atomicity of the domain (as defined by the database designer to the DBMS) and the relation is in a form that is inconsistent with its intended use. If applications need access to the components, then in the logical design those values belong to an attribute defined on a domain declared to the DBMS. In fact, implicit components violate the IP: if name components are meaningful to applications, but this information is not represented explicitly as attribute values drawn from  a domain, but is "hidden" from the DBMS into the NAME domain. To comply with the IP, the designer should create two domains, LAST and FIRST and define corresponding attributes on them.

Note very carefully, therefore, that value atomicity is not an intrinsic property of data. Rather, it is a specific domain definition choice by the database designer consistent with the intended data use: either first name and last name are not meaningful individually to users, in which case NAME is a simple domain with atomic values; or they are, in which case FNAME and LNAME are simple domains with atomic values.

(Cont'd in Part 2.)

No comments:

Post a Comment