Sunday, April 10, 2016

First Normal Form (1NF) in Theory and Practice, Part 2 (UPDATED)

(Cont'd from Part 1)

UPDATE (4/25/16): Minor refinements and clarifications.


In 1969 Codd explicitly cited second order predicate logic (SOPL) as a theoretical foundation for the RDM. Because it permits predicates over predicates (and queries over relations within relations), data languages based on SOPL are expressively more powerful than those based on first order predicate logic (FOPL). For example, SOPL languages give relational operations access to the components--tuples and attributes--of the "inner" relations--the values of relation-valued domains (RVD), which means that both outer and inner relations can be restricted/projected/joined/etc. within the same expression. 

But, first, this means that the implied domain conversion, expansion and normalization need to be done automatically and on the fly, which might not be theoretically possible. For non-simple domains that, unlike RVD's, are not set-valued, a general approach to the problem is probably not even definable.

Second, recall that one of the fundamental motivations for the RDM was physical data independence (PDI): queries and applications should not have to be rewritten just because some change is made to physical storage (memory, disk, screen). PDI requires a declarative query language that expresses logical data relationships and values in queries/constraints, while hiding physical implementation details.

"From the theory of formal languages and logic it is known that implementing a declarative data language (on computers) requires an algorithm that will correctly and automatically process any query whatsoever (corresponding to every syntactically correct logical formula), regardless of what data are in the database. The algorithm must evaluate the query in light of the data in the database and terminate with a proper answer. In the case of RDM, it should produce a relation containing zero or more tuples that make the query "true"--no other termination is acceptable.

If the data language has only the power of FOPL and (as is always the case in practice) the database has a finite size, such an algorithm for query evaluation can be found. But if it has the power of SOPL, expressions are possible that cannot be evaluated (for example, self-referencing expressions) and the formal language is then undecidable, an algorithm to implement a declarative query language is impossible and all hope of PDI is lost." --David McGoveran
In other words, the choice for RDM is between  
  1. More powerful SOPL, but complexity, undecidability, no declarative data language and no PDI;
  2. Less powerful FOPL, but simplicity, decidability, declarative data language and PDI.
which should be obvious. By 1970 Codd abandoned SOPL in favor of FOPL and stated that "the possibility of eliminating non-simple domains [such as RVD's] appears worth investigating".

The Normal Form

Although he did not explain, the evidence suggests that Codd realized that union-compatible inner relations (RVD values)--same number of attributes defined on same domains--allow the "flattening" of outer relations in a uniform way. In other words, relations with such RVD's have an alternative RVDless form--a normal form. If all relations are in the "preferred" normal form, FOPL is sufficient. He named the process of eliminating non-simple domains (including RVD's) normalization and the resulting relations normalized.

Note: Later, the analysis of attribute dependencies gave rise to further normalization and the normal form became first normal form (1NF). You 

  • normalize to 1NF by eliminating all non-simple domains from relations; 
  • further normalize to 5NF a relation that is in 1NF, but "bundles" multiple entity classes--is in 2NF-4NF--by separating the classes each to its own relation.

For an intuitive understanding of the normal form, recall that tuples in the database represent facts (i.e., logical propositions, or declarative sentences) about real world entities, whose truth is asserted by some trusted authority. Every relational operation (query) derives a new relation from database relations, whose tuples represent facts that are logical implications of the database facts. Relations with RVD's represent c
ompound facts--facts within facts--which make constraints and queries much more complex and their formulation prone to error. Relations that represent non-compound facts are in a less "convoluted" form--their  normal form--that avoid these complications, so that SOPL becomes unnecessary.

Thus, for key FOPL advantages of the RDM--declarative languages, PDI and simplicity--to materialize, non-simple domains must be treated as atomic by constraint and query relational expressions in the data language.

Part of the genius of the RDM is that any complexity that is not meaningful to users can be encapsulated in domains--their values can have ANY internal structure as long as it is "hidden" from relational operations--that is the point of having user definable domains. Thus, a domain can be defined as "set-valued" (SVD), or "tree-valued" (TVD), or "graph-valued" (GVD), or relation-valued, or whatever, as long as those values are treated as atomic by constraint and query relational expressions in the data language, which means that, for example and with few exceptions, no join/restrict/project/etc. is applied to a substring of a "text-valued" domain, a member of set belonging to a SVD, a leaf or subtree of a TVD, a node/arc/subgraph of a GVD and so on. Such violations of atomicity create new domains and, hence, new relations unknown to the DBMS, in violation of the IP. --David McGoveran
Unfortunately, this tends to induce designer laziness: often information that is meaningful to users is left hidden in the internal structure of the values of non-simple domains and inaccessible to applications via FOPL. So by 1979 EFC was effectively requiring all domains to be simple and focusing on normalized relations. Non-simple domains (including RVD's) were never prohibited by the RDM, but should generally be avoided without a strong specific justification and confidence that atomicity will not be subverted and undesirable non-relational consequences introduced.

Had Codd based RDM on SOPL, it is doubtful that we would have had even SQL. We would probably still be messing with prohibitively costly hierarchical and CODASYL DBMS's. Sadly, SQL's poor relational fidelity has been responsible for the regress back to those archaic products. But the evolution in EFC's thinking from from SOPL to FOPL in the context of industry's instinct to disregard theory has contributed to confusion.

"All data types NUMERIC, TEXT, DATE or even BLOB can be treated as atomic. It should be only sets and relations which are not." --What is the actual definition of First Normal Form (1NF)
Well, maybe, but only if the power of SOPL justifies its complications (the notion that sets and relations are the only exceptions reveals that relational atomicity is not understood).
"... Like character strings, sets do have some internal structure [that's sometimes] convenient to ignore for certain purposes ... if character strings are atomic then sets must be so, too." --C. J. Date
Well, treating values of non-simple domains as atomic without taking into account meaningful internal components hidden in them incurs the integrity complications of SOPL, without the benefit of its manipulative power.

1NF Defined

The last quote views atomicity as an intrinsic property of data to be "discovered" when, as we have seen in Part 1, it is a database design choice enshrined in domain definitions, which should ensure consistency with data use by applications. So here is a precise 1NF definition:

A relation is in 1NF if all entity properties that are meaningful to users are represented by attributes defined on simple domains.
In other words, if some meaningful attribute(s) is(are) "hidden" as implicit component(s) of domain(s), the relation is not in 1NF and its design is not consistent with intended data use.

For how to assess if given relations are normalized and how a true RDBMS enforces 1NF, stay tuned.

(Cont'd in Part 3)

No comments:

Post a Comment