Sunday, April 10, 2016

First Normal Form in Theory and Practice Part 2



09/19/23: For the latest on this subject see: FIRST NORMAL FORM - A DEFINITIVE GUIDE


Note: This is a 11/23/17 revision of Part 2 of a three-part series that replaced all of my previous posts on the subject (pages of which redirect here), in order to further tighten integration with the McGoveran formalization and interpretation [1] of Codd's true RDM.
(Cont'd from Part 1)
 

Part 1 raised the issue of poor understanding of Codd's concept of a simple domain with atomic values underlying 1NF. In Part 2 I clarify Codd's definition of 1NF and its correct interpretation.


From SOPL to FOPL


In 1969 Codd explicitly cited second order predicate logic (SOPL) as a theoretical foundation for the RDM [2]. Because it permits predicates over predicates (and queries over relations within relations), SOPL-based data languages are expressively more powerful than those based on first order predicate logic (FOPL). For example, given relation-valued attributes (RVA) defined on relation-valued domains (RVD), they allow access to the RVA components -- tuples and attributes -- of the "inner" relations (i.e., the values of the RVAs): both outer and inner relations can be manipulated via data language (e.g., joined) expressions.

But, first, this means that the implied domain conversion, expansion and normalization need to be done automatically and on the fly, which might not be theoretically possible. For attributes defined on non-simple domains that, unlike RVAs, are not set-valued, a general approach to the problem is probably not even definable.

Second, recall that one of the fundamental motivations for the RDM was data independence, physical (PI) [3] -- insulation of queries and applications from physical changes -- and logical (LI), which require a declarative, decidable data language that expresses data manipulation and integrity, independently from physical implementation details.

"From the theory of formal languages and logic it is known that implementing a declarative data language (on computers) requires an algorithm that will correctly and automatically process any query whatsoever (corresponding to every syntactically correct logical formula), regardless of what data are in the database. The algorithm must evaluate the query in light of the data in the database and terminate with a proper answer. In the case of RDM, it should produce a relation containing zero or more tuples that make the query "true" -- no other termination is acceptable. If the data language has only the power of FOPL and (as is always the case in practice) the database has a finite size, such an algorithm for query evaluation can be found. But if it has the power of SOPL, expressions are possible that cannot be evaluated (for example, self-referencing expressions) and the formal language is then undecidable, an algorithm to implement a declarative query language is impossible and all hope of PI is lost." --David McGoveran
In other words, the choice for RDM was between 
  • More powerful SOPL, but no declarative, decidable data language and no PI and complexity;
  • Less powerful FOPL, but declarative, decidable data language, PI and simplicity;
a no-brainer. By 1970 Codd abandoned SOPL in favor of FOPL and stated that "the possibility of eliminating non-simple attributes [e.g., RVAs] appears worth investigating" [3].

Normal Form and Normalization, Further Normalization and 1NF


Although he did not explain, the evidence suggests that Codd realized that if the values of RVAs are union-compatible -- same number of attributes defined on same domains -- it is possible to "flatten" outer relations in a uniform way. In other words, relations with such RVAs have an alternative RVA-less form -- a normal form -- which is preferred, because if all relations are in it, FOPL is sufficient. He named the process of eliminating RVAs normalization and the resulting relations normalized [3].

The original single normal form was tied to the 1969-70 definition of join. After Codd re-defined join in 1970, analysis of attribute dependencies gave rise to multiple normal forms (1NF-5NF) and further normalization [4]. While the original normal form and the 1NF are not comparable (see Note in Part 1), the normalization method is elimination of non-simple RVAs [5].

For an intuitive understanding of 1NF, recall that tuples represent facts (formally, propositions) about real world objects, the truth of which has been asserted by some trusted authorized user. Every relational operation derives a new relation from database relations, the tuples of which represent facts that are logical implications of (inferences from) database facts. Relations with RVAs represent compound facts -- facts within facts -- which make manipulation and integrity expressions more complex. Relations that represent non-compound facts are in a "less convoluted" form -- the 1NF -- that avoids the complications. Thus, 1NF is required for the key advantages of the RDM:

  • FOPL-based declarative, decidable data sub-languages;
  • PI;
  • Simplicity;
to materialize. If databases are not designed accordingly, there are no relations, and the database is not relational, and normalization to repair the design.

Note very carefully that domains can be of arbitrary complexity, as long as they are defined as simple and have no meaningful components.
"Part of the genius of the RDM is that any complexity that is not meaningful to users can be encapsulated in domains -- their values can have ANY internal structure as long as it is "hidden" from relational operations -- that is the point of having user-definable domains. Thus, a domain/attribute can be defined as "set-valued" (SVD), or "tree-valued" (TVD), or "graph-valued" (GVD), or relation-valued, or whatever, as long as those values are treated as atomic by relational expressions in the data sub-language, which means that, for example and with few exceptions, no join/restrict/project/etc. is applied to a substring of a "text-valued" domain, a member of set belonging to a SVD, a leaf or subtree of a TVD, a node/arc/subgraph of a GVD and so on. Such violations of atomicity create new domains and, hence, new relations unknown to the DBMS, in violation of the IP." --David McGoveran
Unfortunately, this tends to induce designer laziness: often information  meaningful to users is left hidden from the DBMS in the internal structure of  non-simple attributes, in violation of the IP and RDM.

By 1979 Codd was effectively requiring all attributes to be defined on simple domains and focusing on 1NF relations. Attributes defined on non-simple RVDs were never prohibited by the RDM, but should generally be avoided without a strong specific justification and confidence that atomicity will not be subverted and undesirable non-relational consequences introduced.

Had Codd based RDM on SOPL, it is doubtful that we would have had even SQL -- we would have probably still been messing with prohibitively costly hierarchical and CODASYL (network) DBMSs. Sadly, SQL's poor relational fidelity is responsible for the regress back to those archaic products. But the evolution in Codd's thinking from from SOPL to FOPL in the context of industry's instinct to disregard theory has contributed to confusion.

"All data types NUMERIC, TEXT, DATE or even BLOB can be treated as atomic. It should be only sets and relations which are not." --What is the actual definition of First Normal Form (1NF)
Well, maybe, but only if the power of SOPL justifies its complications (the misconceptions that sets/relations are the only exception reveals that relational atomicity is not understood).
"... Like character strings, sets do have some internal structure [that's sometimes] convenient to ignore for certain purposes ... if character strings are atomic then sets must be so, too." --C. J. Date
Well, treating values of non-simple domains as atomic without taking into account components meaningful to users renders FOPL insufficient and incurs the integrity complications of SOPL, without the benefit of its expressive power.

1NF Defined


The last quote views atomicity as an intrinsic property of data to be "discovered", when, as we have seen in Part 1, it is a database design choice enshrined in domain definitions that should be consistent with data use by applications. So here is a precise 1NF definition:

A relation is in 1NF if all attributes are defined on simple domains that have no components meaningful to users.
If any is implicit and hidden from the DBMS in violation of the IP, the design is not consistent with the intended data use by applications and there is no 1NF and, therefore, no relation.

Note: As already mentioned, 1NF is insufficient -- a relation must be in 5NF by definition.

For 1NF enforcement, see Part 3.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Codd, E. F., Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks. IBM Research Report, San Jose, California RJ599 (1969).

[3] Codd, E. F., A Relational Model of Data for Large Shared Data Banks, Commun. ACM 13(6): 377-387 (1970)

[4] Codd, E. F., Further Normalization of the Data Base Relational Model, IBM Research Report, San Jose, California RJ909 (1971).

[5] Codd, E. F., Normalized Data Base Structure: A Brief Tutorial, IBM Research Report, San Jose, California RJ935 (1971).

[6] Pascal, F., The Interpretation and Representation of Database Relations V.1 (April 2017).





No comments:

Post a Comment

View My Stats