Tuesday, April 26, 2016

Data Fundamentals for Analysts: Nested Facts and the (First) Normal Form

My April column @All Analytics.

A R-table with attributes defined only on simple domains takes a less convoluted form -- a normal form -- devoid of nesting. If R-tables are in the preferred normal form i.e., components meaningful to applications (here, employee attributes) are simple domains in their own right and a true RDBMS enforces value atomicity -- first order logic is sufficient. This imposes some limitations on the expressive power of data languages, but they are declarative and PDI and simplicity are preserved. A true RDBMS enforces atomicity via a data language that does not allow applications to access attribute components not explicitly defined on their own domains.

Read it all. (Please comment there, not here)

FYI: I have revised all three parts of the series on 1NF -- mainly refinements and clarifications.









Tuesday, April 19, 2016

First Normal Form in Theory and Practice Part 3


Note: This is a 11/23/17 revision of Part 3 of a three-part series that replaced all of my previous posts on the subject (pages of which redirect here), in order to further tighten integration with the formalization and interpretation [1] of McGoveran's formalization and interpretation [1] of Codd's true RDM.

(Continued from Part 2)

 
"Is this table in 1NF?" is a common question in database practice. On the other hand, "What problems are solved by splitting street addresses into individual columns?", or  
What's the best way to store an array in a relational database does not seem to evoke associations with 1NF. This reveals poor foundation knowledge.


Part 1 introduced the poor understanding of 1NF and Part 2 provided a correct definition and explanation. Part 3 explains how 1NF can be enforced by the data language, which SQL does not.

Database Design Consistent with Intended Data Use


The Information Principle (IP) -- Rule 0 of the RDM [2] -- mandates all information in a relational database be represented in relations, explicitly and in exactly one way -- as values of attributes defined on domains. Database relations must be in 5NF -- defined in terms of attribute dependencies [3] -- which requires that they are in 1NF -- defined in terms of simple domains and value atomicity [4]. 1NF ensures that FOPL is sufficient for data sub-languages and they are declarative and decidable [5] and physical independence (PI) [6], 5NF guarantees logical validity, logical independence (LI) and semantic correctness [7].

No 1NF means no 5NF and no relations. Some information -- namely, attribute components meaningful to users -- is represented implicitly and, thus, hidden from the DBMS, in violation of the IP and the RDM. Application access to that information requires a SOPL based data language, with loss of the relational advantages.

One implication is that database designers should make sure that all object properties of interest to users are represented by attributes defined on simple domains and are not implicit in components thereof. Another implication is that it is impossible to ascertain whether a table is a R-table (i.e., if it displays a relation -- which must be in 1NF) by its sheer inspection. Going back to the table in Part 1,

=========================================
ID  NAME             ADDRESS
-----------------------------------------
1  Mark Tomers      56 Tomato Road
2  Fred Askalong    3277 Hadley Drive
3  May Anne Brice   225 Century Avenue
...
=========================================
(1) it has uniquely named columns and assuming that (2) ID represents the PK (3) no information is encoded in the order of rows and columns and (4) No values are missing, it is a R-table if and only if (5) NAME and ADDRESS are defined on simple domains and their components are not meaningful to users.

Note: Tables such as

===========================================
LAST    FIRST  PHONE1        PHONE2     
-------------------------------------------
Smith   John   303-123-4567        
Doe     Mary   303-456-9933  303-456-9944
Johnson Bill   303-987-6543
...
===========================================

are often given as examples of non-R-tables on 1NF grounds. But while the PHONE1, PHONE2 design is poor, depending on how the domains of NAME and ADDRESS are defined, it may not violate 1NF. If they are both simple and have no components meaningful to users, the relation is in 1NF, otherwise it isn't. It's actually the missing values (and possibly the lack of a PK) that disqualify the table. Although Codd supported missing value "marks" -- not to be confused with SQL NULLs -- that is now considered a mistake (for a relational solution to missing data, see [8]).
 

Enforcing 1NF


1NF has very limited meaning outside true RDBMSs. In SQL -- erroneously accepted in the industry as a relational data sub-language (it is neither relational, nor just a data sub-language), 1NF is a heuristic rule that is easy to treat inconsistently.

First, SQL DBMSs do not support true domains. Their built-in system data types (SDT) -- VARCHAR(n), CHAR(n), INTEGER, MONEY, DATE -- are essentially primitive domains with vendor-defined value ranges. Database designers cannot create custom domains, or derive such from SDTs.

Note: Primitive domains represent in the database abstract properties that are not observed directly, but are inferred from observed object properties. For example, people have names and addresses, from which a primitive domain VARCHAR(n), or more structured versions can be inferred, where the maximum n is vendor-defined.
Second, there is no CREATE DOMAIN SIMPLE option in SQL. Moreover, if the designer defines, for example, an attribute on a SDT, SQL has built-in functions

that can be used to subvert atomicity (e.g., YEAR() or MONTH()applicable to the primitive SDT DATE).

Note: According to McGoveran, such functions have two possible uses: (1) as a shorthand for a disjunction of values and (2) extraction of internal components to treat them as attributes in their own right.Both are useful, but we are sloppy about distinguishing between them. If (1) is intended, then it is consistent with 1NF.
"Consider the restriction name LIKE “"Jo%"”. If, for example, the permissible values of the NAME domain that begin with “Jo” are “John Smith” and “Joseph Jones”, this restriction can be understood as just shorthand for name = "”John Smith"” OR name = "“Joseph Jones”". This interpretation is reasonable even when we don'’t explicitly know all the permissible values of the NAME domain. If (2) is intended, expressions get more verbose. For example, consider the comparison name = "John Smith". If both first and last name are independently meaningful and 1NF is enforced, we have to convert NAME to (FIRST, LAST) and the comparison becomes first = "John" AND last = "Smith".

When 1NF is not enforced, if a FOPL-based data sub-language allows application access to components, it is tantamount to design changes that can have broad and unintended effects. If, for example, NAME should suddenly be treated in some relation as (FIRST LAST), NAME should be split into two attributes. But 
What does that mean for the functional dependency (FD) between the primary key (PK) and NAME?
What if it turns out that the FD was really LAST and the PK?
What if NAME were referenced by a foreign key (FK) in some relation representing employee children?" --David McGoveran

Single-valued Attributes?


C. J. Date argues that value atomicity is not precisely definable and that a relation is in 1NF if it has single-valued attributes (SVA) (i.e., every attribute has a single domain value in every tuple). Thus, the table,

=============================================
LAST     FIRST  PHONE
---------------------------------------------
Smith    John   303-123-4567
Doe      Mary   303-456-9933,303-456-9944
Johnson  Bill   303-987-6543
...
=============================================
is purportedly not a R-table by Date's definition because PHONE is defined on a multi-valued attribute. To be a R-table, it would have to be defined on a SVA -- a simple RVD, with sets as values treated as atomic (e.g., {303-456-9933,303-456-9944}) by the data language.

Note: In which case PHONE values would not be union-compatible and the relation would not be amenable to normalization.
 

But "single-valuedness" is not an inherent property of the data, as is implicit in Date's position. Regardless of the corresponding domain's data type defined by the designer (e.g., RVD values that are sets of phone numbers, or a VARCHAR(n) domain's values that are concatenations of phone numbers, or values that are commalists of phone numbers), whether or not PHONE is a SVA is a function of whether it is consistent with its intended use (i.e., whether it has meaningful components) and if it does, it is not single-valued. That's how Codd's 1NF is defined.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., Interpreting Codd: The 12 Rules.

[3] Pascal, F., Depends on the Dependencies: Normal Forms and the Conceptual-Logical Conflation.

[4] Pascal, F., Simple Domains and Value Atomicity.

[5] Pascal, F., The Interpretation and Representation of Database Relations V.1 (April 2017).

[6] Pascal, F., Physical Data Independence.

[7] Pascal, F., Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness.

[8] Pascal, F., The Final NULL in the Coffin: A Relational Solution to Missing Data.




Sunday, April 17, 2016

This Week

1. What's wrong with this picture?

NoSQL database management systems give us the opportunity to store our data according to more than one data storage model, but our entity-relationship data modeling notations are stuck in SQL land. Is there any need to model schema-less databases, and is it even possible? --Theodore Hills, The Hybrid Data Model, Dataversity.net

Sunday, April 10, 2016

First Normal Form in Theory and Practice Part 2


Note: This is a 11/23/17 revision of Part 2 of a three-part series that replaced all of my previous posts on the subject (pages of which redirect here), in order to further tighten integration with the McGoveran formalization and interpretation [1] of Codd's true RDM.
(Cont'd from Part 1)
 

Part 1 raised the issue of poor understanding of Codd's concept of a simple domain with atomic values underlying 1NF. In Part 2 I clarify Codd's definition of 1NF and its correct interpretation.

From SOPL to FOPL


In 1969 Codd explicitly cited second order predicate logic (SOPL) as a theoretical foundation for the RDM [2]. Because it permits predicates over predicates (and queries over relations within relations), SOPL-based data languages are expressively more powerful than those based on first order predicate logic (FOPL). For example, given relation-valued attributes (RVA) defined on relation-valued domains (RVD), they allow access to the RVA components -- tuples and attributes -- of the "inner" relations (i.e., the values of the RVAs): both outer and inner relations can be manipulated via data language (e.g., joined) expressions.

But, first, this means that the implied domain conversion, expansion and normalization need to be done automatically and on the fly, which might not be theoretically possible. For attributes defined on non-simple domains that, unlike RVAs, are not set-valued, a general approach to the problem is probably not even definable.

Second, recall that one of the fundamental motivations for the RDM was data independence, physical (PI) [3] -- insulation of queries and applications from physical changes -- and logical (LI), which require a declarative, decidable data language that expresses data manipulation and integrity, independently from physical implementation details.

"From the theory of formal languages and logic it is known that implementing a declarative data language (on computers) requires an algorithm that will correctly and automatically process any query whatsoever (corresponding to every syntactically correct logical formula), regardless of what data are in the database. The algorithm must evaluate the query in light of the data in the database and terminate with a proper answer. In the case of RDM, it should produce a relation containing zero or more tuples that make the query "true" -- no other termination is acceptable. If the data language has only the power of FOPL and (as is always the case in practice) the database has a finite size, such an algorithm for query evaluation can be found. But if it has the power of SOPL, expressions are possible that cannot be evaluated (for example, self-referencing expressions) and the formal language is then undecidable, an algorithm to implement a declarative query language is impossible and all hope of PI is lost." --David McGoveran
In other words, the choice for RDM was between 
  • More powerful SOPL, but no declarative, decidable data language and no PI and complexity;
  • Less powerful FOPL, but declarative, decidable data language, PI and simplicity;
a no-brainer. By 1970 Codd abandoned SOPL in favor of FOPL and stated that "the possibility of eliminating non-simple attributes [e.g., RVAs] appears worth investigating" [3].


Normal Form and Normalization, Further Normalization and 1NF


Although he did not explain, the evidence suggests that Codd realized that if the values of RVAs are union-compatible -- same number of attributes defined on same domains -- it is possible to "flatten" outer relations in a uniform way. In other words, relations with such RVAs have an alternative RVA-less form -- a normal form -- which is preferred, because if all relations are in it, FOPL is sufficient. He named the process of eliminating RVAs normalization and the resulting relations normalized [3].

The original single normal form was tied to the 1969-70 definition of join. After Codd re-defined join in 1970, analysis of attribute dependencies gave rise to multiple normal forms (1NF-5NF) and further normalization [4]. While the original normal form and the 1NF are not comparable (see Note in Part 1), the normalization method is elimination of non-simple RVAs [5].

For an intuitive understanding of 1NF, recall that tuples represent facts (formally, propositions) about real world objects, the truth of which has been asserted by some trusted authorized user. Every relational operation derives a new relation from database relations, the tuples of which represent facts that are logical implications of (inferences from) database facts. Relations with RVAs represent compound facts -- facts within facts -- which make manipulation and integrity expressions more complex. Relations that represent non-compound facts are in a "less convoluted" form -- the 1NF -- that avoids the complications. Thus, 1NF is required for the key advantages of the RDM:

  • FOPL-based declarative, decidable data sub-languages;
  • PI;
  • Simplicity;
to materialize. If databases are not designed accordingly, there are no relations, and the database is not relational, and normalization to repair the design.

Note very carefully that domains can be of arbitrary complexity, as long as they are defined as simple and have no meaningful components.
"Part of the genius of the RDM is that any complexity that is not meaningful to users can be encapsulated in domains -- their values can have ANY internal structure as long as it is "hidden" from relational operations -- that is the point of having user-definable domains. Thus, a domain/attribute can be defined as "set-valued" (SVD), or "tree-valued" (TVD), or "graph-valued" (GVD), or relation-valued, or whatever, as long as those values are treated as atomic by relational expressions in the data sub-language, which means that, for example and with few exceptions, no join/restrict/project/etc. is applied to a substring of a "text-valued" domain, a member of set belonging to a SVD, a leaf or subtree of a TVD, a node/arc/subgraph of a GVD and so on. Such violations of atomicity create new domains and, hence, new relations unknown to the DBMS, in violation of the IP." --David McGoveran
Unfortunately, this tends to induce designer laziness: often information  meaningful to users is left hidden from the DBMS in the internal structure of  non-simple attributes, in violation of the IP and RDM.

By 1979 Codd was effectively requiring all attributes to be defined on simple domains and focusing on 1NF relations. Attributes defined on non-simple RVDs were never prohibited by the RDM, but should generally be avoided without a strong specific justification and confidence that atomicity will not be subverted and undesirable non-relational consequences introduced.

Had Codd based RDM on SOPL, it is doubtful that we would have had even SQL -- we would have probably still been messing with prohibitively costly hierarchical and CODASYL (network) DBMSs. Sadly, SQL's poor relational fidelity is responsible for the regress back to those archaic products. But the evolution in Codd's thinking from from SOPL to FOPL in the context of industry's instinct to disregard theory has contributed to confusion.

"All data types NUMERIC, TEXT, DATE or even BLOB can be treated as atomic. It should be only sets and relations which are not." --What is the actual definition of First Normal Form (1NF)
Well, maybe, but only if the power of SOPL justifies its complications (the misconceptions that sets/relations are the only exception reveals that relational atomicity is not understood).
"... Like character strings, sets do have some internal structure [that's sometimes] convenient to ignore for certain purposes ... if character strings are atomic then sets must be so, too." --C. J. Date
Well, treating values of non-simple domains as atomic without taking into account components meaningful to users renders FOPL insufficient and incurs the integrity complications of SOPL, without the benefit of its expressive power.

1NF Defined


The last quote views atomicity as an intrinsic property of data to be "discovered", when, as we have seen in Part 1, it is a database design choice enshrined in domain definitions that should be consistent with data use by applications. So here is a precise 1NF definition:

A relation is in 1NF if all attributes are defined on simple domains that have no components meaningful to users.
If any is implicit and hidden from the DBMS in violation of the IP, the design is not consistent with the intended data use by applications and there is no 1NF and, therefore, no relation.

Note: As already mentioned, 1NF is insufficient -- a relation must be in 5NF by definition.

For 1NF enforcement, see Part 3.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Codd, E. F., Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks. IBM Research Report, San Jose, California RJ599 (1969).

[3] Codd, E. F., A Relational Model of Data for Large Shared Data Banks, Commun. ACM 13(6): 377-387 (1970)

[4] Codd, E. F., Further Normalization of the Data Base Relational Model, IBM Research Report, San Jose, California RJ909 (1971).

[5] Codd, E. F., Normalized Data Base Structure: A Brief Tutorial, IBM Research Report, San Jose, California RJ935 (1971).

[6] Pascal, F., The Interpretation and Representation of Database Relations V.1 (April 2017).