Tuesday, April 19, 2016

First Normal Form in Theory and Practice Part 3

Note: This is a 11/23/17 revision of Part 3 of a three-part series that replaced all of my previous posts on the subject (pages of which redirect here), in order to further tighten integration with the formalization and interpretation [1] of McGoveran's formalization and interpretation [1] of Codd's true RDM.

(Continued from Part 2)

"Is this table in 1NF?" is a common question in database practice. On the other hand, "What problems are solved by splitting street addresses into individual columns?", or  
What's the best way to store an array in a relational database does not seem to evoke associations with 1NF. This reveals poor foundation knowledge.

Part 1 introduced the poor understanding of 1NF and Part 2 provided a correct definition and explanation. Part 3 explains how 1NF can be enforced by the data language, which SQL does not.

Database Design Consistent with Intended Data Use

The Information Principle (IP) -- Rule 0 of the RDM [2] -- mandates all information in a relational database be represented in relations, explicitly and in exactly one way -- as values of attributes defined on domains. Database relations must be in 5NF -- defined in terms of attribute dependencies [3] -- which requires that they are in 1NF -- defined in terms of simple domains and value atomicity [4]. 1NF ensures that FOPL is sufficient for data sub-languages and they are declarative and decidable [5] and physical independence (PI) [6], 5NF guarantees logical validity, logical independence (LI) and semantic correctness [7].

No 1NF means no 5NF and no relations. Some information -- namely, attribute components meaningful to users -- is represented implicitly and, thus, hidden from the DBMS, in violation of the IP and the RDM. Application access to that information requires a SOPL based data language, with loss of the relational advantages.

One implication is that database designers should make sure that all object properties of interest to users are represented by attributes defined on simple domains and are not implicit in components thereof. Another implication is that it is impossible to ascertain whether a table is a R-table (i.e., if it displays a relation -- which must be in 1NF) by its sheer inspection. Going back to the table in Part 1,

ID  NAME             ADDRESS
1  Mark Tomers      56 Tomato Road
2  Fred Askalong    3277 Hadley Drive
3  May Anne Brice   225 Century Avenue
(1) it has uniquely named columns and assuming that (2) ID represents the PK (3) no information is encoded in the order of rows and columns and (4) No values are missing, it is a R-table if and only if (5) NAME and ADDRESS are defined on simple domains and their components are not meaningful to users.

Note: Tables such as

LAST    FIRST  PHONE1        PHONE2     
Smith   John   303-123-4567        
Doe     Mary   303-456-9933  303-456-9944
Johnson Bill   303-987-6543

are often given as examples of non-R-tables on 1NF grounds. But while the PHONE1, PHONE2 design is poor, depending on how the domains of NAME and ADDRESS are defined, it may not violate 1NF. If they are both simple and have no components meaningful to users, the relation is in 1NF, otherwise it isn't. It's actually the missing values (and possibly the lack of a PK) that disqualify the table. Although Codd supported missing value "marks" -- not to be confused with SQL NULLs -- that is now considered a mistake (for a relational solution to missing data, see [8]).

Enforcing 1NF

1NF has very limited meaning outside true RDBMSs. In SQL -- erroneously accepted in the industry as a relational data sub-language (it is neither relational, nor just a data sub-language), 1NF is a heuristic rule that is easy to treat inconsistently.

First, SQL DBMSs do not support true domains. Their built-in system data types (SDT) -- VARCHAR(n), CHAR(n), INTEGER, MONEY, DATE -- are essentially primitive domains with vendor-defined value ranges. Database designers cannot create custom domains, or derive such from SDTs.

Note: Primitive domains represent in the database abstract properties that are not observed directly, but are inferred from observed object properties. For example, people have names and addresses, from which a primitive domain VARCHAR(n), or more structured versions can be inferred, where the maximum n is vendor-defined.
Second, there is no CREATE DOMAIN SIMPLE option in SQL. Moreover, if the designer defines, for example, an attribute on a SDT, SQL has built-in functions

that can be used to subvert atomicity (e.g., YEAR() or MONTH()applicable to the primitive SDT DATE).

Note: According to McGoveran, such functions have two possible uses: (1) as a shorthand for a disjunction of values and (2) extraction of internal components to treat them as attributes in their own right.Both are useful, but we are sloppy about distinguishing between them. If (1) is intended, then it is consistent with 1NF.
"Consider the restriction name LIKE “"Jo%"”. If, for example, the permissible values of the NAME domain that begin with “Jo” are “John Smith” and “Joseph Jones”, this restriction can be understood as just shorthand for name = "”John Smith"” OR name = "“Joseph Jones”". This interpretation is reasonable even when we don'’t explicitly know all the permissible values of the NAME domain. If (2) is intended, expressions get more verbose. For example, consider the comparison name = "John Smith". If both first and last name are independently meaningful and 1NF is enforced, we have to convert NAME to (FIRST, LAST) and the comparison becomes first = "John" AND last = "Smith".

When 1NF is not enforced, if a FOPL-based data sub-language allows application access to components, it is tantamount to design changes that can have broad and unintended effects. If, for example, NAME should suddenly be treated in some relation as (FIRST LAST), NAME should be split into two attributes. But 
What does that mean for the functional dependency (FD) between the primary key (PK) and NAME?
What if it turns out that the FD was really LAST and the PK?
What if NAME were referenced by a foreign key (FK) in some relation representing employee children?" --David McGoveran

Single-valued Attributes?

C. J. Date argues that value atomicity is not precisely definable and that a relation is in 1NF if it has single-valued attributes (SVA) (i.e., every attribute has a single domain value in every tuple). Thus, the table,

Smith    John   303-123-4567
Doe      Mary   303-456-9933,303-456-9944
Johnson  Bill   303-987-6543
is purportedly not a R-table by Date's definition because PHONE is defined on a multi-valued attribute. To be a R-table, it would have to be defined on a SVA -- a simple RVD, with sets as values treated as atomic (e.g., {303-456-9933,303-456-9944}) by the data language.

Note: In which case PHONE values would not be union-compatible and the relation would not be amenable to normalization.

But "single-valuedness" is not an inherent property of the data, as is implicit in Date's position. Regardless of the corresponding domain's data type defined by the designer (e.g., RVD values that are sets of phone numbers, or a VARCHAR(n) domain's values that are concatenations of phone numbers, or values that are commalists of phone numbers), whether or not PHONE is a SVA is a function of whether it is consistent with its intended use (i.e., whether it has meaningful components) and if it does, it is not single-valued. That's how Codd's 1NF is defined.


[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., Interpreting Codd: The 12 Rules.

[3] Pascal, F., Depends on the Dependencies: Normal Forms and the Conceptual-Logical Conflation.

[4] Pascal, F., Simple Domains and Value Atomicity.

[5] Pascal, F., The Interpretation and Representation of Database Relations V.1 (April 2017).

[6] Pascal, F., Physical Data Independence.

[7] Pascal, F., Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness.

[8] Pascal, F., The Final NULL in the Coffin: A Relational Solution to Missing Data.

No comments:

Post a Comment