Tuesday, April 19, 2016

First Normal Form (1NF) in Theory and Practice, Part 3 (UPDATED)


(Cont'd from Part 2)



UPDATE (4/25/16): Minor refinements and clarifications.
 
"Is this table in 1NF?" is a common question in database practice. On the other hand, "What problems are solved by splitting street addresses into individual columns?", or  
What's the best way to store an array in a relational database does not seem to evoke connections to 1NF. This reveals poor foundation knowledge.

Database Design Consistent with Data Use


The Information Principle (IP) mandates that all information in a relational database is represented explicitly and in exactly one way--as values of relation attributes defined on domains. We have seen that if domains are simple--have no meaningful components--FOPL is sufficient, relational data languages are declarative and support physical data independence (PDI). If the database designer defines all relation attributes meaningful to users on simple domains, the relation is in its normal form (1NF). 


If relations are not in 1NF, applications will require SOPL access to attributes that are implicit components of domains. Such subversion of the value atomicity defined into the simple domains by the designer essentially creates new domains and relations on the fly, "behind the back" of the DBMS, in violation of the IP. The implication is that designers should make sure that all entity properties of interest to users are represented by attributes defined on simple domains and not  implicit components of domains.


It is impossible, therefore, to ascertain whether a relation is in 1NF by sheer inspection of a table purported to picture it. Consider, for example
=========================================
 ID  NAME             ADDRESS
-----------------------------------------
  1  Mark Tomers      56 Tomato Road
  2  Fred Askalong    3277 Hadley Drive
  3  May Anne Brice   225 Century Avenue
   ...
=========================================
First, the normal form is a way to express relations, so does the table picture a relation i.e., is it a R-table? It is if and only if 
  • it is subject to at least one uniqueness constraint (i.e., has a key);
  • it has unique column names; 
  • it has no information encoded in the order of rows and columns;
  • it has no missing values.

Note: Tables such as
===========================================
 LAST    FIRST  PHONE1        PHONE2      
-------------------------------------------
 Smith   John   303-123-4567         
 Doe     Mary   303-456-9933  303-456-9944
 Johnson Bill   303-987-6543 
===========================================

are often given as examples of not being in 1NF. But a table with missing column values is not a R-table, because relations do not have missing attribute values. Even without the PHONE2 column, the table would be a R-table only if it had at least one key, that is, a uniqueness constraint were enforced on either (LAST,FIRST) or PHONE1. Although EFC supported the idea of representing missing data with special "marks"--not to be confused with SQL NULL's--that is now considered a mistake (for a relational solution to missing data, see The Final NULL in the Coffin). End Note

Second, what domains are NAME and ADDRESS defined on? If they are both simple and have no components that are individually meaningful to users, the relation is in 1NF, otherwise it isn't. Note that, like all normal forms, 1NF is established only with reference to the business model/rules.

 

Enforcing 1NF: RDBMS vs. SQL


1NF has very limited meaning outside true RDBMS's. In SQL 1NF is a heuristic rule that is easy to treat inconsistently. SQL tables that contain duplicate rows, NULL's and nameless columns are not relations. 
 
A true RDBMS would enforce 1NF designs. For all domains defined by the database designer as simple e.g.,
CREATE DOMAIN SIMPLE ...
the data language would not allow relational constraint and query expressions that reference  components thereof, in effect disallowing subversion of value atomicity and enforcing 1NF.

SQL does not support true (user defined) domains. Their built-in system data types (SDT) such as VARCHAR(n), CHAR(n), INTEGER, MONEY and DATE are essentially primitive domains with vendor-defined value ranges.  Primitive domains represent in the database abstract properties not directly observed, but inferred from observed entity properties. For example, people have names and addresses, from which an abstract property TEXT(n) can be inferred. 


Database designers cannot create custom domains, or derive such from SDT's. The SQL supported SDT's are simple, but if designers define NAME and ADDRESS on domains VARCHAR(x) and VARCHAR(y) and their components are meaningful to users, SQL DBMS's implement built-in language functions such as SUBSTRING(), or DAY(), YEAR(), MONTH() that subvert their atomicity

As McGoveran points out, such functions have two possible uses
"(1) as a shorthand for a disjunction of values and (2) extraction of internal components to treat them as attributes in their own right. Both are useful, but we are sloppy about distinguishing between them. If (1) is intended, then 1NF is consistent with this usage. For example, consider the restriction:
name LIKE “"Jo%"”
If, for example, the permissible values of the NAME domain that begin with “Jo” are “John Smith” and “Joseph Jones”, this restriction can be understood as just shorthand for:
name = "”John Smith"” OR name = "“Joseph Jones”"
This interpretation is reasonable even when we don'’t explicitly know all the permissible values of the NAME domain. If (2) is intended, expressions get more verbose. For example, consider the comparison:
name = "John Smith"
If both first and last name are independently meaningful and 1NF is enforced, we have to convert NAME to (FIRST, LAST) and the comparison becomes:
first = "John" AND last = "Smith" 
When 1NF is not enforced, if a FOPL-based data language allows application access to components, it is tantamount to design changes that can have broad and unintended effects. If, for example, NAME should suddenly be treated in some relation as (FIRST LAST), NAME should be split into two attributes. But
  • What does that mean for the functional dependency (FD) between the primary key (PK) and NAME?
  • What if it turns out that the FD was really LAST and the PK?
  • What if NAME were referenced by a foreign key (FK) in some relation representing employee children?
  

Single-valued attributes


C. J. Date (CJD) argues that value atomicity is not precisely definable and has redefined 1NF: a relation is in 1NF if it has single-valued attributes (SVA): every attribute has a single domain value in every tuple. Thus, tables such as

=============================================
 LAST     FIRST  PHONE
---------------------------------------------
 Smith    John   303-123-4567
 Doe      Mary   303-456-9933,303-456-9944
 Johnson  Bill   303-987-6543
=============================================

are sometimes given as examples of (pictures of) relations that are not in 1NF by CJD's definition, because PHONE is defined on a multi-valued attribute (MVA)--if it were a SVA, its values would be sets enclosed in {}.

But whatever domain it is defined on--a RVD whose values are relations with sets of phone numbers as tuples, or a VARCHAR(n) domain whose values are concatenations of phone numbers, or a domain whose values are commalists of phone numbers--PHONE will always have a single value of the corresponding type in every tuple. In other words, by CJD's definition all relations are always in 1NF by definition. That's due to viewing value atomicity as an intrinsic property of the data to be discovered. 

EFC's definition, on the other hand, views value atomicity as a database designer choice enshrined in the domain definition consistent with data use by applications--a precise definition.  The relation is in 1NF only if PHONE is defined on a simple domain that has no meaningful components.



No comments:

Post a Comment