Tuesday, April 26, 2016

Data Fundamentals for Analysts: Nested Facts and the (First) Normal Form

My April column @All Analytics.

A R-table with attributes defined only on simple domains takes a less convoluted form -- a normal form -- devoid of nesting. If R-tables are in the preferred normal form i.e., components meaningful to applications (here, employee attributes) are simple domains in their own right and a true RDBMS enforces value atomicity -- first order logic is sufficient. This imposes some limitations on the expressive power of data languages, but they are declarative and PDI and simplicity are preserved. A true RDBMS enforces atomicity via a data language that does not allow applications to access attribute components not explicitly defined on their own domains.

Read it all. (Please comment there, not here)

FYI: I have revised all three parts of the series on 1NF -- mainly refinements and clarifications.

Tuesday, April 19, 2016

First Normal Form in Theory and Practice Part 3

(Cont'd from Part 2)

UPDATE (4/25/16): Minor refinements and clarifications.
"Is this table in 1NF?" is a common question in database practice. On the other hand, "What problems are solved by splitting street addresses into individual columns?", or  
What's the best way to store an array in a relational database does not seem to evoke connections to 1NF. This reveals poor foundation knowledge.

Database Design Consistent with Data Use

The Information Principle (IP) mandates that all information in a relational database is represented explicitly and in exactly one way--as values of relation attributes defined on domains. We have seen that if domains are simple--have no meaningful components--FOPL is sufficient, relational data languages are declarative and support physical data independence (PDI). If the database designer defines all relation attributes meaningful to users on simple domains, the relation is in its normal form (1NF). 

If relations are not in 1NF, applications will require SOPL access to attributes that are implicit components of domains. Such subversion of the value atomicity defined into the simple domains by the designer essentially creates new domains and relations on the fly, "behind the back" of the DBMS, in violation of the IP. The implication is that designers should make sure that all entity properties of interest to users are represented by attributes defined on simple domains and not  implicit components of domains.

Sunday, April 17, 2016

This Week

1. What's wrong with this picture?

NoSQL database management systems give us the opportunity to store our data according to more than one data storage model, but our entity-relationship data modeling notations are stuck in SQL land. Is there any need to model schema-less databases, and is it even possible? --Theodore Hills, The Hybrid Data Model, Dataversity.net

Sunday, April 10, 2016

First Normal Form in Theory and Practice Part 2

(Cont'd from Part 1)

UPDATE (4/25/16): Minor refinements and clarifications.


In 1969 Codd explicitly cited second order predicate logic (SOPL) as a theoretical foundation for the RDM. Because it permits predicates over predicates (and queries over relations within relations), data languages based on SOPL are expressively more powerful than those based on first order predicate logic (FOPL). For example, SOPL languages give relational operations access to the components--tuples and attributes--of the "inner" relations--the values of relation-valued domains (RVD), which means that both outer and inner relations can be restricted/projected/joined/etc. within the same expression.