Tuesday, April 26, 2016

Data Fundamentals for Analysts: Nested Facts and the (First) Normal Form

My April column @All Analytics:

A R-table with attributes defined only on simple domains takes a less convoluted form -- a normal form -- devoid of nesting. If R-tables are in the preferred normal form i.e., components meaningful to applications (here, employee attributes) are simple domains in their own right and a true RDBMS enforces value atomicity -- first order logic is sufficient. This imposes some limitations on the expressive power of data languages, but they are declarative and PDI and simplicity are preserved. A true RDBMS enforces atomicity via a data language that does not allow applications to access attribute components not explicitly defined on their own domains.

Read it all (and comment there, please).

I have revised all three parts of the series on 1NF -- mainly refinements and clarifications.

Tuesday, April 19, 2016

First Normal Form (1NF) in Theory and Practice, Part 3 (UPDATED)

(Cont'd from Part 2)

UPDATE (4/25/16): Minor refinements and clarifications.
"Is this table in 1NF?" is a common question in database practice. On the other hand, "What problems are solved by splitting street addresses into individual columns?", or  
What's the best way to store an array in a relational database does not seem to evoke connections to 1NF. This reveals poor foundation knowledge.

Database Design Consistent with Data Use

The Information Principle (IP) mandates that all information in a relational database is represented explicitly and in exactly one way--as values of relation attributes defined on domains. We have seen that if domains are simple--have no meaningful components--FOPL is sufficient, relational data languages are declarative and support physical data independence (PDI). If the database designer defines all relation attributes meaningful to users on simple domains, the relation is in its normal form (1NF). 

If relations are not in 1NF, applications will require SOPL access to attributes that are implicit components of domains. Such subversion of the value atomicity defined into the simple domains by the designer essentially creates new domains and relations on the fly, "behind the back" of the DBMS, in violation of the IP. The implication is that designers should make sure that all entity properties of interest to users are represented by attributes defined on simple domains and not  implicit components of domains.

Sunday, April 17, 2016

This Week

1. What's wrong with this picture?

NoSQL database management systems give us the opportunity to store our data according to more than one data storage model, but our entity-relationship data modeling notations are stuck in SQL land. Is there any need to model schema-less databases, and is it even possible? --Theodore Hills, The Hybrid Data Model, Dataversity.net

Sunday, April 10, 2016

First Normal Form (1NF) in Theory and Practice, Part 2 (UPDATED)

(Cont'd from Part 1)

UPDATE (4/25/16): Minor refinements and clarifications.


In 1969 Codd explicitly cited second order predicate logic (SOPL) as a theoretical foundation for the RDM. Because it permits predicates over predicates (and queries over relations within relations), data languages based on SOPL are expressively more powerful than those based on first order predicate logic (FOPL). For example, SOPL languages give relational operations access to the components--tuples and attributes--of the "inner" relations--the values of relation-valued domains (RVD), which means that both outer and inner relations can be restricted/projected/joined/etc. within the same expression. 

Friday, March 25, 2016

Data Fundamentals for Analysts: Not Worth Repeating - Duplicates

“Stating the same fact more than once, does not make it truer, only redundant,” as E. F. Codd used to say. 

Frequent hits @dbdebunk.com are driven by the question “Are keys mandatory?” Puzzlingly, many data professionals do not seem to understand why duplicates should be prohibited. This should worry analysts.

But the absence of an identifier means that individual entities are not meaningful, so this representation contradicts the real world. Contradictions produce problems. First, a DBMS is incapable of “visually” discerning a data entry duplication error from "valid" duplicates, which means high risk of inconsistent databases and wrong counts and other query results.

Read it all.

Sunday, March 20, 2016

This Week

Q: I am trying to remove duplicate rows from a database containing 4446 tables and when finished, merge data from that database to an existing database with the same structure. Is there an easy way to remove duplicate rows from all those 4446 tables?

A1: Please tell us "what is a 4446 table?" And, please give examples of duplicate records.

A2: Recreate your database with unique primary keys. 2) Copy records from original database to recreated database, one table at a time. The duplicate records will drop on the floor. 3) Fix the thousands of bugs in your application code that created the duplicate rows in the first place.

A3: I'm creating temporary tables + I add unique index to that table and insert data into and then I rename it to the original one.