Wednesday, May 25, 2016

Why Data Scientists Must Understand Normalization

My May post @All Analytics:

We are constantly told how data scientists must be “jacks of many skills”, but one of the most important is rarely included in the list.

Very few databases are properly designed. Many SQL databases are denormalized inadvertently, or intentionally (and erroneously) "for performance". They require special constraints to control data redundancy and prevent inconsistencies, which are practically never enforced. Analysts cannot, therefore, take database consistency for granted. Furthermore, to issue sensible queries and ensure correct results and interpretation thereof, it’s not enough for analysts to know the types of fact represented in the database, but also whether and how the database designer has chosen to bundle -- nest or merge -- those facts and how to disentangle them for analysis.

Read it all (and comment there, not here, please). 

Saturday, May 21, 2016

This Week (& a Change in Format)

I am changing the format of the posts on DBDebunk.

"This Week" posts every other week will continue to be followed by posts on data and relational fundamentals, but with a twist: each of the latter posts will explain the preceding week's "What's wrong with this picture?" and with "Quote of the Week".

Here's the first post in the new format

1. What's wrong with this picture?

"Data is stored in two-dimensional tables consisting of columns (fields) and rows (records). Multi-dimensional data is represented by a system of relationships among two-dimensional tables. This usually leads to data storage becoming redundant, and also difficult to maintain on account of addition and deletion anomalies. This is only the case if we do not normalize the data. Keys are fields or combinations of fields used to identify records." --Nigel Peck, Denormalization Summary

2. Quote of the Week

"Relations are multidimensional. They are not flat. They are not two dimensional. Don't let the term table mislead you." I read the above statement on the back cover of CJ Date book on relational theory "Database in depth". Can anyone help how to visualize this multidimensional nature of relations?"

Sunday, May 15, 2016

REWRITTEN: Weak Entities, Referential Constraints and the Conceptual-Logical Conflation

I rewrote the earlier post Weak Entities, Referential Constraints and Normalization.

Data model agnosticism has nothing to do with the “emergence/popularity of non-relational technologies” (NoSQL, NewSQL did not exist when Chen introduced the E/RM in the 70's), it is  is inherent in conceptual modeling--which is what E/RM is used for--by definition. Conceptual models are expressed in real world terms--facts about classes of property-sharing entities. To be represented in a database, a conceptual model must be formalized as a logical model expressed in database terms. That's when a data model comes into play, to provide abstract data structure, integrity and manipulation (relation, relational constraints and relational algebra in the relational case). The conceptual and  logical were always separate, but as Everest comments show, levels of representation are constantly confused. 

Read it all

Monday, May 2, 2016

This Week

1. What's wrong with this picture?
The query framework, as it exists in modern day MongoDB, supports the following functionality:
  • Filtering. The rough equivalent of the WHERE clause in SQL.
  • Paging. The rough equivalent of LIMIT and OFFSET in SQL.
  • Sorting. The rough equivalent of ORDER BY in SQL.
--John De Goes, MongoDB: The Frankenstein Monster of NoSQL Databases,

Tuesday, April 26, 2016

Data Fundamentals for Analysts: Nested Facts and the (First) Normal Form

My April column @All Analytics:

A R-table with attributes defined only on simple domains takes a less convoluted form -- a normal form -- devoid of nesting. If R-tables are in the preferred normal form i.e., components meaningful to applications (here, employee attributes) are simple domains in their own right and a true RDBMS enforces value atomicity -- first order logic is sufficient. This imposes some limitations on the expressive power of data languages, but they are declarative and PDI and simplicity are preserved. A true RDBMS enforces atomicity via a data language that does not allow applications to access attribute components not explicitly defined on their own domains.

Read it all (and comment there, please).

I have revised all three parts of the series on 1NF -- mainly refinements and clarifications.

Tuesday, April 19, 2016

First Normal Form (1NF) in Theory and Practice, Part 3 (UPDATED)

(Cont'd from Part 2)

UPDATE (4/25/16): Minor refinements and clarifications.
"Is this table in 1NF?" is a common question in database practice. On the other hand, "What problems are solved by splitting street addresses into individual columns?", or  
What's the best way to store an array in a relational database does not seem to evoke connections to 1NF. This reveals poor foundation knowledge.

Database Design Consistent with Data Use

The Information Principle (IP) mandates that all information in a relational database is represented explicitly and in exactly one way--as values of relation attributes defined on domains. We have seen that if domains are simple--have no meaningful components--FOPL is sufficient, relational data languages are declarative and support physical data independence (PDI). If the database designer defines all relation attributes meaningful to users on simple domains, the relation is in its normal form (1NF). 

If relations are not in 1NF, applications will require SOPL access to attributes that are implicit components of domains. Such subversion of the value atomicity defined into the simple domains by the designer essentially creates new domains and relations on the fly, "behind the back" of the DBMS, in violation of the IP. The implication is that designers should make sure that all entity properties of interest to users are represented by attributes defined on simple domains and not  implicit components of domains.

Sunday, April 17, 2016

This Week

1. What's wrong with this picture?

NoSQL database management systems give us the opportunity to store our data according to more than one data storage model, but our entity-relationship data modeling notations are stuck in SQL land. Is there any need to model schema-less databases, and is it even possible? --Theodore Hills, The Hybrid Data Model,