Friday, March 25, 2016

Data Fundamentals for Analysts: Not Worth Repeating - Duplicates

“Stating the same fact more than once, does not make it truer, only redundant,” as E. F. Codd used to say. 

Frequent hits are driven by the question “Are keys mandatory?” Puzzlingly, many data professionals do not seem to understand why duplicates should be prohibited. This should worry analysts.

But the absence of an identifier means that individual entities are not meaningful, so this representation contradicts the real world. Contradictions produce problems. First, a DBMS is incapable of “visually” discerning a data entry duplication error from "valid" duplicates, which means high risk of inconsistent databases and wrong counts and other query results.

Read it all.

Sunday, March 20, 2016

This Week

Q: I am trying to remove duplicate rows from a database containing 4446 tables and when finished, merge data from that database to an existing database with the same structure. Is there an easy way to remove duplicate rows from all those 4446 tables?

A1: Please tell us "what is a 4446 table?" And, please give examples of duplicate records.

A2: Recreate your database with unique primary keys. 2) Copy records from original database to recreated database, one table at a time. The duplicate records will drop on the floor. 3) Fix the thousands of bugs in your application code that created the duplicate rows in the first place.

A3: I'm creating temporary tables + I add unique index to that table and insert data into and then I rename it to the original one.

Wednesday, March 9, 2016

Data Fundamentals for Analysts: The Fourth V -- Veracity

Veracity is about truth, which is distinct from meaning. It’s not that technology alone will not solve the “fourth V problem”-- technology cannot play any part in its solution. It’s different for meaning and relational databases, but not with big data technologies and data science.

Read it all

Sunday, March 6, 2016

This Week (UPDATED)

1. Quote of the Week
[With] a declarative language and you have no real control over the execution plan. Heck, Oracle 12c can use multiple different execution plans for the same query depending on bind parameters, statistics not matching reality, and the phase of the moon.

No declarative language will ever be a first-class programming citizen in my eyes. Too much magic. Too non-linear. Too hard to debug for those reasons. They are great for when you have to do really simple things but once you step off the well worn path, you are in the thicket.

Tuesday, March 1, 2016

First Normal Form (1NF) in Theory and Practice (UPDATED), Part 1

This series is a full rewrite of my last post on the normal form (1NF). It replaces all the previous posts (redirected here) and all my writings on the subject. I would like to thank David McGoveran for reviewing drafts, offering important comments and sharing some details on the genesis and history of the 1NF and Codd's thinking on the subject.

UPDATE (4/25/16): Minor refinements and clarifications.

First Normal Form Misunderstood

On the one hand,
"... there is no generally accepted definition of 1NF ... the word that you see most often is 'atomic'. It is common to say that a relation is in 1NF if all its attributes [sic] are atomic ... Does 1NF equate to “atomic attribute [values]”? ... what [do] people have in mind [when they claim] atomicity? ... the [meaning] behind definitions is that you should rarely need to extract information from a value of an attribute ... But that explains why one cannot decide, depending on theory only, whether a relation is in 1NF ... it is a habitual use of data that makes attributes atomic, not theory. No wonder, there is so much mess in theory about what 1NF should be." --What is the actual definition of First Normal Form (1NF)

On the other,
"Codd in 1969/70 (and RM V/2 20 years later) gave a precise, theory-based definition of "atomic" aka "simple" aka "non-decomposable" (later aka non-"compound" aka non-"structured"): not relation-valued. And he gave a precise definition of "normalized" (1NF): relations free of relation-valued-domains (RVD)!" --DBDebunk reader