ON DIRTY DATA AND CLEANING
with Fabian Pascal

 

 

 

From: Karen Simmons

Date: Sep 15 2005

 

I came across this paragraph in an online article 12 Tips for Generating Rich Data at

destinationcrm.com (emphasis mine.)

 

3) Clean your data regularly.  There are many kinds of dirty data. Some of the most basic--having multiple entries for the same customer or misspellings--can be the most labor-intensive to remove. Other cleansing issues stem from organizational problems. Your marketing department might classify data one way with one naming convention, while your sales department uses another. But it all goes back to policies: Require all users to input data the same way, and clean data often, deleting mistakes and duplicates.

 

Am I wrong in thinking that if your database is set up properly, then your users can't classify data willy-nilly?  Therefore you won't have to clean your data, deleting mistakes and duplicates?  From what I've seen, policies (in and of themselves) don't ensure data integrity.

 

There's more, but the above blurb was what jumped out at me.

 

 

From: Fabian Pascal

 

But of course. But that requires to think upfront and do design. And that's hard for the ignorami who can't think [and for whom it is easier to believe that you can do away with design].

 

 

Ed. Note: In practice, there is more to it. First, most SQL products have poor integrity support, which means that many constraints must be implemented in application code. This is prone to error, and often too prohibitive. It's exacerbated by poor (undernormalized) design, which introduces redundancy and makes the integrity burden exponentially more prohibitive, which is almost never addressed. In fact, most practitioners are oblivious to the increased integrity risks of denormalized designs, and don't bother to address them.

 

 

Posted 10/28/05