Thursday, March 2, 2017

The Trouble with Data Warehouse Analytics

You've probably heard the frequent argument that relational databases (which, unfortunately, in practice, means SQL ones) do not serve the performance, flexibility, and temporalization needs of analytical applications satisfactorily. Indeed, Anchor, Data Vault, and Dimensional Modeling techniques are promoted as solutions to the "problems" due to normalized databases. All this is rooted in certain fundamental misconceptions that can be costly for business intelligence, analytics, and data science.

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.

What the industry calls a warehouse is essentially a materialized view of an operational database that typically doesn't get updated in full sync with the underlying database -- i.e., a time-specific snapshot database copy. Such views are produced by data manipulation -- applying some operations to the database. In effect they are query results. Many if not most data professionals overlook a unique and crucial feature of truly relational DBMSs and databases -- their formal dual theoretical foundation -- simple set theory (SST) and first order predicate logic (FOPL). If relational set operations are properly applied to fully normalized database relations, results are guaranteed to be logically and semantically correct.

Thus, (1) given a true RDBMS and a fully normalized relational database (2) a "transformation process" is triggered (by documented circumstances) that applies relational algebra operations to the database to produce the desired view, which (3) is materialized by persisting it in storage. If either the design of the database or the transformation applied by the DBMS violates the theory, correctness is no longer guaranteed.

While warehousing is often undertaken to improve performance, the transformation is purely logical and performance is determined exclusively at the physical implementation level -- how the data is represented and accessed in storage. The only physical implementation aspect of data warehousing is the materialization of the view, which is independent of warehousing techniques.
Warehouses are essentially databases biased for some data applications (and against others) and are rooted in poor database foundation knowledge and logical-physical confusion. Even when warehouses consist of relations, warehouse developers often do not understand their precise interpretation. The design is based on various unwarranted or false assumptions about what the data means. The above modeling methods do not allow documenting the transformation itself -- the relational algebra operations that comprise the transformation. But more often than not warehouses do not consist of relations, which are minimally required to be in first normal form (1NF), and are, therefore, not just denormalized, but non-relational. Consequently, all bets are off; sound derivations of correct analytical results are not guaranteed.

Neutral databases that serve different data views to multiple applications were introduced as a solution to the prohibitive problems caused by application-specific biased files. Application programs needing different views of the same data required differently structured files, proliferating redundancy and inconsistencies. Each and every application program had to enforce data integrity and security, and optimize performance -- functions now centralized in the DBMS. Application-based integrity enforcement created a redundant, complex, and error-prone maintenance burden that was so prohibitive it was mostly foregone. Due to lack of familiarity with history, application-biased databases are bringing those problems back. Those who forget the past are doomed to repeat it.

Because warehouses are read-only, the risks to data integrity may be smaller. They are confined and exclusive to the developer responsible for the transformation and the warehouse load procedure, an environment that is more controlled than the shared operational database that are usually updated by many applications/users. But warehouses are populated by SQL DBMSs that are not truly relational, from poorly designed operational databases, so all bets are off.

It sometimes makes sense to offload data and analytics from operational databases. But without the relational guarantee, caveat emptor.

1 comment:

  1. Today people have problem understandig science (based on mathematics - logic, sets) and some srbitrary concepts that are based on someone's authority or 'industry standard' devoid of any logical framewor. (Hello Object oriented and data warehousing ;-) )

    Data Warehousing 'science' suffers from lack of logic. It contradicts itself: it is supposed to be used for ad hoc querying, yet it requires aggregation and assumption about intended use. It is built for 'speed' yet cube queries take hours to execute - no kidding - this is from a Microsoft sponsored course. Data cleaning is the most laughable requirement - isn't data already clean if it comes from a relational database? What is the purpose of keeping data somewhere if it is not clean? I it was not clean at the moment of entry into database/file/spreadsheet, how are we going to clean i? Connecting data from heterogeneous sources. Like SQL databases, Access files, Excel tables and Word documents? Thank you, but no thank you. Analytic for business? Good luck with present skill level of executives and decision makers. Since executives are not capable doing simplest things in spreadsheets, we must invent mambo-jumbo arbitrary concepts, and make it look serious and scientific, to keep our jobs? Again, thank you but no thank you.


View My Stats