Sunday, December 3, 2017

DBMS for Analytics: Risky Business Without Foundation Knowledge (Part 1)

Note: This was originally a post at AllAnalytics, which is no longer exists, so some links to other posts there no longer work, but I left them in to alert the reader that I have written on those specific subjects. Other links work.

A new study finding that "non-relational database management systems now comprising 70% of analytics data sources" attributes their popularity to "superiority" over RDBMSs in satisfying analytics needs. There are good reasons to be skeptical of such findings, but even if this one were true, the arguments advanced in support of the claim are rooted in the usual misconceptions due to poor foundation knowledge debunked by this blog. Let's see.

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.

Note: The RDM assumes conceptual models (i.e., that object groups, their properties and their relationships have been identified). Some analytical systems are designed to 'discover' the models (from data value relationships, sequence analysis, co-occurrences, etc.). These are two distinct endeavors that are commonly confused, but should not be, an issue I addressed in Data Meaning: Analytics vs. Data Mining and Data, Information, Knowledge Discovery, and Knowledge Representation.

"Only 30% of data analytics is still performed against traditional relational database management systems" while "approximately 70% are modern non-RDBMS sources" like Hadoop, NoSQL, in-memory, search, columnar/MPP analytic and cloud native databases."
As my readers should know, there is little to no understanding in the industry of what a RDBMS is (What Is a True Relational System and What Is Not). In fact, there are 'no' true RDBMSs, only SQL DBMSs wrongly alleged to be relational. They have limited relational fidelity and nowhere near the capabilities and advantages conferred by the RDM. Classifications of DBMSs as relational cannot and should not be trusted. Some of the criticisms of "relational" systems apply, thus, to SQL DBMSs, not RDBMSs and in what follows I will, therefore, substitute [SQL] for 'relational'.

Note: Even criticisms of SQL DBMSs cannot be of their "analytics capabilities" (analytics is an application function), but at best only for their data retrieval capabilities (data integrity and manipulation, which they data management DBMS function). There is little recognition of this distinction in the industry. (Understanding the Division of Labor between Analytics Applications and DBMS).

"For example, analytics users that understand how to leverage graph queries can derive deep network structure insight and wide relationship analysis over graphed data that simply can't be computed on relational schema structured data."
There are applications for which directed graph data structures are suitable. But they are much rarer than what their proponents would have you believe and have serious drawbacks outside that context, not the least of which are prohibitive complexity and inflexibility. This is a core problem that the RDM was introduced to address and is a major reason for hierarchic and network DBMSs having been effectively dropped more than four decades ago in favor of even weakly relational SQL (so much for "modern" -- them who forget the past …).

To the extent that there is anything to the often repeated claim that "[SQL] databases have a fraught relationship with applications written in object-oriented programming languages like Java, PHP and Python", that is intentional, to get away from the latter's affinity to the directed graph structures and avoid their problems. 

But (1) this has nothing to do with analytics per se and (2) a 'careful' separation between computationally complete programming languages (CCL) and data languages is necessary to guarantee relational advantages (Data Sublanguages, Programming, and Data Integrity).

Logical-physical Confusion

In-memory, columnar, and cloud native DBMSs are DBMS implementations that say nothing about their underlying data model, which is ignored in DBMS reviews, evaluations, or comparisons (Structure, Integrity, Manipulation How to Compare Data Models). No wonder that misconceptions -- rather than true RDBMSs that would address the SQL valid criticisms -- proliferate (Database Management: No Progress Without Data Fundamentals). This logical-physical confusion (LPC) is rampant (Don't Mix Model with Implementation) and underlies most of the non-RDBMS superiority arguments:

  • "… scale horizontally";
  • "processing huge amounts of data in the cloud";
  • "allowing relatively low-cost servers to be combined into a single, powerful cluster";
  • "solve great performance challenges, tackle huge scales of data, help mine value from a wider variety of data";
But true RDBMSs support physical independence (PI), which means that any such implementations are possible for RDBMSs too.

There's more. Stay tuned for Part 2.


No comments:

Post a Comment

View My Stats