Monday, October 2, 2017

Understanding the Division of Labor between Analytics Applications and DBMS

I am coming across, on the one hand, instructions on how to do "analytics with SQL" and, on the other, tools purporting to enable "analytics without SQL." They are an umpteenth iteration of essentially similar ideas during my 30-plus years in data management and reflect common and entrenched fundamental misconceptions that I have documented and analyzed the costly consequences of in my writings and teachings. They will keep repeating, inhibiting genuine progress, as long as data fundamentals are ignored or dismissed. One of the least understood is the distinction between DBMS and application functions.

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.

DBMS functions are the responsibility of and performed by the DBMS 'for all applications'. There are several categories thereof, but here we will focus on the one including the three functions defined by a data model such as the RDM: structure, integrity and manipulation. A true RDBMS has a relational 'data sub-language' with which (1) relations and (2) integrity constraints are defined and data can be retrieved by (3) manipulating relations mathematically as sets -- i.e., applying the set operations of the relational algebra on them. Restricting the sub-language to the three data management functions makes first order predicate logic (FOPL) sufficient, with the following advantages:
Note: Other DBMS functions are storage management, performance optimization, concurrency control, transaction management. They are not the purview of the data model, are performed by the DBMS, but not via the data sub-language, because they require higher logic than FOPL (see next).

Application-specific functions are the responsibility of and performed by each application. They include (1) result preparation and (2) presentation and (3) communication with the DBMS. They are not limited to data management, so they require higher logic than FOPL and, thus, 'computationally complete' languages (CCL). A CCL (e.g., programming language), is expressively more powerful than a relational data sub-language, but:
  • Is imperative (procedural), prone to undecidability and significantly more complex;
  • Logical validity and semantic correctness are not system-guaranteed;
  • Physical and logical independence are not supported.
'Proper hosting' of relational data sub-languages in CCLs is the optimal division of labor between the DBMS and applications: relational advantages are preserved for data management functions performed by the DBMS and only application-specific functions are relegated to applications. Violating this arrangement by extending either (a) the data sub-language with application functions, or (b) the host CCL with data management functions destroys optimality and causes a plethora of costly complications.

Consider now the above two "solutions" for analytics. SQL was intended to be a relational data sub-language and, as such, should have expressed strictly the three data management DBMS functions. So, first, forcing SQL to perform analytics functions is a type (a) violation. Second, analytic functions are application-specific and, therefore, tools that purport to "perform analytics without SQL" is a trivial promise that should be the case 'by definition.' The correct perspective for 'databased analytics' is, of course, that applications use CCLs that (1) host SQL for data management DBMS functions -- data retrieval -- and (2) perform analytics functions that do not require, or rely on SQL.

This should not be taken as a SQL recommendation for data management functions. It is neither truly and fully relational, nor a well designed language, nor a strictly data sub-language, as it haphazardly includes some application-specific functions. But unfortunately, SQL is the industry's standard database language that has no superior alternative. This is in itself due to its authors lacking a good grasp of data fundamentals and the RDM and contributes to the persistent failure to understand and appreciate the optimal division of labor and to provide tools that enforce it.

Therefore, to avoid the traps that industry practices and products impose, it behooves the analyst to acquire foundation knowledge.

No comments:

Post a Comment

View My Stats