Friday, December 29, 2017

DBMS for Analytics: Risky Business Without Foundation Knowledge, Part 2

Note: This was originally posted at AllAnalytics, which no longer exists, so some links to other posts there no longer work, but I left them in to alert the reader that I have written on those specific subjects. Other links work.  

In this two-part series I alert analysts that correct interpretation and assessment of media/industry claims without being misled requires a good grasp of data fundamentals. In Part 1, I discussed the logical-physical confusion and the erroneous missclassification of DBMSs as relational and non-relational underlying the argument that the latter are superior to the former for analytics applications. In Part 2, I discuss a third misconception behind the claim.

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. If you deem this site worthy of continuing, please support its upkeep. A generous reader has offered to match all contributions, so please take advantage of his generosity. Regular monthly contributions and books and papers purchases can ensure this material unavailable anywhere else will continue to be free. Thanks.

Consistency and Integrity

"RDBMS's greatest strength -- data integrity -- has now become its biggest weakness. To ensure consistent entry of data, (RDBMS) requires a strict data model enforced by tons of referential data relationship constraints ... [a] data model burden emerges as a serious limitation in cases where an organization requires flexibility ... enter the non-relational database."
Data practitioners who perceive integrity as a weakness are in the wrong field -- they do not understand what integrity and consistency are and, therefore, the meaning of data (What Meaning Means: Business Rules, Predicates, Integrity Constraints and Database Consistency, Redundancy, Consistency, and Integrity Derivable Data). By enforcing integrity constraints, a DBMS ensures consistency of the database with the business rules that denote the meaning of the data (i.e., the faithfulness of the database to the conceptual model of the segment of the real world it represents). No analyst should accept data that does not represent reality accurately, yet they do without being aware of it (Relational Fidelity & Analytics Integrity). A DBMS that doesn't enforce integrity cannot be considered a true DBMS. SQL DBMSs do a rather poor job of that (To Really Understand Integrity, Don't Start with SQL), non-SQL and other non-RDBMSs give it up altogether -- their data structures make it both theoretically and practically unfeasible. Trading off integrity for "flexibility" is a bad analytics strategy, but how can technology that had already been dropped as inflexible in favor of SQL several decades ago be a better option today? They who forget the past ...

Giving up integrity is sometimes acceptable for data warehouses and other read-only databases, but, first, this is somewhat of an illusion: they must be populated and without constraint enforcement, what ensures consistency for all applications (including analytics) using them, particularly if the data come from SQL, or non-relational databases where constraints are not enforced either? (The Trouble with Data Warehouse Analytics). Second, beyond integrity, the theoretical foundation of the RDM guarantees logical validity of the data sets retrieved for analysis; and adherence to relational design principles (Database Design: What It Is and Isn't) ensures their semantic correctness (Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness). Neither "modern" non-RDBMSs, nor "traditional" SQL DBMSs guarantee them.

"Unstructured Data"

I used quotes because it is a contradiction in terms ('Unstructured Data': Why This Popular Term Is Really a Contradiction, Structuring the World With 'Unstructured Data'). All data is structured by definition -- if they weren't, they would be random noise, not data.

" ... if you're dealing with massive amounts of unstructured data, you may not have the luxury of developing a relational database with clearly defined schema ... related information of all types. Instead of relying on tables, non-RDBMS databases are document-oriented. This way, non-structured data -- such as articles, photos, social media data, videos, or content within a blog post -- can be stored in a single document that can be easily found, but isn't necessarily categorized into fields like a relational database does."
More fundamental misunderstanding. All data management involves manipulation of some data structure (Analytics = Manipulation of Data Structure), which determines the kind of manipulation applicable to it (i.e., what kind of questions can be asked of the data and what kind of answers can be obtained). All the enumerated elements are structured, it's just that the structures are complex and, thus, so is their manipulation (All Structures Aren't Created Equal) and integrity protection. Documents are multi-structured (Documents and Databases). Data are not "inherently" relational or non-relational -- it is a matter of modeling and design choices. Relational structure has advantages that are critical -- decidable data languages, guaranteed logical validity, semantic correctness, physical and logical independence -- that no non-RDBMS can deliver (Data Sublanguages, Programming, and Data Integrity, Physical Data Independence). The upfront effort is, thus, not a luxury, but it is not done either because the value is unappreciated, modeling and design skills are lacking, or out of expediency and laziness. Problem is, if you don't understand reality sufficiently to model it and design a relational database, you have no hope of trustworthy databased analytics.

No comments:

Post a Comment

View My Stats