Tuesday, August 1, 2017

Structure, Integrity, Manipulation: How to Compare Data Models




The IT industry operates like the fashion industry: every few years -- and the number keeps getting smaller -- a "new" data technology pops up, with vendors, the trade media and various "experts" all stepping over each other to claim that it'll "revolutionize your business" and unless you jump on the bandwagon, you'll be "left behind." But time and again these prove to be fads lacking a sound foundation. Huge resources are invested in migrations from fad to fad, rather than in productive work (Don't believe the hype about Hadoop usage, Basta, Big Data It's Time to Say Arrivederci). Remember?
"Hadoop seems to take over relational database, as Hbase can store even unstructured data whereas relational data warehouse limits to structured data ... handles traditional structured data just fine, albeit in a different way than a RDBMS ... EDW vendors [will] incorporate Hadoop framework into their core architectures to enable advanced and high performance analytics."


--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.
---------------------------------------------------------------------------------

First, "unstructured data" is a contradiction in terms: all data is structured 'by definition' -- if they weren't, they would be meaningless random noise and, therefore, not data. Second, performance is determined exclusively by physical implementation (hardware, OS, DBMS and its optimizer, network load, concurrent access, and so on). Associating it with the data model, which is purely logical, is 'logical-physical confusion' (LPC). Third, practically all references to 'relational' are actually to SQL systems, which are far from relational. Fourth, data warehouses are non-relationally designed databases.
But, fifth and most important, "handling data in a different way" is so much handwaving. Data management IS (1) manipulation of some (2) data structure with 3) protection of data integrity, the three components of a data model -- i.e., there is no data management without one. So any claim for a data management technology of equivalence to, or superiority over an alternative involves a comparison of the formal data structure, manipulation and integrity components of the corresponding data models and their pros and cons.
The three components of the RDM can be specified precisely and succinctly:
  • Structure = relation, represents object group:
- attributes represent properties
- tuples represent (facts about) objects
- domain and attribute constraints represent property rules
- tuple constraints represent object rules (relationships among properties)
- multi-tuple constraints represent multi-object rules (relationships among objects)
- database (multi-relation) constraints represent multi-group rules (relationships among groups)
  • Manipulation = Relational algebra, represents inferencing--i.e., derivation of new facts from database facts
For a true RDBMS and fully normalized databases the exclusive advantages are:
  • Declarative, decidable data language -- i.e., queries are guaranteed to terminate;
  • Physical and logical independence -- i.e., queries and applications are insulated from physical and some logical reorganizations of the database;
  • System-guaranteed
- database consistency (with the business rules in effect)
- logical and semantic correctness of query results;

all due to the RDM's formal foundation in first order logic and set mathematics.
What exactly is there to gain in return for opting for a "different way" and giving them up? Not only don't you get an answer when you ask proponents. They cannot even specify the underlying data model. Instead, you get ramblings without any specifics e.g.,
"You can and do model data in Hadoop. Indeed, the first thing you do with HBase is decide how to organize your data into column families. Kind of hard to argue that doesn't happen because we and lots of others do exactly that. Ditto with Hive. Heck, even the choice of how you write data into HDFS requires some choice on organization even if it "just' mimics the what it was modeled the application/source it is being loaded from. Am I saying that is the same type and depth of modeling as an RDBMS, no. It is still data modeling, yes. The fact it isn't the same as a RDBMS is the whole point, there are design considerations where you want the HBase model instead of a RDBMS."
"Modeling columns" and "expressing relationships" is too little and too vague.
When somebody pushes an alternative to the RDM (including BigData/Hadoop), ask what
  • Its three formal components are (exactly, please!)?
  • In the real world they represent?
  • Is the underlying theoretical foundation? and
  • Advantages they offer to compensate for the loss of those of the RDM?
Let me know if you ever get a satisfactory answer.



No comments:

Post a Comment

View My Stats