Sunday, February 23, 2014

Thinking Logically: SQL, NoSQL and the Relational Model

I very much doubt that somebody who does not think logically can be a fully competent database professional. From a LinkedIn exchange, an attempt to "address some of my points by calling them out":
JL: You said: "... much of the underlying motivation of NoSQL stuff is anti-relational ..." Okay, so? Data management in graph/document/columnar dbs is not possible because they are "anti-relational"?
The point I made was that, name notwithstanding, NoSQL vendors/proponents are not just anti-SQL, they are actually anti-relational, an important difference.
  • What exactly does this have to do with whether graph/documents/columnar database systems "are possible or not"?
  • Columnar DBMS's can be relational and are not considered NoSQL products.
  • At issue is not the possibility of graph/document systems, but what they are appropriate for.

A document management system serves a different purpose than that of a relational database system, which is a deductive logic system: it infers facts that are logical implications of the facts recorded in the database. Problems materialize when document systems are presented as alternatives to relational or SQL systems, or are mistaken for such. Graph systems are a different story. They are a re-labeled version of the hierarchic and network database systems that were discarded decades ago as inferior to even SQL
JL: You said: "... they just think that SQL is relational". Not really. Do you have evidence that? On contrary, there is an Apache project, Hive, atop Hadoop providing SQL like capabilities. If "they" (whoever "they" are) would "think SQL is relational" then clearly using SQL on top of Hadoop wouldn't have been conceived off. Likewise, Aster (graph db), OrientDB (graph db and document store), have SQL parsers as optional access layers on top of their kernels. And, Cassandra, HBase (columnare db) also have SQL access layers.
More faulty logic.
  • Regarding evidence, JL has got to be kidding.
  • That SQL is considered relational does not mean that it is. In fact, it is far from being a true and complete concretization of the relational model, so being used as a data language by non-relational systems does not necessarily negate my argument. Whether they are functional equivalents or superior to native SQL systems, however, is another matter altogether.
  • If a data model (in the Codd sense in which the relational model is one) underlies HADOOP, I am unaware of a definition, formal or otherwise--what exactly are its structure, integrity and manipulation features? Until it is specified, the fact that a SQL interface has been implemented suggests that it essentially constitutes some physical implementation scheme, none being ruled out, as long as it is not exposed to users in applications.
JL: You said: "Moreover, most NoSQL products violate the relational model without any support for SQL." Well, Fabian, ALL NoSQL products are NOT designed for relational modeling, and are not suppose to be Relational theory compliant. Yes, data will be duplicated; yes, there will be no 3rd or higher normal forms. And guess what, that is by design.
You said: "All NoSQL products (and even SQL products) violate it." Of course, NoSQL was never intended, by design, to support RM.
Which validates my contention that the motive behind NoSQL is anti-relational. But to the extent that they reveal this motive, proponents do state or suggest that relational and SQL systems "have limitations" that NoSQL systems overcome.
JL: The respective design/architecture is around CAP theorem.

It is, indeed, the case that NoSQL products trade off consistency for scalability/performance, but that does not mean that the CAP theorem is the only or even main fundamental difference between them and relational systems, as JL implies. Moreover, I reiterate that as long as the former are deployed consciously, with full knowledge of the tradeoff and its implications, it's OK. That, however, is far from being the case in reality (see example in next).
You said: "All NoSQL products were motivated by the illusion that you can achieve the same informational purposes as RDBMS's without the upfront effort of structuring the data in R-tables." This statement in totality is conjecture.
No, it is not. Here's just one example as indisputable evidence: Why You Should Never Use MongoDB and my Anatomy of a Data Management Project posts @All Analytics.
JL: You said: "... an absurdity: data is inherently structured, otherwise it is random noise." This is use-case dependent. If data is being collected through a predesigned form, and stored in RM, sure, there is "inherent structure". However, this is not always the case. Example, data generated via PLC on factory floor, data generated by various sensors in an ER room, data generated by various circuitry in Mars Rover, data generated on various web pages throughout the internet, data generated on blog sites. I think you got my point. If I want, I can premise: data is in fact inherently noisy because data itself does NOT have relationship constructs. Only inherent aspect is the semantics expressed in the content, and of course, respective metadata.
This betrays very basic lack of understanding of data fundamentals.
  • Factory floors do not generate random noise--that would be informationally useless--and meaning is carried by structure. It's just that a particular informational objective requires a different structure than the one generated by the factory floor. There are essentially two options:
  1. Determine the useful structure via conceptual modeling and logical database design ("defining the relationships", which is what the RM is for);
  2. Dump the data into some arbitrary structure e.g. blobs or documents and "embed" the model in applications;
1 requires a thorough knowledge of the business, informational objectives and data fundamentals, which is
scarce and requires thinking and time upfront. This makes 2 expediently attractive by comparison, hence the illusion.
JL: It seems that you are confusing metadata as "inherent structure"; where in-fact, the inferences (cognizing relationship) made off data provide a perception of inherent structure. From the semantics inherent in data, a contextual interpretation of concept/entity/object interrelationships yields a domain model. Furthermore, data is just instances of this domain model, which can be persisted in ANY database.
No, I am not. Metadata are, ultimately, data and are often generated precisely because the "regular" data are not organized in a useful structure via modeling. However, metadata is not a full substitute for data.
JL: You said: "Sure, they pay lip service to "for different purposes than RDBMS", but read their technical and promotional material and that illusion is what they use to attract customers." lol. Honestly, not. Forget vendors, lets look at equivalent Apache projects and open source projects by community, by companies like Facebook, Google, Amazon. They have no promotional material and they not at all claim anything to do with RM.
Open source projects are not fundamentally different than commercial vendors: there is an interest in growing the user base and, while there is no profit from selling the product, there is usually from support and services, which justifies the endeavor. More importantly, foundation knowledge is not more common in the OS than it is in the commercial segment, I am sorry to say. If they do not always make references to RM, it is quite likely that they do not know or understand it.  I refer JL and to this site and my writings elsewhere, which document cases where they do and present products as superior alternatives to RM. What they don't do is provide a clear, if any, definition of their data models, how they are different than the RM, and with what practical implications. This is very costly for uneducated users.

No comments:

Post a Comment

View My Stats