NoSQL and SQL: A Plague on Both Their Houses

Follow @DBDebunk Follow @ThePostWest

Oracle Defends Relational DBs Against NoSQL Competitors prompted Does Oracle Really Understand NoSQL?, which was shared on LinkedIn and triggered a LinkedIn exchange in which I participated. The following comment is an adequate summary of the second article:

JN: It's an unfortunate bit of propaganda. Some truth mixed in with distractions and irrelevant comparison. I've met the DataStax team. They're smart people with a solid understanding of their space. I'm disappointed to see them mix good and bad information into something that looks like objective truth.

and the first is not much better. The interested reader can visit all three links. What I want to do here is amplify on some of my LinkedIn comments and add some.

DF: While this article is technically accurate (unlike many others these days), it misses many points. SQL and NoSQL people talk cross each other and the dialogue becomes a deaf conversation, hence a little boring. SQL people refuse to learn and understand NoSQL's requirements and NoSQL people usually have no clue how relational databases work. But put it more simply: why the heck are those guys even fighting? SQL and NoSQL don't share the same market space--even if it appears so. So relax, and enjoy your market segment...

Both SQL and NoSQL people think that SQL databases and DBMS's are relational--clear evidence that neither know and understand the relational model. Otherwise we would have had true RDBMS's, not SQL DBMS's and without those probably no NoSQL systems either. The latter owes a lot to the former.

Note: My colleague David McGoveran is currently writing a book completing, expanding and clarifying what he believes is the RDM that Codd had in mind and would have produced, had not illness intervened [1]. From what I've seen, practically nobody knows how true relational databases work. None of the current views of the RDM and interpretations does Codd's intentions justice. Watch for revisions of my papers that will reflect it.

DF: Relational model is the socialism of queries. NOSQL/JSON model is the wild capitalism of queries. In relational db the normalization of data ensures that all queries have equal chances--as a result, overall, they all perform equally (more or less bad). In NoSQL, data being denormalized, certain queries are certainly VIPs and have MUCH better chances then the rest of the queries. Not only that, but the rest of the unfortunate queries, are heavily penalized. The problem of NoSQL is to understand which queries will be accelerated via denormalization vs. which ones will be slowed down. And as a user, I can see use cases on both sides of the fence (relational, noSQL)--the question is to understand the pros and cons.

Points arising:

Anybody who associates normalization with performance does not understand the RDM and data fundamentals. "Which queries will be accelerated via denormalization vs. which ones will be slowed down" is an all too common example of logical-physical confusion (LPC). It's the physical implementation of both SQL DBMS's and databases, not the level of logical normalization, that determines performance.

Note: For this response I was accused of being a "fanatic unable to discuss abstract issues". Aside from the strange definition of a fanatic, I will leave it to the reader to consider the irony of this accusation. Incidentally, my colleague Erwin Smout comments: "Sneakily introducing the tacit suggestion that the "pros and cons" are on an equal level. Of course, without any mention of how those levels were measured, making the claim gratuitious and handwavy. I don't know the word for that technique, but DF seems to resort to it fairly often."

As I have argued more than once, the name notwithstanding, NoSQL is a rejection of the RDM, not of SQL [2]. In relational databases, further normalization (from 2NF-4NF to 5NF) and denormalization convert one set of logical R-tables to another set of logical R-tables. But NoSQL databases do not consist of R-tables and are, therefore, non-relational. What exactly does denormalization mean in this context?

Without the dual theoretical foundation of the RDM (FOPL & RA), what guarantees data integrity and logically correct query results? Denormalized databases are riddled with redundancy that, to prevent corruption, must be controlled. But the constraints necessary to control it induce the very same joins that denormalization intends to eliminate, defeating its purpose [3].

It is not by chance that Codd included "shared data banks" in the title of his first technical paper on the RDM, by which he meant multi-application, multiuser, multi-query/update databases with concurrent, different and varying patterns of data access. Hence his notion of data in their "natural structure", i.e., databases unbiased towards/against any of those, particularly the not yet envisioned. As its name indicates, a RDBMS optimizer optimizes the collective performance of all, rather than maximize that of some at the expense of others, that comes close to a fool's errand.

DF: Denormalization comes in various ways, as you should probably know by now after studying the database modeling for 35 years, and not all of them mean redundancy. There are several normal forms and some are of interest in my opinion and have nothing to do with data redundancy. And BTW, normalization is broader then only data dependency. 1st normal form has nothing to do with data dependency.

Indeed, there are five normal formal forms (six, if one counts 6NF). It is important to distinguish between normalization (to 1NF) and further normalization (to 5NF), both of which are relational design repair methodologies [3]. The only one that that does not involve redundancy and dependency theory is 1NF. According to the original and correct definition by Codd, a R-table is in 1NF (or normalized) if it is not defined on any Relation-valued domains (RVD) i.e., domains with relations as values [4]. In other words, 1NF is a property of R-tables in a relational database, not in NoSQL (why are rejecters of the RDM so eager to use relational terms?).

FH: In his original paper Codd specifically talks about *formatted data*, and provides a formal model whereby he separates the data from its implementation and the software that will eventually use it. We should select the *kind of database* based on the structure of the data that we intend to store, and what we intend to do with it, rather than the *kind of application* on hand ... it is foolhardy to say that a particular database model is inappropriate for a given application without mentioning the structure of the data to be stored, its intended use, volume of data, integrity/security issues, etc.

Precisely because Codd "separated the data from its implementation", by "formatted data" he meant logical, not storage structure. It is equally foolhardy to assume that data in any logical format necessarily "does not fit" (i.e., cannot be structured as relations that represent sets of facts). Database management is concerned with knowledge representation and knowledge consists of facts.

KC: One of the biggest significant distinction between graph and relational databases is the fact that the schema for a graph database is itself data that is available to the query. This makes it possible for a logical model to be provided that can be made explicit as a physical model, and it also makes it possible to provide conditional logic upon that model (and from that to its associated data) in ways that relational data models simply cannot handle.

There are places where each is needed. XML databases work very well for... tree adjoining grammars (TAGs) most closely match directed acyclic graphs. JSON databases work in those situations where you have a persistable document object model, something that is harder to model with relational databases ... SQL databases work very well with fixed structure, relational datasets and represent an optimization in that direction. Graph databases give you flexibility, but at the expensive of performance.

Points arising:

The schema for a relational database is documented in the catalog as data and is available for interrogation.

"...logical model to be provided that can be made explicit as a physical model" is precisely what physical data independence (PDI) does away with.

The RDM intentionally is set oriented and based on first order predicate logic (FOPL) and relational algebra (RA) for declarativity and simplicity. Graph database are based on higher order, more powerful logic, but are record oriented, procedural, more complex and allow self-reference, which introduces undecidability. A correct implementation of the hosting of a truly relational (not SQL) data sublanguage in a computationally complete programming language can achieve the power without loss of the relational advantages.

I am not aware of a formal, well defined and complete "document object model" in the sense in which RDM is a data model. Its absence is precisely what led XML standardization to drop the document for a "sequence" abstraction, characterized by the very prohibitive complexity that the RDM was intended to avoid [5].

SQL databases (which are not relational) do not have a "fixed structure" and as we know from pre-SQL experience with hierarchic and network databases, graph databases' prohibitive inflexibility and complexity led to their abandonment several decades ago in favor of SQL.

DF: God bless Codd. What he did is very useful and interesting, but I never agreed to take Codd as my God, and his word as my Gospel. There are other models and theories that are equally useful.

Oh, bollocks.

Note: Erwin: "That sneaky technique again. Works because this is what lots of people want to read. And those people surely won't wonder what objective measurement technique was used to assess the usefulness of all models (which ????) DF had in mind here".

References

[1] LOGIC FOR DATABASE FOLKS, forthcoming.

[2] Moving in Circles: SQL for NoSQL

[3] The Costly Illusion: Normalization, Integrity & performance

[4] Interpreting Codd: 1NF in Theory and Practice

[5] Documents and Databases

POSTS

Sunday, January 3, 2016

NoSQL and SQL: A Plague on Both Their Houses

No comments:

Post a Comment