Saturday, September 22, 2012

Davide and David on NoSQL

After publishing my 3-parter on NoSQL I came across a post by Davide Mauri and an interview with David McGoveran on the subject, to which I would like to add some clarifications (this is not a debunking, as they hardly justify such).

Mauri asks: "Do NoSQL people really want to drop the relational model?"

The answer, of course, is yes, as I demonstrated in my article.
First important thing that immediately shows up, is that there a lot of confusion  between the Relational Model – the mathematical fundaments of RDBMS – and the physical implementation of it, which is the RDBMS itself. I’m sure this is nothing new, but it’s an important point to keep in mind. This is the key to understand the NoSQL idea, and to give answer to it.
This confusion exists because practitioners don't know and understand data fundamentals. When they say "Don't bother me with all this theory, I have practical things to do", what they mean by theory is "not product-specific". So if your perspective is exclusively product oriented, you are unaware of levels of representation--conceptual, logical and physical--or of the distinction between model and implementation. This, of course, is the source of the common logical-physical confusion.
Developers complain that RDBMS does not scale. While, for my experience, this is not totally true, I can pretend to agree, but even in this case, the problem is in the implementation not in the model. Is not that leaving the relational model behind will make databases more scalable.
Davide means, of course, SQL DBMSs. The relational model is logic and mathematics applied to database management. How can logic and mathematics have anything to do with scalability, which is a physical implementation issue? In fact, via physical data independence--the detachment of the logical representation from physical implementation details--the relational model gives implementers complete freedom to do physically whatever they darn please to optimize performance, as long as they don't expose it to users in applications.
Other complaints regard the fact that SQL is too verbose and quite frustrating sometimes because ask you to write too much “ugly” code. I agree with that. Writing queries to join many table on many columns is just boring. But again I don’t really see how dropping the relational model could help on this side.
I am not going to defend SQL. Oracle and IBM did serious damage when they decided to release what was an IBM research prototype query language in commercial products. It is neither a truly/fully relational data language, nor well designed as a language. Be that as it may, there is the issue of poor understanding of SQL and its misuse,  like the string manipulation in a previous To Laugh or Cry? post, which is also related to lack of foundation knowledge.

In general SQL is learned syntactically, without much semantic context: "this is the SELECT clause, this is the FROM clause, this is the WHERE" clause" and so on. In the absence of relational context, SQL statements are written with blinders on. If, on the other hand, you know that the SELECT clause is projection, the WHERE clause is restrict and the FROM clause with more than one R-table is Cartesian product, you understand both the semantics underlying SQL and its limitations: it is syntax intended to express a join implicitly, without actually writing JOIN. There are relational operations, however, that could not be expressed implicitly with the SELECT FROM WHERE syntax--e.g. UNION--and the rest of relational operations were added later explicitly anyway, including JOIN, making SQL a verbose, very redundant language, much harder to optimize.

So the problem with SQL is not that it's relational, but that it is not and is poorly designed to boot. And that is due to the fact that neither vendors nor users know or understand sufficiently the relational model.
One last big complaint is that if a developer has an object – an instantiated class – and he only needs to persist it, he can’t. He has to map object attributes to tables and columns, which is, again, time consuming and boring. And in addition this doesn’t offer optimal performance since such mapping, of course, consumes resources. I agree also on this. But I know that in “The Third Manifesto” this problem is totally solved, just making RDBMS aware of Object-Oriented principles. Now, if Microsoft can just integrate more and more SQL Server with .NET, the problem will be solved. I know that Oracle is way more integrated with Java, so we can ask for the same thing. But, hey, even here I don’t see how eradicating the relational model can help. On the contrary the proposed improvement of it can make our life easier and still safer (in terms of data quality).
I get uncomfortable when objects come into play in database management. First, object language is fuzzy. Second, object orientation is at best a programming/application development paradigm and has little to offer database management, except perhaps type theory, which is orthogonal to the relational model (and is what Davide refers to having been solved in THE THIRD MANIFESTO). There is also a significant difference regarding encapsulation: where OO wants it, relational does not. In a comment to my previous post Object Orientation, Logic and Database Management, Eric Kaun refers to a paper on the formal foundations of OO, but the paper explicitly refers to programming and type theory; as far as I know, there is no explicit, formal OO data model, in large part precisely because it lacks non-encapsulating relations. But I am willing to be persuaded otherwise by evidence.
So, at the end, my idea is that NoSQL doesn’t really mean “Throw relational databases away”, but more something like “Give me correct implementation of the relational model, make it capable to handle objects (in Object Oriented Principles way), make SQL less verbose and more integrated with .NET and give me something that I can scale automatically.”
Davide's conclusion is a valid one by somebody appreciating data fundamentals, even if significant care must be exercised insofar as what the "handling of objects" means. But a careful inspection reveals that that is not what the NoSQL is about: they are clearly anti-relational. While it is true that the NoSQL "movement" confuses SQL with relational, the problem is more fundamental than what the name implies and here I will refer to David McGoveran's interview which, I think, is consistent with my own article. I have highlighted what I consider critical.
The NoSQL community can be seen as an outgrowth of the anti-relational movement that came about with object-oriented programming. They share lots of the same issues and concerns ... The NoSQL community is concerned with being able to cheaply (e.g., with little up-front design effort) scale up to thousands of servers and nodes, perhaps geographically distributed, or thousands of control processors, often maintaining some partitioning or “sharding” of the data. Support for MapReduce algorithms is often considered essential, but its proper use can be tricky ... Some DBMS solutions that would be classed as NoSQL allow programmers to access the data stored in a way that will best meet their application needs [i.e. application-biased databases]. Their physical organizations are application-specific. Among them are graph (linked) databases, text databases, columnar databases, stream-oriented databases, and key-value (aka associative) databases ... I suggest considering a NoSQL solution if any of the following are true:

First, when discovery of relationships is more important than consistent processing and specific data results.
Second, if the data processing is meant to be inductive (e.g., suggestive) rather than deductive (i.e., precise).
Third, when the application is changing very fast, data complexity is great (variety or amount).
Fourth, if physical issues, like big data or a high degree of parallelism, are more crucial than data integrity. You must be willing to throw away data consistency in favor of performance and scalability.
Fifth, if you have a mission-critical one-off application for which a fixed data organization is ideal, in which case the costs and risks may be lower than licensing a vendor’s RDBMS or trying to force an open-source RDBMS to fit the need.
The core concept in database management is organization/structure, which implies upfront design. That is what produces the benefits of consistency, logical inferencing and precise, provably correct answers to queries among many others. The context of discovery for which NoSQL products are proposed solutions serves different informational purposes. As long as the distinction is properly understood, it's fine . But, unfortunately, I don't think it is (remember my comment above that the industry extends anything to everything, which is the source of fads?).

First, there is the illusion that by skipping design effort upfront you will still somehow get the same benefits of relational database management, not to mention that NoSQL products are promoted as "superior to and replacements not of just SQL products, but of relational technology". Second, what do you do with your discoveries, once you got them? How do you exploit them with NoSQL and at what cost? (see my post on Why I Migrated from MongoDB).

And third, I still ask: what does physical distribution and performance has to do with the data model?

No comments:

Post a Comment

View My Stats