Sunday, November 15, 2015

Moving in Circles: SQL for NoSQL



We've been there, done that.  

In Coming Full Circle: Why SQL now powers the NoSQL Craze Ryan Betts, CTO at VoltDB, argues that NoSQL products should adopt SQL for queries. I don't know about you, but to me it looks like a contradiction. Let me make it clear that my intention here is neither to defend SQL, nor to criticize it--I sure have done enough of that during the years--but rather strictly debunk the notion that its use with NoSQL systems is a good idea.

For decades, SQL was the established language used to interact with databases. Then data management requirements necessitated by cloud environments and big data use cases led to new systems. These new systems, collectively termed “NoSQL,” focused first on scalability and availability. They often sacrificed the interactivity and queryability of SQL as a false shortcut to scale.
SQL was not created for interaction with any databases. At least in intent it was to be a set-oriented relational data language. Unfortunately, poor grasp and appreciation of the RDM produced a data language that was worse than it could and should have been, both relationally and as a language. Be that as it may, it sure was not intended for anything other than relational databases.
Giving up SQL was not a necessary trade-off. In fact, this design choice created systems that were harder to use, narrowed their use case profile, and forced users to write complex programs to replicate what would have been simple SQL statements. For example, if you talk to enterprises that were early Hadoop adopters, many complain that using the new technology required the expertise of exclusive, highly skilled programmers, while legacy technologies that were SQL based were more accessible to analysts, data scientists, managers and marketers. The short-term choice to give up SQL caused a lot of pain for users, led to incompatible and proprietary query languages, and in some cases resulted in systems that were severely limited by weak interactivity and queryability.
To the extent that SQL systems are, relatively, easier to use than their predecessors, it is not just due to SQL per se. Rather, SQL takes (albeit, partial) advantage of a simple and well-defined formal structure to exploit the power of the dual theoretical foundation of the RDM--first order predicate logic (FOPL) and set theory. In fact, the effects of the structure and theory are so robust, that even the poor relational fidelity of SQL and the complexification of databases denormalized for performance" could not rob SQL systems from their power and ease of use relative to alternatives. Take that structure away--which is exactly what NoSQL systems did--and, SQL or no SQL (pun intended), you lose that combination of power and simplicity, including the guarantee of provably logically correct query results. The problematics of NoSQL systems that Betts mentions were, thus, unavoidable and actually predicted by relational proponents. Codd invented the RDM precisely to avoid them.
Understanding the modern database landscape is difficult. There are a large variety of systems, many vendors, and many different and overlapping technologies. It is interesting, though, that across this landscape, almost all have learned that SQL is table stakes. The original NoSQL systems posited that SQL didn’t scale, that SQL didn’t work without relational (table-based) schema, and that SQL interfaces were inappropriate to “modern” use cases. All of these claims have been debunked. Users are driving vendors to embrace SQL. We now see NoSQL vendors advertising their SQL capabilities as competitive differentiators. It is easy to predict that over the next 12 to 18 months we will begin to see marketing literature comparing SQL compatibility across different NoSQL systems. Try explaining that one to your manager.
These were also the circumstances preceding the RDM and SQL and why the consequences are similar. In fact, just as SQL is promoted now for non-relational NoSQL databases, a SQL interface  to non-relational hierarchic and network databases was attempted (efforts notwithstanding, it failed). Then, as now, the importance and value of the RDM was missed and dismissed. Consider: Every SQL update can be expressed as a query that return the after-update table. It follows that the update anomalies that result from updates to an unnormalized database may affect queries as well, yielding misleading data. So what then "SQL works without relational (table-based) schema" mean? SQL access to a "normalized" big data mashup (like Google's databases), or a "unstructured schema-less" blob is, at best, something akin to accessing an collection of unnormalized tables. It may be easy, may scale, etc. but results are ambiguous at best, or meaningless as often as not. Unless a relational language applies relational operations to sets, say goodbye to the guarantee of provably logically correct and sensible results (and I am not even mentioning data integrity).
“The system must provide full SQL query support and other functionality users expect from a SQL database. Features like indexes and ad hoc query are not just nice to have, but absolute requirements for our business.” (Google F1 paper). Google implemented a SQL system (F1) to manage its core adwords business. From 2012, F1 has been stated to manage 100s of TB and processes 100,000s of queries per second--a clear demonstration that SQL scales and that SQL features are critical to satisfying business users.
Oh, wow, Google has discovered that ad-hoc querying is an essential database function--about time. At a heavy price they seem to have discovered that many of the "SQL can't" claims are false, but I doubt that have yet realized that those that aren't are due to SQL's own poor relational fidelity and commercial implementations (which is, by the way, where indexes belong).
The headline feature of Couchbase 4.0, long a vocal proponent of NoSQL, is a SQL interface. Couchbase brands its version of SQL “N1QL.” But N1QL is nearly indistinguishable from SQL. Couchbase touts SQL benefits alongside its NoSQL messaging: N1QL enables you to query a document database without limitations and without compromise – sort, filter, transform, group, and combine data with a query. That’s right. You can combine data from multiple documents with a JOIN. That flexible data model you were promised? This is it. You’re no longer limited to “single table” and “table per query” data models.

Apache Cassandra is another prominent “NoSQL” system that has been pushed to deliver SQL as the primary database interface. The project brands its flavor of SQL “CQL” but goes to considerable length to explain that CQL is really just SQL. “CQL v3 offers a model very close to SQL in the sense that data is put in tables containing rows of columns. For that reason, when used in this document, these terms (tables, rows and columns) have the same definition than (sic) they have in SQL.”
In the 70's non-relational DBMS names were suffixed with /R and presto, they became relational. The QL suffix is not different. One can consider CQL and N1QL "versions of SQL" only based strictly on syntax (CQL--This is not the SQL you are loooking for). But that's not where SQL ease of use and power, such as they are, come from.  (Incidentally, Betts uses data model but means logical model).
Google first birthed the MapReduce model into the marketplace. Behind the scenes the company quickly realized the shortcomings of MapReduce as a query model and built a large-scale analytic SQL-based system named “Dremel.” Dremel maps a SQL query model onto a document-based storage model at massive scale. Again, Google proves that SQL scales across volume, velocity and problem space. The Hadoop ecosystem has taken up the path, launching multiple SQL-on-Hadoop projects, including Impala, Presto, SparkSQL and Drill.
Had implementers understood the relational concept of physical data independence (PDI), they would have not required any proof and would have saved the cost of big experiments. SQL scales in the sense that its syntax is more or less independent of the amount of data being processed. More importantly, relational proponents have for years explained that there is nothing to prevent properly implemented truly relational systems from supporting the needs that have been claimed to be impossible relationally. This was ignored (at a heavy cost for everybody) and still with the wrong conclusion: it's not SQL, but the RDM that counts.  Inability to learn from mistakes is a serious progress inhibitor.
Why SQL? Data management systems must make data interactive. Interactivity is core to deriving value from the data that is being stored. All of the critical design choices made by database designers are motivated by interactivity. Vendors that put other priorities first are quickly realizing the corner they’ve backed into and are grafting SQL query processors into their core products.
It's better I don't comment on this.

"If you have to support queries, why use SQL?" Betts asks.

  • SQL is a Standard: For relational databases;
  • SQL is Declarative: Well, more or less, but this is due to its partial reliance on FOPL, which is equivalent to set theory;
  • SQL Scales: As syntax, it does by definition; it's its implementations that don't;
  • SQL is Flexible: To the extent it is, it is due to whatever relational fidelity it has, but it's not flexible enough to cover up the problems of NoSQL, no matter how much those products stretch the terminology to appear relational.
NoSQL is not about no SQL and not even about not only SQL, it's about no RDM. No matter what suffixes are attached to product , as long as disregard for data and relational fundamentals continues, there will be movement in circles.

 

Note: Thanks to David McGoveran for reviewing a draft of this post and adding an important point.

No comments:

Post a Comment

View My Stats