Monday, October 29, 2012

The Cloud and SQL



In Why SQL Is Not Suited for the Cloud and What Is NuoDB, one Aswin writes:
In the data world, SQL is a relation-based, data model, and the most common means used to retrieve and manipulate data. Predicates, clauses, expressions and queries that are all used in SQL data management system have made it abundantly popular and user friendly. But this traditional data management system suffers serious of limitations...


A data language is not a data model, it is based on, or concretizes one. I agree, of course, that SQL has problems and that many of them stem from its data model. But, as the author reveals, SQL's data model is confused with the relational data model and that is, of course, a major mistake. While SQL has some relational roots, its data model is not relation-based--which would require full and correct support of set-processing and two-valued logic. Rather, SQL supports bag-processing and a poorly implemented three-valued logic, among other violations of relational principles.

The implication is that SQL's problems do not stem from its being traditional, but rather from failing to implement the relational model (and poor language design). I would, therefore, advise caution using the term "traditional" in the context of this confusion. It is often interpreted to mean obsolescence of relational technology and since the relational model is predicate logic and set theory applied to database management, this is tantamount to declaring logic and mathematics obsolete. I keep asking those using the term in support of "new" database technologies--which more often than not are actually older--what do those technologies replace logic and math with and I have yet to get an answer. In fact, they are unaware that a data model must be explicitly defined for data management

The author lists several purported SQL-based product limitations (numbered for convenience):

1. Process becomes tedious when large volume of data is used.
2. Often incorrect SQLs are created for multi table data graph.
3. Data repetition occurs that occupies much of your storage.
4. Cannot use DML operations.
5. The absence of table makes the view inactive.
6. This traditional data management system uses much of resources.
7. You don’t get the feature of elasticity.
8. Needs considerable growth in the area of geo-distribution.

Points arising:

1. I assume "tedious process" means slow performance here. This is a trivial statement at best--more data takes more time--and inaccurate at worst--poor performance is more often than not due to incorrect or poor implementations use of products and SQL, rather than the technology itself.

2. I guess the author complains here about poor SQL support of hierarchies (graphs). I sympathize, because as a relational proponent I have been arguing for a long time that the relational model can support tree structures well and it's just that SQL and its implementations that do not. But we must be careful here not to infer from this that the solution is a regress back to hierarchic DBMSs that we discarded decades ago because they proved too complex, rigid and costly. We should keep in mind that while true RDBMSs can handle hierarchies in the real world, hierarchic DBMSs force hierarchies on all reality, whether it is inherently so or not.

Incidentally, Aswin seems bothered by the fact that a tree structure is represented by multiple R-tables. He probably was not around to experience IBM's IMS and compare it with relational multi-tables.

3. It is not clear to me what kind of "repetition"--the correct term is redundancy--is meant here. First, one of the core objectives of relational design--fully normalized databases--is precisely to eliminate unnecessary redundancies (see The Costly Illusion: Normalization, Integrity and Performance). Second, the purpose of keys is to prevent another type of unnecessary redundancy: duplicate rows. Third, if Aswin means foreign keys, they are neither redundancy, nor unnecessary. Fourth, if he means indexes, a correct implementation of a RDBMS may well do away with them (e.g. the one invented by (now defunct) Required Technology enabled just that). Fifth, if he means data warehouses and data marts, they are due to poor design practices (see Is there a 3NF to denormalized tool), which I have long criticized. Storage is hardly the problem here, as larger storage devices get increasingly cheaper. (backups and snapshots are examples of necessary redundancy).

4. I am wondering if the author knows what DML means. In a SQL context it means Data Manipulation Language or, in other words, querying based on the operations of relational algebra. So what exactly prevents them in SQL, which would render it useless?

5. If I understand this correctly, it is ridiculous. Views are the relational mechanism for logical data independence; they are essentially queries that define "windows" into base R-tables, through which applications can access the data in the latter. In this way the relational model was the first and only database approach to offer logical data independence: preventing changes in logical database design from disrupting applications and imposing an application maintenance burden. But, of course, if database data that is accessed via those windows is deleted, there is no way in which any DBMS can preserve access to that data. Non-SQL products have no views, so any and all data loss alterations invalidate applications. Is that better?

6. This does not warrant a comment.

7. By 'elasticity' Aswin probably means schema flexibility. That's bunk and I refer the reader to my several posts on NoSQL.

8. Geo-distribution is part and parcel of physical implementation. In fact, it was demonstrated decades ago that a true RDBMS would facilitate database and/or DBMS distribution (two different things), so in this sense SQL products' poor relational fidelity contributed to their weak distributability. However, it also became clear after several attempts by vendors to implement distributed database management that the real difficulties lie in human, organizational and political factors, not the database technology per se. I don't think that proponents of NoSQL really understand what they're getting, and they will soon find out. But then, that's standard operating procedure in the industry: somebody will come up with yet another ad-hoc "new technology" to clean up the mess.

But what has all this to do with the Cloud and why does Aswin think that SQL "is not suited for it"?
SQL is more than 30 years old and it was not designed with the intent of remote, or “cloud” computing. It was solely built to leverage server architecture. Cloud-based computing requires things to be easily added and removed to meet huge demands, and it’s a bit too complex in these traditional database designs.
As I already explained, that's not historically true. Not much later than Codd's invention of the relational model and SQL implementations serious attempts were made to develop distributed databases and DBMSs. In fact, RTI, the vendor of Ingres, a DBMS which was based on QUEL, a data language that had some advantages over SQL, developed a distributed DBMS version of the product, but as far as I know not technological factors inhibited its adoption, but rather those I mentioned above.
"Enter, NoSQL" writes Aswin, and given the general lack of data fundamentals, including the history of the field, its proponents will rediscover them at some point. I am fairly certain that if RTI's technical staff, with much better such knowledge and understanding, did not appreciate those problems, it will take more time for NoSQL proponents to realize them.
When we deal with a huge quantity of data we can’t always manipulate such large data with the traditional structured query language and hence forth the “Not Only SQL” came into existence. NoSQL mainly concentrates on retrieval, not on the relation, between data. Big companies that deal with huge quantities of data such as Google and Amazon started using these noSQL databases. The cloud-friendly way to approach large data is noSQL. For key value storage, big tables, huge document storage, graphical databases, peer stores, the XML database, noSQL design is the more suited one.
First, data volume does not affect or determine suitability of a purely logical data model or language. Both "small" and "BigData" can be structured at will and queried by the same data language. What counts is the informational purpose. SQL deficiencies notwithstanding, for inferencing needs of the kind that RDBMSs fulfill, NoSQL is certainly worse, not better.

Aswin admits that "NoSQL does [sic: do, there are many disparate products under the NoSQLumbrella] have its [their] own set of limitations. NoSQL fails in giving ACID guarantees".

Which together with lack of integrity raises the question to what extent these products are really DBMSs. They supposedly ensure "eventual consistency" and, therefore, require tolerance of information that is either not up to date, or inconsistent. But it surely explains how these products manage to perform and distribute better: they give up the most critical function that DBMSs were invented to provide. It is essentially the same principle underlying denormalization: trading integrity for performance.
But "NoSQL mainly concentrates on retrieval, not on the relation between data" [sic]--and mostly text messages, exchange threads and the like, at that--so guaranteed immediate consistency is supposedly not a critical need. Which validates my point that is the nature of the data and informational needs that are at issue here, not the size of the data or the Cloud.
NoSQL has other limitations, as well. The JOIN clause which helps in combining 2 or more tables from 1 or more database is not available in these NoSQL format databases. Additionally, a lot of skill and effort are required to maintain these databases. Case in point, it’s been more than 25 years since the concept came into existence, but finding a skilled professional to maintain a NoSQL database is very difficult.
No kidding! I wonder why.

The description of NuoDB is such that I won't bother with it. It's precisely what is expected from a failure to know and understand what a well-defined and sound data model is and how crucial it is.


No comments:

Post a Comment

View My Stats