Sunday, April 26, 2015

Comments on Stonebraker Interview

Revised: 12/2017

Interviewed about his Turing Award, Michael Stonebraker is "modest" about his jointly-with-others contribution:

"... the Ingres database [sic] brought Codd’s lofty relational ideas into the realm of ordinary individuals ... turned [them] into constructs that could be manipulated by ordinary people ... it was argued at the time that RDBMS couldn’t perform, but we showed it could be efficient."
and gives most of the credit to "Ted" Codd:
"What Ted proposed was radical ... a complete change from how things were being done in database [sic] ... he turned the problem of data management into one of relations. That dramatically simplified things ... The conventional wisdom was that you should build for the particulars of how the data is stored. He saw that made no sense ... he [moved] the actual manipulation of data away from assembly language programming of the time to higher levels of abstraction that would later become structured query language, or SQL ... He brought principles of encapsulation and abstraction to programming databases, like with a high-level-language in programming."
Quite. Except that Ted was vehemently critical of SQL as a botched concretization of the RDM which, as it turned out, ensured that his ideas would never be truly and fully implemented (one of which, incidentally, was a relational declarative data sublanguage that would replace programming for data management DBMS functions). On the one hand SQL, whatever its flaws, was much superior to the database technologies that preceded it; on the other it has been forever identified with the RDM, to the point where the chance for true RDBMSs was lost (the assembly language statement is not quite accurate -- COBOL, FORTRAN and special purpose languages were used at the time -- assembly language was used for writing access methods at the I/O level, but even that wasn't pure).
Be that as it may, now that MS finally deems Oracle’s database [sic], IBM’s DB2, and Microsoft‘s SQL Server obsolete "legacy code", you would expect at least some of his criticism to focus on their poor relational fidelity and how his own "fundamental research into database theory" leads to true RDBMSs that would have made Ted proud. Particularly as one of the authors of the Ingres DBMS QUEL data language which, in some ways, was relationally superior to SQL.

But you get neither (with the caveat that I know from personal experience what happens when you humor journalists by giving interviews -- what is published often has little resemblance to what was actually said --although based on his past statements, this does not seem to be a major problem here).

"A third of the database market -- Oracle and SQL Server and DB2 -- is legacy code that will be replaced by things such as VoltDB ... there is two orders of magnitude performance difference to be had ... sooner or later that will be significant ... if you want to do 50 transactions per second, it doesn’t matter what technology you use, you can use whatever you want. But if you want to run 50,000 transactions per second, your current implementation is simply not going to do it. Sooner or later, you are going to be up against a technology wall that will force you to move to new technology, and it will be completely based on return on investment."
50 trans/sec was once considered a challenge to relational OLTP (I am not sure how Codd would have taken that). I am the last to defend SQL and to the extent that performance problems exist criticism is, of course, fair game. However, given the state of foundation -- and particularly RDM -- knowledge in the industry, the Turing prestige gives one the opportunity and responsibility to at least make clear that many  of SQL products deficiencies -- as, if not more important than performance are due to their poor relational fidelity (and performance is not excluded). Don't get me wrong: performance is terribly important. One does not need to spend years, or even months in the IT industry in general and the database field in particular, to realize that it is as close to the 'be all and end all' in database management as you can get. So, understandably, MS designed and promoted his products as superior performers -- that is what you do as a vendor. But as a scientist and educator who appreciates the value of Codd's contribution, should the performance and promotion of one's products be the exclusive focus? I was always very suspicious of merging science with commerce and the current utter corruption of academia (science, research and education) by commercial interests proves me right. No compatibility and science cannot win.
"Another third of the market, focused on “data warehousing,” is moving from row-stores to “column stores,” which can be far more efficient."
The concept of data independence and, specifically, physical independence (PI) was from the very start a core objective of the RDM -- diversity of storage and access methods satisfying a broad range of applications insulated from changes in, and details thereof -- which SQL authors and implementers failed to support. No SQL DBMS gave any ability to change the physical store, most let it affect logical issues and all implemented a relation as a "row-store" -- the direct image representation. Many even insisted that the RDM required row-images as records and that nonsense has carried forward to the present day.

Columnar storage -- the idea is to store domain values and a bit map of them to represent the relation physically, constructing the logical relation for all users on the fly -- is nothing new: products actually existed at least as early as 1989. Be that as it may, it is not the columnar storage per se that's important, but that it is one of many storage options and the RDM gives implementors freedom to implement multiple such options and change them transparently whenever necessary for performance! This is what PI is and when it comes to SQL DBMS performance, this is what MS should stress!

"A third is “everything else” ... 100 or more of these NoSQL companies ... “NoSQL” databases [sic] and Hadoop ... started out, NoSQL meant, ‘Not SQL,’ then it became ‘Not only SQL,’ and now I think it means “Not-yet-SQL" ... NoSQL proposes low-level languages, and they are betting against the compiler, and that’s an incredibly dangerous thing to do ... Hadoop, it will take on SQL aspects and merge with data warehousing ... Cloudera released the Impala system ... a SQL column-store engine. MapReduce is nowhere to be found ... The historical Hadooop stack was Hive on top of MapReduce, on top of HDFS ... MapReduce will atrophy and be replaced by SQL ... Hadoop will look like the data warehouse market, and NoSQL will look like the SQL market."
Welcome to the past. The situation is almost identical to the one preceding the RDM, which was intended to get rid of it. The failure to learn from past mistakes and disregard for the theoretical foundation of the RDM -- not just poor SQL performance -- is a main progress inhibitor in database management and not just because of "betting against the compiler". A relational proponent with MS's longevity should remind the industry that those who forget the past are doomed to repeat it.

Codd realized, for example, that documents mix data with presentation, producing complexity that he strived to eschew. Here, for example, are some observations I posted last week on DocumentDB, Microsoft's NoSQL product:

  • Polyglot persistence: Wasn't this a problem the RDM was supposed to address?
  • Hierarchy: Ditto.
  • NoSQL: No SQL, but a "SQL-like" language (now it's used for what it was supposed to be eliminating).
  • No integrity, or data independence;
  • Cloud: At least mainframes were under each company's control.
"NoSQL guys will drift toward looking at SQL ... They will move to higher-level languages and the only game in town is SQL. VoltDB and other approaches can fix the problems brought about by legacy RDBMSs."
This has already happened, but the very problems that Codd tried to avoid with the RDM are also comming home to roost [1,2,3]. Besides, SQL and the DBMSs based on it are not relational (the average data professional is unaware of that, but MS knows it and should not reinforce the misleading and ignorance), but even with the little relational capability is has, applying SQL to utterly non-relational data structures is simply nonsense. The future of database management is not promising.
"Facebook is one giant social graph, with the problem of how to find the average distance from anyone to anyone ... implemented on [thousands of] MySQL instances ... You can simulate a graph as an edge matrix, a connectivity matrix in an array-based system, or a table system, or you build a special-purpose engine to implement the graph directly. All three are being prototyped and commercialized, and the jury is out whether there is room for a new graph engine ... the answer to graph problems is it will be done by either an array or a table DBMS."
David McGoveran points out that graph databases tend to involve lots of relationships and few instances -- like a relational database in which every relation has only a few tuples and many multi-relation constraints, and with every access those constraints are likely to change and new relations are likely to be defined. The "graph data model" (to the extent that there is one in the sense that the RDM is) requires a computationally complete language (CCL) based on higher logic than RDM's first order predicate logic (FOPL) which is prone to undecidability. The relational algebra (RA) avoids this by being intentionally less powerful than CCLs. One consequence is the inability to compute transitive closure (TC) directly in the algebra, a trivial graph theory problem. The RDM can handle an important subset of graph theory, but requires special graph operators as domain operators (which might be made efficient, but it's a very difficult problem), or else do certain graph operations in a host CCL. Be that as it may, not everybody is Facebook and, quite characteristic of industry fads, if Facebook implements a graph system everybody will emulate them, even if they don't know what TC is, regressing to the good old IMS and Codasyl days. Those who forget the past...

Codd, of course, was a Turing recipient himself, "for fundamental contributions to the concepts and practices underlying modern database systems". He went against the grain, as MS recognizes. MS got his award "For fundamental contributions to the concepts and practices underlying modern database systems.". The Turing gives him an excellent opportunity to follow in Codd's steps, go against the grain and promote true relational technology as both a theoretically sound and practical database foundation for a vast range of application needs. But in his interview he spends almost no time on the RDM and relational fidelity, or correcting -- rather than reinforcing -- misconceptions about them. Pity.


[1] Pascal, F., The SQL and NoSQL Effects: Will They Ever Learn?

[2] Pascal, F., Thinking Logically: SQL, NoSQL and the Relational Model

[3] Pascal, F., NoSQL and SQL: A Plague on Both Their Houses

No comments:

Post a Comment

View My Stats