Sunday, April 26, 2015

Comments on Stonebraker Interview (UPDATED)

UPDATE: My paraphrasing of David McGoveran was not entirely accurate and the paragraph was revised.

Interviewed about his Turing Award, Michael Stonebraker is "modest" about his jointly-with-others contribution:
... the Ingres database [sic] brought Codd’s lofty relational ideas into the realm of ordinary individuals ... turned [them] into constructs that could be manipulated by ordinary people ... it was argued at the time that RDBMS couldn’t perform, but we showed it could be efficient.
He gives most of the credit to "Ted" Codd:
What Ted proposed was radical ... a complete change from how things were being done in database [sic] ... he turned the problem of data management into one of relations. That dramatically simplified things ... The conventional wisdom was that you should build for the particulars of how the data is stored. He saw that made no sense ... he [moved] the actual manipulation of data away from assembly language programming of the time to higher levels of abstraction that would later become structured query language, or SQL ... He brought principles of encapsulation and abstraction to programming databases, like with a high-level-language in programming.

Quite. Except that Ted was vehemently critical of SQL as a botched concretization of the RDM which, as it turned out, ensured that his ideas were never truly and fully implemented (one of which was to eliminate programming altogether, by substituting a declarative data sublanguage).

On the one hand SQL, whatever its flaws, was much superior to what preceded it and on the other it has been forever identified with the RDM, to the point where the chance for a true implementation was lost. (the "assembly language" statement is not quite accurate--COBOL, FORTRAN and special purpose languages were used at the time. Assembly language was used for writing access methods at the I/O level but even that wasn't pure).

Be that as it may, now that MS finally deems Oracle’s database [sic], IBM’s DB2, and Microsoft‘s SQL Server obsolete "legacy code", you would expect at least some of the criticism to focus on their poor relational fidelity and how exactly his "fundamental research into database theory" leads to TRDBMS's that would have made Ted proud. Particularly as an author of Ingres's QUEL which, in some ways, was relationally superior to SQL.

But that's not the impression you get (with the caveat that I know from personal experience what happens when you humor journalists by giving interviews--what is published often has little resemblance to what was actually said--but based on his past statements, this does not seem to be a major problem here).

A third of the database market--Oracle and SQL Server and DB2--is legacy code that will be replaced by things such as VoltDB ... there is two orders of magnitude performance difference to be had ... sooner or later that will be significant ... if you want to do 50 transactions per second, it doesn’t matter what technology you use, you can use whatever you want. But if you want to run 50,000 transactions per second, your current implementation is simply not going to do it. Sooner or later, you are going to be up against a technology wall that will force you to move to new technology, and it will be completely based on return on investment.
50 trans/sec was once considered a challenge to relational OLTP by IMS. I am the last to defend SQL and to the extent that performance problems exist criticism is, of course, fair game. However, given the state of foundation knowledge and particularly of the RDM in the industry, the Turing prestige gives one the opportunity and responsibility to at least make clear that many of SQL products' deficiencies--as, if not more important, than performance--and including performance, are due to their poor relational fidelity.

I am not sure how Codd would have taken "if you want to do 50 transactions per second, it doesn’t matter what technology you use, you can use whatever you want." Don't get me wrong: performance is terribly important. One does not need to spend years, or even months in the IT industry in general and the database field in particular, to realize that it is as close to the 'be all and end all' in database management as you can get. MS designed and promoted his products as superior performers--that is what you do as a vendor. But as a scientist and educator who appreciates the value of Codd's contribution, should the exclusive emphasis be on performance and promotion of one's products?

Another third of the market, focused on “data warehousing,” is moving from row-stores to “column stores,” which can be far more efficient.
The concept of data independence and, specifically, physical data independence (PDI) was from the very start a core objective of the RDM--diversity of storage and access methods satisfying a broad range of applications insulated from changes in, and details thereof--which SQL authors and implementers failed to support. No SQL DBMS gave any ability to change the physical store, most let it affect logical issues and all implemented a relation as a "row-store"--the direct image representation. Many even insisted that the RDM required row-images as records and that nonsense has carried forward to the present day.

Columnar storage is nothing new. The idea is to store domain values and a bit map of them to represent the relation physically, constructing the logical relation for all users on the fly. Products actually existed at least as early as 1989. Be that as it may, it is not the columnar storage per se that's important, but that it is one of many storage options and the RDM gives implementors freedom to implement multiple such options and change them transparently whenever necessary for performance! This is what PDI is and when it comes to SQL DBMS performance, this is what MS should stress!
A third is “everything else” ... 100 or more of these NoSQL companies ... “NoSQL” databases [sic] and Hadoop ... started out, NoSQL meant, ‘Not SQL,’ then it became ‘Not only SQL,’ and now I think it means “Not-yet-SQL" ... NoSQL proposes low-level languages, and they are betting against the compiler, and that’s an incredibly dangerous thing to do ... Hadoop, it will take on SQL aspects and merge with data warehousing ... Cloudera released the Impala system ... a SQL column-store engine. MapReduce is nowhere to be found ... The historical Hadooop stack was Hive on top of MapReduce, on top of HDFS ... MapReduce will atrophy and be replaced by SQL ... Hadoop will look like the data warehouse market, and NoSQL will look like the SQL market.
Welcome to the past. The situation is almost identical to the one preceding the RDM, which was intended to fix it. The failure to learn from past mistakes and disregard for the RDM--not just poor SQL performance--is a main progress inhibitor in database management and not just because of "betting against the compiler". A relational proponent with MS's longevity should remind the industry that those who forget the past are doomed to repeat it.

Codd realized, for example, that documents mix data with presentation, producing complexity that he strived to eschew. Here, for example, is a revised note I posted last week on DocumentDB, Microsoft's NoSQL product:
  • Polyglot persistence: Wasn't this a problem the RDM was supposed to address?
  • Hierarchy: Ditto.
  • NoSQL: No SQL, but a "SQL-like" language (now it's used for what it was supposed to be eliminating).
  • No integrity, or data independence; 
  • Cloud: At least mainframes were under each company's control. 
NoSQL guys will drift toward looking at SQL ... They will move to higher-level languages and the only game in town is SQL. VoltDB and other approaches can fix the problems brought about by legacy RDBMS's.
Looks like avoiding modeling and database design upfront without loss of relational capabilities really is the illusion that I have often warned it is and is coming home to roost. Since SQL is not the true solution and, as far as I know, VoltDB is not a true-to-Codd TRDBMS (unless somebody can correct me), the future of database management does not look promising.
Facebook is one giant social graph, with the problem of how to find the average distance from anyone to anyone ... implemented on [thousands of] MySQL instances ... You can simulate a graph as an edge matrix, a connectivity matrix in an array-based system, or a table system, or you build a special-purpose engine to implement the graph directly. All three are being prototyped and commercialized, and the jury is out whether there is room for a new graph engine ... the answer to graph problems is it will be done by either an array or a table DBMS.

David McGoveran points out that graph databases tend to involve lots of relationships and few instances--like a relational database in which every table has only a few rows and many
multi-table constraints, and with every access those constraints are likely to change and new tables are likely to be defined. Graph structures (1) have inherent order which adds considerable complexity to manipulation and integrity enforcement and (2) graph theory is computationally complete and, thus, prone to undecidability.
The relational algebra avoids this by being intentionally less powerful than the language of graph theory. One consequence is the inability to compute transitive closure directly in the algebra, a trivial graph theory problem.
The RDM can handle an important subset of graph theory: you have to implement special graph operators as domain operators (which might be made efficient, but it's a very difficult problem) or else do certain graph operations in a host language. Be that as it may,not everybody is Facebook and, quite characteristic of industry fads, if Facebook implemented a graph system everybody will emulate them, even if they don't know what transitive closure is, regressing to the good old IMS and Codasyl days. Those who forget the past...

Codd, of course, was a Turing recipient himself, "for fundamental contributions to the concepts and practices underlying modern database systems". He went against the grain, as MS recognizes. MS got his award "for fundamental and continuing contributions to the theory and practice of database management systems, esp. relational databases". The Turing gives him an excellent opportunity to follow in Codd's steps, go against the grain and promote true relational technology as both a practical and theoretical sound database foundation for a vast range of application needs.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment