Sunday, February 24, 2013

Language Redundancy and DBMS Performance: A SQL Story



Recently I came across SQL: The way you write your query matters by Iggy Fernandez that refers to an old article of mine in which I compared the performance of five PC DBMSs executing seven different syntactic SQL formulations of the same query. I got wildly different timings, ranging from 15 seconds to 2500 seconds!

Over the years I used that example to explain the practical importance of relational fidelity and how disregard for it, as well as for principles of good language design affect database practice. Enough time has passed since then and younger generations of practitioners have joined the field by learning SQL syntactically and confusing it with the relational model. So perhaps the time has come to retell the story for their benefit.

Consider the basic SQL syntax SELECT ...FROM ... WHERE. What exactly is it intended to express? When I ask this question, I usually get blank looks. Here's a hint:

SELECT columns
FROM table(s)
WHERE row condition;

Of course, it is a description of some table result, but there's more to it than that. When I used to teach SQL, I made sure first to introduce the audience to the operations of the relational algebra. Only then I would ask the question, at which point it was realized that:
  • the SELECT clause is a projection
  • the WHERE clause is a restriction
  • if the FROM clause specifies a single table, then the result is a table with the projected columns and the rows to which it is restricted;
  • if the FROM clause specifies two or more tables, it's a Cartesian product
Now, if you were taught right, you should know that some of the table operations are primitive e.g. projection, restriction, product, while others are derived as combinations of the basic operations e.g., join, which consists of a projection of a product restricted to only the rows which satisfy the condition that their join columns values match. In other words, the SELECT statement was designed to facilitate expressing a join.

But, you ask, SQL has a JOIN syntax, so why was this necessary? To understand that you need to know some history.

Keep in mind that SQL was a first attempt to give end-users access to databases which, at the time, were accessible only to highly skilled programmers. Therefore, when SQL was developed at IBM as a prototype query language for System R, IBM's research project testing the feasibility of RDBMS, it was thought that relational theory--predicate logic and set mathematics--terms such as join, difference, intersect--are not familiar to users and would defeat the purpose. This is not unreasonable per se, but the way SQL was designed to address this issue created more serious problems than the one it actually intended to solve.

A syntax was required that would express relational operations indirectly, without using their actual names. That was achieved by the basic syntax to express joins, while some of the other operations could be expressed with nested SELECTs i.e. sub-queries. There are three problems with SQL sub-queries.
  • Some of the expressions relying on them are actually much more complex than the explicit relational expressions would be.
  • The sub-query implementation is responsible for SQL's redundancy--you can express the same query in many different ways. This makes SQL a very difficult target language for performance optimization. The user ends up burdened with figuring out how to formulate a query for best performance, precisely what the relational model was invented to prevent.
  • Some operations e.g. union could not be expressed with sub-queries and had to be stated explicitly anyway.
During SQL's design Codd was still at IBM, where he invented the relational model, and I've been told that he warned IBM about these problems and advised that, since the research had demonstrated that RDBMS was feasible, an effort to design a truly relational serious data language should be undertaken. That was not done, Oracle implemented SQL as it was, IBM followed suit and imposed it on the ANSI committee and we've been living with the consequences ever since. Of course, language redundancy is hardly the only SQL deficiency.

My article demonstrated the practical implications of disregard for soundness. The only product among those I tested that systematically delivered the same performance for all queries was Ingres. What distinguished it from the other products was its different native relational data language, QUEL, which did not have sub-queries and, therefore, was not redundant. Ironically, IBM market dominance forced its vendor, RTI, to add a SQL interface to the QUEL engine. But no matter how you expressed a query in SQL, it could be converted to only one QUEL expression, hence the homogenous performance. (If I am not mistaken, the SQL-to-QUEL conversion was designed by Chris Date).

This story ended even more ironically. Ultimately, the direct relational operators were all added to SQL anyway, so now SQL is doubly redundant!

What conclusions can be drawn from this story regarding the claims (1) the superior product always wins in a free market and (2) don't bother me with theory, I have practical things to do?



No comments:

Post a Comment

View My Stats