Sunday, February 24, 2013

Language Redundancy and DBMS Performance: A SQL Story

Recently I came across SQL: The way you write your query matters by Iggy Fernandez that refers to an old article of mine in which I compared the performance of five PC DBMSs executing seven different syntactic SQL formulations of the same query. I got wildly different timings, ranging from 15 seconds to 2500 seconds!

Over the years I used that example to explain the practical importance of relational fidelity and how disregard for it, as well as for principles of good language design affect database practice. Enough time has passed since then and younger generations of practitioners have joined the field by learning SQL syntactically and confusing it with the relational model. So perhaps the time has come to retell the story for their benefit.

Consider the basic SQL syntax SELECT ...FROM ... WHERE. What exactly is it intended to express? When I ask this question, I usually get blank looks. Here's a hint:

SELECT columns
FROM table(s)
WHERE row condition;

Of course, it is a description of some table result, but there's more to it than that. When I used to teach SQL, I made sure first to introduce the audience to the operations of the relational algebra. Only then I would ask the question, at which point it was realized that:
  • the SELECT clause is a projection
  • the WHERE clause is a restriction
  • if the FROM clause specifies a single table, then the result is a table with the projected columns and the rows to which it is restricted;
  • if the FROM clause specifies two or more tables, it's a Cartesian product
Now, if you were taught right, you should know that some of the table operations are primitive e.g. projection, restriction, product, while others are derived as combinations of the basic operations e.g., join, which consists of a projection of a product restricted to only the rows which satisfy the condition that their join columns values match. In other words, the SELECT statement was designed to facilitate expressing a join.

But, you ask, SQL has a JOIN syntax, so why was this necessary? To understand that you need to know some history.

Keep in mind that SQL was a first attempt to give end-users access to databases which, at the time, were accessible only to highly skilled programmers. Therefore, when SQL was developed at IBM as a prototype query language for System R, IBM's research project testing the feasibility of RDBMS, it was thought that relational theory--predicate logic and set mathematics--terms such as join, difference, intersect--are not familiar to users and would defeat the purpose. This is not unreasonable per se, but the way SQL was designed to address this issue created more serious problems than the one it actually intended to solve.

A syntax was required that would express relational operations indirectly, without using their actual names. That was achieved by the basic syntax to express joins, while some of the other operations could be expressed with nested SELECTs i.e. sub-queries. There are three problems with SQL sub-queries.
  • Some of the expressions relying on them are actually much more complex than the explicit relational expressions would be.
  • The sub-query implementation is responsible for SQL's redundancy--you can express the same query in many different ways. This makes SQL a very difficult target language for performance optimization. The user ends up burdened with figuring out how to formulate a query for best performance, precisely what the relational model was invented to prevent.
  • Some operations e.g. union could not be expressed with sub-queries and had to be stated explicitly anyway.
During SQL's design Codd was still at IBM, where he invented the relational model, and I've been told that he warned IBM about these problems and advised that, since the research had demonstrated that RDBMS was feasible, an effort to design a truly relational serious data language should be undertaken. That was not done, Oracle implemented SQL as it was, IBM followed suit and imposed it on the ANSI committee and we've been living with the consequences ever since. Of course, language redundancy is hardly the only SQL deficiency.

My article demonstrated the practical implications of disregard for soundness. The only product among those I tested that systematically delivered the same performance for all queries was Ingres. What distinguished it from the other products was its different native relational data language, QUEL, which did not have sub-queries and, therefore, was not redundant. Ironically, IBM market dominance forced its vendor, RTI, to add a SQL interface to the QUEL engine. But no matter how you expressed a query in SQL, it could be converted to only one QUEL expression, hence the homogenous performance. (If I am not mistaken, the SQL-to-QUEL conversion was designed by Chris Date).

This story ended even more ironically. Ultimately, the direct relational operators were all added to SQL anyway, so now SQL is doubly redundant!

What conclusions can be drawn from this story regarding the claims (1) the superior product always wins in a free market and (2) don't bother me with theory, I have practical things to do?



Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

17 comments:

  1. The superior product always wins, for one particular meaning of "superior"

    ReplyDelete
    Replies
    1. Of course. But it certainly does not match the criteria I formulated in my paper "Business Modeling for Database Design". far from it.

      Delete
  2. I think one of the consequences of pragmatism is the tendency to attempt to patch up fundamentally flawed products rather than addressing the root problems. Thus another layer is added, increasing complexity and making the product even more unreliable.

    As the saying goes; "big bugs have little bugs upon their backs to bite'em and little bugs have smaller bugs and so ad infinitum".

    Or to put it another way pragmatism is better in theory than it is in practice.

    On the subject of the best product winning I noticed that a certain large software company with a SQL-DBMS product spent 20% of its operating costs on sales and marketing and only 13% on research and development.

    ReplyDelete
    Replies
    1. That is actually a very profitable business model.

      As I so often said, you come up with some ad-hoc, not thoroughly thought out technology, tool or approach, present it as a magic wand, it accumulates problems and you come up with another that will solve them, but create others which require yet another and so on.

      As long as users tolerate it and vendors profit, I see no reason this will stop.

      Delete
  3. Check out: http://martinfowler.com/bliki/NoDBA.html

    Part of the 'database' problem is incessant wail of "master-programmers" that they really know better than everyone else and if left on their own they can solve every problem with some 'modern development techniques.' Look at his diagram and tell me where the 'users' fit in and what 'ceremony' will be necessary to access that 'application' data. As usual, facts suffer and nonsense rules.

    ReplyDelete
    Replies
    1. I've been deploring this problem for ever since I can remember.

      I am commenting on an article on NoSQL and dropping ACID which I will post later which correctly states that the real implication of this approach is that the consistency of the database is left to the application developers.

      How many "master programmers" would you rely on to implement the ACID functionality correctly and completely for every application?

      Delete
    2. Martin Fowler's comment about adding a single column to the database taking weeks is a consequence of the poor integration of "modern" application languages and tools with DBMSs.

      Adding a new column to the database is trivial, working out the consequences for all the applications that use the database can be very difficult. This is a problem that application developers and language vendors have created for themselves. The problem is of course a data management one - there is a relation between applications and the tables and columns they use. Where and how is this data represented?

      Delete
    3. IMHO the Fowler article fails to highlight the real problem, which is that SQL DBMSes lack adequate support for change & configuration management of schemas. This is the reason for the "ceremony" which he describes and leads to the Them & Us mentality between DBAs & developers.

      DBMS vendors should provide a versioned data dictionary capability that is integral to their products, thus facilitating deployment of schema & reference data baselines along the same lines as other software. As it is, this activity is all too easy to get wrong, and is compounded by the mix of physical and logical when it comes to DDL statements -- table spaces, table partitioning, etc, along with table structure.

      Delete
    4. This issue was raised in earlier exchanges here and certainly is a factor. However, even if such tools were provided, there would still remain the rest of the factors discussed here.

      BTW, vendors will not deleiver what users do not demand. And practioners do not necessarily appreciate the real causes of various problems and don't demand the right solutions, so they don't get them. There is a question whether the DBMS designers have better knowledge.

      Delete
    5. What exactly is "adequate support for change & configuration management of schemas" ??? I'm a bit inclined to conjecture that it will never exist. Support for this that is perceived by the user as "adequate", that is.

      Delete
  4. That Martin Fowler essay describes my world at this very moment.

    I can't say I was excluded from the design process because there wasn't one. We are entering user-acceptance testing TODAY and they are still making database design changes. I suppose that's agile. Or it's improvising, and improvising improvisations upon improvisation.

    I cannot fathom their logical model. The little of it that I do understand uses synthetic keys for everything, all are called id in one table and something_id where they appear as a foreign key. Natural keys are never used as foreign keys so you can never find anything without doing at least two joins, usually three, and sometimes more.

    All (and I mean ALL) values are encoded. To give one example there a table that translates ISO 3166-1 alpha-2 country codes to their own code which they proceeed to use everywhere they need a country code.

    But the worst of it is that their enterprise of interest is not the business but their Java classes. It is an object repository in a database. And of course even in their own object world they've abstracted everything into grey goo.

    I can't imagine even the programmers are pleased with what they've ended up with. I know the customer won't be.

    ReplyDelete
    Replies
    1. If, at the very least, the consequences of such endeavors would be used to learn from mistakes. The worse problem, however, is that lessons are not learned because:

      * whatever the costly implications, they are considered normal and the maintenance burden means long employment/contract.

      * If they do realize there are problems, they are unable to associate them with the real causes and, therefore, they cannot be fixed.

      If you have no concept of the right way, there is no learning curve.

      Delete
    2. "If you look at the classes...what strikes you as missing?

      You guessed it! No identification schemas are provided for the classes. In object oriented programming objects may be identified by their memory addresses or internal object
      identifiers...so UML does not require that you provide a value-based identification scheme for use by humans who are communicating about the objects. For conceptual analysis, however, such human-oriented reference schemes must be supplied."

      Terry Halpin "Information Modelling and Relational Databases"

      As we have previously discussed clear communication is one of the most difficult things to achieve when designing systems. With OO you have lost that game right at the start.

      Delete
    3. Well, if your perspective is that of a programmed application (and the DBMS reduced to a "storage persister"),
      then everything the users need is perfectly encapsulated in the application, then data values are irrelevant.

      This is, of course, what we used to do before we had databases.

      The Java kids whine about the trouble changing the schema while in reality it's because of the OO development approach that is causing the bottlenecks. A great illustration of inability to solve the problems.

      Delete
  5. Will,

    Exactly what I've just answered anonymous. If you have no clue what database management really is, figuring out how and why of problems is not possible.

    Here's more from Fowler:
    http://typicalprogrammer.com/?p=101#more-101

    ReplyDelete
    Replies
    1. To Anonymous: note that most proposed solutions to poor planning involve less planning. 'Fail faster' is now a goal! ('Succeed just once' would be nice!)

      What kills me about these 'Fowlers' is that they explicitly reject using the database as an interface only to reference it from code as a physical structure and then they complain that their applications are 'tied' too closely to the database! You can however write code to a database interface (a logical structure.) It does require actual design work. (I sometimes question that Mr. Pascal attributes this to a 'lack of knowledge' rather than a 'lack of effort!' I've told him before that he is nicer than me!)

      The only quibble I have with the "Typical Programmer" is when he says, "Large enterprise databases are usually implemented like that." They are not implemented like that in my experience. Most 'enterprise databases' I have seen are just landfills. The smell alone should drive most reasonable people away!

      Delete
    2. >You can however write code to a database interface (a logical structure.) It does require actual design work.

      If you refuse to accept the exclusivity of the R-table structure and insist that ANY structure is acceptable and embed all those inessential structures in your types--because that's all you have--then you indeed will complain about an "impedance mismatch.

      Lack of knowledge and lack of effort are not mutually exclusive. In fact they are correlated. I do not often come across instances of professionals who know, understand and appreciate sound practices, yet are too lazy to employ them, unless the system somehow punishes instead of rewarding it. In a vast majority of the cases knowledge is lacking.

      As to being nice, have you ever been labeled "the Taliban of database management"?

      Delete