A CURE FOR MADNESS
by C. J. Date

 

 

 

Though this be madness, yet there is method in't

--William Shakespeare

 

THE ORIGINAL QUESTION

 

I wrote this short note in response to a paper titled Subquery Madness, in which the author, Jonathan Gennick, raises a question about subqueries in SQL and their implementation in Oracle.  The following is an edited version of his question. 

 

We're given a table called SUBTEST with two columns, FLAG and NUM, both containing character data, with the constraint that if FLAG = 'N' then NUM represents a numeric value (Fig. 1 shows some sample values). 

 

SUBTEST

+------------+

¦ FLAG ¦ NUM ¦

+------+-----¦

¦ N    ¦ 123 ¦

¦ X    ¦ 123 ¦

¦ Y    ¦ pqr ¦

¦ N    ¦ 456 ¦

¦ Z    ¦ ijk ¦

+------------+

          Fig. 1: Table SUBTEST

 

Now consider this SQL query: 

                                            

SELECT FLAG, TO_NUMBER ( NUM ) NUM

FROM   SUBTEST

WHERE  FLAG = 'N' ;

 

Given the sample data of Fig. 1, this query returns the result table (let's call it R1) shown in Fig. 2. 

 

R1

+------------+

¦ FLAG ¦ NUM ¦

+------+-----¦

¦ N    ¦ 123 ¦

¦ N    ¦ 456 ¦

+------------+

            Fig. 2: Table R1

 

Now suppose we use the original query as a subquery within the FROM clause of another query: 

 

SELECT *

FROM ( SELECT FLAG, TO_NUMBER ( NUM ) NUM

       FROM   SUBTEST

       WHERE  FLAG = 'N' )

WHERE  NUM > 0 ;

 

According to Gennick, this query fails with the following error message: 

 

ERROR:

ORA-01722: invalid number

 

Gennick suggests in his paper that the failure occurs because the outer query attempts to evaluate the condition NUM > 0 on one of the nonnumeric values in column NUM.  To quote:  "[The] optimizer is free to test ... rows against [the] NUM > 0 predicate first, if that's deemed more efficient." 

 

As Gennick subsequently points out, however, the foregoing explanation cannot be correct, because the condition NUM > 0 involves a type error (one that should be caught at compile time, moreover)--NUM is of type character string and 0 is of type numeric.  What has really happened is that the optimizer has rewritten the original query as follows: 

 

SELECT FLAG, TO_NUMBER ( NUM ) NUM

FROM   SUBTEST

WHERE  TO_NUMBER ( NUM ) > 0

AND    FLAG = 'N' ;

 

Now it's clear that the failure occurs as soon as the system tries (in the WHERE clause) to convert some nonnumeric NUM value--say 'pqr'--to a number.  The question is, then:  Is this rewriting on the part of the optimizer valid?  The short answer is no.  But the question does merit a slightly lengthier response and discussion, and such is the intent of the rest of this note. 

 

 

THE SYNTAX AND SEMANTICS OF SELECT - FROM - WHERE

 

First I want to make an obvious point about SELECT - FROM - WHERE expressions in SQL.  To do so, I don't need to consider such expressions in all of their full generality and complexity--the following simple form will suffice: 

 

SELECT expression, expression, ..., expression

FROM   table

WHERE  condition

 

Note in particular that I've mentioned just one table, T say, in the FROM clause.  I also want to impose the limitation that there are no subqueries in either the SELECT clause or the WHERE clause (this query is really simple).  Note carefully, then, that every column mentioned in the SELECT clause or the WHERE clause must be a column of table T specifically. 

 

Observe now that I deliberately didn't rule out the possibility of subqueries in the FROM clause.  The fact is, the foregoing syntax rule regarding columns in the SELECT and WHERE clauses applies no matter how table T is specified.  Here again is Gennick's original query: 

 

SELECT *

FROM ( SELECT FLAG, TO_NUMBER ( NUM ) NUM

       FROM   SUBTEST

       WHERE  FLAG = 'N' )

WHERE  NUM > 0 ;

 

This query is logically equivalent to: 

 

SELECT *

FROM   R1

WHERE  NUM > 0 ;

 

(you will recall that R1 is the result of the subquery in the FROM clause).  It should be clear, then, that the reference to column NUM in the WHERE clause is a reference to a column of table R1, not a reference to the column of the same name in table SUBTEST.  (In fact, the two NUMs even have different data types--SUBTEST.NUM is of type character string, while R1.NUM is numeric.) 

 

Note:  The foregoing point might be easier to understand if we had introduced a different name, XYZ say, in the original query, thus: 

 

SELECT *

FROM ( SELECT FLAG, TO_NUMBER ( NUM ) XYZ

       FROM   SUBTEST

       WHERE  FLAG = 'N' )

WHERE  XYZ > 0 ;

 

(boldface for emphasis).  Using XYZ instead of NUM makes no logical difference to the query, but I think it does make a psychological difference--it's obvious, now, that XYZ is a column of R1, and SUBTEST has no such column at all. 

 

Now, the SQL standard makes it perfectly clear that the result of the query

 

SELECT *

FROM   R1

WHERE  NUM > 0 ;

 

is defined as follows: 

 

1.      Evaluate R1. 

2.      Restrict the result of the previous step to just those rows satisfying NUM > 0. 

3.      Project the result of the previous step over all of its columns (which is effectively a no op, of course). 

 

In other words, the inner subquery must be evaluated before the outer WHERE and SELECT clauses are executed (hence my unequivocal no to the question "Is this rewriting on the part of the optimizer valid?").  But there's still a little more to be said. 

 

 

MODEL vs. IMPLEMENTATION

 

Part of the reason I wanted to discuss Gennick's question is that it illustrates very well my thesis that model vs. implementation is one of the great logical differences.  I have argued this point in many places and on many occasions; in fact, a paper with the title On the Logical Difference Between Model and Implementation appeared recently on this very website, as part of a series On Logical Differences 1-4.  In most of my discussions of this topic, however, I have concentrated on the relational model specifically (contrasting it with implementations--perhaps I should say would-be implementations--of that model).  In the case of Gennick's question, however, the relational model is involved only somewhat incidentally; rather, the model we're dealing with is a formal language definition (of the language SQL, to be precise), and the implementation is what Gennick refers to in his paper as the database engine (also known as the optimizer).  In other words, Gennick's question is a language question, not a relational one. 

 

Any formal language definition, if it's worth its salt, will specify the precise syntax and semantics of the language in question.  The pertinent portion of the SQL language definition reads as follows (I'm quoting here from the SQL:1992 version of that standard): 

 

<table expression> ::=

    <from clause>

    [ <where clause> ]

    [ <group by clause> ]

    [ <having clause> ]

 

The result of a <table expression> is a derived table in which the descriptor of the i-th column is the same as the descriptor of the i-th column of the table specified by the <from clause> ... If all optional clauses are omitted, then the result of the <table expression> is the same as the result of the <from clause>.  Otherwise, each specified clause is applied to the result of the previously specified clause and the result of the <table expression> is the result of the application of the last specified clause. 

 

In the case at hand, therefore (in which there is a WHERE clause but no GROUP BY or HAVING clause), we see that (a) we have the syntax correct* and (b) the semantics are as previously stated:  First execute the FROM clause, then execute the WHERE clause.  That's the formal model.  That's what the query means

 

----------

* Actually, another portion of the SQL standard, not quoted here, would require the explicit introduction of a name to refer to the result of the subquery in the FROM clause, like this: SELECT * FROM (...) AS R1 WHERE ... (italics for emphasis)--even though that name R1 isn't referenced anywhere else in the query!  Oracle, not unreasonably, appears not to enforce this requirement. 

----------

 

Now, the algorithm given in the formal definition--first execute the FROM clause, then execute the WHERE clause, in the case at hand--is purely conceptual in nature.  In effect, the standard is just saying "If you execute this algorithm, then you'll get the right answer."  But that if is a pretty big if!  There's no implication that the implementation (the "database engine") has to do exactly what the formal definition says.  In fact, the implementation is free to use any algorithm it chooses, just so long as whatever algorithm it does use is guaranteed to give the same answer as the specified conceptual algorithm.  And, of course, there are often very good reasons--usually performance reasons--for using a different algorithm, thereby (for example) executing clauses in a different order or otherwise rewriting the original query.  But the implementation is free to do such things only if it can be proved that the algorithm used is logically equivalent to the conceptual one specified in the formal definition.  In Gennick's example, however, the optimizer has clearly used an algorithm that is not logically equivalent to the conceptual algorithm specified in the SQL standard, and the implementation is incorrect. 

 

 

NESTED FUNCTIONS

 

There's another way to state the foregoing argument, and some readers might like this alternative formulation better.  In general, we can regard an SQL SELECT - FROM - WHERE expression as a nested expression, or nested set of function invocations, of the form

 

s ( w ( f ( x ) ) )

 

Explanation: 

 

·   The expression f(x)--f for FROM--is a function invocation that returns a table (x here stands for anything that's legal in a FROM clause). 

 

·         w (WHERE) is a function that operates on a table and returns a table. 

 

·         s (SELECT) is a function that operates on a table and returns a table. 

 

This hypothetical notation of nested function invocations makes it very clear that, at least conceptually, s can't be invoked before w (because s's input is w's output), and w can't be invoked before f (because w's input is f's output).  Thus, if the implementation does want to invoke the functions in some different sequence (perhaps in some kind of coroutine fashion), then it must guarantee that the semantics are preserved; in other words, it must guarantee that it produces the result that would have been produced if the original nested sequence had been adhered to.  And again we see, in the example at hand, that this requirement is not met and the implementation is in error.

 

 

Posted 08/27/04