Though this be madness, yet there is method in't
--William Shakespeare
THE ORIGINAL QUESTION
I wrote this short note in response to a paper titled Subquery
Madness, in which the author, Jonathan Gennick, raises a
question about subqueries in SQL and their implementation in Oracle. The following is an edited version of his
question.
We're given a table called SUBTEST with two columns, FLAG and
NUM, both containing character data, with the constraint that if FLAG = 'N'
then NUM represents a numeric value (Fig. 1 shows some sample values).
SUBTEST
+------------+
¦ FLAG ¦ NUM ¦
+------+-----¦
¦ N ¦ 123 ¦
¦ X ¦ 123 ¦
¦ Y ¦ pqr ¦
¦ N ¦ 456 ¦
¦ Z ¦ ijk ¦
+------------+
Fig. 1:
Table SUBTEST
Now consider this SQL query:
SELECT FLAG, TO_NUMBER ( NUM ) NUM
FROM SUBTEST
WHERE FLAG = 'N' ;
Given the sample data of Fig. 1, this query returns the
result table (let's call it R1) shown in Fig. 2.
R1
+------------+
¦ FLAG ¦ NUM ¦
+------+-----¦
¦ N ¦ 123 ¦
¦ N ¦ 456 ¦
+------------+
Fig.
2: Table R1
Now suppose we use the original query as a subquery within
the FROM clause of another query:
SELECT *
FROM ( SELECT FLAG, TO_NUMBER ( NUM ) NUM
FROM SUBTEST
WHERE FLAG = 'N' )
WHERE NUM > 0 ;
According to Gennick, this query fails with the following
error message:
ERROR:
ORA-01722: invalid number
Gennick suggests in his paper that the failure occurs because
the outer query attempts to evaluate the condition NUM > 0 on one of the
nonnumeric values in column NUM. To
quote: "[The] optimizer is free to
test ... rows against [the] NUM > 0 predicate first, if that's deemed more
efficient."
As Gennick subsequently points out, however, the foregoing
explanation cannot be correct, because the condition NUM > 0 involves a type
error (one that should be caught at compile time, moreover)--NUM is of type
character string and 0 is of type numeric.
What has really happened is that the optimizer has rewritten the
original query as follows:
SELECT FLAG, TO_NUMBER ( NUM ) NUM
FROM SUBTEST
WHERE TO_NUMBER ( NUM )
> 0
AND FLAG = 'N' ;
Now it's clear that the failure occurs as soon as the system
tries (in the WHERE clause) to convert some nonnumeric NUM value--say 'pqr'--to
a number. The question is, then: Is this rewriting on the part of the
optimizer valid? The short answer is
no. But the question does merit a
slightly lengthier response and discussion, and such is the intent of the rest
of this note.
THE SYNTAX AND SEMANTICS OF SELECT - FROM - WHERE
First I want to make an obvious point about SELECT - FROM -
WHERE expressions in SQL. To do so, I
don't need to consider such expressions in all of their full generality and
complexity--the following simple form will suffice:
SELECT expression, expression, ..., expression
FROM table
WHERE condition
Note in particular that I've mentioned just one table, T say,
in the FROM clause. I also want to
impose the limitation that there are no subqueries in either the SELECT clause or
the WHERE clause (this query is really simple). Note carefully, then, that every column mentioned in the SELECT
clause or the WHERE clause must be a column of table T specifically.
Observe now that I deliberately didn't rule out the
possibility of subqueries in the FROM clause.
The fact is, the foregoing syntax rule regarding columns in the SELECT
and WHERE clauses applies no matter how table T is specified. Here again is Gennick's original query:
SELECT *
FROM ( SELECT FLAG, TO_NUMBER ( NUM ) NUM
FROM SUBTEST
WHERE FLAG = 'N' )
WHERE NUM > 0 ;
This query is logically equivalent to:
SELECT *
FROM R1
WHERE NUM > 0 ;
(you will recall that R1 is the result of the subquery in the
FROM clause). It should be clear, then,
that the reference to column NUM in the WHERE clause is a reference to a column
of table R1, not a reference to the column of the same name in table
SUBTEST. (In fact, the two NUMs even
have different data types--SUBTEST.NUM is of type character string, while
R1.NUM is numeric.)
Note: The foregoing
point might be easier to understand if we had introduced a different name, XYZ
say, in the original query, thus:
SELECT *
FROM ( SELECT FLAG, TO_NUMBER ( NUM ) XYZ
FROM SUBTEST
WHERE FLAG = 'N' )
WHERE XYZ > 0 ;
(boldface for emphasis).
Using XYZ instead of NUM makes no logical difference to the query, but I
think it does make a psychological difference--it's obvious, now, that XYZ is a
column of R1, and SUBTEST has no such column at all.
Now, the SQL standard makes it perfectly clear that the
result of the query
SELECT *
FROM R1
WHERE NUM > 0 ;
is defined as follows:
1.
Evaluate R1.
2.
Restrict the result of the previous step to just those rows
satisfying NUM > 0.
3.
Project the result of the previous step over all of its
columns (which is effectively a no op, of course).
In other words, the inner subquery must be evaluated before
the outer WHERE and SELECT clauses are executed (hence my unequivocal no to the
question "Is this rewriting on the part of the optimizer
valid?"). But there's still a
little more to be said.
MODEL vs. IMPLEMENTATION
Part of the
reason I wanted to discuss Gennick's question is that it illustrates very well
my thesis that model vs. implementation is one of the great logical
differences. I have argued this
point in many places and on many occasions; in fact, a paper with the title On
the Logical Difference Between Model and Implementation appeared recently
on this very website, as part of a series On Logical Differences 1-4. In most of my discussions of this topic,
however, I have concentrated on the relational model specifically (contrasting
it with implementations--perhaps I should say would-be implementations--of that
model). In the case of Gennick's
question, however, the relational model is involved only somewhat incidentally;
rather, the model we're dealing with is a formal language definition (of
the language SQL, to be precise), and the implementation is what Gennick refers
to in his paper as the database engine (also known as the
optimizer). In other words, Gennick's
question is a language question, not a relational one.
Any formal
language definition, if it's worth its salt, will specify the precise syntax
and semantics of the language in question.
The pertinent portion of the SQL language definition reads as follows
(I'm quoting here from the SQL:1992 version of that standard):
<table expression> ::=
<from clause>
[ <where clause>
]
[ <group by
clause> ]
[ <having
clause> ]
The result of a <table expression> is a derived table in
which the descriptor of the i-th column is the same as the descriptor of
the i-th column of the table specified by the <from clause> ... If
all optional clauses are omitted, then the result of the <table
expression> is the same as the result of the <from clause>. Otherwise, each specified clause is applied
to the result of the previously specified clause and the result of the
<table expression> is the result of the application of the last specified
clause.
In the case at hand, therefore (in which there is a WHERE
clause but no GROUP BY or HAVING clause), we see that (a) we have the syntax
correct* and (b) the semantics are as
previously stated: First execute the
FROM clause, then execute the WHERE clause.
That's the formal model. That's
what the query means.
----------
* Actually, another
portion of the SQL standard, not quoted here, would require the explicit
introduction of a name to refer to the result of the subquery in the FROM
clause, like this: SELECT * FROM (...) AS R1 WHERE ... (italics for
emphasis)--even though that name R1 isn't referenced anywhere else in
the query! Oracle, not unreasonably, appears
not to enforce this requirement.
----------
Now, the algorithm given in the formal definition--first
execute the FROM clause, then execute the WHERE clause, in the case at hand--is
purely conceptual in nature. In effect,
the standard is just saying "If you execute this algorithm, then
you'll get the right answer." But
that if is a pretty big if!
There's no implication that the implementation (the "database
engine") has to do exactly what the formal definition says. In fact, the implementation is free to use
any algorithm it chooses, just so long as whatever algorithm it does use is
guaranteed to give the same answer as the specified conceptual algorithm. And, of course, there are often very good
reasons--usually performance reasons--for using a different algorithm, thereby
(for example) executing clauses in a different order or otherwise rewriting the
original query. But the implementation
is free to do such things only if it can be proved that the algorithm used
is logically equivalent to the conceptual one specified in the formal
definition. In Gennick's example,
however, the optimizer has clearly used an algorithm that is not
logically equivalent to the conceptual algorithm specified in the SQL standard,
and the implementation is incorrect.
NESTED FUNCTIONS
There's another way to state the foregoing argument, and some
readers might like this alternative formulation better. In general, we can regard an SQL SELECT -
FROM - WHERE expression as a nested expression, or nested set of
function invocations, of the form
s ( w ( f ( x ) ) )
Explanation:
·
The expression f(x)--f for
FROM--is a function invocation that returns a table (x here stands for
anything that's legal in a FROM clause).
·
w (WHERE) is a function that operates on a table
and returns a table.
·
s (SELECT) is a function that operates on a
table and returns a table.
This hypothetical notation of nested function invocations
makes it very clear that, at least conceptually, s can't be invoked
before w (because s's input is w's output), and w
can't be invoked before f (because w's input is f's
output). Thus, if the implementation
does want to invoke the functions in some different sequence (perhaps in some
kind of coroutine fashion), then it must guarantee that the semantics are
preserved; in other words, it must guarantee that it produces the result that
would have been produced if the original nested sequence had been adhered
to. And again we see, in the example at
hand, that this requirement is not met and the implementation is in error.
Posted
08/27/04