MORE ON CURE FOR MADNESS
by C. J. Date

 

 

 

FURTHER THOUGHTS

 

Further reflection on Gennick's problem made me realize there was more that could usefully be said on this matter; hence the present section. 

 

Let's agree to ignore SQL as such until further notice and switch to relational algebra (and relational terminology) instead.  In relational algebra, if r and p are a relational expression and a predicate (i.e., a truth-valued expression), respectively, then the expression

 

r WHERE p

 

evaluates to a relation with the same heading--i.e., the same attributes--as r and with a body containing just those tuples of r for which p evaluates to TRUE.*  That result relation is said to be a restriction of r, and the operation that computes it is called a restriction operation.  Loosely, what the restriction operation does is filter out certain tuples from its operand relation

 

*Strictly speaking, p must be quantifier-free and its parameters must be some subset of the attributes of r, but there's no need to get into the fine detail of such matters here. 

 

Now suppose the expression r itself denotes another restriction operation, so that the overall expression takes the form, say,

 

( s WHERE q ) WHERE p

 

for some relational expression s and some predicate q.  Then it's well known that this overall expression can be legitimately transformed into

 

s WHERE ( q ) AND ( p )

 

And this transformation is useful for optimization purposes, because the original expression implies that the implementation must perform two passes over the s-relation while the transformed version shows that one pass suffices.  Moreover, the original expression suggests that the implementation must check q first when filtering out tuples, but the transformed version shows it can check p first if it likes, because AND is commutative.  (And it presumably would like to check p first if, for example, p is easier to evaluate than q, or if it's likely that most tuples satisfy q but very few satisfy p.) 

 

Suppose now, however, that we have a situation in which the q-filtering must be done before the p-filtering (as it were).  Then the relational expression that represents this state of affairs is not

 

( s WHERE q ) WHERE p

 

--because this expression does not capture the requirement that q must be applied first, as we have just seen.  So what is to be done?  Well, in fact a situation of this same general nature occurs in connection with the issue of type inheritance.   Let's look at an example.  Note:  The discussion that follows is based on one given in the book FOUNDATION FOR FUTURE DATABASE SYSTEMS: THE THIRD MANIFESTO (2nd edition), by Hugh Darwen and myself, where further details can be found. 

 

By way of example, then, let CIRCLE be a subtype of type ELLIPSE, and let relation s have an attribute E that's declared to be of type ELLIPSE.  In general, then, some tuples of s will have a value for attribute E that is in fact a circle instead of just a (noncircular) ellipse.  Now consider the following query: 

 

Get tuples of s where the E value is in fact a circle and the radius of that circle is greater than two. 

 

    Observe immediately that the following "obvious" formulation of the query doesn't work: 

 

( s WHERE IS_CIRCLE ( E ) )

    WHERE THE_R ( E ) > LENGTH ( 2.0 )

 

(I'm assuming here that (a) for a given tuple t of s, IS_CIRCLE(E) returns TRUE if the value of E in t is a circle and FALSE otherwise, and (b) THE_R is an operator that takes a circle as its operand and returns the corresponding radius, a value of type LENGTH.)  The reason the formulation shown above doesn't work is because it's logically equivalent to the following: 

 

s WHERE THE_R ( E ) > LENGTH ( 2.0 ) AND IS_CIRCLE ( E )

 

And this expression will fail as soon as it tries to invoke the operator THE_R on a value of E (in some tuple t of s) that's just an ellipse and not a circle, because THE_R isn't defined for ellipses that aren't circles; in other words, there's no guarantee that the system will apply the IS_CIRCLE filtering before attempting to check the radius.  Note:  In fact, if static type checking is performed, then the expression will fail at compile time, because the compiler will know that the argument E to THE_R is of type ELLIPSE but the corresponding parameter is of type CIRCLE. 

 

Clearly, then, what we need to do, for any given tuple t of s, is make sure the E value is a circle before we check to see whether the radius of that circle is greater than two (indeed, we mustn't even perform this latter check if the E value isn't a circle).  To that end, Hugh Darwen and I (in the book mentioned previously) invented some new syntax, such that the query under discussion can be correctly formulated as: 

 

( s : IS_CIRCLE ( E ) ) WHERE THE_R ( E ) > LENGTH ( 2.0 )

 

The expression s:IS_CIRCLE(E) is defined to return a relation r that's identical to s except that: 

 

·         The body of r contains just those tuples of s for which the E value is a circle. 

 

·   Attribute E in the heading of r is of type CIRCLE, not type ELLIPSE (so the compile-time type checking on THE_R(E) will now succeed). 

 

Moreover, the operator precedence rules are defined in such a way that--even if we don't enclose it in parentheses as above--the expression s:IS_CIRCLE(E) is guaranteed to be evaluated before the restriction specified in the WHERE clause is performed.  The overall effect is thus as desired. 

 

Abstracting somewhat from the foregoing example, then, we can say that the relational expression

 

( s : q ) WHERE p

 

is such that the q-filtering will definitely be done before the p-filtering (even if the parentheses are omitted); that is, the expression cannot legally be transformed into

 

( s WHERE p ) : q

 

or

 

s WHERE ( p ) AND ( q )

 

(and not into (s:p):q either, come to that). 

 

The relevance of the foregoing discussion to Gennick's example should be obvious, but let me spell the details out.  (So now I'm switching back to SQL--but I want to simplify the original example and abstract from it, somewhat.)  Consider the following SQL expression: 

 

SELECT * FROM ( SELECT * FROM s WHERE q ) WHERE p

 

Rather counter-intuitively (at least, so it seems to me), then, this expression is not just SQL syntax for

 

( s WHERE q ) WHERE p

 

(!)--precisely because, as we saw in the body of this note, it can't always be legally transformed into the SQL expression

 

SELECT * FROM s WHERE q AND p

 

In fact, the SQL standard makes it clear (albeit not in these terms, of course) that the expression under discussion--

 

SELECT * FROM ( SELECT * FROM s WHERE q ) WHERE p

 

--is really SQL syntax for

 

( s : q ) : p

 

So the problem with SQL (and I regard this state of affairs as yet another SQL deficiency) is that it provides syntax for (s:q):p but not for (s WHERE q) WHERE p.  Thus, users are effectively forced to write (s:q):p even when what they really mean is (s WHERE q) WHERE p.  And in Oracle, at least, the optimizer apparently assumes that users always mean the latter and therefore performs the logically incorrect transformation--which happens to work, most of the time, because most of the time users really do mean (s WHERE q) WHERE p!  A somewhat topsy-turvy situation, you might be forgiven for thinking. 

 

 

A FINAL OBSERVATION

 

It's worth noting that SQL does in fact support something along the lines of the s:IS_T(A) operator described briefly in the previous section.  Here's the query from that section expressed in SQL ("Get s rows where the E value is in fact a circle and the radius of that circle is greater than two"): 

 

SELECT *

FROM ( SELECT A, ..., TREAT E AS CIRCLE AS E, ..., Z

       FROM   s

       WHERE  TYPE ( E ) IS OF ( CIRCLE ) )

WHERE  THE_R ( E ) > LENGTH ( 2.0 )

 

Note:  I'm assuming here that s has attributes A, ..., Z in addition to the specific attribute E that's of interest to this particular query. 

 

Do note, however, how clumsy the SQL support is!  In particular, the operations of (a) picking out just the tuples of s in which the E value is a circle and (b) changing the type of attribute E in the result from ELLIPSE to CIRCLE--I'm speaking pretty loosely here--are expressed separately (in the WHERE clause and the SELECT clause, respectively).  By contrast, these two operations are bundled together in our syntax s:IS_CIRCLE(E) and cannot be separated.  As a consequence, there's scope for additional errors in SQL ... For example, the expression

 

SELECT *

FROM ( SELECT *

       FROM   s

       WHERE  TYPE ( E ) IS OF ( CIRCLE ) )

WHERE  THE_R ( E ) > LENGTH ( 2.0 )

 

fails on a type error (how, exactly?).

 

 

Posted 09/24/04