Sunday, November 30, 2014

SQL's Incomplete Set-lization, Part 2




by Erwin Smout


[FP: Two weeks ago I posted a debunking of an article blaming some SQL sins. Erwin has some additional comments.]

1. Multisets


From the original article:
It is beyond any doubt that set is the basis of mass data computation. Although SQL has the concept of set, it is limited to describing simple result set, and it does not take the set as a basic data type to enlarge its application scope.
Sidestepping several possible nitpicks here, such as e.g., that SQL allows duplicate rows and thus, in its basic form, has bag, not set algebra, the intention behind the complaint here is mostly accurate.


The problem in the case presented is "Employees in the company whose birthday are the same as those of others", which can be rephrased as "Get the groups of people that are born on the same day-of-year, wherethe size of the group is >1 (i.e. filter singleton groups out) and keep the grouping intact". The SQL presentedby the author to address the problem as best as possible is as follows:







The intent is obvious:

(1) Get the set of MMDD dates that have >1 person born on that day;
(2) Get the set of persons whose MMDD birthday is in that set.

The author takes issue with the fact that this SQL query returns the table
+----+------+
+ ID + MMDD +
+----+------+
! P1 ! 1231 !
! P2 ! 1231 !
! P3 ! 0401 !
! P4 ! 0401 !
! P5 ! 0401 !
+----+------+
instead of the table
+--------+------+
+ IDS    + MMDD +
+--------+------+
! +----+ ! 1231 !
! ! ID ! !      !
! +----+ !      !
! ! P1 ! !      !
! ! P2 ! !      !
! +----+ !      !
! +----+ ! 0401 !
! ! ID ! !      !
! +----+ !      !
! ! P3 ! !      !
! ! P4 ! !      !
! ! P5 ! !      !
! +----+ !      !
+--------+------+
and remarks that "SQL cannot describe this kind of "set consisting of sets".

It is true that no existing SQL implementation supports this query, but SQL can and the standard does. Here is a formulation for getting the latter, 2-row, result table :
SELECT TO_CHAR(birthDay,'mmdd') as mmdd,
 CAST (COLLECT(ROW(id))
 AS ROW (id INTEGER) MULTISET) AS IDS
FROM exam_mark
GROUP BY TO_CHAR(birthday,'mmdd');
For more detail, readers are referred to "Grouping and Ungrouping in SQL", section 5.7 of Hugh Darwen's free downloadable book SQL: A COMPARATIVE SURVEY.

(Disclaimer: Since there does not exist any SQL compiler to feed this expression to, the syntactic correctness of the example cannot be tested. The only point is that for SQL as defined by ISO 9075 the author's claim is false. Incidentally, the author used Oracle SQL "because it did such a good job of complying with SQL:2003". We find, therefore, the following historical note about COLLECT/UNNEST in the above-mentioned book quite ironical:
UNNEST first appeared in SQL:1999, though in that edition it was used only with arrays, not with multisets. COLLECT and FUSION first appeared in SQL:2003.
The author concludes this first part of her issues with SQL's "incomplete set-lization" with:
At this time, it is necessary to use from the source set the condition obtained from grouping to query again, so sub-query appears again unavoidably.
We feel the author's pain and sympathize, but had the author known the international standard form of the data manipulation language she uses every day (and had this been true for the majority of the users of said language), then the user community might have demanded that vendors implement this feature [FP: See my note at the end].


2. Quota Queries


The second part of the author's remarks regarding incomplete set-lization is (a.o.) on the subject of quota queries.

The statement of the original problem is "Find out students whose scores ranks in top 10 for all subjects". That looks like synonymous to "Find out students who didn't score any ranking >10 in any subject". Observe that there's also a restricting predicate in there that involves a "for all" term. In other words: this is not only a problem of quota queries/ranking, it also involves universal quantification, which smacks of relational division.

Anyway, the set-lization comes from the fact that the ranking is to be done on a subject-per-subject basis: each individual subject has its own nr. 1 student, its own nr. 2, etc. So this is a combined problem of GROUPing with R-table-valued attributes as in the foregoing sense, ranking those R-table-valued attributes, and then filtering out all the students who had any rank anywhere >10.

Given the level of support in SQL for any one of these three aspects, it should not come as a surprise that the query is hard to write, or, once written, hard to decrypt because the "train of thought" has been broken. That said, I do think that the SQL standard does have a lot to offer to address (at least most of) the concerns of the article's author. Except the ranking/quota queries part, that is.

So let's first delve into that. Quota queries and ranking problems have long been an issue of thorough investigation. And even so, it remains a thorny problem. Especially the draws. If there are two equal winners, do we both place them at place 1? Or at place 2?  Should we give the user the option to specify "no draws allowed" and if draws do occur, fail the query? And so on. More than enough papers to read about the details [FP: See the chapter on the subject in my PRACTICAL ISSUES IN DATABASE MANAGEMENT, available via the HOME page). And it is true that SQL still hasn't got any form of direct support for doing quota queries. Major part of the reason is, no doubt, that the vendors beat the committee to the spec. Oracle apparently has a RANK() function, SQL Server has a TOP n SELECT clause--syntactic ambiguity prevails.

While these problems are posed under the heading of "incomplete set-lization", ranking has in fact very little to do with set-lization: sets have no inherent order and to pretend that they do, is to flout the basic underpinnings of the relational model! (Note that using an attribute reflect a ranking does not fall in the "pretend that they do" category. The sets {(A 1)(B 2)} and {(B 2)(A 1)} are still the same set.

So, assuming the author's main issue is with how the actual formulation of the final solution fails to reflect "train of thought" (that is, to be sufficiently self-documenting), the problem can be addressed using the standard's WITH clause (the part that does the ranking is, of necessity, vendor-specific). The standard goes to great lengths trying to pin down precisely what is possible and what is not. "Great lengths" in this case is :
  • 8 pages of detailed syntax rules
  • 4 more pages of "general rules"
The 12 pages include a.o. the following gem:
"WQEi shall not contain a <query expression> TSQ that contains a <query name> referencing WQNj, unless TSQ is simply contained in a <derived table> that is immediately contained in a <table primary> that is immediately contained in a <table reference> that is immediately contained in a <from clause> that is immediately contained in a <table expression> that is immediately contained in a <query specification> that constitutes a <simple table> that constitutes a <query primary> that constitutes a <query term> that is immediately contained in a <query expression body> that is WQEi."
Got that? If not, the gist of all that is "once a table is defined by a WITH clause, it is available for reference in whatever follows in the rest of the SQL statement".

So:
WITH ranked_per_subject
 AS SELECT name,subject,RANK() OVER (PARTITION BY ...) AS ranking
    FROM scores,
stud_with_any_rank_gt_10
 AS SELECT DISTINCT name
    FROM ranked_per_subject
    WHERE ranking > 10,
 SELECT DISTINCT name
 FROM ranked_per_subject
 WHERE name NOT IN stud_with_any_rank_gt_10;
where:
  • SCORES is a database R-table;
  • RANKED_PER_SUBJECT is a name for a R-table defined by the 1st SELECT expression;
  • STUD_WITH_ANY_RANK_GT_10 is the name for a R-table defined by the 2nd SELECT expression;
  • The 3rd SELECT expression is the query using the two introduced table names.
And here's how the query follows the train-of-thought :
  • Lines 1-3: Construct table (1) with students' rankings per subject;
  • Lines 4-7: Construct table (2) of all students with any ranking >10;
  • Lines 8-10: SELECT the names that are only in (1) and not in (2).
[FP: I can only reiterate the conclusion of the previous installment:

This exposes one of the consequences of product use and quality absent foundation knowledge. Such knowledge would enable data professionals--both users and vendor personnel--to distinguish between:
  • The relational model 
  • SQL as a (poor) concretization of the model; 
  • The SQL standard;
  • SQL implementations.
which would be conducive to better tools and their improvement via user demand. As it is, confusion and misplaced blame reign and there are no incentives for improvement.




No comments:

Post a Comment

View My Stats