Sunday, October 19, 2014

Precision, Procedurality and SQL, Part 1



 by Erwin Smout and Fabian Pascal

"To be as precise as we possibly can is not a luxurious mannerism that the academic prig can afford himself in his (supposedly!) sheltered environment; for people facing the problems of "the real world" it is a Must." --E.W. Dijkstra, An Open Letter to L. Bass


From In Some Cases illustrating drawbacks of SQL in data computing and analytics
The computing power of SQL for mass structured data is complete, that is to say, it is impossible to find anything that SQL cannot compute. But its support layer is too low, which can lead to over-elaborate operation in practical application.
One of the four aspects of this "over-elaboration" is "computation without substep", but before we comment on it, the article glosses over an important matter.


Implementing data computation process based on a type of computation system is in fact the process of translating business problem into formalized computation syntax ...
Before the informal "business problem" can be formalized into data language syntax, business concepts must be mapped to database concepts--the table operation(s) that will yield the desired solution must be identified. The importance of this explicit step cannot be stressed enough, for without it exactly what will be expressed in formal syntax?

Consider the business problem "I need customer and order information". The operation to be applied depends on the logical design of the database. If, for example, there are two 5NF tables, CUSTOMERS and ORDERS, a join is in order; if, on the other hand,  there is a single denormalized table, a projection on specific columns must be applied. Furthermore, whether the specific operation makes sense depends on what the table(s) mean--their interpretation given by the business rules--which is not in the table(s) (see Data Analysts: Know Your Business Rules). What syntax we use to express, for example, the projection

SQL:
SELECT DISTINCT CustName,BirthDate
FROM Customers;
Tutorial D:
Customers {CustName,BirthDate};

SIRA_PRISE:
PROJECT(Customers,(CustName,BirthDate));
Hypothetical OO-style:
Customers.project(new List<String>("CustName","BirthDate"));
is really no big deal. What is a big deal in terms of "how to address/solve that problem" is the understanding that
  • A specific relational operation is applied;
  • The relational argument to this operation is R-table CUSTOMERS;
  • The projection is on specific columns of CUSTOMERS.
which is independent of the final language syntax.

Note: The reader is reminded that SQL tables may not be R-tables and SQL operations may not be true relational operations.
... [SQL's] model system is inconsistent with people’s natural thinking habit. It causes a great barrier in translating problems, leading to the case that the difficulty to formalize the problem solving method into computation syntax is much greater than to find the solution of the problem.
The real question here is whether users' procedural "natural thinking habit" is something to be encouraged, or "step-by-step" thinking should be, as much as possible, eliminated--at least when it comes to the formulation of the final solution. In this respect, relational algebra and SQL can be viewed as little more than an early incarnation of the idea of functional programming promoting the latter. Is functional programming a bad idea?

Be that as it may, the mathematical property of relational closure--R-tables are closed to relational algebra as numbers are closed to arithmetic--permits the nesting of operations: the result of any relational operation on R-table(s) is an R-table which can be operated upon further with the same operations. Thus, it's not SQL's non-procedurality that is deficient, but rather its somewhat cumbersome support of nesting, via explicit INSERT INTO...SELECT FROM... syntax.

Note: Procedurality is not absolute, but a matter of degree. The Tutorial D syntax above,for example, is slightly less procedural than SQL's.
SQL requires computation to be written out in one statement, and it is necessary to adopt storage procedure to implement computation step by step. No sub-step not only causes difficulty in thinking, but also makes it difficult to use intermediate result ... Carrying out complex computation step by step can reduce the difficulty of the problem to a great extent, conversely, collecting a multi-step computation into one to be completed in just one step increases the complexity of the problem.
A not so unimportant point to note here is that what is presented as a "difficulty of the problem", actually is the "difficulty of the solution"!!! That is, the difficulty for a user to understand the structure/nature of the solution once it has been spelled out in a piece of code. If the user is not capable of disambiguating between the problem and the solution, when such disambiguation is warranted, then any solution he/she devises will just become part of the problem, thus enlarging it.
The number of persons of the sales department, where the number of persons whose native place is NY, and where the number of female employees?
Aside: The author is not a native English speaker and this is not good English. Unfortunately, as Dijkstra, Date and others stressed more than once, clear and succint expression of problems is critical. Understanding a problem should include ability to express or communicate its nature clearly and precisely. If you understand a problem, but yet cannot do that, your mastery of (natural) language is poor, and that will equally cause problems when you are devising and communicating solutions to the problem you understand.

Here is the three-step SQL solution given by author:

Conventional thought ... The query each time is based on the existing result last time ... (snipped)
This simply isn't true. The three queries are separate and none is "based on the result last time". As confirmed by "... it is impossible to reuse the preceding result in answering the next question and it is only possible to copy the query condition once more." Contradictions make it hard to get the exact point.

So, first, to rephrase the issue targeted here: The three queries will issue three separate requests to the DBMS, each of which will process the same set of rows, only to apply a different filtering to them. That's redundant and will degrade performance.

Most, if not all commercial SQL systems "will cause three separate requests", but it does not have to be like that. The client library/programming API in SIRA_PRISE, for example, has an interface method that allows to ask .execQueries(query1, query2, query3, ...). Part of the reason that this is possible is because SIRA_PRISE also addresses and solves that other issue--"setlization is not complete"--one of the three others that the author identified (and which this response is not addressing). No reason why a SQL system could not offer a similar facility. Precompilers processing embedded SQL could detect three consecutive queries and issue one request. JDBC-like libraries could perfectly expose such a facility to the user as an invokable method much like SIRA_PRISE's.

Next, to "reuse the preceding result in answering the next question", the Tutorial D language allows to do exactly that (this may also be true of SQL/PSM, though I am not sure):
VAR onlysales INIT (employee WHERE department = 'sales');
VAR onlysalesny INIT (onlysales WHERE native_place = 'NY');
VAR onlysalesnyf INIT (onlysalesny WHERE gender = 'female');
? COUNT(onlysales);
? COUNT(onlysalesny);
? COUNT(onlysalesnyf);
Unfortunately, this is premature optimization, which is what usually happens when users get involved in it. A specific optimization--storing and re-using the result of a previous, similar query--is chosen as the best possible. But it will still do three separate loops (and I won't even mention the space complexity that stems from storing local copies of subsets of the database table), when a better optimization might perhaps be to do just a single loop and to do the three COUNT's during that in a single pass. In other words:

(a) For each row, first determine its eligibility for the three distinct counts. SQL has CASE expressions for this (syntax may not be fully accurate):
SELECT CASE WHEN department = 'sales'
        THEN 1 ELSE 0
       END AS countfordept,
       CASE WHEN department = 'sales' AND native_place = 'NY'
        THEN 1 ELSE 0
       END AS countfornatp,
       ...
(b) Then sum up the three COUNTFOR columns.

Variations on the theme are obviously possible, e.g. the department = 'sales' predicate can be moved to the WHERE clause to suggest to the system trying to exploit the possible existence of an index on this column. The main point here is that it is perfectly doable to achieve "single pass, multiple counts" in SQL.

To the objection that this "causes serious deformation in the train of thought", we have two replies. First, the "train of thought" that precedes the construction of a SQL query can still perfectly be documented in prose. In fact, this is highly desirable, because it is unrealistic to hope that complex, that is, long queries can be understood by later users from just the SQL syntax itself. It is desirable for any code snippet to be as self-documenting as reasonably possible (avoiding extra documentary prose that must then also be maintained alongside the actual code), but the key word there is 'reasonable', and somewhere somehow, one is always doomed to bump into limits of reason with this.

Second, standard SQL has FILTER clauses with aggregate functions such as COUNT to achieve exactly what is sought here. Here's the single-pass-three-counts query in standard SQL:
SELECT COUNT(*) AS DEPT_COUNT,
       COUNT(*) FILTER (WHERE native_place = 'NY') AS nycount,
       COUNT(*) FILTER (WHERE native_place = 'NY' AND gender = 'female')AS nyfemale count
FROM employee
WHERE department = 'sales';
Very often complaints that "SQL cannot do <X>" (substitute anything for <X>), really mean that  a particular DBMS"cannot do <X>", because it has (deliberately, most of the time--and underlying reasons can vary greatly) not implemented some feature that does exist in the standard. Or, worse, even if the DBMS is capable of doing <X> the problem is poor mastery of it, caused either by not keeping up with its evolutions, or by not bothering to even once to inspect the "what's new" section in the manual of a new release, and so on.

To illustrate, the author used Oracle-based SQL syntax "because it does a good job of complying with SQL:2003". 2003 is 11--eleven!--years ago. I'm actually curious to know whether she can tell, right off the bat, how many officially released versions of the standard there have been since 2003, and what were the main areas/issues addressed in each such version, let alone which features Oracle has implemented, if any. We fear that question is rhetorical. In other words, cases of "They don't know how to dance and blame the crooked floor".

This also exposes one of the consequences of product use absent foundation knowledge. Such knowledge enables a data professional to distinguish between
  1. the relational model 
  2. SQL as a (poor) concretization of the model 
  3. the SQL standard 
  4. a specific implementation that is neither complete, nor fully compatible with it
and inhibits user-driven product improvements and technology progress.




No comments:

Post a Comment

View My Stats