Sunday, August 13, 2017

Relational Fidelity, Cursors and ORDER BY

Here's what's wrong with last database picture, namely:
"In a book I am reading (QUERYING SQL SERVER 2012) the author talks about theory of how databases work. He mentions relations, attributes and tuples etc. He frequently stresses the fact that some aspect of T-SQL is not relational. Like in the following excerpt:
"T-SQL also supports an object called a cursor that is defined based on a result of a query, and that allows fetching rows one at a time in a specified order. You might care about returning the result of a query in a specific order for presentation purposes or if the caller needs to consume the result in that manner through some cursor mechanism that fetches the rows one at a time. But remember that such processing isn’t relational. If you need to process the query result in a relational manner--for example, define a table expression like a view based on the query--the result will need to be relational. Also, sorting data can add cost to the query processing. If you don’t care about the order in which the result rows are returned, you can avoid this unnecessary cost by not adding an ORDER BY clause."
I would like to know, since every implementation of SQL pretty much has an ORDER BY clause which makes it non-relational, why does it even matter that (the set after ORDER BY is used) its not relational anymore since its like that everywhere? I can understand if he said it was non-standard, for example using != instead of <> for inequality because that affects portability etc., but I do not understand why something is better being relational. Please enlighten." --stackoverflow.com

Inevitably, logical-physical confusion (LP) contaminated the issues with physical implementation considerations:
"And cursors and row-by-row processing are much slower than set-based operations."  
"The author IS saying that ORDER BY makes it non-relational. He has mentioned this many times. It seems the reason the author keeps mentioning it is probably because he wants the readers to differentiate and later understand performance implications."
implying that relational fidelity must be compromised for satisfactory performance--a canard that has been amply debunked and is dismissable out of hand.

One response got it some of it right:

"It sounds like the author is referring to the use of set theory (which is in his terms "relational") vs. processing row by row with a cursor or similar method. Accessing the data in a relational way allows the database engine to perform selects/joins/sorts/orders/etc. on entire sets of data where as a cursor will only process one row at a time. As far as I know, adding an ORDER BY clause does not make a query "non-relational", the author is just noting that if you do not care about the order of result set, you can leave that clause out of your query and the database engine will return the data in whatever order it ended up processing it in."
But there is much more to relational fidelity than set-at-a-time operation; and a RDBMS performs restricts, joins and other RA operations, but sorts/orders are not such. 

Relational Fidelity


The proper interpretation of Codd's RDM is as follows (I highlight the differences from the current understanding).

A relational database consists of related relations the design of which adheres to

  • The Principle of Orthogonal Design (POOD);
  • The Principle of Representational Minimality (PORM);
  • The Principle of Expressive Completeness (POEC);
There is a (so far unproven) McGoveran conjecture that these principles subsume the Principle of Full Normalization (POFN).

A relation is

  • A set of n-valued keyed unordered tuples, which are
- sets of values of uniquely named unordered attributes defined on simple domains;
- without missing values;
  • Fully normalized (5NF) by definition--i.e., represents a single object group;
A RDBMS
  • Has a FOPL-based relationally complete data sub-language;
  • Constrains relations and the database
- to obey the above relational discipline;
- to be consistent with the captured business rules;
  • Applies the set-at-a-time operations of the relational algebra (RA) to derive new relations;
Codd's Rule 0--the Information Principle (IP)--mandates that all information in a relational database is represented explicitly and in only one way--as values of attributes in relations--mndate which, if satisfied, yields core RDM advantages:
  • Physical and logical independence;
  • FOPL-based data sub-languages that are
- declarative;
- decidable;
- expressively complete;
- simpler (than higher order logic);
  • System-guaranteed logical and semantic correctness of query results;
RA operations on relations produce--by definition--relations as results.

Non-relational Operations and Results


  • Non-RA operations are non-relational;
  • Non-relational results are not relations--i.e.,
- Are not properly constrained
- do not obey relational discipline--i.e., have duplicates, attributes with duplicate names, or nameless, meaningful ordering of tuples, or attributes, or missing values;
- are not consistent with the business rules;
- Are not fully normalized;
It follows that non-relational results are produced by
  • RA operations on non-relations;
  • Non-RA operations;
that violate the RDM and defeat relational advantages. 


Guaranteed Order


Meaningful order of tuples or attributes conveys information and is subject to the IP. To comply, information implicit in the order should appear explicitly as attribute values in tuples instead. Databases that encode information in the order of tuples or attributes violate the IP and the RDM:
  • Do not consist exclusively of relations;
  • Information is
- "hidden" from the DBMS and the data sub-language and inaccessible to RA operations;
- lost it if the database is reorganized in storage, or during result processing;
- PI is defeated and performance optimization inhibited if the order is to be preserved;

DBMS vs. Application Functions


At the logical level a RDBMS is responsible, via the FOPL-based relational data sub-language, for the data management DBMS functions--data definition (structure), retrieval (manipulation) and constraint enforcement (integrity). Result preparation and presentation is the responsibility of applications via computationally complete host languages not limited to FOPL. The DBMS retrieves via the RA relations as sets for applications, which rely on a host language for any non-RA operations to process and present them to users.


Cursors


Cursor operations are tuple-at-a-time and rely an application-imposed tuple ordering. As such, they are not RA operations--non-relational--and not expressible in a FOPL-based relational data sub-language. They and the ordered sets they operate on belong in the application and its interface to the RDBMS, not in the RDBMS. An application can, if necessary, use the cursor mechanism to step tuple-at-a-time through a set and present it to the user in the desired order, with the DBMS retrieval function preserving relational advantages.


ORDER BY


One reply stated categorically that "ORDER BY absolutely makes a result set NON-RELATIONAL", while another argued the opposite:
"That's simply ridiculous. Let's say that for a big fluke the query engine returns the table already sorted, this would be relational. Instead, if you explicitly ask for such sorting (obtaining the same result) it's not relational anymore. It doesn't make any sense. If the order of the rows is irrelevant in the relational model then: why having a particular order makes it not-relational? Please help me understand."
Both lack the proper perspective. It is important not to confuse the relational fidelity of an operation and a result.

On the one hand, if and when SQL produces a relation as a result--not always  the case--ORDER BY only orders it for presentation purposes--think of it as a report--and presentation being an application, not DBMS function, is not required to conform to relational discipline. This is why even SQL allows an ORDER BY only on the "outermost" SELECT--the final result that will be presented to the user. In other words, a non-relational result presentation does not render the relational result itself non-relational--if the results needs to be operated upon, the DBMS operates on the set, not on the ordered presentation (see below).

On the other hand, ORDER BY is not a RA operation, for which reason it should not be included in a relational data sub-language as it is in SQL, but left to computationally complete application languages, or to a result/report formatting language supported by the DBMS.



Relational Closure


It is critical that query results--distinct from their presentation to applications--be relations.

Arithmetic operators operate on numbers and produce numbers, which can, in turn, be further operated upon by arithmetic operators. In other words, arithmetic operations are closed over the set of numbers and, therefore, can be nested. Consider what would happen if arithmetic operators produced anything other than numbers as results: arithmetic closure and nestability would break down and all bets would be off computationally.

Similarly, RA operators operate on relations and produce relations, which can, in turn be operated on by RA operators--i.e., RA operations are closed over relations and nestable. Nestability allows us to apply sequences of RA operations, build sophisticated RA expressions, narrow down results and so on. Relational closure and nestability break down if results are not relations and all bets are off.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment