Wednesday, August 29, 2018

DISTINCT and ORDER BY Are Not Relational




“One of the things that confuse SQL users all the time is how DISTINCT and ORDER BY are related in a SQL query ... most people quickly understand:
   
SELECT DISTINCT length
FROM film

[that] returns results in an arbitrary order, because the database can (and might apply hashing rather than ordering to remove duplicates) ... Most people also understand:
   
SELECT length
FROM film
ORDER BY length

[that] will give us duplicates, but in order ... And, of course, we can combine the two:
   
SELECT DISTINCT length
FROM film
ORDER BY length

[But if] somewhat intuitively, we may want to order the lengths differently, e.g. by title:
   
SELECT DISTINCT length
FROM film
ORDER BY title

[m]ost databases [sic] fail this query with an exception like Oracle’s:

ORA-01791: not a SELECTed expression

At first sight ... this

SELECT length
FROM film
ORDER BY title

works after all ... So, how are these different? We have to rewind and check out the logical order of SQL operations (as opposed to the syntactic order). And always remember, this is the logical order, not the actual order executed by the optimiser.”
--How SQL DISTINCT and ORDER BY are Related, Jooq.org


------------------------------------------------------------------------------------------------------------------
SUPPORT THIS SITE 

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you. 


NEW PUBLICATIONS 

NEW: The Key to Relational Keys: A New Perspective

NEW: SOCIAL MEDIA 

I deleted my Facebook account. You can follow me on Twitter:

  • @dbdebunk: will contain links to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture, which I am bringing back.
  • @ThePostWest: will contain evidence for, and my take on the spike in Anti-semitism that usually accompanies existential crises. The current one is due to the decadent decline of the West and the corresponding breakdown of the world order.
HOUSEKEEPING

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 
------------------------------------------------------------------------------------------------------------------
  
Insofar as the IT industry is concerned, SQL DBMSs are relational -- claim otherwise and you will be dismissed as a "purist" or "zealot", and subjected to ridicule. But that's because neither SQL authors, nor users have had a sufficient understanding of the RDM and its advantages[1]. A good indication is that even though SQL is claimed to be the "de-facto relational data language", it is taught and used in a purely syntactic manner, devoid of relational context, which obscures its relational infidelities, and inhibits understanding of their deleterious consequences, which are absurdly blamed on the RDM.

Case in point: you would not know it from the article, but neither DISTINCT, nor ORDER BY would exist in a relational data sub-language[2] -- that is their most significant common property.


Relational Closure


Similar to numbers being closed under arithmetic -- arithmetic operations applied to numbers produce numbers -- database relations are closed under the relational algebra (RA): RA operations applied to relations produce relations[3]. And just like arithmetic breaks down when applied to anything other than numbers, RA breaks down when applied to anything other than relations, in which case, all bets are off:
  • Relational closure is not preserved;
  • RA operations are not nestable (i.e., results are not amenable to re-operation);
  • Logical validity and semantic correctness cannot be system-guaranteed[4];
  • Information can be lost.

Of the several defining (i.e., required) properties of relations[5], two are:

  • No duplicates[6] -- every relation has a primary key (PK)[7];
  • No significant order of tuples and attributes -- database design does not encode information in either of the two orders[8].

They are implicitly enshrined in two of Codd's famous 12 rules (there are actually 13)[9]:

  • The Guaranteed Logical Access rule requires that every data value in a relation is logically accessible by relation name + PK value + attribute name;
  • The Information Rule -- now known as as the Information Principle (IP) -- requires that all information in a relational database be represented explicitly and exclusively as attribute values in relations.

DISTINCT: Non-relational Operation


The SQL statements above query the FILM SQL table of the Sakila database for film lengths -- they execute a projection of FILM on LENGTH.  The meaning of the result depends on the relationship between films and lengths:
  • If every film has a unique length, the relationship is 1:1 (functional dependency): each result tuple represents the length of one film;
  • If there are multiple films with the same length, the relationship is M:1, and each result tuple represents the length for a group (set) of films (some of which may consist of a single film).

But in both cases a relational projection on FILM would produce a relation that has a PK and is devoid of duplicates.


Note: In case 1, the relationship between LENGTH and FILM.FILM_ID is the same functional dependency of FILM.LENGTH on FILM.FILM_ID, so a true RDBMS would infer that result.LENGTH is PK, and is functionally dependent on the FILM.FILM_ID projected away (i.e., a multi-relation constraint). No SQL DBMS makes such inferences, or supports constraint inheritance, necessary for semantic correctness[10].

Given that RA operations produce relations (closure), DISTINCT is superfluous in a relational system (i.e., a true RDBMS and relational databases). It exists in SQL precisely because SQL DBMSs violate closure by producing results such as the second SQL statement -- bags with duplicates, rather than relations -- one of their many relational infidelities[11].
“Relations are sets. In set theory, the members of a set (tuples in a relation) are not treated as individual things, but as symbols (i.e., names) of those things (PK values in relations). If we name a tuple for inclusion more than once, the result is the same as naming it once -- there cannot be duplicates to be eliminated. A RA operation on one or more relations does not create a bag, and then reduces it to a relation by eliminating the duplicates.”  --David McGoveran
In this sense, DISTINCT is non-relational.

Note: The execution of operations in SQL that create and remove duplicates (e.g., "database can and might apply hashing rather than ordering to remove duplicates") is typically an incorrect implementation due to an accident of history.
“Computing {1} union {1}, for example,  can be executed without creating any "intermediate" duplicates that would then have to be eliminated -- when "adding" the set member 1 to the result, the "addition" is modulo existence, so if 1 is already a member of the set, the "addition" is a no-op -- it is skipped.” --David McGoveran

ORDER BY: Application Operation


The order of the result of the first SQL statement is arbitrary in compliance with the IP prohibition of no significant order (implementation details such as hashing are irrelevant as long as they are not exposed to users and application[12]). But missed here entirely is what I explained just a week ago: ORDER BY is about display order (distinct from storage order), which is not significant, in the sense that if it changes, no information is lost.
“The row-column arrangement [of a R-table] on [some] the display medium [paper or screen] is a property of the tabular presentation of relations, and presentation is an application, not DBMS function, outside the purview of the RDM.”[8]

Note: Insignificant display order should not be confused with the significant order of duplicates in bags produces by some proprietary (e.g., Oracle and DB2) non-relational SQL operations that violate closure and the IP and, thus, lose both information if results are reordered (including by ORDER BY), and nestability. Moreover, in SQL, the ordering of attributes is significant.[8]

The author urges users to focus on "the logical order of SQL operations (as opposed to the syntactic order) [and] the actual order executed by the optimiser". Unfortunately, this requires understanding data and relational fundamentals, which is precisely what data practitioners -- be they DBMS designers or users -- lack -- SQL is taught and used syntactically and in the implementation, not relational context -- hence a discussion of DISTINCT and ORDER BY that fails to indicate they are not relational/DBMS operations, the mix of which is responsible for quirks and confusion. Any wonder that SQL problems are blamed on the RDM?


References

[1] Pascal, F., DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.

[2] Pascal, F., Natural, Programming and Data Language.

[3] Pascal, F., Relational Fidelity, Cursors and ORDER BY.

[4] Pascal, F., Object Orientation, Relational Database Design, Logical Validity, and Semantic Correctness.


[5] Pascal, F., What Relations Really Are and Why They Are Important.

[6] Pascal, F., Duplicates: Stating the Same Fact More Than Once Does Not Make it Truer, Only Redundant.

[7] Pascal, F., The Key to Relational Keys: A New Understanding.

[8] Pascal, F., Order Is for Society, Not Databases.

[9] Pascal, F., Interpreting Codd: The 12 Rules.

[10] Pascal, F., The 'Real World' and Database Design.

[11] Pascal, F., SQL Sins.

[12] Pascal, F., Physical Independence Parts 1,2,3.



Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.



No comments:

Post a Comment

View My Stats