ON DUPLICATES
by C. J. Date

 

 

 

From: LP 
To: Editor

 

I have tried to review your article Double Trouble, Double Trouble as carefully as I can.  Here are some questions, remarks, and general thoughts: 

 

3) I fail to understand how "R join S" gives a different answer form "S join R" in SQL. You might mean the possibility to specify "SELECT *" but that is hardly a major defect and be easily avoided?

 

4) On rereading Codd's "V2" book I noticed quite a lot on duplicates.  There was some discussion on the need of a statistical function (DOD) when relations are projected out of sensitive data (sections 23.2.2 and 23.2.4).  Should this be discussed in the article?  Perhaps it would just add unnecessary clutter?

 

5) I would think that the consensus among "serious minded" DB-designers is that tables should always have a primary key.  To me it is perfectly incomprehensible that somebody should argue otherwise.  I think the biggest concern is the (alleged!) performance implications of duplicate removal.  It came to my mind that one way to look at the issue is to compare it to the GOTO less programming episode that started with Dijkstra's article in 1968. Replace "programmer" with "Query Issuer" and "Goto's" with "Duplicate rows" in the following discussion:

 

"Programmers used to use GOTOs, because that seemed like a natural way to transfer control to another place in the program. Dijksra declared GOTOs harmful in his landmark article of 1968, giving ways to eliminate them. Programmers resisted, because they felt that they could not do all necessary things without GOTOs.  They felt something had been taken away from them. However, programs with GOTOs were much harder to understand, and on top of it all the compilers where bigger, buggier and had a harder time optimizing the code. Little by little new programming methodologies appeared that did away with GOTOs and some newer programming languages did not even have GOTO's in the first place (Pascal, ADA, ...). The old programmers had a hard time not using GOTOs, but the new ones where not even told about GOTOs and they managed to make perfectly decent programs without them (to say the least!)."

 

Of course there are some differences:

 

·   Duplicates where known to be harmful from the very start--I suspect that Ted Codd did not even dream that he would encounter them in products. 

·   No new languages--SQL still dominates. 
 

I myself have now tried to orient myself into the new way of thinking (not that I had previously relied on duplicates on any way) and I have noticed that the duplicate elimination problem goes away as long as I make semantically sensible queries. The view of databases as big barrels from which data is scooped with a mug is so pervasive.  The right analogy would be an oracle (!) to which questions are asked - ask a stupid question and you get a stupid answer!

 

6) I think the problems with defining a Cartesian product on tuple-bags was quite interesting - I had not been aware of it before.

 

7) As for the final plea ... I think it is naïve, to say the least. I think it should be acknowledged that SQL is a lost cause, and there is no power in the world that can fix it. Even very slight changes cause lots of pain for users and this would mean big investments and tricky product updating for companies that don't have duplicate removal in their interest at all. I would be happy if at least we could have a consensus on the matter, in a way everybody admitting "Smoking is harmful", but not necessarily assuming that people will quit smoking or tobacco companies stopping to make tobacco. 
 
 

From: Chris Date
To: LP 

 

The following responses are keyed to your numbering. 

 

1.      Done. Thank you.

 

2.      I think my point (actually Fabian's point) is still valid, even if what you say is true.

 

3.      R JOIN S and S JOIN R give different results in SQL because the left-to-right column orderings are different. The implications can be quite nasty.

 

4.      Regarding Codd's book on RM/V2: First of all, I have to say that, with all due respect to Ted Codd, I don't think very much of his book. In fact I wrote a detailed and rather negative review of it--see A Critical Review of the Relational Model Version 2 (RM/V2), in my book RELATIONAL DATABASE WRITINGS 1989-1991. You're quite right in saying the book had a lot on duplicates!--in the review just mentioned, I said "I lost count of the number of times [the book] told me that duplication was not permitted" (this was meant to be a joke, though I suppose it isn't very funny). As for the DOD function, I could never really see the point. After all, it's only shorthand. For example:

 

WITH (s RENAME CITY AS c) AS x :
EXTEND s {CITY} ADD COUNT (x WHERE c = city) AS dod;

 

I definitely don't want to get into a discussion of what seems to me to be rather a small point in my article.

 

5.      I like this parallel. With your permission, I'd like to include an edited version of it in my article. Would you mind?

 

6.      The "tuple-bag Cartesian product problem" comes from Appendix D (Some Outstanding Issues) of C. J. Date and Hugh Darwen: A GUIDE TO THE SQL STANDARD. No apologist for SQL has ever refuted our argument (at least, not to us, anyway).

 

7.      I don't think the plea is all that naïve. See the conclusion to Part 1 of the article.

  
 

From: LP 

I realized that you where not very impressed by Ted's V2-book, but I did not know you thought so little of it. It's sad, really...

 

On point 3:  I would just like to ask, if the problem appears only if one uses "SELECT *" or are there other circumstances where the ordering is a problem?

 

On point 7:  Ah, well... that's only my opinion. I hope you are right!

 

One more comment:  Maybe a reference could be added to the article The Role of Functional Dependence in Query Decomposition by Hugh Darwen.  I think it is an excellent piece of work!  (I don't know if it logically fits in the article). 
 
 

From: Chris Date

A few responses: 

 

Re the RM/V2 book: Yes, it is sad.

 

Re left-to-right column ordering in SQL: The problem is much worse than just "SELECT *" (though that problem is bad enough). It's because there's a left-to-right column ordering that SQL assumed from the outset that columns were identified by ordinal position. That assumption, though never spelled out explicitly perhaps, led directly to the phenomena of tables with columns that had non-unique names and tables with columns that had no names at all!--a monstrous violation of Chris Date's Naming Principle.

 

I agree with you on Hugh's functional dependence paper but don't see the direct relevance to the article.

 

 

From: LP 
 

Re left-to-right column ordering:  OK, I get the picture.

 

Obviously, what would also be of immense interest (at least for me) would be a (personal?) account of the very beginnings of the RM-development e.g. how you were introduced to it, what did you think at first, how things started rolling etc. Also a personal account of "The Great Debate" would be interesting.
 
 

From: Chris Date
 

Further to my last reply: I've thought of something else I should have said, but didn't, in answer to your question as to why left-to-right column ordering in SQL is such bad news. The point is, left-to-right column ordering implies that, e.g.,

 

RELATION (A AT,B BT,C CT) 

and

 

RELATION (B BT,C CT,A AT)

 

(to use Tutorial D-like notation) are different relation types! Consider the effects of this fact on, e.g., the rules regarding union and intersection! Consider also the effects on relation type inference; e.g., when you've figured out how to state the rules regarding union on relations of the two types shown above, what type exactly does the result of such a union have? (I.e., what left-to-right order do the columns of that result appear in?) Etc., etc., etc.


 

From: LP 
 

I finally received the second draft of the duplicates article, thanks!

 

I am flattered to have my quote published (the GOTO stuff) but at the same time a bit wary of my incomplete command of the English language.  English was my mother tongue, but I haven’t been speaking it regularly since childhood.

 

Maybe one thing that might come to mind by somebody reading the article:  If the duplicates are so hard to count, as mentioned, how does SQL manage to all the same count duplicate rows in e.g. SELECT COUNT(*) FROM t1? It would seem that problems arise when you try to remove or update one of the duplicates and, of course, join them to other tables etc...

 

Have you had time to take a look at the Approximate Join stuff or the Logic paper? (Still holding my breath!)

 

 

Posted 02/23/02

 

 

 

[ABOUT] [QUOTES] [LINKS]