From: LP
To: Editor
I have tried to review your article Double
Trouble, Double Trouble as carefully as I can. Here are some
questions, remarks, and general thoughts:
…
3) I fail to understand how "R join S" gives a
different answer form "S join R" in SQL. You might mean the
possibility to specify "SELECT *" but that is hardly a major defect
and be easily avoided?
4) On rereading Codd's "V2" book I noticed quite a
lot on duplicates. There was some discussion on the need of a statistical
function (DOD) when relations are projected out of sensitive data (sections
23.2.2 and 23.2.4). Should this be discussed in the article?
Perhaps it would just add unnecessary clutter?
5) I would think that the consensus among "serious
minded" DB-designers is that tables should always have a primary
key. To me it is perfectly incomprehensible that somebody should argue
otherwise. I think the biggest concern is the (alleged!) performance
implications of duplicate removal. It came to my mind that one way to
look at the issue is to compare it to the GOTO less programming episode that
started with Dijkstra's article in 1968. Replace "programmer" with
"Query Issuer" and "Goto's" with "Duplicate rows"
in the following discussion:
"Programmers used to use GOTOs, because that seemed
like a natural way to transfer control to another place in the program. Dijksra
declared GOTOs harmful in his landmark article of 1968, giving ways to
eliminate them. Programmers resisted, because they felt that they could
not do all necessary things without GOTOs. They felt something had been
taken away from them. However, programs with GOTOs were much harder to
understand, and on top of it all the compilers where bigger, buggier and had a
harder time optimizing the code. Little by little new programming methodologies
appeared that did away with GOTOs and some newer programming languages did not
even have GOTO's in the first place (Pascal, ADA, ...). The old programmers had
a hard time not using GOTOs, but the new ones where not even told about
GOTOs and they managed to make perfectly decent programs without them (to say
the least!)."
Of course there are some differences:
·
Duplicates where known to be harmful from the very
start--I suspect that Ted Codd did not even dream that he would encounter them
in products.
·
No new languages--SQL still dominates.
I myself have now tried to orient myself into the new way of
thinking (not that I had previously relied on duplicates on any way) and I have
noticed that the duplicate elimination problem goes away as long as I make
semantically sensible queries. The view of databases as big barrels from which
data is scooped with a mug is so pervasive. The right analogy would be an
oracle (!) to which questions are asked - ask a stupid question and you get a
stupid answer!
6) I think the problems with defining a Cartesian product on
tuple-bags was quite interesting - I had not been aware of it before.
7) As for the final plea ... I think it is naïve, to say the
least. I think it should be acknowledged that SQL is a lost cause, and there is
no power in the world that can fix it. Even very slight changes cause lots of
pain for users and this would mean big investments and tricky product updating
for companies that don't have duplicate removal in their interest at all. I
would be happy if at least we could have a consensus on the matter, in a way
everybody admitting "Smoking is harmful", but not necessarily
assuming that people will quit smoking or tobacco companies stopping to make
tobacco.
From: Chris Date
To: LP
The following responses are keyed to your numbering.
1.
Done. Thank you.
2.
I think my point (actually Fabian's point) is still valid,
even if what you say is true.
3.
R JOIN S and S JOIN R give different results in SQL because
the left-to-right column orderings are different. The implications can be quite
nasty.
4.
Regarding Codd's book on RM/V2: First of all, I have to say
that, with all due respect to Ted Codd, I don't think very much of his book. In
fact I wrote a detailed and rather negative review of it--see A Critical
Review of the Relational Model Version 2 (RM/V2), in my book RELATIONAL DATABASE
WRITINGS 1989-1991. You're quite right in saying the book
had a lot on duplicates!--in the review just mentioned, I said "I lost
count of the number of times [the book] told me that duplication was not
permitted" (this was meant to be a joke, though I suppose it isn't very
funny). As for the DOD function, I could never really see the point. After all,
it's only shorthand. For example:
WITH (s RENAME CITY AS c) AS x :
EXTEND s {CITY} ADD COUNT (x WHERE c = city) AS dod;
I definitely don't want to get into
a discussion of what seems to me to be rather a small point in my article.
5.
I like this parallel. With your permission, I'd like to
include an edited version of it in my article. Would you mind?
6.
The "tuple-bag Cartesian product problem" comes from
Appendix D (Some Outstanding Issues) of C. J. Date and Hugh Darwen: A GUIDE TO THE SQL
STANDARD. No apologist for SQL has ever refuted our argument
(at least, not to us, anyway).
7.
I don't think the plea is all that naïve. See the conclusion
to Part 1
of the article.
From: LP
I realized that you where not very impressed by Ted's V2-book,
but I did not know you thought so little of it. It's sad, really...
On point 3: I would just like to ask, if the problem
appears only if one uses "SELECT *" or are there other
circumstances where the ordering is a problem?
On point 7: Ah, well... that's only my opinion. I hope
you are right!
One more comment: Maybe a reference could be added to
the article The Role of Functional Dependence in Query Decomposition by
Hugh Darwen. I think it is an excellent piece of work! (I don't
know if it logically fits in the article).
From: Chris Date
A few responses:
Re the RM/V2 book: Yes, it is sad.
Re left-to-right column ordering in SQL: The problem is much
worse than just "SELECT *" (though that problem is bad enough). It's
because there's a left-to-right column ordering that SQL assumed from the
outset that columns were identified by ordinal position. That assumption,
though never spelled out explicitly perhaps, led directly to the phenomena of
tables with columns that had non-unique names and tables with columns that had
no names at all!--a monstrous violation of Chris Date's Naming Principle.
I agree with you on Hugh's functional dependence paper but
don't see the direct relevance to the article.
From: LP
Re left-to-right column ordering: OK, I get the
picture.
Obviously, what would also be of immense interest (at least
for me) would be a (personal?) account of the very beginnings of the
RM-development e.g. how you were introduced to it, what did you think at first,
how things started rolling etc. Also a personal account of "The Great
Debate" would be interesting.
From: Chris Date
Further to my last reply: I've thought of something else I
should have said, but didn't, in answer to your question as to why left-to-right
column ordering in SQL is such bad news. The point is, left-to-right column
ordering implies that, e.g.,
RELATION (A AT,B BT,C CT)
and
RELATION (B BT,C CT,A AT)
(to use Tutorial D-like notation) are different
relation types! Consider the effects of this fact on, e.g., the rules
regarding union and intersection! Consider also the effects on relation type
inference; e.g., when you've figured out how to state the rules regarding union
on relations of the two types shown above, what type exactly does the
result of such a union have? (I.e., what left-to-right order do the columns of
that result appear in?) Etc., etc., etc.
From: LP
I finally received the second draft of the duplicates
article, thanks!
I am flattered to have my quote published (the GOTO stuff)
but at the same time a bit wary of my incomplete command of the English
language. English was my mother tongue, but I haven’t been speaking it
regularly since childhood.
Maybe one thing that might come to mind by somebody reading
the article: If the duplicates are so hard to count, as mentioned, how
does SQL manage to all the same count duplicate rows in e.g. SELECT COUNT(*)
FROM t1? It would seem that problems arise when you try to remove or
update one of the duplicates and, of course, join them to other tables
etc...
Have you had time to take a look at the Approximate Join
stuff or the Logic paper? (Still holding my breath!)
Posted
02/23/02
[ABOUT]
[QUOTES]
[LINKS]