“What I tell you three times is true.”
--Lewis Carroll
(Continued from Part 1)
Introduction
In Part 1 of this article, I tried to explain why
duplicate rows (hereinafter abbreviated to just duplicates) are prohibited in
the relational model. When that
material was first published, I received a long letter from a reader, Robert A.
Alps, which raised questions about some of the arguments I'd been making. And
it seemed to me that it might be worth airing some of the comments from Alps'
letter, and my responses to them, before a wider audience. Hence Part 2 of the present article (I
should make it clear that Alps wasn't saying duplicates were a good thing. To quote his letter: "I am not arguing
in favor of duplicates, rather against some of your arguments opposing
[them]. I only want to be sure the
arguments are sound before concluding 'Out, damned duplicate.'"
The Pennies Example
In Part 1 I said that individual objects must be
identifiable--in other words, distinguishable from all other objects--because
if an object isn't identifiable, it's impossible even to talk about it, let
alone perform any kind of operation upon it, or use it for any sensible
purpose. Alps countered with the
following example:
“Imagine a bag of 100 newly minted pennies. You can reach into the bag and pull one out
and while it is in your hand it has a special identity (of being the penny in
your hand). But as soon as you place
that penny back in the bag, it loses that special identity and becomes, once
again, indistinguishable from the others.
Is there some useful operation we can perform on indistinguishable
objects? Certainly. To the extent that we have information
regarding their number, we may perform mathematical operations using that
information. If we know that one bag
contains 100 pennies and another contains 200 pennies then we know that the
combination of the two bags holds 300 pennies.
This is a useful operation.”
Now, obviously I agree that the individual pennies in a bag
of 100 pennies can be regarded as indistinguishable from one another at a
certain level of discourse (in other words, at a certain level of
abstraction). Surely, however, the
point is that the entities under discussion here aren't individual pennies;
rather, they're the collective entities "bags of
pennies." Those collective
entities in turn have a certain property (namely, cardinality), with
values 100, 200, and so on.
Furthermore, those collective entities have some identity,
because otherwise (as I said before) we can't even refer to them; at the very
least, we have to say something like "This is this bag of pennies
and that is that one." [Ed. Comment:
In other words, to point at them.]
But suppose we ask ourselves the question "How are
values of the cardinality property determined?" Then I contend that we have to descend to a lower level of
discourse and consider individual pennies as entities. And at this level we
must be able to distinguish those individual pennies in order to be able
to count them! As I put it in Part 1:
“[A] common reaction to this argument is 'But I really don't
need to distinguish among the duplicates--all I want to do is to be able to
count them.' The point I'm trying to
make is that you do need to distinguish them, even just to count them (for
otherwise how do you know which ones you've already counted?). This point is crucial, of course, and I
really don't know how to make it any more strongly than I already
have." [Ed. Comment: Try to count a pile of pennies by throwing
each back into the pile after you count it; this is the equivalent of what
duplicate proponents suggest, without
realizing it. See last paragraph in this section]
I further contend that Alps' example of a "useful
operation"--namely, the operation of combining the bag of 100 pennies and
the bag of 200 pennies to obtain a bag of 300 pennies, which is a union
operation, of course*--is self-evidently an operation on bags
of pennies, not an operation on individual pennies per se. (A set union, that is, not a bag
union (pun intended).
To continue with what I said inPart 1, I went
on to observe that the way we distinguish duplicates in general is by their
relative position; in a bag (aka multiset) containing two 6's, for
example, we say something like "This 6 is here and that 6 is there,"
or "This is the first 6 and that 6 is the second." Alps countered by claiming that "the
suggestion that we place a positional ordering on duplicates in all situations
appears highly artificial and patently false." Well, perhaps I should turn the argument around at this point;
specifically, if you're one of those people who think that duplicates
"truly exist" in some sense, then please tell me exactly--exactly,
please!--how you propose to count those 100 "duplicate" pennies. I do
think that anyone who advocates the position that duplicates are a good idea
needs to provide a good, convincing answer to this question. It's nearly ten
years since I first issued this challenge, and nobody has yet come up with a
cogent response to it
Bag Theory
Recall that, loosely speaking, a bag in mathematics is
a set that permits duplicates. Duplicate advocates thus sometimes claim that
their Part 1 approach is just as respectable as the relational approach
because it too is based on solid theory (bag theory instead of set theory). In Part 1 of this article, however, I
claimed that mathematical "bag theory" treatments usually start by
assuming that there's a way to count duplicates! I further claimed that that assumption effectively means that bags
are defined in terms of sets.
Alps replied by constructing an outline bag theory that--he
claimed--didn't start with such an assumption. But I think it does! One of the
three primitives of his theory is an expression of the form "count xA"
which (to quote) "is intended to denote the number of times the [object] x
occurs in the bag A." Now,
this quote might have been meant only as an informal, intuitive remark, not as
part of the theory as such, but I'm still suspicious. In any case, the question
of whether such an assumption is necessary is perhaps a red herring (and I was
perhaps wrong to raise it in Part 1).
The point rather is that bag theory:
Ø
Is necessarily more complex than set theory;
Ø
Includes set theory as a proper subset (perhaps I
should say subbag?);
Ø
Is reducible to set theory.
Occam's Razor would thus clearly suggest that we stay with
sets and not get into the unnecessary complexities of bags. (As so often in a database context, Occam's
Razor is peculiarly apt here. One
simple way of stating it is:
"Entities should not be multiplied beyond necessity." A simpler way still is just: "No unnecessary entities.")
The Performance Issue
In Part 1 I recommended that users should always
ensure that query results contain no duplicates--for instance, by always
specifying DISTINCT at appropriate points in the query--and thus simply forget
about the whole problem. However, I
also pointed out that SELECT DISTINCT might take longer to execute than SELECT
ALL, in general, even if the DISTINCT is effectively a "no-op." Alps commented that the performance problem
doesn't go away if the system is (re)defined to eliminate duplicates
automatically:
“Imagine an address table containing street, city, state, and
zip data. If a query is performed
selecting city from the table under a system that [eliminates duplicates
automatically], the system is going to have to go through all the same checking
for duplicates that would be required if [it didn't eliminate duplicates
automatically but] the query asked for DISTINCT cities.”
Here I think Alps missed my point slightly. What I meant was
that, because it appears that always eliminating duplicates implies performance
problems, there was a strong motivation for the SQL language designers to make not
eliminating duplicates the default. (In
fact, I very specifically pointed out in Part 1 of this article that
support for duplicates can actually have a negative impact on
performance, and my reason for doing so was precisely to provide a
counterbalance to this familiar argument.)
But the full implications, psychological and otherwise, of this very bad
language design decision weren't thought through at the time. It's my belief that the occasional
overhead that might be incurred (in a well-implemented system) in always
eliminating duplicates would be vastly outweighed by the benefits--including
performance benefits--that such a system would provide.
Of course I understand that checking for duplicates is
required regardless of whether (a) the system prohibits them or (b) the system
permits them, but the user specifies DISTINCT.
But note that if the language is defined always to eliminate
duplicates, at least conceptually, then the system will sometimes be able not
to eliminate them--for intermediate results, at least, though not of course for
final results--as an optimization, where appropriate.
Some More SQL Problems
At this juncture I'd like to digress for a moment and talk
about a couple of problems that are caused by duplicates in the SQL standard
specifically. The following discussions
are based on material from Appendix D (Some Outstanding Issues) from the
book by Hugh Darwen and myself A GUIDE TO THE SQL
STANDARD.
The first problem concerns Cartesian product. Part of the standard's explanation of the
SQL FROM clause reads as follows:
“[The] result of the
<from clause> is the ... Cartesian product of the tables identified by
[the] <table reference>s [in that <from clause>]. The ... Cartesian product, CP, is the
multiset of all rows R such that R is the concatenation of a row from each of
the identified tables...”
Note, therefore, that CP is not well defined!--the
fact that the standard goes on to say that "The cardinality of CP
is the product of the cardinalities of the identified tables"
notwithstanding. Consider the tables T1
and T2 shown below:
T1 T2
+----+ +----+
¦ C1 ¦ ¦ C2 ¦
+----¦ +----¦
¦ 0 ¦
¦ 1 ¦
¦ 0 ¦
¦ 2 ¦
+----+ +----+
Either of the following fits the above definition for
"the" Cartesian product CP of T1 and T2 (that is, either one
could be "the" multiset referred to):
CP1 CP2
+---------+ +---------+
¦ C1 ¦ C2 ¦ ¦ C1 ¦ C2 ¦
+----+----¦ +----+----¦
¦ 0 ¦
1 ¦ ¦ 0 ¦
1 ¦
¦ 0 ¦
1 ¦ ¦ 0 ¦
2 ¦
¦ 0 ¦
2 ¦ ¦ 0 ¦ 2 ¦
¦ 0 ¦
2 ¦ ¦ 0 ¦
2 ¦
+---------+ +---------+
As an exercise, I suggest you try your hand at fixing up the
wording of the standard appropriately.
If you do try this exercise, I believe you'll find you're inevitably led
into using the language of sets, not bags, in order to get around the errors
and ambiguities.
Here's the second SQL problem. Consider the following cursor definition, which is
intended to apply to the usual suppliers-and-parts database:
DECLARE x CURSOR
FOR SELECT sp.s#,sp.qty
FROM sp
Note that:
a.
Cursor X permits updates;
b.
The table that is visible through cursor X permits duplicates;
c.
The underlying table (table SP) does not permit
duplicates.
Now suppose a "positioned UPDATE or DELETE"
operation is executed via cursor X (UPDATE or DELETE ... WHERE CURRENT OF
x). Then there's no way, in general, of
saying precisely which row of table SP is being updated or deleted by that
operation. How would you fix this
problem?
(After you've given solutions to these two SQL problems,
please write out one googol times "There's no such thing as a
duplicate.")
Concluding Remarks
The letter from Alps concluded: "Whether or not to allow
duplicates in SQL and relational databases seems to me to be more a practical
question than a theoretical one."
Well, regular readers of my database writings will know that I don't
subscribe to the notion that "theoretical" and "practical"
issues are at odds with one another. Au contraire, it's my strong
position that theory--meaning relational theory specifically--is a highly
practical matter. Thus, I want the system to be built on a solid,
well-established theoretical foundation, for all the obvious good practical
reasons. I further contend that bag theory, even if it could be made
independent and respectable, simply doesn't enjoy the same long pedigree and
high level of acceptance that set theory does. Thus, I want to build on set
theory, not bag theory--and I don't want duplicates!
In this connection, I'd like to draw a possibly enlightening
parallel ... The following remarks are due to Lauri Pietarinen (see Part 1
of this article). In his review of an
earlier draft, Lauri wrote:
“It came to my mind that one way to look at the issue is to
compare it to the GOTOless programming episode that started with Dijkstra's
[letter] in 1968 ... Replace "programmer" by "query issuer"
and "GOTOs" by "duplicates" in the following
discussion: Programmers used to use
GOTOs because that seemed like a natural way to transfer control to another
place in the program. Dijkstra declared
GOTOs harmful in his landmark [letter], giving ways to eliminate them. Programmers resisted, because they felt they
could not do all necessary things without GOTOs. They felt something had been taken away from them. However, programs with GOTOs were much
harder to understand, and ... the compilers were bigger, buggier, and had a
harder time optimizing the code. Little
by little, new programming methodologies appeared that did away with GOTOs, and
some newer programming languages did not even have GOTOs in the first place ...
The old programmers had a hard time not using GOTOs, but the new ones
were not even told about GOTOs and they managed to make perfectly decent
programs without them (to say the least!).”
I'd like to close with the following argument (it's based on
some remarks made by Hugh Darwen some years ago in a private letter to myself,
and it effectively sums up my position on this whole business of
duplicates):
1.
We all agree that databases are made up of collections of
"records"--or "rows," or "tuples," or
whatever--of similar format, where each record has exactly one value for each
item in the format (aka 1NF).
2.
We all agree that different kinds of collections are useful
for different purposes. Sometimes lists
are wonderful, sometimes arrays, sometimes queues or stacks,
sometimes sets, sometimes bags, etc.
3.
Of these different kinds of collections, only one is known
to provide, singlehanded, a basis for a complete model of data. And all of the others can be mapped, if need
be, to this one.
4.
To insist on supporting bags is equivalent to at least one of
the following (probably more than one):
·
To insist on using at least two kinds of collection in
one's model of data;
·
To fall back on one of those kinds of collection for
which no complete model of data has yet been developed, and therefore to have
to go to the trouble of developing that theory (if it is available to be
developed);
·
To be like SQL and attempt to apply the theory of
relations to bags of tuples without fully checking its applicability,
consequently making mistakes (like taking relational theorems to hold equally
well for such bags, or misdefining UNION), failing to provide interpretations
for operations whose relational interpretations don't hold for such bags (for
example, Cartesian product is no longer just AND, and projection is no longer
existential quantification), and delivering inferior products to the folks who
are sensible and knowledgeable enough to stay with relations only.
We don't need to argue against the (occasional) utility of
bags of tuples, any more than we need to argue against the occasional utility
of lists of tuples (why, those are what SQL cursors deliver!), or even stacks
of tuples, queues of tuples, and arrays of tuples for that matter. In fact, why do the duplicate advocates
specifically single out bags of tuples (only) for inclusion in their
approach?
Appendix
By way of a kind of appendix to the foregoing, I'd like to
offer a few comments on the tuple-bag algebra presented in the book by Hector
Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom, DATABASE SYSTEM
IMPLEMENTATION (Prentice-Hall, 2000).
That algebra, you might recall, was referenced in the letter from Lauri
Pietarinen that I quoted in the opening to Part 1 of this article.
Here first is an extract from page 237 of the book in
question (Most if not all of the quotes in this appendix are paraphrased just
slightly from the original, mainly for typographical reasons. Of course, I've tried not to change the
sense of any of them.)
“[The relational] algebra involves operations on relations ...
However, SQL uses a bag (multiset) model, rather than a set model. Also, there are operations in SQL, such as
aggregation, grouping, and ordering (sorting), that are not part of the
classical relational algebra. Thus, we
need to reconsider this algebra in the light of its role as a representation
for SQL queries.”
A few comments right away:
Ø
So SQL uses "a bag (multiset) model," does
it? OK: Where's that model defined?
(Perhaps this question is unfair.
Probably all that Garcia-Molina et al. mean is that the basic SQL
construct is a bag, not a set, of rows.
Well, OK again; but then I wish they wouldn't use such high-flown
language to make such a simple point.
As I've written elsewhere--see my article Models, Models,
Everywhere, Nor Any Time to Think [cjd3a.htm]
on this website--I think the word "model" is one of the most
grotesquely overused in the whole IT field.)
Ø
I don't quite know what Garcia-Molina et al.
mean when they talk about "the classical relational algebra," but
aggregation and grouping have been part of the algebra ever since the early
1970s (and possibly earlier); that seems classical enough to me! (In case
you're interested, the reference I have in mind here is the paper by Patrick
Hall, Peter Hitchcook, and Stephen Todd entitled An Algebra of Relations for
Machine Computation, which appeared in the Conference Record of the 2nd
ACM Symposium on Principles of Programming Languages, Palo Alto, CA.,
January 1973. See also Stephen Todd’s paper The Peterlee Relational Test
Vehicle – A System Overview, IBM System Journal 15, No. 4, 1976.)
Ø
Ordering, by contrast, is not part of the
relational algebra; nor can it be, because its result isn't a relation. This doesn't mean you can't have an ORDER BY
operator, of course--it just means that operator isn't part of the algebra as
such, and it can't be used in an expression that's nested inside some other
(relational) expression, or more generally in any context where the result is
indeed required to be a relation.
That's why you can't use ORDER BY in a view definition, for
example.
Ø
"Reconsider this algebra"? No, "this algebra" is fine. What the authors need to do instead is
define a whole new algebra that works on tuple-bags instead of
relations.
Ø
In fact, SQL isn't even based on a tuple-bag algebra,
because SQL's "tuples" aren't tuples!--they have a left-to-right
ordering to their components. As a
consequence, for example, R JOIN S and S JOIN R
give different results, in SQL. (I
don't mean to suggest that R JOIN S is valid SQL syntax, of
course though you might be surprised to learn that R NATURAL JOIN S
is.) Since my primary focus in the
present article is on duplicates specifically, I won't make a big deal of this
left-to-right ordering business here, but it is a significant issue in
its own right.
Anyway, a few pages later, in Section 6.1 of their book (the
section is entitled An Algebra for Queries), Garcia-Molina et al.
proceed to present their tuple-bag algebra.
Another quote:
“[Relational] algebra was originally defined as if
relations were sets [sic!--italics added].
Yet relations in SQL are really bags ... Thus, we shall introduce
relational algebra as an algebra on bags.“
Well, all right ... I think it was Chesterton who said you
can call black white if you want to, and furthermore I would defend your right
to do so--but I think I'd have to question your wisdom and your sanity if you
actually did so. The plain fact is,
relations simply are sets, not bags, and I don't think you do the cause
of communication and understanding much of a service if you insist on
pretending otherwise. (More precisely, the body of a relation is a
set.)
Furthermore, I don't think it's a good idea to try to define
a theory to fit (force-fit?) bad established practice. Rather, we should define a good
theory--the relational model suggests itself as a candidate--and then try to
get practice to follow that theory. (A
heavy-handed attempt at humor on my part.)
Be that as it may, it's an unfortunate fact that--so far as I
can tell--the tuple-bag algebra presented by Garcia-Molina et al.
completely ignores the relational algebra as currently understood in the
relational world. Indeed, it
reinvents--not always terribly well, either--a variety of concepts that have
been thoroughly investigated in the relational literature; the
"aggregation and grouping" stuff is a case in point. It also bundles together certain constructs
that, speaking personally, I would have much preferred to keep apart. Here are some specific quotes, with
commentary:
“Projection ... We shall extend this operator beyond classical
relational algebra to allow renaming of attributes and construction of
attributes by calculation ... just as the SQL SELECT clause does.”
I think it would be cleaner to keep "projection" to
mean what it always did mean, and to use separate RENAME and EXTEND operators
in order to achieve "renaming ... and construction of attributes,"
just as the relational algebra does.
Another example of force-fitting the theory to bad established
practice?
“Product. This operator
is the set-theoretic Cartesian product ...”
No, it isn't!--it's the bag-theoretic Cartesian
product. And as I pointed out in the
body of this article, it's hard to define, too. Indeed, Garcia-Molina et al. don't define it, they
simply assume we all know what it is.
What's more, they go on to say:
“If R and S are relations, the product R TIMES S is a relation
whose schema consists of the attributes of R and the attributes of S. Should there be an attribute name, say a,
found in both schemas, then we use R.a and S.a as the names of the two
attributes in the product schema.”
The proposed naming scheme is seriously flawed (and in any
case it's unnecessary). As one example
of the kind of problem it leads to, what if R and S are the same
relation? More important, what if the
input relations R and S are derived relations (in other
words, what if they're represented by subexpressions that are more complex than
just simple "relation names," and thus have no names of their
own)? Indeed, the naming mechanism
proposed by Garcia-Molina et al. here serves as strong evidence to
support my earlier claim that the designers of this algebra cannot have paid
much attention to the existing literature in this area. In particular, the naming problem they're
wrestling with here was solved as far back as 1975--over a quarter of a century
ago!
“Union ... For R UNION S,
a tuple t is in the result as many times as the number of times it is in R plus
the number of times it is in S.”
Here I just note that the operator defined isn't the usual
bag union; it isn't the regular SQL UNION, either (in fact it's the SQL
"UNION ALL" operator). The
second of these points is acknowledged by Garcia-Molina et al., but the
first--which is more serious, in a way--isn't.
I could say quite a lot more regarding deficiencies in the
proposed tuple-bag algebra, but we're supposed to be concentrating here
specifically on why duplicates are a bad idea, so let me move on ... Now, I
claimed in Part 1 of this article that duplicates are an optimization
inhibitor. And, of course,
Garcia-Molina et al. do recognize this fact. In Section 7.2 (Algebraic Laws for Improving Query Plans),
there are several points where they explicitly admit that dealing with bags
instead of sets can get in the way of optimization, in one way or another. Some quotes:
“R JOIN S = S JOIN R ... We might imagine that the order of
[attributes] will be different [for the results of] the left and right
[expressions here], but formally, tuples in relational algebra have no ...
order of attributes.”
Actually, the problem here is not that we're dealing with
bags instead of sets, but rather that the bags or sets in question don't
contain tuples as such (see my earlier remarks in this appendix regarding
left-to-right ordering of tuple components).
But note the sneaky appeal to relational algebra in the middle of
what's supposed to be an exposition of tuple-bag algebra! The sorry truth is that--as noted earlier--R
JOIN S and S JOIN R are not equivalent, in
SQL.
“For instance, you may have learned set-theoretic laws such as A
INTERSECT (B UNION C) = (A INTERSECT B) UNION (A INTERSECT C), which is
formally the "distributive law of intersection over union." This law holds for sets, but not for
bags.”
Quite right. And this
is just a "for instance"! It
would be much more interesting--not to say more useful--to give an exhaustive
list of such laws that hold for sets but not for bags.
“R WHERE c1 OR c2 = (R WHERE c1) UNION (R WHERE c2) ... [This]
law ... works only if the relation R is a set [sic!].”
Yes. See my previous
comment.
“[Projection] keeps the number of tuples the same and only
reduces the length of the tuples. In
fact, ... sometimes a projection actually increases the length of tuples.”
True relational projection doesn't "keep the
number of tuples the same," it reduces the number of tuples (in
general); thus, "doing projections early" is a good optimization
tactic--in the relational world, but not in the tuple-bag world. As for
projection "increasing tuple length": Well, no, true relational projection never does that, it's
EXTEND that does that. As I said
earlier, I think it would be better to treat these two as separate operators
(and the confusions involved in the foregoing claim by Garcia-Molina et al.
show why, in part).
I don't think I want to comment any further on this
particular tuple-bag algebra here; I think I've made my point. That point might be summed up as
follows:
·
It doesn't matter whether an algebra can be defined for
tuple bags.
·
It doesn't matter whether expression transformation
laws can be defined for such an algebra.
·
What does matter is that the range of possible results
that can be obtained from any given database is vastly expanded, though not
usefully so.
·
What matters further is that the range of logically
distinct query expressions that can be formulated is vastly expanded, too,
though again not usefully so. (Quite
apart from anything else, think of the increased costs in terms of user
education here!)
·
What matters still further is that numerous logically
distinct expressions are logically distinct only inasmuch as they
produce different degrees of duplication in their results; those expressions
thus can't be transformed into one another, even when it would be advantageous
from a performance point of view to do so, and even when the user doesn't
really care about the differences in the degree of duplication in their
results. Note that even
"law-abiding citizens"--that is, users who make sure always to avoid
duplicates entirely--suffer here; that is, they still pay some price, simply
because the system permits duplicates, even though they themselves never
"take advantage" of this particular "feature."
·
What also matters is that the optimizer itself is
harder to write and harder to debug.
DBMS product releases are thus delayed, and cost more, and are buggier,
when they do appear. Again, law-abiding
citizens pay a price for this.
·
What also matters is that extensions to the
language--probably SQL--are harder to define. Indeed, every time the SQL
standards bodies consider the addition of some new construct to the SQL
language, they have to consider the impact of duplicates on that new construct
(view updating provides a good example of this problem). Again, we all pay a price for this ... The
standard takes longer to appear, and is buggier when it does appear, than would
otherwise have been the case. I say
again: We all pay for this. (Somebody has to pay the
standardizers for the extra time involved, too!)
·
Finally, from my own current research interests, I know
that all of the foregoing problems are going to be exacerbated yet again when
the standard and the products start including support for temporal data.
So I make my plea one final* time: Let's drop duplicates, once and for all! ("Final" here is wishful thinking,
I expect.)
Posted
04/12/02