DOUBLE TROUBLE, DOUBLE TROUBLE PART 2
by Chris Date

 

 

 

“What I tell you three times is true.”

--Lewis Carroll

 

 

(Continued from Part 1)

 

 

 

Introduction

 

In Part 1 of this article, I tried to explain why duplicate rows (hereinafter abbreviated to just duplicates) are prohibited in the relational model.  When that material was first published, I received a long letter from a reader, Robert A. Alps, which raised questions about some of the arguments I'd been making. And it seemed to me that it might be worth airing some of the comments from Alps' letter, and my responses to them, before a wider audience.  Hence Part 2 of the present article (I should make it clear that Alps wasn't saying duplicates were a good thing.  To quote his letter: "I am not arguing in favor of duplicates, rather against some of your arguments opposing [them].  I only want to be sure the arguments are sound before concluding 'Out, damned duplicate.'" 

 

The Pennies Example

 

In Part 1 I said that individual objects must be identifiable--in other words, distinguishable from all other objects--because if an object isn't identifiable, it's impossible even to talk about it, let alone perform any kind of operation upon it, or use it for any sensible purpose.  Alps countered with the following example: 

 

“Imagine a bag of 100 newly minted pennies.  You can reach into the bag and pull one out and while it is in your hand it has a special identity (of being the penny in your hand).  But as soon as you place that penny back in the bag, it loses that special identity and becomes, once again, indistinguishable from the others.  Is there some useful operation we can perform on indistinguishable objects?  Certainly.  To the extent that we have information regarding their number, we may perform mathematical operations using that information.  If we know that one bag contains 100 pennies and another contains 200 pennies then we know that the combination of the two bags holds 300 pennies.  This is a useful operation.” 

 

Now, obviously I agree that the individual pennies in a bag of 100 pennies can be regarded as indistinguishable from one another at a certain level of discourse (in other words, at a certain level of abstraction).  Surely, however, the point is that the entities under discussion here aren't individual pennies; rather, they're the collective entities "bags of pennies."  Those collective entities in turn have a certain property (namely, cardinality), with values 100, 200, and so on.  Furthermore, those collective entities have some identity, because otherwise (as I said before) we can't even refer to them; at the very least, we have to say something like "This is this bag of pennies and that is that one." [Ed. Comment: In other words, to point at them.]

 

But suppose we ask ourselves the question "How are values of the cardinality property determined?"  Then I contend that we have to descend to a lower level of discourse and consider individual pennies as entities. And at this level we must be able to distinguish those individual pennies in order to be able to count them!  As I put it in Part 1:

 

“[A] common reaction to this argument is 'But I really don't need to distinguish among the duplicates--all I want to do is to be able to count them.'  The point I'm trying to make is that you do need to distinguish them, even just to count them (for otherwise how do you know which ones you've already counted?).  This point is crucial, of course, and I really don't know how to make it any more strongly than I already have."  [Ed. Comment: Try to count a pile of pennies by throwing each back into the pile after you count it; this is the equivalent of what duplicate proponents suggest, without  realizing it. See last paragraph in this section]

 

I further contend that Alps' example of a "useful operation"--namely, the operation of combining the bag of 100 pennies and the bag of 200 pennies to obtain a bag of 300 pennies, which is a union operation, of course*--is self-evidently an operation on bags of pennies, not an operation on individual pennies per se.  (A set union, that is, not a bag union (pun intended). 

 

To continue with what I said inPart 1, I went on to observe that the way we distinguish duplicates in general is by their relative position; in a bag (aka multiset) containing two 6's, for example, we say something like "This 6 is here and that 6 is there," or "This is the first 6 and that 6 is the second."  Alps countered by claiming that "the suggestion that we place a positional ordering on duplicates in all situations appears highly artificial and patently false."  Well, perhaps I should turn the argument around at this point; specifically, if you're one of those people who think that duplicates "truly exist" in some sense, then please tell me exactly--exactly, please!--how you propose to count those 100 "duplicate" pennies. I do think that anyone who advocates the position that duplicates are a good idea needs to provide a good, convincing answer to this question. It's nearly ten years since I first issued this challenge, and nobody has yet come up with a cogent response to it

 

 

Bag Theory

 

Recall that, loosely speaking, a bag in mathematics is a set that permits duplicates. Duplicate advocates thus sometimes claim that their Part 1 approach is just as respectable as the relational approach because it too is based on solid theory (bag theory instead of set theory).  In Part 1 of this article, however, I claimed that mathematical "bag theory" treatments usually start by assuming that there's a way to count duplicates!  I further claimed that that assumption effectively means that bags are defined in terms of sets

 

Alps replied by constructing an outline bag theory that--he claimed--didn't start with such an assumption. But I think it does! One of the three primitives of his theory is an expression of the form "count xA" which (to quote) "is intended to denote the number of times the [object] x occurs in the bag A."  Now, this quote might have been meant only as an informal, intuitive remark, not as part of the theory as such, but I'm still suspicious. In any case, the question of whether such an assumption is necessary is perhaps a red herring (and I was perhaps wrong to raise it in Part 1).  The point rather is that bag theory:

 

Ø       Is necessarily more complex than set theory;

Ø       Includes set theory as a proper subset (perhaps I should say subbag?);

Ø       Is reducible to set theory. 

 

Occam's Razor would thus clearly suggest that we stay with sets and not get into the unnecessary complexities of bags.  (As so often in a database context, Occam's Razor is peculiarly apt here.  One simple way of stating it is:  "Entities should not be multiplied beyond necessity."  A simpler way still is just:  "No unnecessary entities.") 

 

 

The Performance Issue

 

In Part 1 I recommended that users should always ensure that query results contain no duplicates--for instance, by always specifying DISTINCT at appropriate points in the query--and thus simply forget about the whole problem.  However, I also pointed out that SELECT DISTINCT might take longer to execute than SELECT ALL, in general, even if the DISTINCT is effectively a "no-op."  Alps commented that the performance problem doesn't go away if the system is (re)defined to eliminate duplicates automatically:  

 

“Imagine an address table containing street, city, state, and zip data.  If a query is performed selecting city from the table under a system that [eliminates duplicates automatically], the system is going to have to go through all the same checking for duplicates that would be required if [it didn't eliminate duplicates automatically but] the query asked for DISTINCT cities.” 

 

Here I think Alps missed my point slightly. What I meant was that, because it appears that always eliminating duplicates implies performance problems, there was a strong motivation for the SQL language designers to make not eliminating duplicates the default.  (In fact, I very specifically pointed out in Part 1 of this article that support for duplicates can actually have a negative impact on performance, and my reason for doing so was precisely to provide a counterbalance to this familiar argument.)  But the full implications, psychological and otherwise, of this very bad language design decision weren't thought through at the time.  It's my belief that the occasional overhead that might be incurred (in a well-implemented system) in always eliminating duplicates would be vastly outweighed by the benefits--including performance benefits--that such a system would provide. 

 

Of course I understand that checking for duplicates is required regardless of whether (a) the system prohibits them or (b) the system permits them, but the user specifies DISTINCT.  But note that if the language is defined always to eliminate duplicates, at least conceptually, then the system will sometimes be able not to eliminate them--for intermediate results, at least, though not of course for final results--as an optimization, where appropriate.

 

 

Some More SQL Problems

 

At this juncture I'd like to digress for a moment and talk about a couple of problems that are caused by duplicates in the SQL standard specifically.  The following discussions are based on material from Appendix D (Some Outstanding Issues) from the book by Hugh Darwen and myself A GUIDE TO THE SQL STANDARD.

 

The first problem concerns Cartesian product.  Part of the standard's explanation of the SQL FROM clause reads as follows: 

 

“[The] result of the <from clause> is the ... Cartesian product of the tables identified by [the] <table reference>s [in that <from clause>].  The ... Cartesian product, CP, is the multiset of all rows R such that R is the concatenation of a row from each of the identified tables...” 

 

Note, therefore, that CP is not well defined!--the fact that the standard goes on to say that "The cardinality of CP is the product of the cardinalities of the identified tables" notwithstanding.  Consider the tables T1 and T2 shown below: 

 

T1              T2

+----+          +----+

¦ C1 ¦          ¦ C2 ¦

+----¦          +----¦

¦  0 ¦          ¦  1 ¦

¦  0 ¦          ¦  2 ¦

+----+          +----+

 

Either of the following fits the above definition for "the" Cartesian product CP of T1 and T2 (that is, either one could be "the" multiset referred to): 

        

CP1                   CP2

+---------+           +---------+

¦ C1 ¦ C2 ¦           ¦ C1 ¦ C2 ¦

+----+----¦           +----+----¦

¦  0 ¦  1 ¦           ¦  0 ¦  1 ¦

¦  0 ¦  1 ¦           ¦  0 ¦  2 ¦

¦  0 ¦  2 ¦           ¦  0 ¦  2 ¦

¦  0 ¦  2 ¦           ¦  0 ¦  2 ¦

+---------+           +---------+

 

As an exercise, I suggest you try your hand at fixing up the wording of the standard appropriately.  If you do try this exercise, I believe you'll find you're inevitably led into using the language of sets, not bags, in order to get around the errors and ambiguities. 

 

Here's the second SQL problem.  Consider the following cursor definition, which is intended to apply to the usual suppliers-and-parts database: 

 

DECLARE x CURSOR

FOR SELECT sp.s#,sp.qty

    FROM sp

 

Note that: 

 

a.      Cursor X permits updates;

b.      The table that is visible through cursor X permits duplicates;

c.      The underlying table (table SP) does not permit duplicates

 

Now suppose a "positioned UPDATE or DELETE" operation is executed via cursor X (UPDATE or DELETE ... WHERE CURRENT OF x).  Then there's no way, in general, of saying precisely which row of table SP is being updated or deleted by that operation.  How would you fix this problem? 

 

(After you've given solutions to these two SQL problems, please write out one googol times "There's no such thing as a duplicate.")

 

 

Concluding Remarks

 

The letter from Alps concluded: "Whether or not to allow duplicates in SQL and relational databases seems to me to be more a practical question than a theoretical one."  Well, regular readers of my database writings will know that I don't subscribe to the notion that "theoretical" and "practical" issues are at odds with one another. Au contraire, it's my strong position that theory--meaning relational theory specifically--is a highly practical matter. Thus, I want the system to be built on a solid, well-established theoretical foundation, for all the obvious good practical reasons. I further contend that bag theory, even if it could be made independent and respectable, simply doesn't enjoy the same long pedigree and high level of acceptance that set theory does. Thus, I want to build on set theory, not bag theory--and I don't want duplicates! 

 

In this connection, I'd like to draw a possibly enlightening parallel ... The following remarks are due to Lauri Pietarinen (see Part 1 of this article).  In his review of an earlier draft, Lauri wrote: 

 

“It came to my mind that one way to look at the issue is to compare it to the GOTOless programming episode that started with Dijkstra's [letter] in 1968 ... Replace "programmer" by "query issuer" and "GOTOs" by "duplicates" in the following discussion:  Programmers used to use GOTOs because that seemed like a natural way to transfer control to another place in the program.  Dijkstra declared GOTOs harmful in his landmark [letter], giving ways to eliminate them.  Programmers resisted, because they felt they could not do all necessary things without GOTOs.  They felt something had been taken away from them.  However, programs with GOTOs were much harder to understand, and ... the compilers were bigger, buggier, and had a harder time optimizing the code.  Little by little, new programming methodologies appeared that did away with GOTOs, and some newer programming languages did not even have GOTOs in the first place ... The old programmers had a hard time not using GOTOs, but the new ones were not even told about GOTOs and they managed to make perfectly decent programs without them (to say the least!).” 

 

I'd like to close with the following argument (it's based on some remarks made by Hugh Darwen some years ago in a private letter to myself, and it effectively sums up my position on this whole business of duplicates): 

 

1.      We all agree that databases are made up of collections of "records"--or "rows," or "tuples," or whatever--of similar format, where each record has exactly one value for each item in the format (aka 1NF). 

2.      We all agree that different kinds of collections are useful for different purposes.  Sometimes lists are wonderful, sometimes arrays, sometimes queues or stacks, sometimes sets, sometimes bags, etc. 

3.      Of these different kinds of collections, only one is known to provide, singlehanded, a basis for a complete model of data.  And all of the others can be mapped, if need be, to this one. 

4.      To insist on supporting bags is equivalent to at least one of the following (probably more than one): 

·      To insist on using at least two kinds of collection in one's model of data;

·      To fall back on one of those kinds of collection for which no complete model of data has yet been developed, and therefore to have to go to the trouble of developing that theory (if it is available to be developed);

·      To be like SQL and attempt to apply the theory of relations to bags of tuples without fully checking its applicability, consequently making mistakes (like taking relational theorems to hold equally well for such bags, or misdefining UNION), failing to provide interpretations for operations whose relational interpretations don't hold for such bags (for example, Cartesian product is no longer just AND, and projection is no longer existential quantification), and delivering inferior products to the folks who are sensible and knowledgeable enough to stay with relations only. 

 

We don't need to argue against the (occasional) utility of bags of tuples, any more than we need to argue against the occasional utility of lists of tuples (why, those are what SQL cursors deliver!), or even stacks of tuples, queues of tuples, and arrays of tuples for that matter.  In fact, why do the duplicate advocates specifically single out bags of tuples (only) for inclusion in their approach?

 

 

Appendix

 

By way of a kind of appendix to the foregoing, I'd like to offer a few comments on the tuple-bag algebra presented in the book by Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom, DATABASE SYSTEM IMPLEMENTATION (Prentice-Hall, 2000).  That algebra, you might recall, was referenced in the letter from Lauri Pietarinen that I quoted in the opening to Part 1 of this article. 

 

Here first is an extract from page 237 of the book in question (Most if not all of the quotes in this appendix are paraphrased just slightly from the original, mainly for typographical reasons.  Of course, I've tried not to change the sense of any of them.) 

 

“[The relational] algebra involves operations on relations ... However, SQL uses a bag (multiset) model, rather than a set model.  Also, there are operations in SQL, such as aggregation, grouping, and ordering (sorting), that are not part of the classical relational algebra.  Thus, we need to reconsider this algebra in the light of its role as a representation for SQL queries.” 

 

A few comments right away: 

 

Ø       So SQL uses "a bag (multiset) model," does it?  OK:  Where's that model defined?  (Perhaps this question is unfair.  Probably all that Garcia-Molina et al. mean is that the basic SQL construct is a bag, not a set, of rows.  Well, OK again; but then I wish they wouldn't use such high-flown language to make such a simple point.  As I've written elsewhere--see my article Models, Models, Everywhere, Nor Any Time to Think [cjd3a.htm] on this website--I think the word "model" is one of the most grotesquely overused in the whole IT field.) 

 

Ø       I don't quite know what Garcia-Molina et al. mean when they talk about "the classical relational algebra," but aggregation and grouping have been part of the algebra ever since the early 1970s (and possibly earlier); that seems classical enough to me! (In case you're interested, the reference I have in mind here is the paper by Patrick Hall, Peter Hitchcook, and Stephen Todd entitled An Algebra of Relations for Machine Computation, which appeared in the Conference Record of the 2nd ACM Symposium on Principles of Programming Languages, Palo Alto, CA., January 1973. See also Stephen Todd’s paper The Peterlee Relational Test Vehicle – A System Overview, IBM System Journal 15, No. 4, 1976.)

 

Ø       Ordering, by contrast, is not part of the relational algebra; nor can it be, because its result isn't a relation.  This doesn't mean you can't have an ORDER BY operator, of course--it just means that operator isn't part of the algebra as such, and it can't be used in an expression that's nested inside some other (relational) expression, or more generally in any context where the result is indeed required to be a relation.  That's why you can't use ORDER BY in a view definition, for example. 

 

Ø       "Reconsider this algebra"?  No, "this algebra" is fine.  What the authors need to do instead is define a whole new algebra that works on tuple-bags instead of relations. 

 

Ø       In fact, SQL isn't even based on a tuple-bag algebra, because SQL's "tuples" aren't tuples!--they have a left-to-right ordering to their components.  As a consequence, for example, R JOIN S and S JOIN R give different results, in SQL.  (I don't mean to suggest that R JOIN S is valid SQL syntax, of course though you might be surprised to learn that R NATURAL JOIN S is.)  Since my primary focus in the present article is on duplicates specifically, I won't make a big deal of this left-to-right ordering business here, but it is a significant issue in its own right. 

 

Anyway, a few pages later, in Section 6.1 of their book (the section is entitled An Algebra for Queries), Garcia-Molina et al. proceed to present their tuple-bag algebra.  Another quote: 

 

“[Relational] algebra was originally defined as if relations were sets [sic!--italics added].  Yet relations in SQL are really bags ... Thus, we shall introduce relational algebra as an algebra on bags.“

 

Well, all right ... I think it was Chesterton who said you can call black white if you want to, and furthermore I would defend your right to do so--but I think I'd have to question your wisdom and your sanity if you actually did so.  The plain fact is, relations simply are sets, not bags, and I don't think you do the cause of communication and understanding much of a service if you insist on pretending otherwise. (More precisely, the body of a relation is a set.) 

 

Furthermore, I don't think it's a good idea to try to define a theory to fit (force-fit?) bad established practice.  Rather, we should define a good theory--the relational model suggests itself as a candidate--and then try to get practice to follow that theory.  (A heavy-handed attempt at humor on my part.) 

 

Be that as it may, it's an unfortunate fact that--so far as I can tell--the tuple-bag algebra presented by Garcia-Molina et al. completely ignores the relational algebra as currently understood in the relational world.  Indeed, it reinvents--not always terribly well, either--a variety of concepts that have been thoroughly investigated in the relational literature; the "aggregation and grouping" stuff is a case in point.  It also bundles together certain constructs that, speaking personally, I would have much preferred to keep apart.  Here are some specific quotes, with commentary: 

 

“Projection ... We shall extend this operator beyond classical relational algebra to allow renaming of attributes and construction of attributes by calculation ... just as the SQL SELECT clause does.” 

 

I think it would be cleaner to keep "projection" to mean what it always did mean, and to use separate RENAME and EXTEND operators in order to achieve "renaming ... and construction of attributes," just as the relational algebra does.  Another example of force-fitting the theory to bad established practice? 

 

“Product.  This operator is the set-theoretic Cartesian product ...”

 

No, it isn't!--it's the bag-theoretic Cartesian product.  And as I pointed out in the body of this article, it's hard to define, too.  Indeed, Garcia-Molina et al. don't define it, they simply assume we all know what it is.  What's more, they go on to say: 

 

“If R and S are relations, the product R TIMES S is a relation whose schema consists of the attributes of R and the attributes of S.  Should there be an attribute name, say a, found in both schemas, then we use R.a and S.a as the names of the two attributes in the product schema.” 

 

The proposed naming scheme is seriously flawed (and in any case it's unnecessary).  As one example of the kind of problem it leads to, what if R and S are the same relation?  More important, what if the input relations R and S are derived relations (in other words, what if they're represented by subexpressions that are more complex than just simple "relation names," and thus have no names of their own)?  Indeed, the naming mechanism proposed by Garcia-Molina et al. here serves as strong evidence to support my earlier claim that the designers of this algebra cannot have paid much attention to the existing literature in this area.  In particular, the naming problem they're wrestling with here was solved as far back as 1975--over a quarter of a century ago! 

 

“Union ...  For R UNION S, a tuple t is in the result as many times as the number of times it is in R plus the number of times it is in S.” 

 

Here I just note that the operator defined isn't the usual bag union; it isn't the regular SQL UNION, either (in fact it's the SQL "UNION ALL" operator).  The second of these points is acknowledged by Garcia-Molina et al., but the first--which is more serious, in a way--isn't. 

 

I could say quite a lot more regarding deficiencies in the proposed tuple-bag algebra, but we're supposed to be concentrating here specifically on why duplicates are a bad idea, so let me move on ... Now, I claimed in Part 1 of this article that duplicates are an optimization inhibitor.  And, of course, Garcia-Molina et al. do recognize this fact.  In Section 7.2 (Algebraic Laws for Improving Query Plans), there are several points where they explicitly admit that dealing with bags instead of sets can get in the way of optimization, in one way or another.  Some quotes: 

 

“R JOIN S = S JOIN R ... We might imagine that the order of [attributes] will be different [for the results of] the left and right [expressions here], but formally, tuples in relational algebra have no ... order of attributes.” 

 

Actually, the problem here is not that we're dealing with bags instead of sets, but rather that the bags or sets in question don't contain tuples as such (see my earlier remarks in this appendix regarding left-to-right ordering of tuple components).  But note the sneaky appeal to relational algebra in the middle of what's supposed to be an exposition of tuple-bag algebra!  The sorry truth is that--as noted earlier--R JOIN S and S JOIN R are not equivalent, in SQL. 

 

“For instance, you may have learned set-theoretic laws such as A INTERSECT (B UNION C) = (A INTERSECT B) UNION (A INTERSECT C), which is formally the "distributive law of intersection over union."  This law holds for sets, but not for bags.” 

 

Quite right.  And this is just a "for instance"!  It would be much more interesting--not to say more useful--to give an exhaustive list of such laws that hold for sets but not for bags. 

 

“R WHERE c1 OR c2 = (R WHERE c1) UNION (R WHERE c2) ... [This] law ... works only if the relation R is a set [sic!].” 

 

Yes.  See my previous comment. 

 

“[Projection] keeps the number of tuples the same and only reduces the length of the tuples.  In fact, ... sometimes a projection actually increases the length of tuples.” 

 

True relational projection doesn't "keep the number of tuples the same," it reduces the number of tuples (in general); thus, "doing projections early" is a good optimization tactic--in the relational world, but not in the tuple-bag world. As for projection "increasing tuple length":  Well, no, true relational projection never does that, it's EXTEND that does that.  As I said earlier, I think it would be better to treat these two as separate operators (and the confusions involved in the foregoing claim by Garcia-Molina et al. show why, in part). 

 

I don't think I want to comment any further on this particular tuple-bag algebra here; I think I've made my point.  That point might be summed up as follows: 

 

·   It doesn't matter whether an algebra can be defined for tuple bags. 

 

·   It doesn't matter whether expression transformation laws can be defined for such an algebra. 

 

·   What does matter is that the range of possible results that can be obtained from any given database is vastly expanded, though not usefully so. 

 

·   What matters further is that the range of logically distinct query expressions that can be formulated is vastly expanded, too, though again not usefully so.  (Quite apart from anything else, think of the increased costs in terms of user education here!) 

 

·   What matters still further is that numerous logically distinct expressions are logically distinct only inasmuch as they produce different degrees of duplication in their results; those expressions thus can't be transformed into one another, even when it would be advantageous from a performance point of view to do so, and even when the user doesn't really care about the differences in the degree of duplication in their results.  Note that even "law-abiding citizens"--that is, users who make sure always to avoid duplicates entirely--suffer here; that is, they still pay some price, simply because the system permits duplicates, even though they themselves never "take advantage" of this particular "feature." 

 

·   What also matters is that the optimizer itself is harder to write and harder to debug.  DBMS product releases are thus delayed, and cost more, and are buggier, when they do appear.  Again, law-abiding citizens pay a price for this. 

 

·   What also matters is that extensions to the language--probably SQL--are harder to define. Indeed, every time the SQL standards bodies consider the addition of some new construct to the SQL language, they have to consider the impact of duplicates on that new construct (view updating provides a good example of this problem).  Again, we all pay a price for this ... The standard takes longer to appear, and is buggier when it does appear, than would otherwise have been the case.  I say again:  We all pay for this.  (Somebody has to pay the standardizers for the extra time involved, too!) 

 

·   Finally, from my own current research interests, I know that all of the foregoing problems are going to be exacerbated yet again when the standard and the products start including support for temporal data. 

 

So I make my plea one final* time:  Let's drop duplicates, once and for all!  ("Final" here is wishful thinking, I expect.) 

 

 

Posted 04/12/02

 

 

 

[ABOUT] [QUOTES] [LINKS]