ON PRIMARY KEYS
with Chris Date

 

 

 

From: MB

To: Editor

 

[Note: I originally addressed this directly to C.J. Date, as the mail was inspired by a comment in his book.  However, I am interested in a discussion on the subject with any interested party.]

 

I am something of a neophyte in the database management field, insofar as I have only just started delving into any disciplined study of the subject, although I have some 'practical' experience in managing database systems.  As such, I apologize in advance for any misuse of terminology you may encounter, though I hope, at least, my understanding of the concepts is sound.

 

In your book, AN INTRODUCTION TO DATABASE SYSTEMS you state: "If the set of candidate keys actually does include more than one member, then the choice of which is to be primary is essentially arbitrary."  I don't believe this to be strictly true, even in the absence of a formal system of determining the 'right' primary key.

 

As an example, let's say we're running a bowling league, and we decide to set up a database.  The following relation may well appear in our database:

 

TEAMS (TNAME, STANDING)

 

Our rules for this league state that all teams must have a unique name, any new team gets assigned the lowest standing, and any ties are resolved by a new round. Given this, we have a choice of two candidate keys - TNAME and STANDING.

 

Both attributes are unique, irreducible, and satisfy the entity integrity rule. However, I would expect anyone to intuitively choose TNAME.

 

For the purposes of illustrating the problems of our alternative, let's go ahead and treat STANDING as a primary key anyways, and add another relation that references it:

 

MEMBERS (MNAME, STANDING)

 

The problem here should be self-evident - any change to STANDINGS needs to be cascaded to MEMBERS, despite the fact that nothing has altered in terms of what the table is supposed to represent.  This anomaly stems from the fact that we chose as an identifier a transient, though unique, attribute.

 

Taking this as a starting point, I would propose that any given attributes be separated into two distinct categories - transient and immutable, the later being those attributes which are time-dependent.  I would stipulate as a general rule that no candidate key should be considered an acceptable primary key unless it satisfies the condition of immutability.

 

An obvious issue is that there may not exist an inherent attribute of a given entity that also satisfies the requirements of a candidate key.  This is easily illustrated by provisioning for the possibility of changing the name of a given team.

 

In practice, when setting up a database, I've always avoided this by creating an assigned attribute that satisfies all conditions.  This appears to be performed at the DBMS level in some cases, such as Postgres, where every tuple has an implicit attribute (OID) that uniquely identifies it.

 

I'm interested in hearing any feedback you may have on this subject, as I am not aware of whether this idea has received previous treatment, and because I am certain my presentation of it still suffers from ambiguities and unforeseen side issues.

 

 

Chris Date Responds:  Thank you for your thoughts on primary keys. You might well be right in accusing me of overstatement when I say "the choice of which [candidate key] is to be primary is essentially arbitrary"; I mean, "essentially arbitrary" might be a little too strong. What I meant was--exactly as you suggest--that there is, in general, no formal basis for making the choice.  I certainly agree that choosing a "transient" key ("volatile" might be a better adjective), as primary will often be a bad idea.  I also agree that introducing an artificial, surrogate, nonvolatile key will often be a good idea.  However: 

 

Ø       Both of these precepts are indeed only informal in nature.  (Define "volatile"!  Define "often"!)

 

Ø       What if there are several equally nonvolatile candidate keys?  Consider, e.g., the table of chemical elements, in which each element has a unique name, a unique symbol (e.g., tin = Sn), a unique atomic number, and (I believe, though I'm not a chemist and I could be wrong here) a unique atomic weight--all of them nonvolatile.

 

Ø       I came across a real-world example a few years back in which there were two nonvolatile candidate keys, and which of the two was chosen as primary could legitimately claim to depend on your point of view.  The example involved a 1:1 relationship between shipments and invoices, and so there was a relvar (table) that looked like this: 

 

IS {SHIP#, INV#,...} 

CANDIDATE KEY {SHIP#}

CANDIDATE KEY {INV#}

 

Some users were primarily (I choose my words carefully) interested in shipments and thus thought of SHIP# as primary key; others were primarily interested in invoices and thus thought of INV# as primary key.  What would you do?  (One possible approach would be to define two views, both isomorphic to the base relvar but with different primary keys.  But that suggestion raises the question of primary keys for views ... See below.)

 

Ø       What do you do if you have (say) two candidate keys, one involving one attribute (column) but volatile and the other involving two or more attributes but nonvolatile?  Multi-attribute primary keys suffer from problems of their own (see my article Composite Keys in C. J. Date and Hugh Darwen, RELATIONAL DATABASE WRITINGS 1989-1991.  In other words, nonvolatility cannot be the sole criterion for making the choice, in general.

 

Ø       You note that "[surrogate key support seems to be provided] at the DBMS level in some cases, such as Postgres, where every tuple has an implicit attribute (OID) that uniquely identifies it."  Three points here: 

 

·   I'm not familiar with Postgres as such, but I strongly doubt whether every tuple in Postgres has an "OID."  I strongly suspect that what you mean is that every base tuple has such an ID (i.e., a tuple has such an ID if and only if it's a tuple in a base relation).  But the primary key notion, if it's to mean anything at all formally, has to apply to all relations, not just to base ones!  The Principle of Database Relativity dictates that there must be no arbitrary differences (such as having vs. not having a primary key) between base relations and other kinds of relations such as views. 

 

Note:  SQL, of course, doesn't support primary keys for views; in fact, it doesn't really support any kind of explicit integrity constraints on views, except for the rather bizarre WITH CHECK OPTION option.  But then, that's SQL for you.  What can I say --except to note that this is just another piece of evidence (Exhibit No. 500 or so) that SQL is not now and never was truly relational? 

 

·   Many products do indeed automatically provide some kind of "row ID" for base tuples (only).  But those row IDs typically constitute a violation of theInformation Principle, in that they are typically not represented by regular relational attributes and cannot be accessed as such.  Ted Codd calls theInformation Principle the most fundamental relational principle of all, and I tend to agree with him.  So those products aren't truly relational, either. 

 

·   The terminology of "OID" (= object ID) makes me extremely suspicious.  It strongly suggests that Postgres--or somebody-- thinks that tuples are objects and relations are object classes.  Hugh Darwen and I regard the equating of relations and object classes as The First Great Blunder.  We have written extensively about that blunder--showing in particular just why it is a blunder--in several places, including our book FOUNDATION FOR FUTURE DATABASE SYSTEMS: THE THIRD MANIFESTOin particular.  See also the final chapter in my book AN INTRODUCTION TO DATABASE SYSTEMS

 

By the way, The First Great Blunder is called "first" because there's a second one, too: viz., mixing pointers (OIDs) and relations.  This one is discussed in my articles Don't Mix Pointers and Relations! and Don't Mix Pointers and Relations--Please in C. J. Date, RELATIONAL DATABASE WRITINGS 1994-1997.  If you commit The First Great Blunder you will be inevitably sucked into committing the second one too; however, you can commit The Second Great Blunder without committing the first.  SQL (I mean the current standard, SQL:1999, as well as SQL as implemented in most if not all of the current SQL products) commits The Second Great Blunder, though not the first.  Thus, not only was SQL never truly relational, it's actually become less relational over the years! 

 

Back to the main thread of our discussion.  All in all, I stand by my position that the concept of primary key --unlike the concept of candidate key – is not part of the formal relational model.  Of course, this is not to say that primary keys aren't useful; nor is it to say that we can't find useful rules of thumb to guide us in our practical database design activities.  However, those rules of thumb will usually be just that, rules of thumb, not precise and formal statements. 

 

PS: I apologize--but only slightly!--for my insistence on formal terms (relvar, relation, tuple, attribute).  I used to think that more user-friendly terms (table, row, column) were preferable, precisely because they were less intimidating.  However, I now think those latter terms were a mistake; they have done the cause of genuine understanding a grave disservice, precisely because they are only informal and drag a lot of semantic baggage along with them.  In particular, I think they've led to some of the grievous errors we observe in SQL--duplicate rows and left-to-right column ordering, to name two particularly obvious ones.

 

 

Posted 09/27/02

 

 

 

[ABOUT] [QUOTES] [LINKS]