From: MB
To: Editor
[Note: I originally addressed this directly to C.J. Date, as
the mail was inspired by a comment in his book. However, I am interested in a discussion on the
subject with any
interested party.]
I am something of a neophyte in the database management
field, insofar as I have only just started delving into any disciplined study
of the subject, although I have some 'practical' experience in managing
database systems. As such, I apologize
in advance for any misuse of terminology you may encounter, though I hope, at
least, my understanding of the concepts is sound.
In your book, AN INTRODUCTION TO DATABASE SYSTEMS you
state: "If the set of candidate keys actually does include more than one
member, then the choice of which is to be primary is essentially
arbitrary." I don't believe this
to be strictly true, even in the absence of a formal system of determining the
'right' primary key.
As an example, let's say we're running a bowling league, and
we decide to set up a database. The
following relation may well appear in our database:
TEAMS (TNAME, STANDING)
Our rules for this league state that all teams must have a
unique name, any new team gets assigned the lowest standing, and any ties are
resolved by a new round. Given this, we have a choice of two candidate keys -
TNAME and STANDING.
Both attributes are unique, irreducible, and satisfy the
entity integrity rule. However, I would expect anyone to intuitively choose
TNAME.
For the purposes of illustrating the problems of our
alternative, let's go ahead and treat STANDING as a primary key anyways, and
add another relation that references it:
MEMBERS (MNAME, STANDING)
The problem here should be self-evident - any change to
STANDINGS needs to be cascaded to MEMBERS, despite the fact that nothing has
altered in terms of what the table is supposed to represent. This anomaly stems from the fact that
we
chose as an identifier a transient, though unique, attribute.
Taking this as a starting point, I would propose that any
given attributes be separated into two distinct categories - transient and
immutable, the later being those attributes which are time-dependent. I would stipulate as a
general rule that no
candidate key should be considered an acceptable primary key unless it
satisfies the condition of immutability.
An obvious issue is that there may not exist an inherent
attribute of a given entity that also satisfies the requirements of a candidate
key. This is easily illustrated by
provisioning for the possibility of changing the name of a given team.
In practice, when setting up a database, I've always avoided
this by creating an assigned attribute that satisfies all conditions. This appears to be performed
at the DBMS
level in some cases, such as Postgres, where every tuple has an implicit
attribute (OID) that uniquely identifies it.
I'm interested in hearing any feedback you may have on this
subject, as I am not aware of whether this idea has received previous
treatment, and because I am certain my presentation of it still suffers from
ambiguities and unforeseen side issues.
Chris Date Responds:
Thank you for your thoughts on primary keys. You might well be right in
accusing me of overstatement when I say "the choice of which [candidate
key] is to be primary is essentially arbitrary"; I mean, "essentially
arbitrary" might be a little too strong. What I meant was--exactly as you
suggest--that there is, in general, no formal basis for making the
choice. I certainly agree that choosing
a "transient" key ("volatile" might be a better adjective),
as primary will often be a bad idea. I
also agree that introducing an artificial, surrogate, nonvolatile key will
often be a good idea. However:
Ø Both
of these precepts are indeed only informal in nature. (Define "volatile"!
Define "often"!)
Ø What
if there are several equally nonvolatile candidate keys? Consider, e.g., the table of chemical
elements, in which each element has a unique name, a unique symbol (e.g., tin =
Sn), a unique atomic number, and (I believe, though I'm not a chemist and I
could be wrong here) a unique atomic weight--all of them nonvolatile.
Ø I
came across a real-world example a few years back in which there were two
nonvolatile candidate keys, and which of the two was chosen as primary could
legitimately claim to depend on your point of view. The example involved a 1:1 relationship between
shipments and
invoices, and so there was a relvar (table) that looked like this:
IS {SHIP#, INV#,...}
CANDIDATE KEY {SHIP#}
CANDIDATE KEY {INV#}
Some users were primarily
(I choose my words carefully) interested in shipments and thus thought of SHIP#
as primary key; others were primarily interested in invoices and thus thought
of INV# as primary key. What would you
do? (One possible approach would be to
define two views, both isomorphic to the base relvar but with different primary
keys. But that suggestion raises the
question of primary keys for views ... See below.)
Ø What
do you do if you have (say) two candidate keys, one involving one attribute
(column) but volatile and the other involving two or more attributes but
nonvolatile? Multi-attribute primary
keys suffer from problems of their own (see my article Composite Keys in C. J.
Date and Hugh Darwen, RELATIONAL
DATABASE WRITINGS 1989-1991.
In other words, nonvolatility cannot be the sole criterion for making
the choice, in general.
Ø You
note that "[surrogate key support seems to be provided] at the DBMS level
in some cases, such as Postgres, where every tuple has an implicit attribute
(OID) that uniquely identifies it."
Three points here:
·
I'm not familiar with Postgres as such, but I strongly
doubt whether every tuple in Postgres has an "OID." I strongly suspect that what
you mean is
that every base tuple has such an ID (i.e., a tuple has such an ID if
and only if it's a tuple in a base relation).
But the primary key notion, if it's to mean anything at all formally,
has to apply to all relations, not just to base ones! The Principle of Database Relativity
dictates that there
must be no arbitrary differences (such as having vs. not having a
primary key) between base relations and other kinds of relations such as
views.
Note: SQL, of course, doesn't support primary keys
for views; in fact, it doesn't really support any kind of explicit integrity
constraints on views, except for the rather bizarre WITH CHECK OPTION
option. But then, that's SQL for you. What can I say --except to note that this is
just another piece of evidence (Exhibit No. 500 or so) that SQL is not now and
never was truly relational?
·
Many products do indeed automatically provide some kind
of "row ID" for base tuples (only).
But those row IDs typically constitute a violation of theInformation
Principle, in that they are typically not represented by
regular relational attributes and cannot be accessed as such. Ted Codd calls theInformation
Principle the most fundamental relational principle of all, and I tend to
agree with him. So those products
aren't truly relational, either.
·
The terminology of "OID" (= object ID) makes
me extremely suspicious. It strongly
suggests that Postgres--or somebody-- thinks that tuples are objects and
relations are object classes. Hugh
Darwen and I regard the equating of relations and object classes as The
First Great Blunder. We have
written extensively about that blunder--showing in particular just why it is
a blunder--in several places, including our book FOUNDATION FOR
FUTURE DATABASE SYSTEMS: THE THIRD MANIFESTOin
particular. See also the final chapter
in my book AN
INTRODUCTION TO DATABASE SYSTEMS.
By the way, The First Great Blunder is called
"first" because there's a second one, too: viz., mixing
pointers (OIDs) and relations. This one
is discussed in my articles Don't Mix Pointers and Relations! and Don't
Mix Pointers and Relations--Please in C. J. Date, RELATIONAL DATABASE
WRITINGS 1994-1997. If you
commit The First Great Blunder you will be inevitably sucked into
committing the second one too; however, you can commit The Second Great
Blunder without committing the first.
SQL (I mean the current standard, SQL:1999, as well as SQL as
implemented in most if not all of the current SQL products) commits The
Second Great Blunder, though not the first. Thus, not only was SQL never truly relational, it's
actually
become less relational over the years!
Back to the main thread of our discussion. All in all, I stand by my position
that the
concept of primary key --unlike the concept of candidate key – is
not part of the formal relational model.
Of course, this is not to say that primary keys aren't useful; nor is it
to say that we can't find useful rules of thumb to guide us in our
practical database design activities.
However, those rules of thumb will usually be just that, rules of thumb,
not precise and formal statements.
PS: I apologize--but only slightly!--for my insistence on
formal terms (relvar, relation, tuple, attribute). I used to think that more user-friendly terms
(table, row,
column) were preferable, precisely because they were less intimidating. However, I now think those
latter terms were
a mistake; they have done the cause of genuine understanding a grave
disservice, precisely because they are only informal and drag a lot of semantic
baggage along with them. In particular,
I think they've led to some of the grievous errors we observe in SQL--duplicate
rows and left-to-right column ordering, to name two particularly obvious ones.
Posted
09/27/02
[ABOUT]
[QUOTES]
[LINKS]