PREAMBLE
I first read the subject
paper some years ago--I forget exactly when, but I know it was sometime in
the early 1990s. At Fabian Pascal's
suggestion, I recently read it again, and that rereading prompted me to write
the commentary that follows. Please
note immediately that my comments are NOT intended for general publication or
consumption in their present form or at the present time.
I might be persuaded to edit them later to
make them suitable for wider distribution.
I knew Adrian
Larner personally, and I can vouch for the fact that he thought about database
matters much more deeply and carefully than most of us usually do.
His opinions on such matters are thus worthy
of attention and respect. Having said
that, however, I have to say also that his writing was not always as clear as
it might have been. He was good with
words, and his words were always well-chosen and clear, but--how to say this
politely?--I fear the same cannot be said of his sentences. In other words,
I at least often had
difficulty in following the flow of his argument, and I regret to have to say
that these remarks do apply to some extent to the subject paper. As a result,
some of my criticisms of that
paper might be invalid, in the sense that they might be based on some
misunderstanding on my part. If so, I
apologize--but I do not think the fault, if fault there is, is all mine.
THE E/R "MODEL"
The subject paper proposes a formal interpretation for the
so-called entity/relationship (E/R) model.
In a private note on that paper, however, Fabian Pascal wrote:
"[I] don't exactly see how [the
proposed E/R interpretation] differs from relational interpretation."
Well, me either. I frankly see no need to drag the E/R model into the discussion
at all; if the subject paper has anything of substance to offer, I believe it
could be applied with very little change to the relational model equally
well--especially since the data structures and operators discussed in the paper
are just the data structures and operators of the relational model, no more and
no less.
In any case, I
don't even care to dignify the E/R ideas by referring to them as a model in the
first place. In my book AN INTRODUCTION TO DATABASE SYSTEMS
I wrote:
[It] is not even clear that the E/R "model" is
truly a data model at all, at least in the sense in which we have been using
that term in this book so far (i.e., as a formal system involving structural,
integrity, and manipulative aspects).
Certainly the term "E/R modeling" is usually taken to mean the
process of deciding the structure (only) of the database, although [it
does deal with] certain integrity aspects also, mostly having to do with keys
... However, a charitable reading of [Chen's original E/R paper] would suggest
that the E/R model is indeed a data model, but one that is essentially just a
thin layer on top of the relational model (it is certainly not a candidate
for replacing the relational model, as some have suggested).
And I went on to justify these claims in detail. I omit
the specifics here, except to note that:
Ø
The
"thin layer on top of the relational model" consisted of certain
referential
actions--cascade delete, etc.--that can certainly be (and usually are)
implemented in today's SQL systems but are not part of the relational model as
originally formulated.
Ø
The
detailed arguments were first written down in a much earlier paper
(Entity/Relationship
Modeling and the Relational Model, in InfoDB 5, No. 2, Summer 1990,
republished in my book RELATIONAL
DATABASE WRITINGS 1989-1991)—so they are hardly new, and were almost
certainly in print at the time the subject paper was first written.
"THE JOIN TRAP"
Quite frankly, I find the subject paper's remarks on what it
calls "the classic join trap" more than a little annoying.
Here's the example from the paper, quoted verbatim:
Consider the
relations:
SUPPLY: SUPPLIER PART
S1 P1
S2 P1
S2 P2
... ...
USE: PART PROJECT
P1 J1
P2
J1
P1 J2
... ...
These have interpretations ... :
(SUPPLY) x supplies y.
(USE) y' is used in z.
We form their composition, according to the standard
interpretation:
EXISTS y
EXISTS y' x supplies y and y = y'
and y' is used in z.
or EXISTS y
x supplies y and y is used in z.
In English, one instance is:
There is something supplied by S1 and used in J1; or, S1 supplies
something used in J1. But this does not
follow from our base data; it could be false while the base data (shown above)
was true (if none of the P1s supplied by S1 were used in J1): the classic join
trap.
Aside: A few brief comments on the foregoing quote:
First, the variables x, y, y', and z
ought by rights to have appropriate ranges defined for them (however, we can
presumably take those ranges as read for present purposes). Second,
it is not clear why Adrian uses the
names x, y, y', and z anyway, in place of the more obvious
(possibly dot-qualified) names SUPPLIER, PART, and PROJECT. Third,
I should explain, in case you are not
familiar with the term, that the composition of SUPPLY and USE is the
join of SUPPLY and USE (on PART), projected on SUPPLIER and PROJECT.
Finally, by the term instance here
Adrian means the interpretation of one particular tuple in the result of the
composition. End of aside.
Anyway, my overall
response to the example is: Stuff and
nonsense! Values of attribute SUPPLIER
(x) are supplier numbers; they identify specific suppliers.
Likewise, values of attribute PROJECT (z)
are project numbers, and they identify specific projects.
But values of attribute PART (y and y'),
though we do call them part numbers, do not identify specific
parts--rather, they identify specific kinds of parts.*
The interpretation as given in Adrian's
sentence beginning "In English" should thus more correctly be:
There is some kind of part supplied by S1 and used in J1; or,
S1 supplies some kind of part used in J1.
As far as I'm concerned, this interpretation is 100%
correct.
----------
* I should know!
I've been using this example since 1972, if not earlier.
Though in fairness I should admit that my
explanations of the example might not always have been as clear as they should
have been.
----------
It seems to me
that Adrian's "join trap" is identical to what James Martin, in his
book COMPUTER DATA-BASE ORGANIZATION (2nd ed., Prentice-Hall, 1977) and
elsewhere, has called "semantic disintegrity."
I wrote a short paper over 20 years ago—The
Relational Model and its Interpretation, published in my book RELATIONAL DATABASE: SELECTED
WRITINGS—that (among other things) tried to debunk that notion.
What follows is a lightly edited version of
the relevant portion of that paper:
A good illustration of lack of clear thinking—arising
presumably from an inadequate understanding of the relational model and its
interpretation—is provided by what is sometimes referred to as "semantic
disintegrity." Here is an example,
taken from a book by James Martin ... Consider the relations:
EMP { EMP#, DEPT# }
DEPT { DEPT#, LOCATION }
Suppose a given department can have any number of locations.
Now consider the relation:
RESULT { EMP#, DEPT#, LOCATION }
(the natural join of EMP and DEPT, on DEPT#).
Martin says this join is invalid, because if
employee e works in department d and department d has
locations x and y, it certainly does not follow that e is
located in both x and y.
The assumption seems to be that relation RESULT states otherwise.
But of course it does not; it merely states
what we already know—namely, that e works in d and d has
locations x and y.
Relation RESULT does represent the answer to a certain query, but that
query is not "Find employee locations."
Thus it is definitely wrong to say that the
join is invalid (though it may be legitimate to warn against incorrect
interpretation of that join).
I also want to
comment on the term "classic join trap."
First, I'm not aware of that term appearing anywhere else than in
the subject paper, so I'm not at all sure that the epithet "classic"
is warranted. More to the point, I
think what Adrian is calling "the join trap" is essentially identical
to what Codd, in his famous 1970 paper A Relational Model of Data for Large
Shared Data Banks (CACM 13, No. 6, June 1970), called the connection
trap. In that paper, Codd was I think
arguing against precisely the kind of confusion that Adrian, in the subject
paper, and James Martin are both guilty of!
I commented on this issue as follows in my book THE DATABASE RELATIONAL MODEL: A
RETROSPECTIVE REVIEW AND ANALYSIS:
Codd wrote:
A lack of understanding of [the semantics of the relational
operators] has led several systems designers into what may be called the connection
trap. [For example, suppose we have
a nonrelational system in which] each supplier description is linked by pointers
to the descriptions of each part supplied by that supplier, and each part
description is similarly linked to the descriptions of each project which uses
that part. A conclusion is now drawn
which is, in general, erroneous: namely that, if all possible paths are
followed from a given supplier via the [corresponding] parts ... to the
projects using those parts, one will obtain a valid set of all projects
supplied by that supplier.
[To the
foregoing I added:] Of course, we
don't have to be following pointers in order to fall into the connection
trap--the very same logical error can unfortunately be made in a purely
relational system too. Indeed, some
writers have criticized relational systems on exactly these grounds ... I hope
it's obvious, however, that such criticisms are invalid, betraying as they do a
sad lack of understanding of the relational model.
The point is—at
least, so it seems to me—we can never stop users from falling into errors of
interpretation; but in a relational context, at least, the errors rise to the
surface (as it were) and can be clearly identified and perhaps avoided.
So Adrian is blaming the relational model
for a problem that, at its worst, is more easily recognized and fixed in a
relational system than in other kinds.
This criticism seems to me less than fair.
"CRITERIA OF IDENTITY"
I agree with the subject paper that "criteria of
identity" are needed. To
quote: "[The] criterion of
identity [is what] tells us when we have got one (i.e., one and not two);
it removes, not vagueness, but ambiguity." But the paper goes on to say:
"[The proposed interpretation] avoids the use of absolute
identity"—and from the context, it is clear that this state of affairs is
to be contrasted with that found in connection with the relational model.
Now, I freely admit that I have no training
in formal logic, but I am certainly not aware of any respectable relational
publication that relies on any notion of "absolute identity."
It is true that in our work on THE THIRD MANIFESTO, Hugh
Darwen and I do require "identity" (we call it equality) to be
defined for every type, and we explain exactly what we mean by that
requirement; as far as I can tell, however, our "equality" is the
same as (is identical to?) Adrian's "criteria of identity."
A REMARK ON SET THEORY
The subject paper includes, just prior to its
"Conclusion" section, a sideswipe against set theory.
I'd like to close this commentary by quoting
the relevant paragraph in its entirety, if only because the paragraph in
question is one of those I find almost completely unintelligible and the
sideswipe thus, even if justifiable, not in fact justified by the subject
paper.
Finally, we can even use the [proposed interpretation] to
distinguish "types" from "instances" without resort to the
complexities of set theory. A person
instance is something that falls under the criterion of application, "... is
a person," and has the criterion of identity, "... is the same person
as something" (or even, "as itself").
The person type is something that falls under the same criterion
of application, but has the criterion of identity, "... is a person, and
so is ..."; the type is an entity with a very coarse classification (a single
equivalence class into which each person falls).
Ed. Comment: The logical
model underlying a database represents something about the real world.
Therefore, to know whether a logical model is an accurate representation, we
must know exactly what it represents. Errors like the connection trap can
only occur if the knowledge or understanding of what is being represented is
poor.
In fact, a logical model is a formal representation of
a conceptual model, which is an informal (ad hoc) representation
of the real world based on subjective perceptions of that world. The conceptual
model is the user-understood interpretation of enterprise-specific data; the
logical model is the DBMS-“understood” meaning of the data. And a data model
e.g. the relational model is, so to speak, a “translation language” via which
the former is mapped to the latter, so that the DBMS can make mechanized
inferences and guarantee their correctness. A clearly and carefully
specified, consistent conceptual model is a pre-requisite for logical
database representation.
Unfortunately, because the conceptual level is subjective and
informal, it is not given adequate attention, with confused, vague, ambiguous,
under-specified, or inconsistent results; confusion of the two levels of
representation also occurs frequently. Mapping such models logically more often
than not yields questionable database designs, erroneous queries, and erroneous
results. I believe that E/R is a poor conceptual modeling methodology, which
should not be confused, as it usually is, with logical design.
For an
attempt to develop a rigorous conceptual modeling methodology—with tighter
logical mapping—which also keeps the two levels more distinct see Conceptual Modeling and
Database Design: A Foundation Framework for Data Management), and the seminar by same name.
Posted 11/18/05