COMMENTS ON LARNER’S “NEW FOUNDATION FOR THE E/R MODEL”
by C. J. Date

 

 

 

PREAMBLE

 

I first read the subject paper some years ago--I forget exactly when, but I know it was sometime in the early 1990s.  At Fabian Pascal's suggestion, I recently read it again, and that rereading prompted me to write the commentary that follows.  Please note immediately that my comments are NOT intended for general publication or consumption in their present form or at the present time.  I might be persuaded to edit them later to make them suitable for wider distribution. 

 

    I knew Adrian Larner personally, and I can vouch for the fact that he thought about database matters much more deeply and carefully than most of us usually do.  His opinions on such matters are thus worthy of attention and respect.  Having said that, however, I have to say also that his writing was not always as clear as it might have been.  He was good with words, and his words were always well-chosen and clear, but--how to say this politely?--I fear the same cannot be said of his sentences.  In other words, I at least often had difficulty in following the flow of his argument, and I regret to have to say that these remarks do apply to some extent to the subject paper.  As a result, some of my criticisms of that paper might be invalid, in the sense that they might be based on some misunderstanding on my part.  If so, I apologize--but I do not think the fault, if fault there is, is all mine. 

 

 

THE E/R "MODEL"

 

The subject paper proposes a formal interpretation for the so-called entity/relationship (E/R) model.  In a private note on that paper, however, Fabian Pascal wrote:  "[I] don't exactly see how [the proposed E/R interpretation] differs from relational interpretation."  Well, me either.  I frankly see no need to drag the E/R model into the discussion at all; if the subject paper has anything of substance to offer, I believe it could be applied with very little change to the relational model equally well--especially since the data structures and operators discussed in the paper are just the data structures and operators of the relational model, no more and no less. 

 

    In any case, I don't even care to dignify the E/R ideas by referring to them as a model in the first place.  In my book AN INTRODUCTION TO DATABASE SYSTEMS I wrote: 

 

[It] is not even clear that the E/R "model" is truly a data model at all, at least in the sense in which we have been using that term in this book so far (i.e., as a formal system involving structural, integrity, and manipulative aspects).  Certainly the term "E/R modeling" is usually taken to mean the process of deciding the structure (only) of the database, although [it does deal with] certain integrity aspects also, mostly having to do with keys ... However, a charitable reading of [Chen's original E/R paper] would suggest that the E/R model is indeed a data model, but one that is essentially just a thin layer on top of the relational model (it is certainly not a candidate for replacing the relational model, as some have suggested). 

 

    And I went on to justify these claims in detail.  I omit the specifics here, except to note that: 

 

Ø       The "thin layer on top of the relational model" consisted of certain referential actions--cascade delete, etc.--that can certainly be (and usually are) implemented in today's SQL systems but are not part of the relational model as originally formulated. 

 

Ø       The detailed arguments were first written down in a much earlier paper (Entity/Relationship Modeling and the Relational Model, in InfoDB 5, No. 2, Summer 1990, republished in my book RELATIONAL DATABASE WRITINGS 1989-1991)—so they are hardly new, and were almost certainly in print at the time the subject paper was first written. 

 

 

"THE JOIN TRAP"

 

Quite frankly, I find the subject paper's remarks on what it calls "the classic join trap" more than a little annoying.  Here's the example from the paper, quoted verbatim: 

 

    Consider the relations: 

 

SUPPLY:  SUPPLIER  PART

         S1        P1

         S2        P1

         S2        P2

         ...       ...

 

USE:     PART      PROJECT

         P1        J1

         P2        J1

         P1        J2

         ...       ...

 

    These have interpretations ... :

 

       (SUPPLY)  x supplies y

 

       (USE)  y' is used in z

 

We form their composition, according to the standard interpretation: 

 

       EXISTS y EXISTS y' x supplies y and y = y'

                                       and y' is used in z.

 

    or EXISTS y x supplies y and y is used in z

 

In English, one instance is:  There is something supplied by S1 and used in J1; or, S1 supplies something used in J1.  But this does not follow from our base data; it could be false while the base data (shown above) was true (if none of the P1s supplied by S1 were used in J1): the classic join trap. 

 

    Aside:  A few brief comments on the foregoing quote:  First, the variables x, y, y', and z ought by rights to have appropriate ranges defined for them (however, we can presumably take those ranges as read for present purposes).  Second, it is not clear why Adrian uses the names x, y, y', and z anyway, in place of the more obvious (possibly dot-qualified) names SUPPLIER, PART, and PROJECT.  Third, I should explain, in case you are not familiar with the term, that the composition of SUPPLY and USE is the join of SUPPLY and USE (on PART), projected on SUPPLIER and PROJECT.  Finally, by the term instance here Adrian means the interpretation of one particular tuple in the result of the composition.  End of aside. 

 

    Anyway, my overall response to the example is:  Stuff and nonsense!  Values of attribute SUPPLIER (x) are supplier numbers; they identify specific suppliers.  Likewise, values of attribute PROJECT (z) are project numbers, and they identify specific projects.  But values of attribute PART (y and y'), though we do call them part numbers, do not identify specific parts--rather, they identify specific kinds of parts.*  The interpretation as given in Adrian's sentence beginning "In English" should thus more correctly be: 

 

There is some kind of part supplied by S1 and used in J1; or, S1 supplies some kind of part used in J1. 

 

As far as I'm concerned, this interpretation is 100% correct. 

 

----------

* I should know!  I've been using this example since 1972, if not earlier.  Though in fairness I should admit that my explanations of the example might not always have been as clear as they should have been. 

----------

 

    It seems to me that Adrian's "join trap" is identical to what James Martin, in his book COMPUTER DATA-BASE ORGANIZATION (2nd ed., Prentice-Hall, 1977) and elsewhere, has called "semantic disintegrity."  I wrote a short paper over 20 years ago—The Relational Model and its Interpretation, published in my book RELATIONAL DATABASE: SELECTED WRITINGS—that (among other things) tried to debunk that notion.  What follows is a lightly edited version of the relevant portion of that paper: 

 

A good illustration of lack of clear thinking—arising presumably from an inadequate understanding of the relational model and its interpretation—is provided by what is sometimes referred to as "semantic disintegrity."  Here is an example, taken from a book by James Martin ... Consider the relations: 

 

EMP  { EMP#, DEPT# }

DEPT { DEPT#, LOCATION }

 

Suppose a given department can have any number of locations.   Now consider the relation: 

 

RESULT { EMP#, DEPT#, LOCATION }

 

(the natural join of EMP and DEPT, on DEPT#).  Martin says this join is invalid, because if employee e works in department d and department d has locations x and y, it certainly does not follow that e is located in both x and y.  The assumption seems to be that relation RESULT states otherwise.  But of course it does not; it merely states what we already know—namely, that e works in d and d has locations x and y.  Relation RESULT does represent the answer to a certain query, but that query is not "Find employee locations."  Thus it is definitely wrong to say that the join is invalid (though it may be legitimate to warn against incorrect interpretation of that join). 

 

    I also want to comment on the term "classic join trap."  First, I'm not aware of that term appearing anywhere else than in the subject paper, so I'm not at all sure that the epithet "classic" is warranted.  More to the point, I think what Adrian is calling "the join trap" is essentially identical to what Codd, in his famous 1970 paper A Relational Model of Data for Large Shared Data Banks (CACM 13, No. 6, June 1970), called the connection trap.  In that paper, Codd was I think arguing against precisely the kind of confusion that Adrian, in the subject paper, and James Martin are both guilty of!  I commented on this issue as follows in my book THE DATABASE RELATIONAL MODEL: A RETROSPECTIVE REVIEW AND ANALYSIS

 

Codd wrote: 

 

A lack of understanding of [the semantics of the relational operators] has led several systems designers into what may be called the connection trap.  [For example, suppose we have a nonrelational system in which] each supplier description is linked by pointers to the descriptions of each part supplied by that supplier, and each part description is similarly linked to the descriptions of each project which uses that part.  A conclusion is now drawn which is, in general, erroneous: namely that, if all possible paths are followed from a given supplier via the [corresponding] parts ... to the projects using those parts, one will obtain a valid set of all projects supplied by that supplier. 

 

       [To the foregoing I added:]  Of course, we don't have to be following pointers in order to fall into the connection trap--the very same logical error can unfortunately be made in a purely relational system too.  Indeed, some writers have criticized relational systems on exactly these grounds ... I hope it's obvious, however, that such criticisms are invalid, betraying as they do a sad lack of understanding of the relational model.

 

    The point is—at least, so it seems to me—we can never stop users from falling into errors of interpretation; but in a relational context, at least, the errors rise to the surface (as it were) and can be clearly identified and perhaps avoided.  So Adrian is blaming the relational model for a problem that, at its worst, is more easily recognized and fixed in a relational system than in other kinds.  This criticism seems to me less than fair. 

 

 

"CRITERIA OF IDENTITY"

 

I agree with the subject paper that "criteria of identity" are needed.  To quote:  "[The] criterion of identity [is what] tells us when we have got one (i.e., one and not two); it removes, not vagueness, but ambiguity."  But the paper goes on to say:  "[The proposed interpretation] avoids the use of absolute identity"—and from the context, it is clear that this state of affairs is to be contrasted with that found in connection with the relational model.  Now, I freely admit that I have no training in formal logic, but I am certainly not aware of any respectable relational publication that relies on any notion of "absolute identity."  It is true that in our work on THE THIRD MANIFESTO, Hugh Darwen and I do require "identity" (we call it equality) to be defined for every type, and we explain exactly what we mean by that requirement; as far as I can tell, however, our "equality" is the same as (is identical to?) Adrian's "criteria of identity." 

 

 

A REMARK ON SET THEORY

 

The subject paper includes, just prior to its "Conclusion" section, a sideswipe against set theory.  I'd like to close this commentary by quoting the relevant paragraph in its entirety, if only because the paragraph in question is one of those I find almost completely unintelligible and the sideswipe thus, even if justifiable, not in fact justified by the subject paper. 

 

Finally, we can even use the [proposed interpretation] to distinguish "types" from "instances" without resort to the complexities of set theory.  A person instance is something that falls under the criterion of application, "... is a person," and has the criterion of identity, "... is the same person as something" (or even, "as itself").  The person type is something that falls under the same criterion of application, but has the criterion of identity, "... is a person, and so is ..."; the type is an entity with a very coarse classification (a single equivalence class into which each person falls).

 

 

Ed. Comment: The logical model underlying a database represents something about the real world. Therefore, to know whether a logical model is an accurate representation, we must know exactly what it represents. Errors like the connection trap can only occur if the knowledge or understanding of what is being represented is poor.

 

In fact, a logical model is a formal representation of a conceptual model, which is an informal (ad hoc) representation of the real world based on subjective perceptions of that world. The conceptual model is the user-understood interpretation of enterprise-specific data; the logical model is the DBMS-“understood” meaning of the data. And a data model e.g. the relational model is, so to speak, a “translation language” via which the former is mapped to the latter, so that the DBMS can make mechanized inferences and guarantee their correctness. A clearly and carefully specified, consistent conceptual model is a pre-requisite for logical database representation.

 

Unfortunately, because the conceptual level is subjective and informal, it is not given adequate attention, with confused, vague, ambiguous, under-specified, or inconsistent results; confusion of the two levels of representation also occurs frequently. Mapping such models logically more often than not yields questionable database designs, erroneous queries, and erroneous results. I believe that E/R is a poor conceptual modeling methodology, which should not be confused, as it usually is, with logical design.

 

For an attempt to develop a rigorous conceptual modeling methodology—with tighter logical mapping—which also keeps the two levels more distinct see Conceptual Modeling and Database Design: A Foundation Framework for Data Management), and the seminar by same name.

 

 

Posted 11/18/05