Wednesday, November 8, 2017

Understanding Conceptual vs. Data Modeling Part 1: Data Model - The RDM Is, the E/RM Isn't



Re-write 10/16/18
“E/RM is a data model -- So says Date, Chen, etc. So says the majority of current industry experts ... With very strong references to Codd (who he worked with), Date elegantly explains the differences between RM and E/RM -- but clearly believes both are data models (even allowing for the charitable comment). If we take a RDB as the ultimate target implementation of data, and an E/RM (or extended) can correctly design all the artifacts that are implemented, this means it is modeling the data. Granted, an E/RM does not explicitly model some of the non-structural aspects of the original Codd definition.”

“Out of interest, is there a common Relational Modeling tool, that is not also an E/RM tool and models the full Codd definition? There are also several other methods of modeling data -- E/RM is more a mechanism to represent the data. If E/RMs are used by IT professionals across the world to direct the design and build of the majority of applications guided by standard methodologies, is the view of this argument that these were all build wrongly? Regardless of success? Is the inferred conclusion that only the RM models data, and ERM, [or] any other techniques do not? [If so] that is a little limiting.”

Objects, Properties, and Ontological Commitment


We are culturally and linguistically conditioned to conceptualize the world as objects with properties. Objects in a universe thereof that share common properties are of the same type and form a class, distinguishing them from objects that are not and do not. Applying a class definition to the universe  selects out the group of objects of that type from the universe.

Philosophical ontology is the study of being, existence, reality, as well as the basic categories of being, and their relationships -- what entities exist or may be said to exist, and how they may be grouped, related, and subdivided according to similarities and differences. 

Note: 'Object' is used in the general, not OO sense. Ontology, as used herein, should not be confused with "computer science ontology", whereby the term ontology was usurped, and is understood by programmers as meaning a conceptual graph of directed semantic relationships among objects (and only sometimes among object types).

Conceptual modeling (1) identifies types of objects of interest, and (2) formulates business rules (BR) that specify their properties and relationships and, as such, makes an ontological commitment. Any approach to conceptual modeling must consider the ontological commitment upon which it is based, which has major implications for the data model used to formalize conceptual models as logical models for computable database representation -- it must be consistent with that commitment.

Unfortunately, due to lack of foundation knowledge in the industry[1], practitioners -- both vendors and users -- are largely unaware of, and oblivious to ontological underpinning and their implications for database technology and practice, one reason why they not only stagnated, but regressed in the last five decades. In this multipart series we explain the important distinction between conceptual, and data modeling (aka logical database design), which requires a formal data model. The E/RM is not, and while it can be used for conceptual modeling of reality, not data, we outline a new conceptual modeling approach that makes a different ontological commitment and requires adjustments to the RDM, both necessary for genuine progress.



------------------------------------------------------------------------------------------------------------------
SUPPORT THIS SITE


I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you. 

NEW PUBLICATIONS 

New: The Key to Relational Keys: A New Perspective

New: SOCIAL MEDIA 

I deleted my Facebook account. You can follow me on Twitter: 

@dbdebunk: will contain links to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture, which I am bringing back. 

@ThePostWest: will contain evidence for, and my take on the spike in Anti-semitism that usually accompanies existential crises. The current one is due to the decadent decline of the West and the corresponding breakdown of the world order. 


HOUSEKEEPING


  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 
------------------------------------------------------------------------------------------------------------------

In the Beginning

  • In 1975, ANSI sanctioned the 3-schema model, labeling the three levels external, conceptual, and physical, all of which describe aspects of the computer architecture. The conceptual level was also referred to as the conceptual model[2].
  • In 1981, the IDEF standard was introduced, which elaborated on this and IDEF1 focused on "information modeling" -- the term du jour for the process of representing semantic information formally -- to produce the conceptual model or conceptual schema, defined as a single (i.e., integrated and application independent) semantic definition of the data resources[3].
Codd, Chen, Bachman, and Nijssen were all involved in these efforts, and the 3-schema architecture was described in Date's early books.

In the context of at least ANSI 3-schema and IDEF, a data model was the formal theory underlying (a) a conceptual model/level, or (b) a logical model/level used interchangeably, and the conceptual level/model was always a specific implementation of the chosen data model (confusing? par for the industry). These notions of the conceptual were more or less formal, not the informal, user focused notion of today. Over time, due in part to the work of Nijssen, Halpin, and colleagues[4,5], and in part to database designers' need to understand the non-technical (application, technology, and formalism independent) terminology of users, and map it to something formal, the conceptual has come to refer to something less formal and detailed[6]. It's not clear when exactly the three-fold conceptual-logical-physical distinction came into being. Thus:

“Between 1975 and 1981 no one considered a "conceptual level" that was both technology and formalism (relational, OO, set theoretic, etc.) independent -- no one strove to have an informal description of the business as needed by the application environment that would be familiar and understandable to non-technical users. So Chen and Codd, coming from formal mathematics, did not recognize the need for a distinct conceptual level as currently understood, with its own terminology. To the degree that they maintained any distinction, it was accidental, and driven by the need to explain what they thought of as conceptual, but which we would now consider logical.”
--David McGoveran

The Relational Data Model


The RDM was introduced in 1969-70[7], but Codd defined the formal data model as understood today only a decade after, in 1980, by identifying its necessary components: a combination of (1) data structures, (2) operators applicable to them to retrieve data, and (3) integrity rules (known as constraints) that keep the database consistent when it is updated[8]. The RDM's are:

  • Structure = relation defined on domains;
  • Integrity = relational constraints;
  • Manipulation = relational algebra (RA).

all derived from simple set theory (SST) and first order predicate logic (FOPL), and adjusted for database management. This has become the logical in the current trinity of representation levels[9].

One single sentence in the 1969-70 paper used terminology considered conceptual today:

“In previous work there has been a strong tendency to treat the data in a data bank as consisting of two parts, one part consisting of entity descriptions (for example, descriptions of suppliers) and the other part consisting of relations[hips] between the various entities or types of entities (for example, the supply relation).”
Note that it's the data (logical) that consists of "entity descriptions" and "relations[hips] between entities/entity types (conceptual), where descriptions are specifications of entity properties (e.g., part number, part name, part description, quantity-on-hand, and quantity-on-order for parts). In other words, the conceptual (objects and properties) and the logical are not distinct levels in Codd's RDM.

The Entity-Relationship Model


In 1976, six years after the RDM, but four years before the definition of a formal data model, Chen introduced the E/RM as a superior alternative data model as understood at the time[10]:
“The logical view of data has been an important issue in recent years. Three major data models have been proposed: the network model, the relational model, and the entity set model ... This paper presents the entity-relationship model, which has most of the advantages of the above three models ... adopts the more natural view that the real world consists of entities and relationships. It incorporates some of the important semantic information about the real world ... The model can achieve a high degree of data independence and is based on set theory and relation theory.”
Logical view and set and relation theory notwithstanding, the E/RM mixes conceptual and logical terminology as they are currently understood:
“...we consider entities and relationships ... There is a predicate associated with each entity set to test whether an entity belongs to it ... Among these [common] properties is the aforementioned test predicate ... In this case, MALE-PERSON is a subset of PERSON ... A relationship set, Ri, is a mathematical relation among n entities ... An attribute can be formally defined as a function which maps from an entity set or a relationship set into a value set or a Cartesian product of value sets ... Also note that an attribute is defined as a function. Therefore, it maps a given entity to a single value (or a single tuple of values in the case of a Cartesian product of value sets).”
Entities, properties, and relationships (conceptual) are intermingled with sets, predicates, mathematical relations, Cartesian product, functions, and tuples (logical) (e.g. entity sets, predicates as properties Today we would deem this conceptual-logical conflation (CLC)[11], but as McGoveran points out:
“Chen thought, like Codd, that set theory (he didn't specify any particular one) was formal enough to count as logical, and that his description of the model (see Part 2) was all one needed for the conceptual level. This doesn't count as CLC: to have conflation you need to have something to conflate. I don't think he saw his logical (set theoretical) terminology as needing to be distinct from the conceptual terminology. Chen's position was essentially consistent with the inadequate treatment of what we consider conceptual modeling at the time, and in fact prior to Codd's 1969 paper.”
In other words, neither saw the need for a conceptual level distinct from the logical. In the RDM this is reflected in the almost total absence of conceptual terminology, while the E/RM mixes both terminologies.

While Chen cannot be accused of CLC, current data professionals ought to be familiar with the three distinct levels of representation, and know that a data model (1) should satisfy Codd's definition, and (2) is used to formalize informal conceptual models of reality as logical models for computable database representation[12]. By that definition, is the E/RM a data model, as many practitioners believe? And if not, what can it be used for?

Continued in Part 2.



References

[1] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.


[2] Three-schema approach, Wikipedia.

[3] IDEF1x, Wikipedia.

[4] Nijssen, S., and Halpin, T. A., CONCEPTUAL SCHEMA AND RELATIONAL DATABASE DESIGN.

[5] Object Role Modeling.

[6] Conceptual schema, Wikipedia.

[7] Codd, E. F., A Relational Model of Data for Large Shared Data Banks.

[8] Codd, E. F., Data Models in Database Management.

[9] Pascal, F., Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation.

[10] Chen, P., The Entity-Relationship Model - Toward a Unified View of Data.

[11] Pascal, F., The Conceptual-Logical Conflation and the Logical-Physical Confusion.

[12] Pascal, F., Data Model: Neither Conceptual, Nor Logical, Nor Physical Model.


Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.




Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.






No comments:

Post a Comment