Sunday, October 28, 2018

Understanding Conceptual vs. Data Modeling Part 3: Don't Conflate Reality and Data




In Part 1 and Part 2  we explained that between 1975-81, when the E/RM and RDM were introduced, there was no distinction between an informal conceptual and a formal logical level. In 1980, however, Codd defined a formal data model and in the later 80s the conceptual-logical-physical levels of representation emerged. If applied to the two models:

  • Only the RDM satisfies the definition;
  • The E/RM can be used at the conceptual level to model reality, the latter can be used to model data at the logical level (i.e., formalize conceptual models as logical models for database representation).
Current practitioners, however, continue to confuse levels of representation and confuse/conflate types of model. So much so, that in my presentations I used to draw an imaginary line dividing the room into two sections, and move to the right section to discuss one level/model, and to the left section to discuss another.

Consider the question "does data modeling slow down an application development process?". I will set aside the notion of "speeding up" application development by skipping altogether "data modeling" (whichever way it is meant), and focus on the response.


------------------------------------------------------------------------------------------------------------------
SUPPORT THIS SITE 

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you. 

NEW PUBLICATIONS 

NEW: The Key to Relational Keys: A New Perspective

NEW: SOCIAL MEDIA 

I deleted my Facebook account. You can follow me on Twitter:
  • @dbdebunk: will contain links to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture, which I am bringing back.
  • @ThePostWest: will contain evidence for, and my take on the spike in Anti-semitism that usually accompanies existential crises. The current one is due to the decadent decline of the West and the corresponding breakdown of the world order.

HOUSEKEEPING
  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 
------------------------------------------------------------------------------------------------------------------

Reality: Facts and Business Rules


“There are different modeling techniques ... The experience displayed here is with fact based modeling, FCO-IM to be precise ... the first step in the modeling procedure is to collect concrete examples ... Suppose we want to model cars, we start with collecting examples of (real world) cars, we start with collecting examples of (real world) cars (OK, maybe pictures of them) ... The next step is to verbalize the collected examples. Verbalization is the expression of facts that are represented by the examples in natural language sentences. Now, to be precise, it is required to derive elementary fact expressions. These are sentences that express exactly one fact, no less, no more. So, this could look something like this:

The car with license number 62-JZK-6 has the color aquamarine blue.”

                   --The power of the data model, LinkedIn.com
 

Data modeling is viewed as consisting of two steps:
  • "Collect concrete examples"; and,
  • "Verbalize the collected examples" as facts.
for which the Fully Communication Oriented Information Modeling (FCO-IM) approach is advocated.

A fact is a statement in natural language about a real world object of some type (car) identified by a name (license number)  -- an assigned shorthand for a collection of properties (color) of objects of that type -- that is unequivocally true or false (for the distinction between names and properties, see[1]).

The two steps comprise conceptual modeling of reality, not data modeling. Indeed, the author concludes his response by stating explicitly that "...a FCO-IM model is fully conceptual". If facts are necessary to "make sense" out of data, then they are not data, but what the data means[2] (i.e., its interpretation).

Note: According to Wikipedia,

“The distinguishing feature of FCO-IM is that it models the communication about a certain Universe of Discourse (UoD) completely and exclusively ... (the facts users exchange when they communicate about the UoD). FCO-IM is therefore a member of the family of information modeling techniques known as fact-oriented modeling (FOM), as are Object-Role Modeling (ORM), predicator set model (PSM), and natural language information analysis method (NIAM)[3].”

UoD is a reality, not data concept. I am not familiar with the FCO-IM specifically, but I am familiar with NIAM -- with which most fact-oriented modeling originates -- and its author made it clear it is a conceptual -- distinct from data -- modeling approach in the very title of his original book[3]. As already mentioned, fact-based modeling  is superior to E/R modeling, but its superiority is lost on non-relational SQL technology and practice.

 

Conceptual modeling of reality consists of identifying types of objects, and formulating business rules (BR) for all. For each object type:
  • There is a BR that generalizes the expression of a fact about an individual object to all objects of that type. For example, the BR generalizing the above car fact to all cars would be:
Forall cars, a car with license number (LICENSE#) has color (COLOR).
where LICENSE# and COLOR are names and colors of specific cars.
  • There is a group of objects of the same type and a set of BRs that specifies relationships among member properties and among members, the latter being collective properties of the group[4].


Data and Constraints


Data modeling is using a formal data model that satisfies Codd's definition to put conceptual models -- BRs and facts -- in a computable form for database representation[5]. If the RDM is used, each group of objects of a type formalizes as a relation, with each fact formalizing as a tuple, and each BR expressing a relationship formalizes as an integrity constraint. For example, the car fact above would be represented by the tuple:

(62-JZK-6,blue)
Integrity constraints are predicates -- formal versions of BRs in first order logic -- expressed in a specific data language such that the DBMS can constrain the data to be consistent with the conceptual model of the real world[6] (i.e., enforce the relationships on the data).

Models, Models Everywhere


“But the true power [of a data model] is in the ability to generate other models that serve other (and mostly) implementation purposes. Apart from deriving a normalized logical data model, other possibilities contain physical data models for different kinds of database platforms, XML schema definitions and JSON-structures to name a few. Depending on the implementation model chosen, nearly all the modeled constraints are transformed to physical constraints that can be applied and enforced operationally. Moreover, the structure and naming of elements of either implementation model is fully consistent with the original conceptual information model.”

It's easier to rephrase the paragraph than debunk it, and leave it to the reader to judge the difference.

The true power of a formal data model that satisfies Codd's definition (e.g., RDM) is in the ability to generate enterprise-specific logical models implementable in hardware and software from enterprise-specific conceptual models produced by a conceptual modeling approach (e.g., FCO-IM). Depending on the approach, the data model, and the hardware and software platform chosen, the elements of a conceptual model (objects, properties, and relationships) are formalized as elements of the logical model (structure and constraints), and implemented physically in the platform. What’s more important, by having [FCO-IM's] step-by-step approach and following it, you can be sure that the information [i.e., conceptual] model you’re creating adheres to a high level of quality and contains necessary BRs.

How consistent an implementation is with the conceptual model depends on a multitude of factors, including:

  • What the conceptual modeling approach captures for database representation;
  • The formal data model used to map the conceptual to the logical model;
  • The foundation knowledge and skills of the modeler, the database designer, and the implementer;
  • How correctly and completely the chosen DBMS supports the formal data model.
Practitioners have poor foundation knowledge, do not practice proper fact-based modeling, do not use a formal data model, confuse levels of representation, and use SQL technology for implementation. For which reasons consistency with conceptual models is not guaranteed[7]. Note, for example, that:
  • A conceptual model contains informal BRs, not formal constraints (and the classification of constraints is based on the types of relationships they formalize[4], not the arbitrary ones mentioned).
  • Relational data modeling produces "[fully] normalized logical models"[8];
  • "XML schemas and JSON structures" correspond to what are claimed to be data models (hierarchic for XML), but which do not satisfy Codd's definition, and which, for a variety of reasons, would be inferior to the RDM even if they did (hence the evaporation of the much hyped "XML DBMSs"). 


Conclusion


When the RDM and the E/RM were introduced there was no distinction between conceptual and logical levels of representation, and no definition of a formal data model. Despite the subsequent emergence of the conceptual-logical-physical distinction, and the formal definition of a data model, use of terms such as "logical data models" and "physical data models" usually reflect -- and certainly induce -- confusion of  levels of representation[9], as well as of types of models[10]. That is why I recommend the three-fold terminology conceptual modeling, logical database design, and physical implementation to preempt the confusion[11]. While Chen in 1976 cannot be accused of it, this does not apply to current practitioners who treat conceptual modeling and data modeling (aka logical database design) as if they are one and the same. They are distinct, the former precedes the latter, and confusing or conflating them leads to serious problems in database practice. The author's heart is in the right place, but industry misconceptions are quite entrenched and difficult to escape without sufficient foundation knowledge[12].

A formal data model must be ontologically consistent with the conceptual modeling approach it formalizes for database representation. While the E/RM can be used for conceptual modeling, and the RDM is ontologically consistent with it, in Part 4 we outline an alternative approach that makes a different commitment which, coupled with revisions of the RDM, offers advantages, and may constitute the first genuine progress in database technology since the RDM and NIAM/ORM[13].


References

[1] Pascal, F., The Key to Relational Keys: A New Understanding.

[2] Pascal, F., What Meaning Means: Business Rules, Predicates, Integrity Constraints and Database Consistency.

[3] Nijssen, S., and Halpin, T. A., CONCEPTUAL SCHEMA AND RELATIONAL DATABASE DESIGN.

[4] Pascal, F., Relationships and the RDM Part 1 - Kinds of Relationships.

[5] Pascal, F., Conceptual Modeling for Database Design - Formalizing the Informal.

[6] Pascal, F., Business Rules, Predicates, Integrity Constraints and Database Consistency.


[7] Pascal, F., Relational Database Design, Logical Validity, and Semantic Correctness. 

[8] Pascal, F., Database Design: What It Is and Isn't.

[9] Pascal, F., The Conceptual-Logical Conflation and the Logical-Physical Confusion.

[10] Pascal, F., Data Model: Neither Conceptual, Nor Logical, Nor Physical Model.

[11] Pascal, F., Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation.

[12] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.

[13] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.







Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.





No comments:

Post a Comment