Sunday, December 2, 2018

What Is a Data Model, and What It Is Not

“The term data model is used in two distinct but closely related senses. Sometimes it refers to an abstract formalization of the objects and relationships found in a particular application domain, for example the customers, products, and orders found in a manufacturing organization. At other times it refers to a set of concepts used in defining such formalizations: for example concepts such as entities, attributes, relations, or tables. So the "data model" of a banking application may be defined using the entity-relationship "data model". This article uses the term in both senses.”
--Data Model, Wikipedia

What a True Data Model Is

Few practitioners realize that Codd invented the Relational Data Model (RDM) as the first exemplar of a data model, a concept that he formalized in 1980 as follows:


I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you. 


NEW: The Key to Relational Keys: A New Perspective


I deleted my Facebook account. You can follow me on Twitter:

  • @dbdebunk: will contain links to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture, which I am bringing back.
  • @ThePostWest: will contain evidence for, and my take on the spike in Anti-semitism that usually accompanies existential crises. The current one is due to the decadent decline of the West and the corresponding breakdown of the world order.

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 

“A combination of three components:
1) a collection of data structure types (the building blocks of any database that conforms to the model);
2) a collection of operators or inferencing rules, which can be applied to any valid instances of the data types listed in (1), to retrieve or derive data from any parts of those structures in any combinations desired;
3) a collection of general integrity rules, which implicitly or explicitly define the set of consistent database states or changes of state or both—these rules may sometimes be expressed as insert-update-delete rules.”
Note: We prefer integrity constraints, to avoid confusion with business rules.

A contender that does not satisfy this definition is not a data model. The hierarchic and network models were contenders, but are incomplete, lack the RDM advantages, and -- their current revival notwithstanding[2] -- had been abandoned due to prohibitive rigidity and complexity. The entity-relationship model (E/RM) can be used for modeling of information about reality (its entities and attributes are conceptual, distinct from RDM's relations  -- not tables! -- which are logical[3]). Confusing or conflating conceptual with logical terminology constitutes conceptual-logical conflation (CLC) of levels of representation[4,5,6,7].

As far as we know, currently only the RDM satisfies Codd's definition, its components being:
  • Data structure type = relations on domains;
  • Data manipulation = relational algebra (RA);
  • Data integrity = several types of constraints.
and also benefits from a sound formal theoretical foundation -- simple set theory (SST) and first order predicate logic (FOPL) -- that confers unique advantages, including:
  • Declarative, decidable data languages;
  • Physical (PI)[8,9.10] and logical independence (LI)[11];
  • System-guaranteed logical validity, and semantic correctness guaranteed by design[12];
  • Simplicity[13].
Note: See the series on Data and Meaning for the significance of semantic correctness by design.

If reality is conceptualized as:
  • Objects: entities, entity groups, and multigroups;
  • Entities with first order properties (1OP), and second order properties (2OP) that are relationships among 1OPs;
  • Groups with third order properties (3OP) that are relationships among all their entity members;
  • Multigroups with fourth order properties (4OP) that are relationships among their group members.
then using the RDM to formalize conceptual models of reality as computable logical models for database representation[14]:
  • Properties formalize as domains;
  • Entity groups formalize as relations;
  • Entities formalize as tuples;
  • 1OPs in context formalize as attributes;
  • 2OPs and 3OPs formalize as constraints;
  • Multigroups formalize as databases;
  • 4OPs formalize as constraints;
  • Queries formalize as RA operations.[15]
Note that objects have properties (of which 2OPs, 3OPs, and 4OPs are relationships[16]).

What a Data Model Is Not

If a data model is used to formalize conceptual models as logical models (i.e., is a general formal theory of data), then neither conceptual, nor logical models are data models. Yet the three types of model are constantly confused/conflated in the industry. 

Note: Data theory rather than model would have prevented the widespread confusion, but for the unfortunate hostile attitude in the industry to the term.

Conceptual, logical, and physical models are commonly referred to in the industry as data models, which reinforces/induces confusion, and inhibits understanding.Some examples:

“I understand that the database design process occurs in the following sequence:

Conceptual Data Model -> Logical Data Model -> Physical Data Model

Can someone please clarify how the relational model applies to this process?”
 “A data model explicitly determines the structure of data. Data models are specified in a data modeling notation, which is often graphical in form.”
“In conceptual data modeling we conceptualize the problem.
In logical data modeling we decide if a relational module will be suitable, efficient and effective to the problem's domain.
In physical data modeling we choose a specific RDBMS.”

This is what happens in Wikipedia. The first usage mentioned in Wikipedia, "An abstract formalization of the objects and relationships found in a particular application domain, for example the customers, products, and orders found in a manufacturing organization", is that of a specific logical model corresponding to a specific conceptual model, neither of which is the data model used to formalize the later as the former.

  • Conceptual models are models of information about the real world, not of data, so "conceptual data model" does not make sense. 
  • While logical and physical models are models of specific data at distinct levels of abstraction, they are not general data models in the Codd/RDM sense (i.e., theories) either, and referring to them as "logical data models" and "physical data models" not only obscures this important distinction, but is also reflects and induces confusion of levels of representation (for the avoidance of which we recommend the three-fold terminology conceptual modeling, logical database design, and physical implementation[17,18]). 

We, thus, correct Wikipedia as follows:
“A formal data model satisfying the Codd definition is a general theory of data used to formalize information about objects with properties/relationships in conceptual models of any domain (for example, customers, products, and orders in manufacturing organizations) as structural, manipulative, and integrity elements of computable logical models for database representation. So, for example, the RDM may be used to formalize an E/RM conceptual model of some aspect of a banking business as a logical model required by applications.”


[1] Codd, E. F., Data Models in Database Management, Workshop on Data Abstraction, Databases and Conceptual Modelling 1980: 112-114.

[2] Pascal, F., Structure, Integrity, Manipulation How to Compare Data Models.

[3] Pascal, F., Understanding Conceptual vs. Data Modeling Part 1-4.

[4] Pascal, F., Conceptual Modeling Is Not Data Modeling.

[5] Pascal, F., Weak Entities, Referential Constraints and the Conceptual-Logical Conflation.

[6] Pascal, F., The Conceptual-Logical Conflation and the Logical-Physical Confusion.

[7] Depends on the Dependencies: Normal Forms and the Conceptual-Logical Conflation.

[8] Pascal, F., Physical Independence Part 1: Don't Mix Model with Implementation.

[9] Pascal, F., Don't Conflate/Confuse Primary Keys, PK Constraints, and Indexes.

[10] Pascal, F., Moving in Circles: RDBMS-SQL Conflation & Logical-Physical Confusion.

[11] Pascal, F., On View Updating (C. J. Date and D. McGoveran Exchange).

[12] Pascal, F., Logical Validity and Semantic Correctness.

[13] Pascal, F., Simplicity: Forgotten, Misunderstood, Underrated Relational Objective.

[14] Pascal, F., Data Model Neither Conceptual, Nor Logical, Nor Physical Model.

[15] Pascal, F., Relationships and the RDM Parts 1,2,3.

[16] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming (draft chapters).

[17] Pascal, F., Levels of Representation Conceptual Modeling, Logical Design and Physical Implementation.

[18] Pascal, F., Levels of Representation Relationships, Rules, Relations and Constraints.

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.

No comments:

Post a Comment

View My Stats