Sunday, July 13, 2014

Data Model: Neither Conceptual, Nor Logical, Nor Physical

Chris Date once titled an article Models, Models Everywhere, Nor Any Time to Think and I wrote about A Model to Call One's Own. We referred to the continuous proliferation of new "models" at the expense of mastering the fundamental and sound ones we have and the vagueness, confusion, misuse and abuse of the term 'model' (see, for example, Canonical data model implicitly managed through Enterprise Data Modeling). That is why, more than four decades after the concept of data model was introduced by Codd, we still have questions like:
Q: I understand that the database design process occurs in the following sequence:
Conceptual Data Model -> Logical Data Model -> Physical Data Model
Can someone please clarify how the relational model applies to this process?
Can you?

Here's one reply:
A: Conceptual Data Model: ... formed after a conceptualization process in the mind ... represents human intentions or semantics ... from observation of physical existence ... necessary means human employ to think and solve problems.
Logical Data Model: ... the abstract structure of a domain of information(conceptual model)... anticipates implementation on a specific computing system.

Physical data model(or database design): ... representation of a data design which takes into account the facilities and constraints of a given database management system.

So in conceptual data modeling we conceptualize the problem.
In logical data modeling we decide if a relational module will be suitable, efficient and effective to the problem's domain.
In physical data modeling we choose a specific RDBMS.
I note, in passing, that
  • "A conceptual model (CM) is a conceptualization" is a tautology;
  • It's the logical model (LM) that we usually refer to as database design; the physical (PM) is implementation, and the LM is independent of PM and of a specific DBMS (see below);
  • Whatever the relational "module" is, efficiency has nothing to do with relational--that's an implementation issue;
  • As to suitability/effectiveness, for what "problem's domain" involving facts and inferencing from them--therefore requiring a database--the relational model is not the best solution and why?
My colleague Erwin Smout correctly explained:
The term "data model" as originally coined has been hijacked by the industry to come to mean something completely different. Originally, it meant "a model telling us which structures (i.e. mathematical concepts !!!) we use to manage our data with (whatever those data are)". In the industry, it almost exclusively means "a model of which pieces of information are important for some business, and how those pieces of information relate to one another".

All of the three models (some more some less, but not that much) are models in the latter meaning. Which is why I personally always prefer to call them information models, not data models.

The relational model of data is a "genuine" data model in the former meaning of the term. It could be summarized in one sentence as "all information in the system is represented using only relations". Other data models (in the former sense) could, e.g., say "all information in the system is represented using only graphs consisting of nodes and edges".

So now to answer your question, the relational model "fits in" in that it offers one possible way to organize your logical information models. Starting from the same conceptual information model, you could also work out a graph-based logical information model, and it would look radically different from the one you'd get using the relational model.
Let let me see if I can clarify further (see Business Modeling for Database Design for a more detailed discussion).

CM: An informal representation
  • understood semantically by users
  • consisting of sets of facts about classes of entities with attributes defined by business rules.
LM: A formal external representation of a CM
  • understood algorithmically by a DBMS 
  • consisting of data structure(s) subject to integrity constraints.   
PM: An internal representation in specific hardware/software of a LM
  • consisting of storage/access/distribution methods.   

How do you get to the LM of a CM? You must formalize the informal--use some formal (abstract) data structure(s) to represent in the database the real world facts--data--about the CM entities. Whichever structure you choose determines the mathematical operations a DBMS can apply to it for purposes of data manipulation--querying--and constraint formulation.

A data model (DM) is to CM's what a programming language is to programs. It consists of
  1. Formal data structure(s)
  2. Operations for data manipulation 
  3. Integrity constraints.
While three DM's have been defined--hierarchic, network and relational--the former two where abstracted post-hoc from existing products and proved problematic in practice. The structure of relational data model (RDM) is the R-table, to which the relational algebra operations are applicable. It has a sound dual theoretical foundation in predicate logic and set theory [1] and proved superior to boot.

Note very carefully that
  • PM's, LM's and PM's are all enterprise-specific--they are models of some enterprise; a DM is not--it is an abstract model of data in general with which to map CM's to LM's;
  • There is a 1:1 correspondence between CM and LM constructs, but there does not have to be between LM and PM constructs--the DBMS internally maps back and forth between them [2].

[1] See Truly Relational: What it Really Means for the criteria  a data model must satisfy to prove superior to the relational model.

[2] Hence physical data independence (PDI), one of the relational advantages on which SQL DBMS's do very poorly. See Go Faster! The TransRelational Approach to DBMS Implementation for one example of true PDI.

No comments:

Post a Comment