Tuesday, June 25, 2013

Data Model: Neither Business, Nor Logical, Nor Physical Model

Note: For a more in-depth discussion see Business Modeling for Database Design.

Chris Date once wrote a paper titled Models, Models Everywhere, Nor Any Time to Think, deploring the confused and distorted way in which fundamental concepts and terminology in general and relational ones in particular, are used in the industry. But no matter how many times a misconception is debunked, the abuse continues and will do so given educational failure and disregard for precision. Data model is a case in point (see Unmuddling Modeling, Parts 1,2) and What Is a Data Model?)

I've read a blog post about what really is a data model, as used in the term "relational data model" (RM). It made the following points:
1. The implementation of a data model is a programming language.
2. The RM is not necessary ... for developing software solutions, maintaining large shared databases, or any other purpose in the world of software development. Any software solutions that can be developed while employing the RM could be written without it, using other data models.
I'm not sure if data languages (e.g., SQL and Tutorial D) qualify as programming languages. But maybe in a broader sense, we can say that data languages are also programming languages in the sense that we use them  to "program" (i.e., declare and manipulate) our data. So if only the relational data model had been implemented correctly, then the industry would have produced better data languages (i.e., the D languages). Am I right?
1 is an inference from Date's definition of a data model:
A data model is an abstract, self-contained, logical definition of the objects, operators, and so forth, that together constitute the abstract machine with which users interact. The objects allow us to model the structure of data. The operators allow us to model its behavior.
I would emphasize abstract: a data model is a formal abstraction that, per-se, has no meaning. It can be applied to database management only if it can be given useful interpretation--some relationship to the real world.

Database management consists of operations on some data structure for information retrieval and integrity enforcement. The specific data structure, together with the operations applicable to and the integrity constraints enforcible on it form a data model. Every DBMS is based on one!

The DBMS must offer some means for users to define instances of the structure and express operations and constraints to which those instances are amenable. In the case of the relational model
  • structure: R-table
  • operations: relational algebra (restrict, project, join and so on)
  • constraints: domain, column, identity (PK), referential (FK) and arbitrary
The means can take various forms: a computationally complete programming language e.g. Alphora's D4, or Date & Darwen's Tutorial D; a data sublanguage embeddable in a programming language e.g. SQL or QUEL, an end-user tool e.g, QBE/QBF (query-by-example/form), or any combination thereof. Whatever the form taken, each must support the three components of some data model.

Note: Codd preferred an embeddable data sublanguage for two reasons. First, to avoid the complications of convincing the programming language standard committees to extend them with relational capabilities. And second, to base it on First Order Predicate Logic—which is "declarative"/"non-procedural"--while existing programming languages were not. He objected vehemently to SQL and designed his own sublanguage Alpha, which was relationally complete. But IBM implemented SQL and the rest is (a rather sad) history.

Strictly speaking, argument 2 is correct in the sense that a DBMS can rely on different data structures and operations. But, given an informational objective, the choice of a model must be grounded in some sound criteria to assess the superiority of one alternative over the others. I offer such criteria in the above mentioned paper (see also Hadoop vs SQL: How to Compare Data Models). As far as I know they are rarely employed in practice, with costly consequences.

There is, however, a sense in which a data model can metaphorically be compared to a programming language: it is to logical models what a programming language is to specific programs written in it: it is used to design enterprise-specific formal logical models that represent enterprise-specific informal business models in databases. Note very carefully that a data model
  • is neither a business, nor a logical model;
  • is not enterprise-specific;
  • it is formal (it formalizes informal business models).
Prior to Codd's definition of a data model data management relied implicitly on elements of hierarchic and network data models. While they do have a formal foundation--graph theory--commercial DBMS's were not fully grounded in it, but abstracted ad-hoc from existing user practice. Subsequent efforts to postfit the theory to practice failed (predictably), in large part due to complexity and rigidity. Codd invented the relational model for the precise purpose of overcoming those deficiencies.

The current attempts to return to hierarchic/network-based products (Switching from relational to the graph model, Towards a Standard Graph-Based Data Model), or the introduction of so-called "schema-less databases" (Schema, NoSQL and the Relational Model, Part 1-3), as if database management does not require a well-defined, formally sound data model are, in part, due to confusion over business, logical and physical models-- what I refer to as the logical-physical confusion (LPC) and the conceptual-logical conflation (CLC)--on the one hand, and over them and the data model on the other.

No comments:

Post a Comment

View My Stats