Sunday, January 17, 2016

Conceptual (Business) Modeling, Logical Database Design and Physical Implementation

A serious problem in the database field is not just that many data professionals do not know and understand the RDM, but also that they believe they do and criticize it. It is relatively easy to detect such critics. In "Recognizing and Treating Tableitis" Gordon Everest tries to be humorous, but it ends up more sad than funny.

Table (n.) – a collection of information (data?) describing a population of entities which possess some common characteristics, called attributes.
-itis – “suffix denoting diseases characterized by inflammation, itself often caused by an infection.”  ---------- from the Wikipedia Wiktionary.

Tables are the building block of relational databases. Tables must generally be “normalized,” at least to 1NF. That may be an appropriate way to think of databases when implemented in a modern day DBMS. However, it is not the way the world thinks logically. People have no problem with commonly occurring phenomena such as:

  • A multi-valued attribute, e.g., an Employee possesses multiple Skills.
  • Many-to-many (M:N) relationships, e.g., as between Employees and Projects
  • A relationship with attributes
even though our systems may. None of these situations can be handled directly in a relational database.
The building blocks of relational databases are, of course, relations--database abstract representations of sets of facts about classes of property-sharing entities. Table are only a way to present relations visually, so when we say R-tables, we mean relations presented as a special kind of tables.

Relations are defined by the relationships among their components--tuples and attributes--presented as rows and columns to users. It's those relationships that are important for database management, so if "the world thought logically", it would focus on them.

"Logical refers to the relationships among the components of the relation, not to any arrangement of the components of a relation. Any presentation that preserves those relationships and adds no extra ones is acceptable. A R-table is one possible such presentation. The problem is that people fixate on this one presentation, identifying it with relation. They then go even further and force the physical implementation of a relation to be table-like". --David McGoveran
Thing is, they are not in the tables! Can GE (as well as the reader) think logically and specify the relationships underlying relations, why they are important and where they are in a database, if not in tables?

Incidentally, there is some conceptual-logical conflation (CLC) here: entities are elements of the real world that share properties (at the conceptual level), attributes are components of database relations (at the logical level).

Whenever I come across relational systems "can't handle something", it's a dead giveaway for failure to grasp the RDM. To GE's specific somethings:

  • Skills can be viewed either as multiple properties, or as one multi-component property--these are two purely subjective perceptions of reality. The former can be represented by a 1NF relation, the latter by a relation with an attribute defined on a relation-valued domain (RVD)[1].
  • A M:N relationship between two classes of (facts about) entities is modeled as a class of associative entities and represented in a relational database by an associating relation. Each tuple of such a relation represents (a fact about) an associative entity that has properties of its own. Et. voila: relationships with attributes.
Critics of 1NF such as GE are concerned with the joins induced by normalized tables. But has GE considered the complexifying effects of RVD's on guaranteed logical access GLA (one of Codd's 12 rules), or on constraint and query expressions? Databases with non-1NF relations must contend with two types of relations--nesting and nested--and this requires additional "handling". At least joins are relational operators expressible declaratively in first order predicate logic (FOPL). An un-nesting operation, however, requires higher ordered logic than FOPL and, therefore, a computationally complete procedural language. And all this complexity for what advantage, exactly? Do the the extremely rare (if not unimportant) special circumstances in which RVD's are useful justify complicating database life for everybody? The answer was given to me by Toon Koppelaars: "Oracle supports nested tables, but I hardly ever see them used, however."
Tableitis (tay-bul-eye’-tus) – a mental condition found in professional data modelers, that is, thinking about tables too early in the process of database design. Some people popularly refer to this condition as “Table Think,” a phrase coined by Dr. Everest in the early 21st century. It is manifest when the data modeler begins to build a data model by thinking about or drawing icons to represent entities. Some call this a high-level conceptual model. Even though their model only contains major entities, we all know that there are attributes lurking in the back of their mind. Thus, the next step in the modeling process is to add attributes to the model. This can be particularly troublesome when the domain being modeled is complex. Sometimes we don’t put the attributes in the right place. The conventional test is to apply the rules of normalization. Correcting a violation of the rules involves decomposition of a data model which can further degrade the performance of a database system.
"Table Think" ought to be criticized as "non-relational think". Those who know and understand the RDM don't think tables, but about the relationships among properties and entities that define entity classes and relationhips between classes in the real world, represented in relational databases by integrity constraints. Anybody who criticizes the RDM for tables, normalized or not, suffers from tableitis too.

As to normalization, GE clearly suffers from LPCitis too (logical-physical confusion). Normalization and denormalization change the number of logical relations, which cannot possibly affect performance. To the extent that performance is affected, it is due to the implementation of SQL DBMS's.

In a LinkedIn exchange David Hay comments:

The problem with "logical" modeling, as I've witnessed it, is that it is too biased towards relational database design. It is removed, to be sure from the tablespaces of the database design language, but as long as it portrays "foreign keys" and 'Primary keys", it cannot be used for object oriented design, XML design, or anything else non-relational.

My preference is for what I've come to call "essential" data modeling. This is a representation of the underlying "essence" of a domain, such as an organization. It is in terms not of data technology objects, but rather in terms of the "things of significance" to the domain. Thus, no "interfaces", "windows", or the like. Just Person, Activity, Contract, etc. Also, the approach I learned from Richard Barker was that relationship names are very important. They portray the facts that link these "things" together. In each direction you have an assertion, such as "each ORDER must be *composed of* one or more LINE ITEMs."

Keys are a logical feature. The problem of FK's and PK's at the conceptual level is not relational bias, but CLC. Can DH provide a formal justification for OO, XML, or other non-relational designs, other than their being used in the industry?

Years ago Chris Date wrote Models, Models Everywhere, Nor Any Time to Think and I wrote A Model to Call One's Own where we deplored the proliferation of model types due, essentially, to ignorance of data and relational fundamentals. It may come as a shock to DH, but what he calls "essential modeling" is exactly what drives relational database design [2];
  • It is intentionally and completely  devoid of technology;
  • The "things of significance"--Person, Activity, Contract--are types of entities;
  • The relationships "that links these things together" are those that define relations and relationships among relations;
  • Assertions such as "each order must be composed of of one or more line items" is an informal version of a formal predicate in logic. First-order predicate logic (FOPL) is one half of the theoretical foundation of the RDM, the other half being set theory, that guarantees query results to be correct in relational databases.
Indeed, that's what makes such modeling "essential". Like so many others, DH simply does not know that and thinks he invented something new and special.


[1] Codd introduced relations in normal form (which became 1NF) because he was initially concerned that "nesting" relations would require second order logic (SOL) and complicate the RDM. It later turned out that that is not the case if relations are finite. So RVD's are currently compatible with RDM.

[2]  See Formalizing the Informal: Business Modeling for Database Design, The Real Data Science Series, paper #1 (revision forthcoming).

No comments:

Post a Comment

View My Stats