Sunday, August 27, 2017

Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness



Note: This is a 8/24/17 rewrite of a 5/20/13 post to bring it in line with McGoveran's formal exposition of Codd's RDM [1] and its correct interpretation.

08/25/17: I have added formal definitions of logical validity and semantic correctness. 
09/01/17: Minor revisions. 
09/02/17: Added references.
03/15/18: Minor revisions.


Here's what's wrong with last week's picture, namely:
"In my experience, using an object model in both the application layer and in the database layer results in an inefficient system. This are my personal design goals:
- Use a relational data model for storage
- Design the database tables using relational rules including 3rd normal form
- Tables should mirror logical objects, but any object may encompass multiple tables
- Application objects, whether you are using an OO language or a traditional language using structured programming techniques should parallel application needs which most closely correspond to individual SQL statements than to tables or "objects". --LinkedIn.com

--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.
--------------------------------------------------------------------------------- 

The RDM put database management on a formal, scientific foot. Consequently, tool experience and relational jargon are insufficient -- foundation knowledge is necessary. Unfortunately, most data professionals do not possess it, in part because they have been misled by the industry and in part because few go through education -- as distinct from training -- a program that teaches the RDM and teaches it correctly. Consequently, even those with the heart in the right place who defend the RDM do it without sufficient understanding, their views distorted conventional wisdom[2] (stay tuned for a debunking of such a recent example).

 

Levels of Representation and Types of Model


There are four distinct types of model in database management:
  • Conceptual
  • Logical
  • Physical
  • Data
The first three types are models of some portion of reality of interest at different levels of abstraction (i.e., are enterprise-specific). A data model is a general formal theory of data. The RDM is grounded in simple set theory (SST) expressible in first order predicate logic (FOPL) and is used to formalize informal conceptual models of reality as "computable" formal logical models database representation.

Models of the four types are routinely confused and used interchangeably (e.g., the first three are referred to as data models, which they are not[3] and there is conceptual-logical conflation (CLC) and logical-physical confusion (LPC))[4]. "Use a relational data model for storage" is an example of both:

  • Use the [not "a"] RDM for logical database design, not "for storage";
  • Use a physical [not "data"] model for storage;

I have long advocated terminology that helps avoid such mistakes[5]:
  • Conceptual modeling;
  • Logical database design;
  • Physical implementation;

Relational Data Management, OO Application Development


A data model covers only data management DBMS functions -- data definition, data integrity enforcement, and data manipulation. In the case of the RDM, they are expressed in a FOPL-based relational data language. Application functions -- preparation and presentation of results -- are expressed in computationally complete  (e.g., programming) languages that host the data language.

There is no formal "object [data] model" -- object orientation (OO) is a set of informal programming guidelines which, if followed, yield programs with useful properties. It belongs, therefore, in the application programming languages: developers can follow the guidelines to program applications, but extending it to database management is unsound and loses the advantages of the RDM:

  • Simpler, declarative, decidable data sub-languages;
  • Physical (PI) and logical independence (LI);
  • System-guaranteed logical validity and, with proper database design, semantic correctness;
Thus, a relation is an element of a logical model that represents a real world object group, where an object is an element (entity, if you will) of the conceptual model. An OO object is an element of an application that, if it accesses a relational database, encapsulates one or more relations, as well as relational operations on them. 


Normalization, Logical Validity and Semantic Correctness


1NF is necessary for logical validity: a 1NF database relation preserves the properties of a mathematical relation, such that the set operations of the relational algebra (RA) can be applied for provably valid query and update results. 
A query result is said to be logically valid if and only if it is derived by any sequence of RA operations on one or more relations.

1NF is insufficient, however, for mutual consistency of base relations, which requires database design joint adherence to
  • The Principle of Orthogonal Design (POOD)
  • The Principle of Representational Parsimony (PORP)
  • The Principle of Expressive Completeness (POEC)
There is an (as yet unproven) McGoveran hypothesis that the three principles jointly imply the Principle of Full Normalization (POFN), but not vice-versa. Full normalization -- a property of individual relations -- requires knowledge of semantic relationships (dependencies) that the DBMS must enforce via constraints and is necessary for database consistency with the conceptual model of reality that the database represents, but is insufficient for semantic correctness -- a multi-relation property -- for which design adherence to the POOD, PORM and POEC is required. 
A query result is said to be semantically correct if an only if for every assignment of meaning to RA relation operands under which their tuples represent true facts, the tuples of the result relations also represent true facts[6].

5NF means a 1:1 relationship between real world object groups and relations: one relation per object group and one object group per relation. While an OO application object can and usually does encompass multiple relations, an object in conceptual models is represented by a tuple of one relation.

Note: An object group in a conceptual model is roughly akin to a “class” in OO programming.


Thus,
  • According to the common understanding of the RDM, database relations are in 1NF by definition, which ensures logical validity; 
  • According to a correct understanding of Codd's RDM, relations are in 5NF by definition, which, together with adherence to the three principles ensures both logical validity and semantic correctness;

Note: This is a huge topic and the basis of McGoveran's forthcoming proposed semantic improvements to RDM[1]. The POOD is just the tip of the iceberg -- it was never (1) intended to be considered alone and in isolation of the PORM and POEC (2) motivated strictly by view updating, which is just one aspect of semantic correctness -- avoiding loss of information by data manipulation.


OO Classes and Relations



"Application objects ... should parallel application needs which most closely correspond to individual SQL statements than to tables or "objects"" argues against considering relations in the logical model as the analog of the application's object classes. 
"I’ve preached this idea since the mid-1980s. Forcing each application object class to be represented as one or more tables is the biggest single mistake the ODBMS vendors and proponents made. On the other hand, the object-relational mapping (ORM) advocates had no discipline that guaranteed the information in the "object model" would map correctly to the relational logical model. They most often mapped it directly to the relational physical model, ignoring requirements of the logical model altogether."
                        --David McGoveran

Here's a correct reformulation of the recommendations:

  • Use the RDM to formalize conceptual models of reality to logical database models;
  • Design databases that
  • consist of relations;
  • are constrained to be consistent with the object groups they represent;
  • satisfy the POOD, PORM and POEC (and, thus, with POFN); application objects may encompass multiple relations;
  • R-tables should mirror (visualize) relations;
  • Application object classes should not be forced to be represented by one or more (stored) relations;
Given the above, can you figure out what's wrong with the following coments?
"When you design a relationally modeled database (better?) there are certain rules that you follow. Among these are rules describing at least six levels of normalization. I design to 3NF as a design goal, yes. Why not 4NF or BNF [sic] or others? Sometimes I will go further, however, as a general rule, 3NF is the minimal level required for efficient storage and processing. Sometimes higher normal forms add unnecessary overhead to a system without adding data security or other benefits. This results in application data structures that are efficient for the application and efficient for communication with the database. If the queries that are modeled are designed for efficiency in the database as well, then all layers are optimized."
"Some of the objects we model as database designers are physical, some are logical,and some are some combination of both. The models, I agree are all 'logical'."
"OO is both [a programming and data paradigm]. I do agree that it is a set of guidelines ... The problem I have with OO databases and data modeling and applications designed around the traditional OO model is that they all assume that you will always fetch and update entire objects which is incredibly inefficient for most of the operations that an application performs. Example: A natural object in a banking application is to model an account as an object. Such an object will naturally have hundreds of elements in it many with a one to many relationship to the account. Most applications will be involved in updating the account's balance which is only a single element. Getting the account information you need to process that update entails examining a few dozen elements at most. To fetch the entire account object into the application in this case multiplies the communications and IO loads by one or two orders of magnitude which I find completely unnecessary."
The upshot of that ill-fated and ill-conceived idea of an object store was to reverse this idea and design applications to pull the stored data objects into the application whole and push them to disk whole when the stored objects represented problem space objects beyond the scope of the application's needs of the moment."

References 

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER.

[3] Pascal, F., Data Model: Neither Conceptual, Nor Logical, Nor Physical.

[4] Pascal, F.,  The Conceptual-Logical Conflation and the Logical-Physical Confusion

[5] Pascal, F., Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation

[6] Pascal, F., The Principle of Orthogonal Database Design Part I






2 comments:

View My Stats