Monday, May 20, 2013

Object Orientation, the Data Model and the "Tyranny of Theory"

Data professionals tend to dismiss as pedantry the insistence by relational proponents on rigor and precision in use of terms and concepts. The relational model (RM) put database management on a formal foot and that explains some of the hostility to it: an industry used to operate in an ad-hoc, trial-and-error manner, based solely on product training, resents what I refer to as the "tyranny of theory".



In a Discussion about different types of DBMSs AK states:
In my experience, using an object model in both the application layer and in thedatabase layer results in an inefficient system. This are my personal design goals:
- Use a relational data model for storage.
- Design the database tables using relational rules including 3rd normal form
- Tables should mirror logical objects but any object may encompass multiple tables
- Application objects, whether you are using an OO language or a traditionallanguage using structured programming techniques should parallel application needs which most closely correspond to individual SQL statements than to tables or "objects".
Now, the trained eye can discern that AK's heart is in the right place, but his language as precise as it should be, which can mislead the formally unitiated.

First, I can't help but recall again the title of an article Date wrote for my old site, Models, Models Everywhere, Nor Any Time to Think. In fact, I published Business Modeling for Database Design precisely to dispell the confusion caused by their proliferation. What is--exactly, please--"the object model"? Is it a data model in the same sense in which the RM is? If it is, what are--exactly, please--its formal structural, manipulative and integrity components? And if it is not, how can it be "used in the database layer"?

Second, the (not "a") relational data model is logical, storage is physical, hence physical data independence (PDI), a crucial relational benefit. You don't "use the data model for storage" (see below), but for logical database design: the mapping of an informal conceptual model consisting of business rules that only users understand semantically to a formal logical model for database representation, that a DBMS "understands" algorithmically--what I call formalizing the informal (see the above mentioned paper).

Third, the RM is a formal data theory and OO is a set of informal programming guidelines which, if adhered to, has some practical advantages for application development (hence an old paper I wrote, Orientation Is Not the Same as Theory). An 'object' in the OO sense is an application element that encapsulates data (not just multiple tables, not even just R-tables!), as well as manipulation, integrity constraints and even non-data components. R-tables don't "mirror" logical objects, they are logical data objects.

I don't follow AK's last paragraph (can you?), but this is often the result of extending informal, fuzzy OO language to database management.
When you design a relationally modeled database (better?) there are certain rulesthat you follow. Among these are rules describing at least six levels of normalization. I design to 3NF as a design goal, yes. Why not 4NF or BNF [sic] or others? Sometimes I will go further, however, as a general rule, 3NF is the minimal level required for efficient storage and processing. Sometimes higher normal forms add unnecessary overhead to a system without adding data security or other benefits. This results in application data structures that are efficient for the application and efficient for communication with the database. If the queries that are modeled are designed for efficiency in the database as well, then all
layers are optimized.
If, given a well defined conceptual model, you map every class of entities to exactly one R-table, the table will implicitly be not just in 1NF--the minimal relational requirement--but in 5NF too. Explicit normalization is necessary only to repair already designed tables that "bundle" more than one class of entities. Such tables exhibit certain drawbacks that can be eliminated by further normalization up to 5NF (see The Costly Illusion: Normalization, Integrity and Performance).

Most 3NF R-tables are also in 5NF. But if an R-table is in 3NF and not in 5NF, the drawbacks are not any different than if they were not in 3NF. The claim that "higher normal forms add unnecessary overhead to a system without adding security or other benefits" can only be made if one is unaware of the drawbacks.
Logical design has nothing to do with efficiency (or security, for that matter).

If you (1) exhaust all the numerous possible implementation factors affecting performance (2) denormalize the database and (3) performance improves, the gains, if any, come not from the denormalization per se, but from trading data integrity and other benefits for it (see The Costly Illusion). Confusing solutions with the problem is not only costly, but also guarantees that the real solution will not materialize. Database practice and DBMS software would be significantly better if practitioners recognized the real source of problems and refused to tolerate such tradeoffs.
Some of the objects we model as database designers are physical, some are logical,and some are some combination of both. The models, I agree are all 'logical'.
Since objects are application, not database elements, if they are physical it is a violation of PDI: the DBMS does not insulate applications from implementation details (storage and access methods).

It is unclear what AK means by "I agree all models are logical". If he means "all logical models are logical", it's a tautology. Otherwise, all models are not logical: conceptual and physical models are not (by a physical model I mean the internal representation of a logical model in hardware).

To my contention that OO is a programming, not data paradigm, AK responded that
... it is both. I do agree that it is a set of guidelines ... The problem I have with OO databases and data modeling and applications designed around the
traditional OO model is that they all assume that you will always fetch and update entire objects which is incredibly inefficient for most of the operations that an
application performs. Example: A natural object in a banking application is to model an account as an object. Such an object will naturally have hundreds of
elements in it many with a one to many relationship to the account. Most applications will be involved in updating the account's balance which is only a single element. Getting the account information you need to process that update entails examining a few dozen elements at most. To fetch the entire account object into the application in this case multiplies the communications and IO loads by one or two orders of magnitude which I find completely unnecessary.

The paradigm in database management is the data model. Since AK agrees that OO is a set of programming "guidelines", how can it be a data paradigm too? In fact, AK's "problems with OO databases" are precisely those due to the fact that it is not. They are not "designed around a traditional OO model", because none has been formally defined. Whatever OO proponents mean by "the object model", it is closer to the older hierarchic/network models and, like them, is closer to storage than a data model should be.
The upshot of that ill-fated and ill-conceived idea of an object store was to reverse this idea and design applications to pull the stored data objects into the application whole and push them to disk whole when the stored objects represented problem space objects beyond the scope of the application's needs of the moment.
See what I mean? Contamination of conceptual and logical design with efficiency considerations, in violation of PDI, is a very bad old idea, a mistake that NoSQL products like MongoDB, actually repeat and which the relational model avoids.

Prior to the RM there was no concept of a formal data model--database management was practiced in an ad-hoc manner. The hierarchic and network DBMSs that preceded the RM were actually abstractions from those existing practices (an effort that, predictably, failed). To be a data professional you trained on tools. The RM correctly reversed this process: DBMSs are implemented to a formal specification, not the other way around. Consequently, some formal education must precede tool training, for you to better know and understand the pros and cons of the tools, how to exploit their pros and work around their cons--"the tyranny of theory".



Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment