Saturday, December 1, 2012

Data Warehouses and the Logical-Physical Confusion

(Erwin Smout is co-author of this post.)

In Implementation Data Modeling Styles Martijn Evers writes:
Business Intelligence specialists are often on the lookout for better way to solve their data modeling issues. This is especially true for Data warehouse initiatives where performance, flexibility and temporalization are primary concerns. They often wonder which approach to use, should it be Anchor Modeling, Data Vault, Dimensional or still Normalized (or NoSQL solutions, which we will not cover here)? These are modeling techniques focus around implementation considerations for Information system development. They are usually packed with an approach to design certain classes of information systems (like Data warehouses) or are being used in very specific OLTP system design. The techniques focus around physical design issues like performance and data model management sometimes together with logical/conceptual design issues like standardization, temporalization and inheritance/subtyping.

Implementation Data Modeling techniques (also called physical data modeling techniques) come in a variety of forms. Their connection is a desire to pose modeling directives on the implemented data model to overcome several limitations of current SQL DBMSes. While they also might address logical/conceptual considerations, they should not be treated like a conceptual or logical data model. Their concern is implementation. Albeit often abstracted from specific SQL DBMS platforms they nonetheless need to concern themselves with implementation considerations on the main SQL platforms like Oracle and Microsoft SQL Server. These techniques can be thought of as a set of transformations from a more conceptual model (usually envisaged as an ER diagram on a certain 'logical/conceptual' level but see this post for more info on "logical" data models).


Performance and flexibility considerations are common to all database management in general. There is nothing special about warehouses--any database is a data warehouse. What the industry calls "warehouses" are just materialized views of an "operational database" that typically don't get updated "in full sync" with the "underlying" database. In other words, they are database snapshots at specific points in time.

A unique and crucial practical value of the relational model is that it allows the formalization of informal business models to be represented in databases. Thus, relationally (1) given a database with a certain formal logical structure (2) a "transformation process" is triggered (by documented circumstances) that applies relational algebra operations to the database to produce a view desired for the warehouse, which (3) is materialized by persisting it physically (storage and access methods). If either the logical structure of the database or the transformation producing the view violates relational principles, consistency/soundness can no longer be guaranteed.

Unfortunately, industry practices flout the formal foundation:
  1. Warehouse developers usually do not make sure they understand the precise interpretation (meaning)--the exact predicates--of the database tables. This results in warehouses that are based on various unwarranted or false assumptions about what the data means.
  2. The enumerated modeling methods do not allow to document the transformation itself--the relational algebra formula that constitutes the transformation. Instead, the focus is on the result, which is not guaranteed to be logically sound because it is usually unlikely that the underlying database is well-designed or consistent, or that the transformation is logically correct.
The materialization of the view is the only physical implementation aspect of data warehousing and it should be obvious that this is not what the enumerated modeling techniques are about, as there is no generally accepted diagram or modeling language to document physical models. In other words, those techniques are not at the physical/implementation level. Hence, "implementation data modeling" reinforces, if not induces the logical-physical confusion. We prefer to reserve 'data modeling' for logical database design and use 'physical database design' at the storage and access level.
It would be good if we could analyze, compare and classify these techniques. This way we can assess their usefulness and understand their mechanics. Apart from the methodology/architecture discussion around information system design and modeling (e.g. for Data warehousing: Kimball, Inmon, Linstedt or a combination) there is not a lot of choice. If we ignore abstractions like Entity/Attribute/Value patterns, grouping (normalization/denormalization), key refactoring, versioning, timestamping and other specific transformation techniques we end up with nominally 3 styles of modeling that are being used right now: Anchor Style, Normalized and Dimensional. All three relate to an overall model transformation strategy (from a logical/conceptual data model), and all three can come in highly normalized varieties and some even in strongly "denormalized" (better is to talk about grouping like in FOM/NIAM/FCO-IM)  ones (and anything in between). This means that the definition of each style lies in a basic transformation strategy which then is further customized using additional transformations, especially grouping.

Normalization can be seen from 2 sides. As an algorithm to decrease unwanted dependencies going from 1NF up to 6NF, or as a grouping strategy on a highly normalized design (6NF or ONF) grouping down to 5NF or lower. Besides being a model transformation process it is also used as a name for a family of related data models using ONLY this model transformation process. Standard normalized data can be in either 1NF to 6NF (which is the highest normal form). Normalization is the simplest and most fundamental family of transformation styles. It is currently mainly used for OLTP systems.
Apart from the logical-physical confusion, I would not group normalization together with "key refactoring, versioning, timestamping and other transformation techniques", not even with denormalization. While normalization as a design method is a "transformation", it has the important distinction that it is formal, rather than arbitrary/ad-hoc like the others.

We must stress again that normalization is a design repair procedure: it is necessary only if and when, due to poor business modeling and/or logical mapping to the database--what Martijn refers to as "unwanted dependencies" (see The Costly Illusion: Normalization, Integrity and Performance). Otherwise, relational databases will be implicitly fully normalized: given a set of well-defined business rules, a good modeler will not have to bother with explicit normalization.
What I call Anchor style modeling (also called key table modeling) is becoming more and more accepted, especially with Data warehousing. Data Vault and Anchor Modeling are both methods that rely in this family of techniques. The basic idea is to split (groups of) keys into their own entities (called hubs, anchors or key tables) and split of all other info in additional entities (non key table entities called leafs, satellites etc.). Another property of this family is that temporal registration is never done on key tables but only on the non key table entities. From there the different techniques and standards, as well as goals, diverge leading to several related techniques that are used to design Data warehouses as well as OLTP systems in different fashions.

Apart from Normalization, implementation modeling styles have been mainly invented to overcome issues in database software, e.g. current mainstream DBMSes. They make (sometimes serious) compromises on aspects like normalization and integrity against performance and maintenance. Dimensional modeling overcomes some performance and some simple temporal limitations, and also covers some limitations of BI tools (mainly pattern based query generation), while at the same time isn't hampering usage and understanding too much. Anchor Style techniques help out with serious temporalization inside DBMSes while at the same time add extra flexibility to mutations of the data model schema.
The reality is that these techniques do not yield fully normalized warehouses. In fact, while denormalization may preserve 1NF, the other techniques are not guaranteed to preserve even that. This is usually dismissed as a concern for query-only warehouses. But since the operational database is not guaranteed to be well designed or consistent, nor the transformation logically sound, warehouses are not guaranteed to be consistent and queries to yield logically correct, or easy to interpret results either (see Business Modeling for Database Design). The risks may be smaller, but only because they are confined/exclusive to the developer responsible for the transformation and the warehouse load procedure, an environment that is more controlled than the shared operational database that is usually updated by any application by anyone in a larger team (assuming, of course that the developer possesses sufficient foundation knowledge, which is not always warranted).

Anchor modeling and dimensional modeling are not "implementation modeling styles" (which do not exist). Warehouses are essentially application-specific database views--logical transformations that bias databases for particular applications. The flaws of SQL and its implementations notwithstanding, the poor state of foundation knowledge in the industry makes it difficult to ascertain that renouncing logical soundness is necessary for satisfactory performance. If and when that is indeed the case, it is a trade-down--we dispute the claim that they don't "hamper understanding too much". They do little more than introduce new graphical languages with new syntactical constructs not much better than E/R diagrams. We contend that introducing new graphical languages where none are needed is a very serious obstacle to understanding.  It imposes unnecessary learning curves on users, thus violating both the letter--formality--and the spirit--parsimony and simplicity --achieved by the relational model (see Forward to the Past: Physical Data Independence).

Because (1) there are no commonly agreed-upon modeling techniques for documenting physical structures and (2) physical design has always been and is becoming ever more complex, practitioners "look for lost keys not where they lost them, but where there is light": they abuse the simpler modeling techniques at the informal-busines or formal-logical level and then claim without proof that they have documented "aspects of physical design"



Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment