Saturday, December 1, 2012

Data Warehouses and the Logical-Physical Confusion

(Erwin Smout is co-author of this post.)

Revised 8/26/18

In Implementation Data Modeling Styles Martijn Evers writes:
"Business Intelligence specialists are often on the lookout for better way to solve their data modeling issues. This is especially true for Data warehouse initiatives where performance, flexibility and temporalization are primary concerns. They often wonder which approach to use, should it be Anchor Modeling, Data Vault, Dimensional or still Normalized (or NoSQL solutions, which we will not cover here)? These are modeling techniques focus around implementation considerations for Information system development. They are usually packed with an approach to design certain classes of information systems (like Data warehouses) or are being used in very specific OLTP system design. The techniques focus around physical design issues like performance and data model management sometimes together with logical/conceptual design issues like standardization, temporalization and inheritance/subtyping."

"Implementation Data Modeling techniques (also called physical data modeling techniques) come in a variety of forms. Their connection is a desire to pose modeling directives on the implemented data model to overcome several limitations of current SQL DBMSes. While they also might address logical/conceptual considerations, they should not be treated like a conceptual or logical data model. Their concern is implementation. Albeit often abstracted from specific SQL DBMS platforms they nonetheless need to concern themselves with implementation considerations on the main SQL platforms like Oracle and Microsoft SQL Server. These techniques can be thought of as a set of transformations from a more conceptual model (usually envisaged as an ER diagram on a certain 'logical/conceptual' level but see this post for more info on "logical" data models)."


I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you. 


NEW: The Key to Relational Keys: A New Perspective

I deleted my Facebook account. You can follow me on Twitter:
@dbdebunk: will contain links to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture, which I am bringing back.
@ThePostWest: will contain evidence for, and my take on the spike in Anti-semitism that usually accompanies existential crises. The current one is due to the decadent decline of the West and the corresponding breakdown of the world order.

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 

Performance and flexibility considerations are common to all database management in general. There is nothing special about warehouses -- any database is a data warehouse. What the industry calls "data warehouses" (DWH) are just materialized views of an "operational database" that typically don't get updated "in full sync" with the "underlying" database. In other words, they are database snapshots at specific points in time.

A unique and crucial practical value of the RDM is that it allows formalization of informal business models to be represented in databases that is rooted in mathematics and logic. Thus, (1) given a database with a relationally designed logical structure (2) a "transformation process" is triggered (by documented circumstances) that applies relational algebra (RA) operations to the database to produce a desired warehouse view, which (3) is materialized by persisting it physically in storage. If either the logical structure of the database or the transformation producing the view violates relational principles, even a true RDBMS (which does not exist) cannot guarantee logical validity and semantic correctness[1].

Unfortunately, industry practices flout the formal theoretical RDM foundation:
  1. Warehouse developers often do not understand the precise interpretation (meaning) -- the exact predicates -- of the base relations. This results in warehouses that are based on various unwarranted or false assumptions about what the data means.
  2. The enumerated modeling methods do not allow to document the transformation itself -- the RA formulae that constitute the transformation. Instead, the focus is on the result, which is not guaranteed to be correct because it is usually unlikely that the underlying database is well-designed or consistent, or that the transformation is correct.
The materialization of the view is the only physical implementation aspect of data warehousing, and it should be obvious that this is not what the enumerated modeling techniques are about, as there is no generally accepted diagram or modeling language to document physical models. In other words, those techniques are not at the physical/implementation level. Hence, "implementation data modeling" reinforces, if not induces the logical-physical confusion (LPC)[2]. We prefer to reserve data modeling for logical database design and use physical implementation at the storage level[3]. 

"It would be good if we could analyze, compare and classify these techniques. This way we can assess their usefulness and understand their mechanics. Apart from the methodology/architecture discussion around information system design and modeling (e.g. for Data warehousing: Kimball, Inmon, Linstedt or a combination) there is not a lot of choice. If we ignore abstractions like Entity/Attribute/Value patterns, grouping (normalization/denormalization), key refactoring, versioning, timestamping and other specific transformation techniques we end up with nominally 3 styles of modeling that are being used right now: Anchor Style, Normalized and Dimensional. All three relate to an overall model transformation strategy (from a logical/conceptual data model), and all three can come in highly normalized varieties and some even in strongly "denormalized" (better is to talk about grouping like in FOM/NIAM/FCO-IM)  ones (and anything in between). This means that the definition of each style lies in a basic transformation strategy which then is further customized using additional transformations, especially grouping."

"Normalization can be seen from 2 sides. As an algorithm to decrease unwanted dependencies going from 1NF up to 6NF, or as a grouping strategy on a highly normalized design (6NF or ONF) grouping down to 5NF or lower. Besides being a model transformation process it is also used as a name for a family of related data models using ONLY this model transformation process. Standard normalized data can be in either 1NF to 6NF (which is the highest normal form). Normalization is the simplest and most fundamental family of transformation styles. It is currently mainly used for OLTP systems."

Apart from the LPC, I would not group normalization together with "key refactoring, versioning, timestamping and other transformation techniques", not even with denormalization. While normalization as a design method is a "transformation", it has the important distinction that it is formal theoretical, rather than arbitrary/ad-hoc like the others[4].

Note: We must stress again that normalization is a design repair procedure: it is necessary only if and when, due to poor modeling and/or logical mapping to the database -- what Martijn refers to as "unwanted dependencies". Relationally designed databases are implicitly fully normalized (in 5NF): given a set of well-defined business rules, a conceptual model will map to a 5NF database[5,6]. 

"What I call Anchor style modeling (also called key table modeling) is becoming more and more accepted, especially with Data warehousing. Data Vault and Anchor Modeling are both methods that rely in this family of techniques. The basic idea is to split (groups of) keys into their own entities (called hubs, anchors or key tables) and split of all other info in additional entities (non key table entities called leafs, satellites etc.). Another property of this family is that temporal registration is never done on key tables but only on the non key table entities. From there the different techniques and standards, as well as goals, diverge leading to several related techniques that are used to design Data warehouses as well as OLTP systems in different fashions."

"Apart from Normalization, implementation modeling styles have been mainly invented to overcome issues in database software, e.g. current mainstream DBMSes. They make (sometimes serious) compromises on aspects like normalization and integrity against performance and maintenance. Dimensional modeling overcomes some performance and some simple temporal limitations, and also covers some limitations of BI tools (mainly pattern based query generation), while at the same time isn't hampering usage and understanding too much. Anchor Style techniques help out with serious temporalization inside DBMSes while at the same time add extra flexibility to mutations of the data model schema."

The reality is that these techniques do not yield fully normalized warehouses. In fact, while denormalization may preserve 1NF, the other techniques are not guaranteed to preserve even that. This is usually dismissed as a concern for query-only warehouses. But since the operational database is not guaranteed to be relational, nor the transformation based on RA, logical validity and semantic correctness are not guaranteed. The risks may be smaller, but only when they are confined/exclusive to the developer responsible for the transformation and the warehouse load procedure, an environment that is more controlled than the shared operational database that is usually updated by any application by anyone in a larger team (assuming, of course that the developer possesses sufficient foundation knowledge, which is not always warranted).

Anchor and dimensional modeling are not "implementation modeling styles" (such do not exist). Warehouses are essentially application-specific database views -- logical transformations that bias databases for particular applications. The flaws of SQL and its implementations notwithstanding, the poor state of foundation knowledge in the industry makes it difficult to ascertain that renouncing logical soundness is necessary for satisfactory performance. If and when that is indeed the case, it is usually a trade-down -- we dispute the claim that they don't "hamper understanding too much". They do little more than introduce new graphical languages with new syntactical constructs not much better than E/R diagrams. We contend that introducing new graphical languages where none are needed is a very serious obstacle to understanding.  It imposes unnecessary learning curves on users, thus violating both the letter (correctness) and the spirit (simplicity and parsimony) of the RDM.

Because (1) there are no commonly agreed-upon formal techniques for physical structures and (2) physical design has always been and is becoming ever more complex, practitioners "look for lost keys not where they lost them, but where there is light": they try to overcome what are performance deficiencies due to physical DBMS and database implementations by abusing logical database design, thus trading correctness for performance.


[1] Pascal, F.,  Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness.

[2] Pascal, F., The Conceptual-Logical Conflation and the Logical-Physical Confusion.

[3]  Pascal, F., Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation.

[4] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[5] Pascal, F., Database Design: What It Is and Isn't.

[6] Pascal, F., Don't Design Databases Without Foundation Knowledge and Conceptual Models.

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.

1 comment:

  1. Data modeling issues are often faced by the various individuals/Organizations and there is a specialist to handle all problemetic things. But I must say the post above have the ability to make every engaged individual a pro and contribute to business in every manner.


View My Stats