Monday, February 10, 2014

The Costly "Denormalization for Performance" Illusion


Certain common misconceptions are excellent indicators of poor understanding, if any, of  the relational model (RM). One of the most common is what I call "the costly illusion": "denormalization for performance".

I will not get into the first four of the 5 Claims About SQL, Explained. I do not disagree with the facts, except to point out that such problems are not due the relational nature of SQL and its implementations, quite the opposite: it is their weak relational fidelity and poor language design that are problematic.

I will focus on the fifth, "Should everything be normalized?"

I do not know about "everything", but yes, all database tables should be normalized (in 1NF). A table that is not in 1NF is not a R-table and all bets are off: a RDBMS can no longer guarantee that query results are provably logically correct with respect to the real world. But by "normalized" Bolenok means, of course, full normalization (5NF)--do not confuse the two. In other words, "should everything be fully normalized?".

To answer the question, one must know what are both the costs and benefits of 5NF relative to 1NF and that require a good grasp of the RM.
Database normalization, to put it simply, is designing a database foolproof to incorrect input of data ... Basically, if you notice that a change you make to a database requires more than one operation, your database is not normalized. Say, a person is promoted and you need to update their position in more than one table. Or you want to insert a record for a new accounting year and find that you need to update the closing date of the previous one as well. Or before updating a field you have to parse its previous contents. If it's possible to break your database consistency by "forgetting" to do something, the database is not normalized (in some sense).
Everything should be as simple as possible, but not simpler. This is a good example of an oversimplification that is vague and, therefore, not very enlightening. It would be more informative to say that a fully normalized (5NF) design
  1. Maximizes consistency with minimal integrity burden on users;
  2. Makes desirable data operations that would not be possible with undernormalized tables (>5NF), possible;
  3. Makes the database easier to understand and query results easier to interpret.
(details in my paper The Costly Illusion: Normalization, Integrity and Performance).
The drawback of normalization is, of course, performance. Denormalization allows you to store copies of data in the places where they are more readily available to the queries. Instead of joining two tables in a business-critical query you run often, you might want to consider just storing the joined records together in a third table. This comes at a price of course: you should pay attention that the data are consistent between the three tables, however, this might be worth it, provided that benefits from making the query faster would justify additional development costs.
As my paper shows, the level of normalization is a strictly logical property of database design. Without knowledge of implementation details--physical storage and access methods, hardware, network, concurrent access and so on--the number of logical tables says nothing about performance. So why is "denormalization for performance" so popular in the industry?

Aside from permitting tables that are not in 1NF (see my paper Business Modeling for Database Design), SQL DBMS's usually have mirror-image implementations: each logical table is stored in its own physical file--and that is why, all else being equal, undernormalized join tables sometimes (but hardly always!) yield performance gains relative to their fully normalized equivalents. Note very carefully, though, that it is this type of implementation, not the number of tables that is responsible!

Practitioners often complain that the RM is "not practical" because it "does not "deal with" implementation", but that is actually a major advantage: physical data independence. Being purely logical, RM leaves DBMS designers complete freedom of implementation, as long as it is not exposed to users and applications. They can do whatever they deem appropriate at the physical level, including changes, to maximize performance, without invalidating existing applications and queries. The freedom includes "storing the joined records together" in the same file. Some SQL DBMS's do, but those that do not have only themselves, not normalization to blame.

Suppose now that you have two 5NF SQL base tables and you exhausted all the implementation options available to maximize performance (rather unrealistic) and it is still not satisfactory. Would you then follow Bolenok's suggestion of a third join base table? Why/why not?

Denormalization, of course, does not mean three tables, but just the <5NF join base table. Bolenok recognizes only one of its three drawbacks, redundancy: you must now ensure that updates are guaranteed to apply changes to all the redundant data values, otherwise the database will be rendered inconsistent. Bolenok deems this "development burden" sometimes justified by performance gains. Let's assume that you opt for the join base table: how would you guarantee consistency?

There are, of course, only two ways to do that: either you inform the DBMS of the redundancy is and have it control it, or have each and every application that updates the table do it. The latter option is, of course, a violation of the relational integrity independence principle and it should be obvious why it is not cost-effective. You are left with the former: the addition of integrity constraints that control the redundancy which, if enforced by the DBMS, ensure that any and all updates will not corrupt the database.

It is here that I encounter evidence of poor understanding of the RM. Most practitioners are either unaware of the need of such constrains altogether and would not know how to formulate them; or think, like Bolenok, that if they are added, denormalization for performance is justifiable.

If you don't know why the latter is a costly illusion, may I recommend my papers?



Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment