Tuesday, November 20, 2012

Forward to the Past: Physical Data Independence

In the third and last installment of Schema, NoSQL and the Relational Model I pointed out, yet again, the persistence of the logical-physical confusion permeating the IT industry. Here are some factors that induce/reinforce it and practical consequences.

Consider, first, the following question by a reader of my old site (note the date).
GA: I've been reading the articles posted by Mssrs. Pascal and Date with great interest and relish. I am one of those "whose heart is in the right place", but whose grasp is still in development. I have a single, if not simplistic, question:
You constantly remind us that the relational model is a logical model having no connection to any physical model (so I infer). You also indicate how no commercial product fully implements the relational model. Therefore, how do we make use of the relational model when dealing with the physical constructs of a commercial database program (Oracle, Access, DB2, etc.)? --dbedunk.com 01/03/03

Second, here's an extract from my reply to a critic of one of my articles on this subject:
A common criticism of the relational model is that "it is not practical because it does not address” physical implementation, while in reality this is an intentional huge practical advantage.

Codd's Rule 8, Physical Data Independence, states "Interactive applications and application programs should not have to be modified whenever [physical] changes in internal storage structures and access methods are made to the database". Note that how products should achieve this is not specified and no specific physical implementations are either imposed, or prohibited. The relational model is nothing but logic applied to facts [to infer derived facts] and logic has nothing to say about physical implementation. Moreover, as I explained in my article, this is a major relational advantage, because it leaves implementers free to do whatever they darn please at the physical level to maximize performance, provided nothing is exposed to users in applications. Any performance problem encountered in practice cannot possibly be due to the relational model, or to the relational fidelity of [SQL] DBMSs (which is pretty low anyway), but to physical factors by definition.

The number and complexity of those factors--a myriad of hardware and software configurations, settings, options and loads has become overwhelming over time. Hardware, storage and access methods, OS, applications, DBMS, version and optimizer, database growth and physical design, concurrent access, application and network loads and so on; and this does not  include the various configurations and settings for each oif these factors. That despite all these, practice so often focuses on logical design and/or integrity constraints as performance inhibitors" is testament to the sorry state of foundation knowledge (see The Costly Illusion: Normalization, Integrity and Performance). This brings to mind the famous anecdote about looking for keys where there is light, rather than where they were lost: it is easier to merge tables or renounce integrity constraints than figure out the fine-tuning of complex physical details. Which demonstrates Codd's genius in realizing the importance of detaching the logical level from the physical level and simplifying it. Unfortuntaely, the industry has not absorbed its practical value.
Third, in A SQL Azure tip a day - Create a table HÃ¥kan Winther writes:
When you design your database tables there are some important things to think of. Some of them are important in a normal database also, but they are much more important in SQL Azure, like:

- Normalize to remove redundant data
- Use the smallest datatype possible
- Create as few indexes as possible, but not too few
- Avoid redundant indexes
- Every table must have clustered index

... because you have limited space for every database, your connections may be terminated due to heavy use of resources, you pay for what you use and the data that you transfer. You can use the SQL Azure management portal do design your tables or write the T-SQL statement yourself. The syntax to create a table in SQL azure is the same as in SQL server 2005/2008, but not all options are supported in SQL Azure.
Now, "designing tables" is, to my mind, what logical database design is all about. If the business modeling and R-table design process is properly done, the database will be implicitly fully normalized (see Business Modeling for Database Design). Explicit further normalization beyond 1NF (2NF-5/6NF) is a procedure necessary only when databases are poorly designed and need "repair".

But logical design should not be contaminated by physical considerations. Only after the proper logical design is completed should the physical implementation (storage and access methods) optimal for performance be determined.
Below you'll find a sample of how to create a table named Table1

CREATE TABLE [dbo].[Table1]
 ( [ID] [int] IDENTITY(1,1) NOT NULL,
   [Column1] [nvarchar](50) NOT NULL,
   [Column2] [nvarchar](15) NOT NULL,
   [Column3] [tinyint] NULL,

Wait a minute! Some important options of the clustered index can't be controlled, like FILLFACTOR, ALLOW_xxx_LOCKS and DATA_COMPRESSION. What does that mean? Well, if you execute the following code, you can see that FILLFACTOR=0, ALLOW_xxx_LOCKS are 1 SELECT name, fill_factor, allow_row_locks, allow_page_locks FROM sys.indexes

That means that every page of the index will be filled to 100%, leaving no room for new records. Well, thats fine for indexes that are always increasing, but if not, you'll get extra IO operations caused by page splits and you'll get fragmented indexes, causing performance issues. To remove the fragmentation you need to rebuild your indexes regulary to increase the performance, but that is a subject for later blog posts. In the SQL azure platform, you shouldn't need to care about the infrastructure, like files or file groups and it isn't possible to place tables or indexes in different file groups.
Note how the DDL is mixed with physical implementation and how complex already the latter is for just a basic table definition and index. This not only induces/reinforces physical-logical confusion, but also the instinct to blame the relational model and logical design for performance problems, despite the plethora of tunable physical factors that determine performance which could and should be considered first: it is much easier to merge tables than to master the physical complexity.

Codd did not envision that his simplification of the logical level would prompt practitioners to look for good performance not at the complex physical level where it is determined, but at the logical level, which is much simpler (akin to looking for lost keys not where they were lost, but where there is light).

Most practitioners are not aware (a) that any performance gains obtained this way, if any, come not from denormalization per se, but from trading off integrity for performance (b) of the other costs of denormalization (see The Costly Illusion: Normalization, Integrity and Performance).

Fourth, in an interview, Big Data Tradeoffs: What Agencies Need To Know, Peter Mell states:
First is not to get mesmerized by the marketing hype. Second, ask yourself, does my relational database give me the performance that I need with the data I have? If the answer is yes, even if it's a little slow, then they have no business looking at big data. The reason I say that so emphatically is that big data is not the next generation of database technology. There are tradeoffs to be made.
See what I mean? Whether a DBMS performs well or badly has nothing to do with its being relational (particularly since the dominant SQL DBMSs are not relational). But I agree that a lot of the BigData/Cloud/NoSQL heat is a yet another marketing promoted fad.
Many people think that big data is just a lot of data. But there are some real technological challenges to overcome with the big data paradigm. In particular we are hitting the limits of our relational databases in many areas. And relational databases often don't work well with unstructured data like English prose or even semi-structured data like an XML document where you don't know what tag is coming next.
Points arising:
  • The common confusion of RDBMS with with SQL DBMS.
  • Either there is data, which is by definition structured, or there is unstructured, meaningless noise. At issue is not whether, or "how much" is data structured, but whether a structure is most useful for given manipulation and integrity purposes.
More specifically, neither is text unstructured, nor is XML "semi-structured". Rather, they are both structured, but the structures do not lend themselves to the kind of manipulation and integrity that R-tables do and, therefore, they do not satisfy the same kinds of informational needs. Such satisfaction by RDBMSs comes at a price: business modeling and logical design, based on knowledge and understanding of the reality to be represented in a database. Mell is aware of that:
When you move into the big data world, you are often giving up very powerful query mechanisms like SQL. You are also giving up perfect data consistency in transactions. What you are gaining is scalability and the ability to process unstructured data.
One tradeoff involves consistency. It's not so much that your data is inconsistent. It's that it will be eventually consistent. Think of the banking system. I am on a trip to Europe, my wife deposits $100, five seconds later I check an ATM in Europe, and the $100 isn't there. Oh no, what happened? Well, it's okay. My $100 will eventually show up. It will be eventually consistent. I would argue if the banking industry can use big data technology, so can you if you need it. Ideally, big data will be consistent all the time.

In the big data world, there is now a very famous theorem called the CAP theorem: consistency, availability and partition-tolerance. I hesitate to ever mention theorems to people, but it really helps people understand the tradeoffs.
Scalability is a physical implementation aspect which has nothing to do with the relational model either (Chris Date was writing on the subject of distributed relational databases and RDBMSs--the two are not the same thing--and the difficulties of ensuring perfect consistency when most proponents of NoSQL had not yet been born).

The point is that users should not have to choose between informational needs at the logical level and good performance at the physical level. If (1) there is no (1) need for what RDBMSs do or (2) sufficient knowledge and understanding of reality or (3) willingness to invest time and effort in modeling and design, then non-relational technologies might do, but you get what you're paying for: you cannot satisfy the same informational needs. But in the absence of foundation knowledge, the marketing hype lures users into just that illusion that NoSQL/BigData tools will satisfy the same needs without the upfront effort.

Dr. Peter G. Neumann is to security what Codd was to database management:
For many of those years, Dr. Neumann (pronounced NOY-man) has remained a voice in the wilderness, tirelessly pointing out that the computer industry has a penchant for repeating the mistakes of the past. He has long been one of the nation’s leading specialists in computer security, and early on he predicted that the security flaws that have accompanied the pell-mell explosion of the computer and Internet industries would have disastrous consequences.
--Killing the Computer to Save It

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment