Sunday, August 26, 2012

Schema, NoSQL and the Relational Model Part 3

I started in Part 1 with Haberman's justifiable doubts about "schema-less" NoSQL databases, using the MongoDB docubase as an example; and an exchange on the problematics of a "document data model" with Matt Rogish. In Part 2 I completed the exchange using the example of the W3C effort to define such a model for XML. I am now returning to schema and MongoDB DocBMS.

The company does not claim "schema-less-ness", but rather "dynamic schemas". Indeed, they have a document titled 'Schema Design', that states.
In relational data models, conceptually there is a "correct" design for a given entity relationship model independent of the use case. This is typically a third normal form normalization. One typically only diverges from this for performance reasons. In MongoDB, the schema design is not only a function of the data to be modeled but also of the use case. The schema design is optimized for our most common use case. This has pros and cons – that use case is then typically highly performant; however there is a bias in the schema which may make certain ad hoc queries a little less elegant than in the relational schema.
So, first, my gut sense that this is not really NoSQL, but NotR was correct. Second, by "relational data models"--plural--they mean logical models.

But third, ah, yes, the good old logical-physical confusion. Whether a logical model is "correct" for a certain business model or not, what does it have to do with performance, which is determined exclusively at the physical level?  Given so many factors there that can affect performance, (You're Doing it Wrong 5 Factors That Affect Database Performance),the propensity to blame the relational model is simply irrational.

Given a relational schema--which is logical-- what exactly prevents use-case optimization at the physical level, other than the product implementation and physical database design? Wasn't physical data independence a core idea of the relational model? If logical design must be contaminated with physical details and a logical schema must be biased for performance purposes, can this be the fault of the relational model and schema, or of the product  implementations and/or the physical design of SQL databases? Should the technology devised to obviate the need for such bias be blamed for the failure to implement it properly?

Googling "dynamic schema" I came across Curt Monash, further validation of the anti-relational nature of the NoSQL fad. 
So what do we call those DBMS — for example NoSQL, object-oriented, or XML-based systems — that bake the schema into the applications or the records themselves? In the MongoDB post I went with “schemaless,” but I wasn’t really comfortable with that, so I took the discussion to Twitter. Comments from Vlad Didenko (in particular), Ryan Prociuk, Merv Adrian, and Roland Bouman favored the idea that schemas in such systems are changeable or late-bound, rather than entirely absent. I quickly agreed.
You can always count on him to complain about the "complexity" and inflexibility of relational schemas. Sure enough, he informs us that
E. F. “Ted” Codd taught the computing world that databases should have fixed logical schemas (which protect the user from having to know about physical database organization).
Huh? Can't a logical schema change, but continue to insulate from physical database organization? But how did Monash conclude that Codd promoted a "fixed schema' in the first place?
Over the next 15 years, Codd’s thinking — and his employer IBM’s technology — evolved to the point that Codd proposed 12 rules for a relational DBMS, the three most fundamental of which are:

Foundation Rule: A relational database management system must manage its stored data using only its relational capabilities.

Information Rule: All information in the database should be represented in one and only one way — as values in a table.

Guaranteed Access Rule: Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

I.e., Codd was positively asserting that a database should have a fixed logical schema, in a tabular form. The clear implication was that programmers could or should be able to write anything they wanted to against that schema, without database performance being unduly compromised.
I note again, in passing, the logical-physical confusion in the last line. But how exactly does a "fixed logical schema" follow from the rules??!?!

Actually, Codd did the exact opposite of what Monash claims: he enhanced the flexibility of the schema precisely because he understood the importance of schema changes, and he wanted to minimize their impact on users and applications. To that end, not only does the relational model ensure physical data independence, by detaching the schema from the physical organization (which preceding DBMSs had embedded in it), but the view mechanism also ensures logical data independence--protection from some logical schema changes. What would be the point of that if schemas were fixed?

So is there anything behind these "relational schema inflexibility" claims? My friend Eric Kaun pointed to the following:
I was referring to tooling to assist the developer in keeping different software branches and releases in sync with the database schema, and managing the schema through source code control systems like CVS, Perforce, etc. One of the major complaints, and sources of home-grown scripts and ad hoc procedures, is the inability to automatically synchronize database schema with code release. I've seen countless PL/SQL script packages that migrate tables to new versions, and they all differ (and typically get some things wrong).

The "shift of base relations" was my overly-terse phrase for allowing the engine to change the base relations, transparently, in order to achieve better performance (and stave off the unfortunate urge to denormalize). The engine should be able to analyze the queries given to it, and at least suggest transformations that preserve all existing rules, but adjust the base relations ("stored tables" as opposed to "materialized views" and such) for performance. Developers need not know anything about it.
What we are really talking about here has very little to do with the relational model, or fixed or inflexible schemas. Rather, on the one hand there are SQL implementations that make it difficult to do many things that would be much easier with truly relational products, such as the ones Eric refers to; and on the other hand, poor knowledge and understanding of data fundamentals and the difference between database functions and the needs fulfilled by some NoSQL systems, such as docubases. These factors combine to inhibit the development of TRDBMSs and tools that exploit their advantages.

Codd made an additional crucial contribution. He came up with the concept of a formal data model because he realized that database management is impossible without one: it's all about structure to which operations are applied (manipulation) to ensure the integrity of the data and to make logical inferences. The R-table is a simple, versatile structure and, as a faithful representation of a relation, lends itself to theoretically well defined, sound operations and integrity constraints.

Haberman thinks that "MongoDB, could arguably have a schema". But a schema is based on a data model and as we have seen from my exchange with Matt Rogish in parts 1 and 2, there isn't a "document data model" like the relational model.From a MongoDB review:
Document-oriented stores, or document stores for short, aren't new to the world of computing ... Document stores encapsulate data into loosely defined documents, rather than tables with columns and rows. Implementations of the underlying document vary by data store, with some representing a document as XML and others as JSON, for instance.
What exactly is the structure of "a document that loosely encapsulates data", exactly what constraints and operations are applicable to it and on what theoretical foundation do they rest? Schema-less claims are quite understandable in such circumstances. And so are the consequences:
The NoSQL movement has spawned a slew of alternative data stores, all of which attempt to fill voids left by traditional relational database implementations. But while it's easy to fit the various relational databases (MySQL, Oracle, DB2, and so on) under a single categorical umbrella, the NoSQL world is much more diverse, and the NoSQL label is too general. NoSQL data stores such as MongoDB and Cassandra are so vastly different from each other that apples-to-apples comparisons are practically impossible. Thus, within the world of NoSQL, there are subcategories such as key-value stores, graph databases, and document-oriented stores.
This is the situation we had before the relational model. It was then, as it is now, a direct consequence of the lack of a sound theoretical foundation, which Codd overcame by bridging predicate logic and set theory and applying the result to database management.

Don't get me wrong. I am not saying that docubases don't have their uses and, in fact, those may well fulfill needs outside the relational scope.

What I am saying is that there is no free lunch. Claims to the effect that NoSQL products, serve the same needs as RDBMSs, only better; or that they are a solution to "problems" of the relational model; or that they are a superior technology because they are schema-less or have "dynamic schemas" are bunk.

Eric Kaun Comments (8/14/12): Schema design cannot be a function of the use case. When you try that, you necessarily reduce the expectations of the data to the lowest common denominator (subset) of the use cases. An XML schema relates to one set of relational queries; it's a specific configuration for one need. Relational supports a huge (limited, but enormous) number of queries, hence many uses cases, and hence a system (and related systems).

“Late-bound” schema is meaningless – that's a query, or at worst a regular expression match (which is what many XQuery expressions boil down to). 

Biased designs are a step back in the direction of the bad old pre-database days of application data management and it won't be surprising to encounter "application-specific databases", defeating the whole point of database management. 

No comments:

Post a Comment

View My Stats