Wednesday, August 15, 2012

Schema, NoSQL and the Relational Model Part 2

Part 1 ended with my following comment to Matt Rogish, on the subject of a document data model:
As Codd realized, to do database management you must have some data model, period! You cannot do it without one. Indeed, a schema is based on it.

So in order to design a document database system of the kind you envision you must first define a document data model: structure, manipulation and integrity. What exactly is it?
Part 2 continues my response to Matt, using an exchange between me and Hugh Darwen to illustrate  what happened when an attempt was made by a W3C committee to come up with an XML document data model.


XQuery "Algebra"

KN: (for Hugh Darwen) In his writing, Fabian Pascal has cited comments you've made about KN: the XML query algebra. That peaked my curiosity but I'm unable to find anything you've written about the subject.
Now, my eyes light up at the word "algebra" ... Originally, I understood it to mean a set of operations that are closed over some type. That is, every operation in X Algebra operates on zero or more values of type X and returns a value of type X. Hence, set algebra, Boolean algebra, relational algebra and the algebra of numbers that gives us arithmetic. Over what is the XML Query Algebra closed? Nobody has ever given me an answer that makes sense (apart from the occasional, honest "I don't know") --Hugh Darwen
THE THIRD MANIFESTO was published before these W3C documents:

1. XML Query Algebra published in 2001 (superseded)
2. XQuery 1.0 and XPath 2.0 Formal Semantics (Sept. 2005)

Have you published any point-by-point analysis of the earlier query algebra or the more recent XQuery semantics documents?

Hugh Darwen: Yes, I remember writing that, but I'm not sure when. FYI, what I wrote then remains true today, but only because I stopped asking and haven't had any reason to spend time on further investigation.  The people I had asked up to that point included certain members (at the time) of the W3C committee developing XQuery.

From a brief look at the September 2005 document, I can guess that every XQuery expression perhaps operates on one (zero?) or more sequences of zero or more things each of which is either an atomic value, or an element, or an attribute, or a document, or a text, or a comment, or a processing-instruction node, yielding one such sequence. I am already struck by the complexity, without delving into what any of these things might be.

It is rather revealing that the very raison d'etre of XML idea—the document—had to be discarded in favor of the "sequence" abstraction, which says about everything you need to know about the whole endeavor.

That complexity is what Codd was smart enough to avoid with RM. Those who don't know, forget, or ignore the past are doomed to repeat it.

Hugh Darwen: Absolutely!

They ran smack into the problem I discuss in PRACTICAL ISSUES IN DATABASE MANAGEMENT: there are different types of documents, even different types of text documents e.g. contracts, email, proposals and even each one of those varies. And in all presentation plays a part. Will you define a different data model for each? (As we shall see, this is exactly what NoSQL products do, but without really clearly defining one, which is what schema-less means).

You can have document domains in a relational database, that is, domains that have documents as values. But there too you have to define operators and integrity for them; so there would probably have to be different document domains with different operators for different types of documents and industries e.g financial contracts, leases, reports and so on. This raises some non-trivial problems. Such domains would have to be designed by those with document-specific expertise and integrated into DBMSs. Is there consensus on them? Who will maintain and support a proliferation of them? The DBMS vendor? The domain designer?

Remember blades? Whatever happened to them?

Why do you think Codd stayed away from that? And he was a very smart guy, much smarter than those who today push "solutions" without thinking of the implications.

If you have database needs, do the work upfront, model the data relationally. If you don't, docubases might do, but recognize the differences and the consequences. Fulfilling database needs with docubase tools is not a cost-effective solution.

That's what you need to tell those who challenge you for a relational Word.

In Part 3 I return to schema and docubases.  

Eric Kaun Comments (8/14/12): The reduction of XQuery expression types to a “sequence of something” is sparse even for a mainstream programming language, much less a schema or data model. If I see code of type “sequence” I immediately ask “of what?”, “are you sure it's not a set or some other collection?”, "does it relate to other collections of some kind?" etc. A sequence is an impoverished data structure with one operation: next(). If you can't even identify what's coming next, then you're left trying to apply some generic identification mechanism to a serialized (“stringified”) version of whatever's popping out of the pipe. You're doing a regular expression fit, just to determine if data meets your expectations.

There are times when you want this restriction – sometimes I write code that feeds the caller nothing more than a sequence, because that's how I intend it to be processed, and there are problems with doing otherwise (for example, the items are expensive to generate, so I can't build the entire collection all at once). But that's by design, and there are many other tools in the toolbox, so to speak. A data model requires more; sequence is about the lowest possible level of interface and collection type. It's even more spartan than an array.

No comments:

Post a Comment

View My Stats