Sunday, August 5, 2012

Schema, NoSQL and the Relational Model, Part 1

In What's Wrong with the Schema? Stephen Haberman is
...wondering why schema-less databases are so popular lately. Most any NoSQL store is schema-less. And while perhaps schema-less-ness is an integral part of NoSQL (e.g. most NoSQL databases are just opaque key/value stores), I would assert it’s an orthogonal concern, and that document-oriented databases, e.g. MongoDB, could arguably have a schema. However, MongoDB doesn’t just say it’s schema-less for technical reasons related to being a NoSQL store, it actually touts its lack of a schema as a benefit, claiming it is “agile” and offers “simplicity and power”. I find all of this confusing, as I actually want a schema.

Now, I'll admit that I don't know much about NoSQL products. Chris Date likes the idea of no SQL, but not what products did with it. I concur with the former part and the latter part validates my suspicion that it's not really SQL that NoSQL proponents have problems with, but the relational model (just like most everybody, they confuse it with SQL).

I went to the MongoDB site and found this:
MongoDB (from "humongous") is a scalable, high-performance, open source NoSQL database. Written in C++, MongoDB features:
Document-oriented storage: JSON-style documents with dynamic schemas offer simplicity and power.
Ah, so we're talking docubases, not databases, no wonder it's NoSQL.

I'll get to the schema, but before I do, here's an exchange I once had with Matt Rogish on the subject on documents (revised with the benefit of hindsight, to improve clarity and strengthen the important points.)

Docubases vs. Databases

MR: I think for my own fun I'll play around with representation of "documents" in the RM. One thing people ask me a lot is "Well if the RM is so good tell me how to implement something like a Word document in a RDBMS!" It seems complex but not impossible. I usually mention things like "sentence is a set of words", "paragraph is a set of sentences", etc. but it's hard to visualize how a RDBMS would operate based upon a few little examples. I guess the question is how "complex" of relations do you create -- word, sentence, paragraph, document -- or do you let the type system handle something like a "document" type (perhaps using relations described earlier)?

With all the hoopla over XML-based document management (I think the OpenOffice product stores all its documents in XML, as does new versions of Microsoft Office) I think there ought to be a Relational answer presented.

Take a look at Chapter 1 in PRACTICAL ISSUES IN DATABASE MANAGEMENT (available from the Home page), it touches on the underlying problematics. For starters, a document combines data and presentation and the presentation is often meaningful and that is precisely one of the things Codd wanted to avoid. 

MR: I re-read Chapter 1 again and, if I'm understanding it correctly, it seems to indicate that the DBMS developer/database designer can either create a bunch of relations or a type—and you favor the relation-level approach since it turns it into a database design problem and not a programming problem?

Certainly a "document" is a non-trivial piece of information to model in a DBMS--but are there theoretical constraints which make it impractical to do so? I mean, provided you did it correctly I would think that you would just have data.

To quote Date: Types are things we can talk about, relations are sets of statements that we can utter about those things. OO people like types so much that they have only types and no relations. And, of course, somebody must provide operations and integrity for those types and that requires programming. That's a consequence of encapsulation.

Only the data in the document would be managed by the DBMS. But document data do not lend themselves well to relational structure, unless you model them into R-tables, so that relational operations and integrity constraints can be applied (what about document content that is not text?) 

MR: I was thinking primarily of the other benefits a DBMS offers—stuff like concurrency/security control, time-varying data, logical data independence, etc.

But I see your point. What predicates would apply to documents? I can envision how you could (simplistically) *model* a document (documents made up of paragraphs made up of sentences made up of words) but other than that what would you do with it? It's not like you could really apply any rules (aside from grammatical, perhaps) which made much sense. No new facts would need to be derived from the raw data itself (metadata, like author, subject, etc. might be useful, but that's neither here nor there).

The only thing I could think of is that it would be nice to relate different bits of documents together--kind of like hyperlinks but in a DBMS-supplied way. Also, if I'm typing a letter it would be nice to embed DBMS-supplied data into it. If I'm mailing something to Bob Smith, I might want it to pull in his address for me automatically and also update it as it changes. I might gather a chart of sales figures for a presentation -- I'd like the presentation to issue a query to the DBMS so that I get the latest figures (and avoid potential embarrassment).

However, none of that really requires that a document be stored in the DBMS itself--just that the presentation layer application have access to it to issue queries. The issues raised in DBMS document storage are non-trivial enough that it probably outweighs the other benefits. It *is* seductive, though. :)

As Codd realized, to do database management you must have some data model, period! You cannot do it without one. Indeed, a schema is based on a data model.

So in order to design a database system of the kind you envision you must first define the "document data model": structure, manipulation and integrity. What exactly is it?

A W3C committee tried for quite a while to define one for XML. For what happened, stay tuned for part 2. 

Eric Kaun Comments (8/5/12): The software engineer Michael Jackson (author of Problem Frames) calls a problem domain like word processing a “workpiece” problem. The goal is maintenance of a single artifact, and is distinct from that of a central data store. As long as users share the same tool, they can exchange this artifact (typically a single file), but given the power typically at the tool's disposal (e.g. the entirety of an increasingly powerful computer), there's little need for scaling efficiency. Even a large document can be scanned, sans indices, quickly enough to seem instantaneous, and building custom indices isn't terribly hard.

More important, this document is typically not edited simultaneously. MS Word and other word processors have revision and merge tools, but products designed to enable simultaneous editing – like Microsoft's OLE, if I recall – are just seldom used. And they're manipulated, one version at a time, in the single tool that created them (sometimes a secondary viewer, but usually the same application), by a single user.

Shared databases, on the other hand, support many users simultaneously, and more importantly, support many applications. That, I think, is the key difference. With relational (and even SQL) you're building the data for an enterprise.



But I think the "Word document" reference distracted from the fact that the subject was "docubases vs. databases" and the problematics of pinning down a "document data model". One pertinent area is probably Content Management Systems (CMS) e.g. publishers or media. These systems usually don't make logical inferences, however, and mostly fall outside the scope of the relational model. On the other hand, these systems often maintain data about the documents, which may well require relational capabilities that DocBMS's would have a hard time with.

No comments:

Post a Comment

View My Stats