Sunday, December 9, 2012

Brother, Spare Me the Paradigms

In an interview Louis Davidson is mostly right and his heart is in the right place. A few quibbles and clarifications.
Consider dimensional design and Big Data as two additional paradigms. Data warehousing has not only allowed us to take the strain of reporting off our OLTP servers (leading to better reporting capabilities), but it has also given us the ability to support larger relational databases capturing more and more business data.

Regarding warehouses, I refer him and the reader to my previous post. It's the
relational model that introduced database views and materializing them per query-only application needs is nothing special. And to reiterate yet again, an R-table is a multidimensional representation of reality. That the industry does not know, understand, appreciate and exploit the relational paradigm does not mean that different paradigms are necessary, or that any industry fad is such. Insofar as database management is concerned, only a formal data model a paradigm makes, because, whether practitioners realize it or not, there cannot
be data management without one. Whatever dimensional design and Big Data are, they certainly do not have one.
Big Data paradigms like Hadoop and NoSQL will alleviate the temptation people have to try to use the relational database in unnatural ways. Extremely large quantities of unstructured data is not exactly relational databases’ strongpoint. And while SQL Server continually gets better at it, the massively parallel capability to deal with data that doesn’t have be continually consistent lets you capture data volumes that would be silly in a relational database.
Neither do Hadoop and NoSQL. Using relational databases in "unnatural" ways is a terribly misleading expression. If your informational need is inferencing facts that are logical implications of facts recorded in the database, then there is no data model that is as sound and cost-effective as the relational model.

Hadoop, as far as I can tell, is promoted for scalability, which is a physical implementation aspect that has nothing to do with the purely logical data model. There is nothing to prevent true RDBMSs from being scalable and if SQL DBMSs are not has little to do with their being relational (which, actually, they are not). I very much doubt that transactional systems can afford "eventual consistency". Not everybody has FaceBook, Google, or Twitter needs and I would be very surprised if even they used Hadoop or NoSQL for their accounting or financial transaction systems. Besides, larger data volumes don't justify renouncing consistency. Without a sound OLTP database to populate warehouses, how can the integrity of the latter and its query results be protected?

SQL may not be the solution, but giving up whatever relational benefits it does provide at the logical level for what are unrelated physical implementation deficiencies is trading down. Whatever NoSQL products do--and they are all different--they cannot satisfy the same informational needs as RDBMSs, certainly not cost-effectively (see below). The solution are scalable true RDBMSs.
By definition, normalization demands that we break down all of the data storage to the most basic bits of data possible. The problem is that the lowest form of the data that one could attain isn’t always what is needed for the application. I commonly use the example of a Microsoft Word document. If your goal is to store a Word document in a database, you would need to decide whether to store the document in one column in a table or place every character and formatting mark in a row of its own. Duh, one column, right? In 99.9% of cases, this is absolutely the case. But in that .1% case, you actually may need to break the data up into smaller units to support an operation that the user needs doable in a very natural manner. For example, if your application counts the number of uses of the individual characters in documents as its primary output, parsing the document once and storing it as one row per letter would be optimal.
"Breaking down all of the data to the most basic bits" is a very poor characterization of normalization. Normalization is a design repair procedure: it "unbundles" multiple entity types each to a separate table, to avoid certain drawbacks that bundling has. It is purely logical (and is orthogonal to physical storage).

I wonder if Davidson confuses normalization with the tabular representation, where the value in an R-table is "the most basic bit", whether the table is fully normalized (5/6NF) or not. But a value can be anything--a number, a string, an image, an R-table, or a document. What those table values are is determined by the business model, that identifies the attributes and entities of interest, which is determined, in turn, by informational needs,

For example
  • If a document is a contract describing some business transaction, then the entity could be the transaction, one attribute of which is the document describing it. This model would map to a table whose rows represent transactions, with a CONTRACT column whose values are the text of every contract, drawn from a CONTRACT domain.
The problem that Davidson refers to as document values being too "coarse" for application arises with this design: retrieval of precise information or integrity enforcement from text is prohibitive (see my exchange with reader Louis in the comments to my post). Moreover, domains encapsulate, another way of saying that all operations applicable to text must be programmed into the domain. Any operation not programmed into the domain will not be available to users. This is the approach taken with objected [oops. Freudian slip?] oriented DBMSs, which places a heavy burden on programmers, for which reason they are referred to as "DBMS building kits" (see Chapter 1 of PRACTICAL ISSUES IN DATABASE MANAGEMENT, available via this site).
  • If a document contains, say, encrypted text and the purpose is to analyze it linguistically to help decipher the code, then modeling implicitly documents as entities, with "characters and format markups" or words as their attributes would satisfy the purpose. Documents would be represented as implicit rows in a table whose columns have characters and markups as values. An RDBMS provides the operations for these kind of values.

Modeling documents as either entities or attributes is attractive because it is more expedient upfront and relegates problems to the future. But more often than not the documents themselves are neither the entities nor the attributes of interest. Rather, they must be extracted from the text content (see Business Modeling for Database Design).

You can:

1. Leave data "unstructured" and either undertake complex programming burdens, or make do with "coarse" information and risk inconsistent data.


2. Invest knowledge, time and effort into business modeling upfront and let the DBMS do most of the work providing precise information while protecting the integrity of data and query results.

What you cannot do is achieve the same informational objectives cost-effectively from "unstructured data" that are possible only with relationally structured data.There is no free lunch.

No comments:

Post a Comment