Sunday, October 7, 2012

Data Fundamentals, Fads and "Big Data"



WS writes:
It is fairly clear to me that the "NoSQL" movement (or "big data" as some people seem to be calling it nowadays) has thrown out the relational baby with the SQL bathwater.
That is exactly the point I was making in my several posts on NoSQL. Since SQL is practically the only commercially implemented data language with any relational characteristics, most practitioners confuse it with the relational model. I quoted with approval David McGoveran that the NoSQL movement has its roots in the anti-relational sentiment expressed by object and XML proponents. Underlying it is the failure to appreciate the distinction between the fads that characterize the industry's mode of operation and relational technology. This failre is due to disregard for and lack of knowledge of data fundamentals. In fact, the relational model has been treated for a time as just another fad.
In particular I wanted to say that I agree very strongly with Eric Kaun's comments about the supposed inflexibility of relational schemas. In my experience it is actually very easy to change the schema, but often fiendishly difficult to assess the impact of these changes on all the applications that use the database.
For a while I just could not understand what the claims of "fixed relational schema" meant, because it is obvious that the relational approach increased schema flexibility via several types of data independence relative to products that preceded it. But recently it occurred to me that what complainers mean by it is not that the schemas are fixed, but rather that inflexibility means exclusive reliance on tables. Why not change the structural basis of the schema "as needed"? This certainly is an explicit fundamental rejection of the relational model (they cannot express even their complaint correctly).
In spite of the apparent sophistication of many developer environments, none of them, as far as I am aware, do anything as simple as representing the relation between application language expressions and database attributes. This would mean a simple query could find all the dependencies between applications and the database. As is often the case application programmers are ignoring the failings of their own version and dependency management and blaming the DBMS instead.
It depends on what the meaning of "sophisticated" is. It may apply to programming languages, techniques and tools, but not necessarily to data fundamentals such as logic, the different data models with their pros and cons, their practical implications and the history of the field. That explains why old technologies are reinvented and relabeled as new and the constant stream of fads. As to the application-DBMS functional distinction, many developers are not even aware of it and see nothing wrong with having to undertake database functions in applications. Hence the increasing number of "application-specific databases".
Regarding NoSQL, I suspect there is an element of cargo-cult thinking going on here. Google have been very successful (at least from a business perspective) using distributed processing so if you use the same methods as Google then you are bound to be successful too.
Obviously if you lack foundation knowledge you will opt for emulating others without a real understanding to what extent their success--to the extent that it is not a short-lived one--is applicable to your circumstances (see SQL/PostSQL/NoSQL). As an example, from my experience with Google's Blogger in particular and their other products in general (see my previous post) I would not be surprised if their poor ability to fix problems and provide expedient support (if at all!) is probably due to some extent to their reliance on ad-hoc, schema-less products. These problems have the potential to reach the same walls as those reached by pre-Codd non-relational technologies. Facebook, Amazon and other "Big Data" companies reveal similar problems.

The size of the data has nothing to do with the data model. Distribution has nothing to do with the data model. Consider, again, David McGoveran's list of circumstances which may justify a NoSQL product:
First, when discovery of relationships is more important than consistent processing and specific data results.
Second, if the data processing is meant to be inductive (e.g., suggestive) rather than deductive (i.e., precise).
Third, when the application is changing very fast, data complexity is great (variety or amount).
Fourth, if physical issues, like big data or a high degree of parallelism, are more crucial than data integrity. You must be willing to throw away data consistency in favor of performance and scalability.
Fifth, if you have a mission-critical one-off application for which a fixed data organization is ideal, in which case the costs and risks may be lower than licensing a vendor’s RDBMS [read: SQL] or trying to force an open-source RDBMS [read: SQL] to fit the need.
With knowledge of data fundamentals one is tempted to conclude that the intention here was to demonstrate that the set of applicable circumstances is very close to being the empty set. In the absence of such knowledge, the temptation to use these tools in order to avoid hard thinking and design effort upfront, while expecting is just too strong to resist.

No comments:

Post a Comment

View My Stats