Sunday, May 26, 2013

Hadoop vs SQL: How to Compare Data Models


Does Hadoop point to the demise of the relational data warehouse? asks JG. While most responders do not expect such a demise, they have different opinions on the subject.

  • RR: "...non-relational data management methods--[Hadoop being considered one, will be required only] to address all those huge volumes of unstructured data;
  • TD: "Hadoop handles traditional structured data just fine, albeit in a different way than a RDBMS ... includ[ing] some degree of SQL support [but this] does not mean that [it] is good at the same things as a well managed RDBMS", [a demise as one of] "those myths ... that seems to come up every two weeks" ... HBase has a data model and works quiet well when your data/application needs fit its design center.
  • KK: "EDW vendors [will] incorporate Hadoop framework into their core architectures to enable advanced and high performance analytics";
  • MR: "all relational database vendors are already doing" just that; 
  • TE: [RDBMS will be replaced] "especially [for] transactional workloads [because] RDBMS will always perform faster do [sic] to the seek-based architecture of the RDBMS compared to the hadoop architecture which is more scanned based [that] was created to do massive batch processing of big amounts of data ...";
  • TN: the "ACID properties inherent in the RDBMS system [that is] not the case with hadoop".
  • RS: "Hadoop seems to take over relational database as Hbase can store even unstructured data whereas relational data warehouse limits to structured data."
At this point I felt compelled to point out that (1) "unstructured data" is a contradiction in terms (2) the relational data model (RM) is predicate logic and set theory applied to databases, neither of which has anything to do with performance and (3) there are no truly and fully relational databases and DBMS's, let alone DWH's (see Truly Relational: What It Really Means). The reader should, therefore, substitute in what follows 'SQL' for 'relational' whenever the latter term is used by anybody other then myself and should not confuse the two.

Given that database management is but manipulation of some data structure, it is hardly surprising that Hbase has some. A common problem in the industry is a fixation on the structural component of the data model, without considering its implications as a target for manipulation and integrity enforcement. Anything can be "stored", the important question is to what informational use it can be put and terms like "works just fine","handles", or "addresses" obscure this core aspect.

It is in this context that replacement of one data model for another must be considered. To replace the one underlying SQL, a data model must, first, qualify as a complete and well defined data model and, second, it must be superior for a given informational objective (see Business Modeling for Database Design for superiority criteria). Hence, I asked TD "Can you specify the structural, manipulation and integrity components of the Hadoop data model and its formal basis. Precisely, please!"
TD: I don't believe anyone here said that Hadoop provides a data model that was that was "better" than RM. Second, what several of us have said is that there are different ways to work with your data and that Hadoop does provide the ability to manage data. In pragmatic terms - and stepping away from pure semantics into what people actually do - Hadoop allows you to;
1) Collect data
2) Store data
3) Organize the data
4) Access data
5) Manipulate data
Hard to argue those capabilities don't fall under that 99.9999% of the world would call "data management". You can' argue that Hadoop doesn't allow for those things becuase hundreds (thousands actually at this point) of customers do exactly those things.
Third, you can and do model data in Hadoop. The HBase example I gave was provided to show that you do model how data is related to other data. Indeed the first thing you do with HBase is decide how to organize your data into column families. Kind of hard to argue that doesn't happen because we and lots of others do exactly that. Ditto with Hive. Heck, even the choice of how you write data into HDFS requies some choice on organization even if it "just' mimics the what it was modeled the application/source it is being loaded from.
Am I saying that is the same type and depth of modeling as an RDBMS, no. It is still data modeling, yes.
The fact it isn't the same as a RDBMS is the whole point, there are design considerations where you want the HBase model instead of a RDBMS.
First, "replacement" implies superiority i.e. it can do at least as well as SQL and/or fulfill needs not satisfied by the latter.

Second, that something is being used does not necessarily guarantee that it is relatively cost-effective--that requires proof/evidence. The DBMS's preceding SQL were abstracted from existing application development practices and were widely used, yet they failed miserably in the long run. SQL's success notwithstanding, it is hardly a true or full concretization of RM, or even the best one tried. In fact, the level of foundation knowledge in and the history of the industry does not justify an expectation that the best technology will always remove inferior ones.

Third, "different ways to manage data" is so much handwaving. It is how data are managed that determines the practical value of one data model relative to another. The three components of RM can be specified succinctly, simply and precisely: R-table, relational algebra and several specific kinds of integrity constraints that can be expressed with the algebra; and so can be its formal foundation: predicate logic and set theory, as well as the many practical implications. Unless and until the same is provided for any other database technology, nothing can be claimed about  replacement of SQL, let alone RM and the desirability thereof. In fact, from various descriptions of Hadoop, a precise definition of its data model is hard to come by. "Modeling columns" and "expressing relationships" is too little and too vague.
VR: The fact is clear that, Hadoop and RDBMS, were built for different use cases in mind. You can never compare apple with orange here. Having said that, layers on top of Hadoop are being added to cater to different use cases. Scallablity is a key factor here, there are distributed Database in market but then the open source contribution is quite minimal, which makes people think Hadoop rage over traditional RDBMS.
The core idea behind the database concept is native application neutrality--a general purpose data resource with multiple "uses" (applications) and data functions centralized in a DBMS. "Layered additions" for different uses indicates that this idea was either missed or disregarded. Incidentally, why would one choose "some" rather than full SQL support?

Scalability and distributivity are physical implementation, not logical data model properties. The first ever truly distributed DBMS (DDBMS) was IngresStar, a relational system based on QUEL, rather than SQL. Date, a relational proponent, published his 12 rules for DDBMS in the 70's-80's and explained why an RDBMS lends itself to distribution better than any alternative. Few of today's proponents of BigData/NoSQL know that Codd was thinking about "eventual consistency" before they were even born (see Why Is Relational So Sacred Anyway?)
GS: Here's a short, interesting take on where Hadoop and RDBMS's fit in complementarily!
I might debunk the video in the future.






Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

3 comments:

  1. I have heard people defend the lack of constraints in "Big Data" tools like Hadoop by saying constraints don't matter if the database is for a data warehouse and therefore "read only".

    If you think about it, a genuinely read only database can only have one possible value - and the quantity of data in it is therefore anything other than big.

    ReplyDelete
    Replies
    1. I heard another rationalization of Hadoop: that it is for one-off analysis of temp data that is not important after the analysis (e.g. analysis of a snapshot of Likes at a point in time).

      In fact, I was asked if I have anything against "relational functionality in apps".

      Delete
  2. Great article! Very informative and simple easy to understand. Keep posting it really help for those people that needs this. I will refer your blog to my friends. Thanks a lot!

    ReplyDelete