Wednesday, February 26, 2014

Anatomy of a Data Management Project: Distribution Independence

The term "distributed" is thrown around a lot these days. Hype notwithstanding, just as with analytics and data science, distribution in data management is nothing new.In fact, SQL vendors (IBM, Sybase, Ingres, Oracle) -- frequently criticized today for non-scalability -- tackled distribution decades ago. The non-relational systems preceding SQL were not amenable to it, and SQL is the closest to the relational model the industry allows you to get. 

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.

In Anatomy of a Data Management Project, I mentioned the Diaspora project, which claimed to be a distributed alternative to Facebook. Developer Sarah Mei writes, "Once you log in, Diaspora’s interface looks structurally similar to Facebook’s... The main technical difference between Diaspora and Facebook [which, she says, runs "on a single logical server"], is invisible to end users: it's the 'distributed' part." 

I suspect the author means physical server. Remember my warning against confusing levels of representation? A clear distinction between physical and logical is particularly important insofar as distribution is concerned. 

What is of practical consequence is whether the details of the physical distribution on multiple computers are transparent to users and applications, that is, they don't need to refer to those details explicitly. Aside from simpler and easier development, maintenance, and data access, when the distribution changes (e.g., more computers are added or there is redistribution) existing queries and applications continue to work unchanged. For such distribution independence, all the necessary functionality -- transactions, consistency, recovery, integrity, concurrency control, security, performance optimization, and so on -- must be encapsulated in the DBMS, not left to developers in each and every application. 

More than a decade ago C. J. Date specified that DBMS functionality in 12 Objectives for Distributed Database Systems. Insofar as users and applications are concerned, they all boil down to a distributed DBMS behaving in all respects exactly like a non-distributed DBMS. Alas, it would be an understatement to say that this is a non-trivial task. 

Relative to a non-distributed scheme, where the DBMS, the database, and applications all reside on the same physical server, distributed schemes involve either the database, or the database and the DBMS, operating across several servers. 

Satisfaction of the 12 objectives means that the local components are treated both as databases and DBMSs in their own right, as well as integrated dynamically into various combinations via communication and cooperation, depending on user tasks. There are both central and local database management. The more transparent the scheme, the more demanding it is of DBMS designers, but the more flexible and easy it is for users. 

What this entails for data consistency, concurrency control, performance optimization, security, data administration, and other database functions is beyond the scope of this post. My aim is just to alert you that "distributed" claims without details as to exactly what is distributed and which objectives out of the 12 are satisfied, lack information necessary to assess a product and its usefulness. 

The claim that Diaspora's distribution is "invisible," therefore, is not enough. Can you determine from the description above whether that claim is valid? 

No comments:

Post a Comment

View My Stats