Friday, November 8, 2019

The CAP Theorem and the RDM: Consistency, Availability, Partitioning




by David McGoveran


I have recently posted on LinkedIn two "To Laugh or Cry?" quotes about advantages and disadvantages of "relational" and non-relational" DBMSs from "Choosing The Right Database" @TowardsDataScience.com (by Jun Wu, "a Content Writer for Technology, AI, Data Science, Psychology, and Parenting" with "background in programming and statistics" who "on her spare time, writes poetry and blogs on her parenting website").

I also brought the article to the attention of David McGoveran, who replied "This deserves a response -- it is simplistic, but makes the current, widespread thinking in the industry clear". His comments that follow focus on the three concepts in the title.

Although likely written with good intentions, Jun Wu's article is representative of the muddy thinking that pervades the database and "data science" communities[1,2]. Besides the rather obvious fact that one does not choose a "database" (an organized collection of data), but a DBMS (software that manages databases), it swallows whole and then spreads confused and inconsistent usage of the terms consistency, availability, and partitioning. [FP: That practitioners use the terms database and DBMS interchangeably is part and parcel of the general misuse and abuse of terminology reflecting poor foundation knowledge.]

The terms consistency, availability, and partitioning as used in Brewer's CAP Theorem are not the same terms as those traditionally used in relational database theory and even commercial SQL DBMS practice. For pragmatic reasons I'll qualify the former terms with "CAP" and the latter terms with "RDB", even though I find it misleading to refer to commercial SQL systems as "relational"[3,4,5,6]. Let's look at each term and its intended meaning.


------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE 

Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. The content of this site is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. Thank you.

NEW

  • 10/26/19: The POSTS page now links to all 2012-2018 posts (to be updated annually at year-end). Except for 2017, the (italicized) links are to abstracts of my columns @All Analytics site, which was discontinued (see below).
  • 10/26/19: Updated and cleaned up the WRITINGS page.
  • 08/09/19: Following my series of posts on data sublanguage (Parts 1-4), I have revised for consistency the corresponding section of paper #2 in the Understanding the Real RDM series, Logical Access, Data Sublanguage, Kinds of Relations, and Database Redundancy and Consistency, which is available for ordering from the PAPERS page.

LATEST PUBLICATIONS (order PAPERS and BOOKS) 


USING THIS SITE 

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search. 
  • Following the discontinuation of AllAnalytics site, the links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA 

I deleted my Facebook account. You can follow me:

  • @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture posts, and my exchanges on LinkedIn.
  • @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
  • @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict. 
------------------------------------------------------------------------------------------------------------------

Consistency



  • RDB consistency refers to every aspect of consistency of the data in a database, meaning that the data are not only physically consistent (the physical record is intact) and read consistent (self-consistent), but also logically consistent with all constraints on that data[7,8] -- including but not limited to domain and attribute[9], and all relationship (e.g., uniqueness) constraints[10] that partially define the meaning of the data[11,12]. Any database that violates the requirements of physical, read, or logical consistency, is -- to put it simply -- garbage, and not worth the media it is recorded on.
  • CAP consistency refers only to a particular kind of read consistency and technically should be called "copy consistency". CAP assumes that multiple copies of the data will be maintained and that at least some of those copies will reside in distinct partitions (i.e., a collection of data that can become network disconnected, possibly due to network failure, subsystem failure, etc.). The idea of "eventual consistency" is that all data copies, regardless of how many partitions they are stored in, will eventually (i.e., given sufficient time) be identical. It says absolutely nothing about RDB logical consistency.


Availability


  • RDB availability refers to availability of the entire system and the data it manages, and requires that the DBMS enforces RDB data consistency[13]. Vendors like to muddy the waters here, claiming high availability even when some required functionality is not usable from time-to-time. Nonetheless, a RDB is RDB-available only if every function necessary to the maintenance of RDB consistency, and all data accessed under operation of those functions, is available. An RDB cannot be RDB-available if it lacks RDB consistency too any degree.
  • CAP availability refers to availability of possibly replicated data. That is, even under network or subsystem failures, some copy of the data would ideally be available to every user of every subsystem. The assumption is that the system could be optimized in various ways by replicating data (i.e., the system maintains copies), each copy of the data residing in a separate "partition". These partitions can be either available or not as a unit. The question is then how much and what data to replicate so as to maximize availability of the entire system. It is intended to be a kind of controlled redundancy. However, since it takes time to keep all the copies the same, there will be periods during which they do not all contain the same values and so are copy inconsistent. Notice that this says nothing about whether or not those copies are logically consistent with constraints!


Partitioning



  • RDB partitioning refers to dividing the data in a database into disjoint subsets, and there are many strategies for doing this. This is distinct from both RDB replication (a.k.a., copy management) and RDB data distribution
  • RDB replication refers to automated means for creating and managing copies of some or all of the data in a database so that the system can automatically provide users with access to an up-to-date version of the data in an efficient manner. Ideally, that copy will be local to the user in a distributed system so that network costs are minimized, RDB replication is also supposed to act as insurance against physical loss of data due to, for example, media failures. 
  • RDB data distribution refers to there being multiple resource managers in a system. For our purposes, think of a resource manager as managing data and determining whether or not to accept data updates and make them permanent. Often, a resource manager may be aseparate DBMS and the system a coordinated collection of DBMSs, each managing a portion of the distributed database. Traditionally, the problem of multiple resource managers arose in the context of geographic distribution across networks but this is only a particular example of the problem to be addressed. In a distributed system, logical consistency requires that resource managers coordinate their data management whenever a data update involves data relationships spread across multiple resource manages.  If consistency cannot be guaranteed, updates are precluded and only consistent data may be read. Notice that this may mean that portions of the database, possibly the entirety, become unavailable until recovered. It also means that a database intended to record realtime events may develop a significant backlog and require a time-consuming recovery operation. Various techniques have been developed to address these issues and minimize their impact.
  • CAP partitioning refers to partitioning of the entire system into potentially disjoint portions, each partition containing a copy of at least some data in the database. Typically, one thinks of the system as a network of nodes which can be divided into multiple partitions. If the network connection between partition fails so that they may not communicate, then users connected to one partition do not have access to data in other partitions and the system cannot keep any data copies consistent. This idea is functionally a combination of issues pertaining to replication and the distributed database issue of having multiple resource managers. The goal is that each independent partition - given sufficient replication of data -- provide its users with copy consistent data despite being in isolation from other partitions for some period of time. Brewer's CAP Theorem shows that this ideal cannot be met. ASIDE: I note that proofs of the CAP Theorem I have investigated make assumptions about consistency that are quite different from those required by RDB.

It makes no sense (i.e., it is nonsense) to compare eventual consistency with data consistency arising from support for ACID transactions. In fact, the notion of transaction in an "eventual consistency" system is not comparable to an RDB transaction. Just because the words are the same should never be taken as cause to treat them as having the same meaning -- let alone the same costs and benefits.

While the advice in Wu's article is reasonable, I would argue it is technically trivial. The success of a project relies on the choice of the DBMS and, perhaps as important, the appropriate use of the features of that DBMS. That said, sophisticated and technical analysis of the features and limitations of a DBMS, as well as knowledge of techniques for compensating for the limitations while taking maximal advantage of the features, is often in short supply. Given that fact, even technically trivial advice has value.

[FP: Lack of foundation knowledge and of familiarity with the history of the field[14,15] prevent practitioners from understanding and appreciating the critical implications of the RDM for non-network applications and, thus, from taking into consideration relational fidelity when they compare DBMS products[16]. They employ what I refer to as a "cookbook approach"[17]. While the list of questions that Wu recommends asking when choosing a DBMS: 

  • How many relationships are in your data?
  • What is the level of complexity in your data?
  • How often do the data change?
  • How often does your application query the data?
  • How often does your application query the relationship underlying the data?
  • How often do your users update the data?
  • How often do your users update the logic in the data?
  • How critical is your Application in a disaster scenario?
should be asked, without knowledge of the fundamental differences between true relational, SQL, and NoSQL systems[18,19,20], it is not possible for the answers to lead to the optimal choice.


References

[1] Pascal, F., DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS: A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER.

[2] Pascal, F., Are You a Thinking Data Professional?

[3] Pascal, F., SQL Sins.

[4] Pascal, F., To Really Understand Integrity, Don't Start with SQL.

[5] Pascal, F., DISTINCT and ORDER BY Are Not Relational.

[6] Pascal, F., Language Redundancy and DBMS Performance: A SQL Story.

[7] Pascal, F., Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy, and Consistency.

[8] Pascal, F., Levels of Representation: Relationships, Rules, Relations and Constraints.

[9] Pascal, F., Understanding Domains and Attributes.

[10] Pascal, F., Relationships and the RDM Parts 1-3.

[11] Pascal, F., Data and Meaning Parts 1-3.

[12] Pascal, F., What Meaning Means: Business Rules, Predicates, Integrity Constraints, and Database Consistency.

[13] Pascal, F., DBMS vs Application Enforced Constraints.

[14] Pascal, F., Database Management: No Progress Without Data Fundamentals.

[15] Pascal, F., They Who Don't Know the Past ...

[16] Pascal, F., What Is a True Relational System (and What It Is Not).

[17] Pascal, F., The Cookbook Approach to Data Management.

[18] Pascal, F., Thinking Logically: SQL, NoSQL and the Relational Model.

[19] Pascal, F., NoSQL and SQL: A Plague on Both Their Houses.

[20] Pascal, F., The SQL and NoSQL Effects: Will They Ever Learn?




 

No comments:

Post a Comment

View My Stats