Monday, July 23, 2018

Industry Practice Is No Substitute for Foundation Knowledge




“A short time ago a colleague asked me where he could find a "Databases 101" guide for the non-technical professional. As it turns out, the internet is littered with information, and mis-information, regarding data and databases. This makes it difficult for someone entering the field of data and databases to understand the bigger picture. He wanted something to help make sense of data and databases, specifically relational versus NoSQL. So, that’s what I decided to work on. This is the start of my Databases 101 guide for the non-database professional; something to help anyone understand why the word “database” is an overloaded term. We use the term "database" to describe a great many things ... I use the terms database and database engine interchangeably in this article.”
                            --Databases 101, ThomasLaRock.com

To “understand the bigger picture”, and “make sense of data and databases” requires foundation knowledge. In its absence, efforts to educate defeat their own purpose, because they are grounded in the very misinformation they should be correcting. This is the mechanism by which misconceptions[1] are perpetuated and reinforced.



------------------------------------------------------------------------------------------------------------------
SUPPORT THIS SITE 

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you. 

NEW PUBLICATIONS 

NEW: The Key to Relational Keys: A New Perspective

NEW: SOCIAL MEDIA 

I deleted my Facebook account. You can follow me on Twitter:
@dbdebunk: will contain links to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture, which I am bringing back. 

@ThePostWest: will contain evidence for, and my take on the spike in Anti-semitism that usually accompanies existential crises. The current one is due to the decadent decline of the West and the corresponding breakdown of the world order.

HOUSEKEEPING
  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 
------------------------------------------------------------------------------------------------------------------

For example, if “"database" is an overloaded term”, using “database and database engine [DBMS] interchangeably”, particularly when explaining the difference between relational and non-relational databases, obscures the important, yet poorly understood fact that for relational advantages to materialize, both the DBMS and databases must be relational, which means something different for either[2].
“To me, databases only have two classifications: relational databases, and everything else. I don’t care what anyone else tells you. Those are the two buckets you start with. Either your database engine is relational, or not. As of July 1st, 2018, there are 138 relational databases (out of 343 total) listed at DB-Engines. Relational is by far the largest category. The top four ranked databases at DB-Engines are relational engines, and six of the top ten overall. Yes, Microsoft Access is there, currently ninth. I’ll let that sink in for a moment.”
There is, indeed, a very good reason to consider this distinction fundamental (though, certainly, not the only one), but the author's explanation misses it.
Not only there are not 138 relational DBMSs (not databases!) -- actually there is none! The author has obviously accepted the widespread misconception in the industry that SQL DBMSs are relational (for which reason I will substitute "SQL" in square brackets for each use of the term "relational"). I suspect he would say “Well, SQL DBMSs are considered by all relational for all practical purposes, and it is sheer "theoretical purity" to insist otherwise.” But, of course, it is precisely theory that fundamentally distinguishes true RDBMSs from non-relational ones, and gives them the practical advantages over them, and SQL DBMSs flount it and fail to provide them. If you don't appreciate this, you simply don't understand what relational DBMSs are, and why they are superior.

“Historically, what defined a database as relational (in a traditional sense) was the ability for database transactions to adhere to the ACID properties. ACID stands for Atomicity, Consistency, Isolation, and Durability. You can read more about these concepts and other database transaction concepts in this paper written by Jim Gray in 1981. This is not to say that only relational databases support ACID, but you will find that the majority of differences between relational and non-relational databases are ACID versus BASE (which we will discuss next). Relational databases leverage ACID for database transactions in order to maintain data quality and integrity. And, to some computer folk, this is a weakness in the design of a relational engine. Favoring data quality and integrity comes at a cost. The costs are performance, latency, storage, and throughput.”
What defines a relational DBMS is fidelity to the RDM, as introduced by Codd in 1969-70[3,4] (which was nothing like "traditional"). Unfortunately, the conventional interpretation of his work for the last five decades is flawed, and we have been correcting it here and elsewhere[5].

Transaction management (TM) is a general database function that all genuine DBMSs -- relational or not -- should support. The RDM facilitates TM support and, historically, SQL DBMSs were among the first to implement it, but it does not make SQL DBMSs relational, and is not the core difference between relational and non-relational DBMSs. This mistaken notion has emerged with the appearance of several types of non-relational systems jointly labeled NoSQL -- some of which, for reasons beyond the scope of this discussion[6], do not support TM and, therefore, ACID. In the absence of foundation knowledge, the industry has misinterpreted this as the primary difference between relational and non-relational systems. In fact, it is just one, albeit important, difference between SQL DBMSs and NoSQL systems.

“Computer folk [who deem] data quality and integrity a [relational] weakness” suffer from logical-physical confusion (LPC)[7]. The same is equally true with respect to scaling:

“The truth is that [SQL] databases scale up just fine when placed in the hands of people that (1) know what they are doing, (2) aren’t afraid to learn new things, and (3) don’t mind doing the work necessary. I said scale up, which is another way of saying “throw hardware at the problem”. Scaling out means you add more nodes to handle the workload. Scaling up versus scaling out is a traditional difference between relational and non-relational database engines. Often when you want more performance from a relational engine you will scale up. But for non-relational systems it is easier to scale out. Scaling out is often cheaper and easier than scaling up.”
True RDBMSs would support full physical independence (PI)[8]. Even SQL databases, with their weak PI support "scale up just fine". However, "people [who] know what they are doing" are necessary but insufficient: also required are true RDBMS with implementations that perform satisfactorily. To the extent that SQL DBMSs do not, it is due to a combination of (1) poor implementations and (2) practitioners lacking foundation knowledge.
“In contrast to [SQL] database engines years ago a handful of systems arose that marketed themselves as No SQL, as an alternative to [SQL] database engines. Over time that stance softened, and instead of the anti-SQL stance they became friendlier by labeling themselves as "Not Only SQL", or NoSQL. This is the group I classify as non-relational. DB-Engines has 205 systems listed as non-relational, but with many sub-categories.”
While SQL DBMSs have some limited relational properties, both SQL DBMSs and NoSQL systems are non-relational[9] because they are not compliant with the RDM (properly understood[6]), and fail to provide the full relational benefits.
“The most popular NoSQL system types are key-value stores (66), document stores (46), graph databases (31), time-series databases (25), and search engines (17). There are also 36 NoSQL databases listed as “multi-model”, meaning they are more than one non-[SQL] engine. An example of a multi-model engine is Cosmos DB, as it supports document, graph, key-value, table, and column-family data models. As stated earlier, NoSQL solutions favor performance over data quality through the use of BASE. BASE stands for Basic Availability, Soft-state, and Eventual consistency.”
This is a good example of how obscuring the distinction between DBMS and database can inhibit understanding. Unlike document stores and graph databases, there is nothing inherently non-relational about key-value, time series, and search databases. If relationally designed, such databases could be managed by true RDBMSs, though not optimally by SQL DBMSs, due to their own limitations, including poor relational fidelity. This would obviate "application-specific DBMSs" that defeat the whole idea of multi-application database. Aside from graph DBMSs (hierarchic and network, which have been superseded by even SQL DBMSs due to their complexity and inflexibility), the only formal data model with a sound theoretical foundation is the RDM. There is, therefore, no "multi-model" DBMSs to speak of -- there are no document, or time-series formal data models -- they are just non-relational systems that violate RDM's core Information Principle[10] and defeat its practical objectives[11,12], whether they include SQL or not.

NoSQL (and non-relational systems in general), have more serious data integrity deficiencies than just lack of ACID support: none relies on the relational algebra to manipulate relations, and most do not enforce constraints. Consequently, they do not guarantee logical validity and semantic correctness of query results[13] -- the fundamental relational distinction -- and one reason I do not consider them DBMSs.

The author claims he “offer[s] a way to help you understand the differences at a high level, so you can make an informed decision about what database engine would best suit your specific requirements.” I don't think he does, but you are free to judge for yourself.


References

[1] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.

[2] Pascal, F., What Is a True Relational System (and What  It Is Not).

[3] Codd, E. F., Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks, IBM Research Report, San Jose, California RJ599 (1969).

[4] Codd, E. F., A Relational Model of Data for Large Shared Data Banks, Commun. ACM 13(6): 377-387 (1970).

[5] Pascal, F., Interpreting Codd.

[6] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[7] Pascal, F., Conceptual-logical Conflation and Logical-Physical Confusion.

[8] Pascal, F., Physical Independence Parts 1,2,3.

[9] Pascal, F., NoSQL and SQL: A Plague on Both Their Houses.

[10] Pascal, F., Interpreting Codd: The 12 Rules.

[11] Pascal, F., The Interpretation and Representation of Database Relations.

[12] Pascal, F., Logical Symmetric Access, Data Sublanguage, Kinds of Relations, Redundancy and Consistency.

[13] Pascal, Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness.


Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.



No comments:

Post a Comment

View My Stats