Wednesday, March 27, 2019

Graph Databases: They Who Forget the Past...

Out of the plethora of misconceptions common in the industry[1], quite a few are squeezed into this paragraph:
“The relational databases that emerged in the ’80s are efficient at storing and analyzing tabular data but their underlying data model makes it difficult to connect data scattered across multiple tables. The graph databases we’ve seen emerge in the recent years are designed for this purpose. Their data model is particularly well-suited to store and to organize data where connections are as important as individual data points. Connections are stored and indexed as first-class citizens, making it an interesting model for investigations in which you need to connect the dots. In this post, we review three common fraud schemes and see how a graph approach can help investigators defeat them.

Relational databases did not emerge in the 80s (SQL DBMSs did);
  • There is no "tabular data" (the relational data structure is the relation, which can be visualized as a table on a physical medium[2], and SQL tables are not relations);
  • Analysis is not a DBMS, but an application function (while database queries, as deductions, are an important aspect of analysis, and computational functions can be added to the data sublanguage (as in SQL), the primary function of a DBMS is data management)[3];
  • A data model has nothing to do with storage (storage and access methods are part of physical implementation, which determines efficiency/performance[4]).

Here, however, we will focus on the current revival (rather than emergence) of graph DBMSs claimed superior -- without any evidence or qualifications -- to SQL DBMSs (not relational, which do not exist) that purportedly "make it difficult to connect data scattered across multiple tables". This is a typical example of how lack of foundation knowledge and of familiarity with the history of the field inhibit understanding and progress[5].


Up to 2018, DBDebunk maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 the website was has been discontinued. The content is not available anywhere else, and if you deem it useful, particularly if you are a regular reader, please ensure its continuation and free availability by supporting it with as much as you can afford via purchases of publications, or donations. Thank you. 

  • Updated the LINKS page: online logic courses. 



  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work.


I deleted my Facebook account. You can follow me:

  • @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture posts, and my exchanges on LinkedIn.
  • REVISED! @ThePostWest on Twitter: Evidence for #AntiZionism as component of the spike in #Antisemitism -- the only universally acceptable hatred -- as the (traditional) response to the existential crisis of decadence and decline of Western (including the US) civilization (for the classic component, see ThePostWest blog).
  • REVISED! @The PostWest blog: Evidence for the classic component of the spike in #Antisemitism -- the only universally acceptable hatred -- as the (traditional) response to the existential crisis of decadence and decline of Western (including the US) civilization (for the classic component, see @ThePostWest on Twitter).

Upside Down and Backwards

In the 1969 paper that introduced the RDM Codd wrote:
“The first part of this paper is concerned with an explanation of a relational view of data. This view (or model) of data appears to be superior in several respects to the graph or network model presently in vogue. It provides a means of describing data with its natural structure only: that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level retrieval language which will yield maximal independence between programs on the one hand, and machine representation and organization of data on the other.[6]
Although he published the three essential components of a data model as (should be, but isn't) understood today -- logical structure, integrity, and manipulation -- only 12 years later[7], the RDM was the first compliant data model: an adaptation of simple set theory (SST) expressible in first order predicate logic (FOPL) to database management, its three components being relations, constraints, and the relational algebra (RA)[8].

Although Codd referred to a "graph or network model", at the time there was no definition of its three components. While the name derives from directed graph theory (DGT)[9], the then prevalent CODASYL (or network) DBMSs (hierarchic DBMSs being a special case thereof) were not based on a graph data model (DGM) adapted from DGT, but had been partially abstracted from industry practice in ad-hoc fashion. Due to the complexity and rigidity of those old generation graph DBMSs, even SQL technology, with its poor relational fidelity, managed to render them obsolete -- the opposite of what is now often claimed for the new graph DBMSs.

Note: Be that as it may, SQL DBMSs are much inferior to true RDBMSs[10]. As we have reiterated, their deficiencies are due not to being relational, but to not being relational enough [11,12,13,14,15].

Network Applications

The RDM focuses on representation of groups of entities, treating the directed graph of relationships among individual entities as secondary. Each database relation is a set of tuples that represents either facts about (properties of) a group of entities of the same type, or a relationship among two or more groups (i.e., an association relation). As we have seen, some of the inter-group relationships are due to relationships among their member entities that give rise to fourth order properties (4OP) of the multigroup formed by the related groups[16].

Unlike SST, DGT is not expressible in FOPL at the logical level, and requires second order logic (SOL). This limitation in expressiveness is the price to pay for the critical advantages of the RDM:

  • Declarative and decidable data sublanguages[17];
  • Logical (LI)[18] and physical independence (PI)[19];
  • Semantic consistency and DBMS-guaranteed logical validity[20];
  • Flexibility, simplicity[21], and, thus, ease of use.

Codd was well aware of this tradeoff, and recognized that what he called "network applications" have "special needs" not served optimally by the RDM:

It may be argued that in some applications the problems have an immediate natural formulation in terms of networks. This is true of some applications, such as studies of transportation networks, power-line networks, computer design, and the like. We shall call these network applications and consider their special needs later. The numerous data bases which reflect the daily operations and transactions of commercial and industrial enterprises are, for the most part, concerned with non-network applications. To impose a network structure on such data bases and force all users to view the data in network terms is to burden the majority of these users with unnecessary complexity.[22]
The RDM was, thus, intended for the majority of non-network applications, for which the then current DBMSs (only informally related to DGT) added complexity, but no functionality -- particularly important given the tendency in industry practice, induced by tradition and programming, to view as network applications many that are not inherently so (hence the introduction of normalization to convert, in such cases, databases from graph to relational designs).
...three important structural concepts supported by current data base systems [hierarchic, network, and cross-reference] ... provide no additional logical capability beyond that of the relational model and to question their inclusion in the data models of future data base systems ... Network applications can be supported by adding a specialized front-end to a general purpose, relational data base management system.[22]
Thus, Codd relegated graph functionality (e.g., navigating/traversing connected elements, typing connections, searching/traversing by connection type, transitive closure, and so on) to application code, and as far as we know, Codd never explored appropriate use of graphs within a data model. As the RDM is currently understood, a RDBMS with a SST/FOPL-based relational data sublanguage is intended to be used strictly for data management, and a front-end with a computationally complete (e.g., programming) language (CCL) based on higher logic[17] supplies graph functionality that the RDBMS does not. Network applications (e.g., bill of materials) that use a SQL DBMS have taken this approach.

The result is cumbersome, error prone, code intensive, and almost none of the advantages of the RDM carry over to the "network" aspects of applications, which usually have to be rewritten when the conceptual -- and, thus, the logical -- model changes. This is due to the conflict between relational advantages, which restrict the data sublanguage to SST/FOPL, and graph functionality, which requires DGT logic higher than FOPL.

New Graph DBMSs

The old CODASYL and hierarchic graph DBMSs intermingled the logical with the physical, so even the rare practitioner sufficiently knowledgeable to not confuse them couldn't, even with great discipline, keep them separate. For example, their links behaved like physical pointers, not like logical relationships.

Having learned from the RDM and SQL, the new graph DBMSs tend to be less physical, but the problem among practitioners is much broader and more persistent: an inability to think abstractly. Graph proponents still don't realize the importance of data independence (physical and logical) and why a DBMS should enable it, and routinely confuse levels of representation even when a graph DBMS allows them to be separate. They almost always assume that abstraction should occur in application code, and not in some "utility" like a DBMS (which they mostly equate to a kind of super file management system).

The focus is on comparing specific issues like support for transitive closure (TC) and logical locality of reference (how the data is logically, not necessarily physically, organized). The relational principle that supplies locality of reference is by organizing tuples according to their relation type (or name). In a graph DBMS, locality of reference is provided logical connectedness (a connectionist view) and, usually, this is propagated to the physical layer. Nonetheless, there is nothing inherent in the graph approach that prevents creating a layer (e.g., by suitable indexing) on a graph database that enables a relational view with (1) recovery of tuples from the usual key value pairs and (2) grouping of tuples by relation type.
--David McGoveran

While current graph DBMSs provide some useful functionality for network applications -- they are still mostly ad hoc: there is still no formal definition of a GDM as a DGT adaptation to database management that is compliant with the Codd definition of a data model, similar to that of the RDM as an adaptation of SST/FOPL, and no attempts to formulate one (if you believe otherwise, specify -- precisely, please! -- the structure, integrity, and manipulation components of the GDM).
A GDM would have to be based, at best, on some fragment of SOL that permits a decidable data sublanguage. While the structural and manipulative portions seem clear, integrity is a bit obscure (as it must be in anything based on SOL or higher), and I've not seen anyone bother to explain compliance with Codd's definition of a data model. Most graph "theorists" simply appeal to the lambda calculus (an HOL), then (apparently without careful investigation) design "declarative" query languages with the desired functionality. While vendors look for the "sweet spot" of network applications where they will shine both functionally and efficiently, products are mostly promoted as general purpose, superior alternatives to "relational" (i.e., SQL) DBMSs.
--David McGoveran
Without a formal specification of a formal DGM compliant with Codd's definition, and so comparable to the RDM's SST/FOPL logic, neither can the soundness of graph DBMSs based on such a model be determined, nor can products of the two types be meaningfully compared[23].

Given the conflicting formal requirements, a more profound question is whether a data model compliant with Codd's definition that provides relational advantages (short of, unavoidably, some lost simplicity) and supports graph functionality is possible. This is way beyond the scope of this discussion, and we refer the reader to relevant discussion in David Mcgoveran's still ongoing work on the formalization, re-interpretation, and extension of the RDM[24].

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.



[2] Pascal, F., Understanding Relations Part 1: Tables So What?

[3] Pascal, F., Understanding the Division of Labor between Analytics Applications and DBMS.

[4] Pascal, F., What Is a Data Model, and What It Is Not.

[5] Pascal, F., Database Management: No Progress Without Data Fundamentals.

[6] Codd, E. F., Derivability, Redundancy And Consistency Of Relations Stored In Large Data Banks.

[7] Codd, E. F., Data Models in Database Management, Workshop on Data Abstraction, Databases and Conceptual Modelling (1980: 112-114).

[8] Codd, E. F., A Relational Model of Data for Large Shared Data Banks.

[9] Graph Theory.

[10] What Is a True Relational System (and What It Is Not).

[11] Pascal, F., SQL Sins.

[12] Pascal, F., To Really Understand Integrity, Don't Start with SQL.

[13] Pascal, F., DISTINCT and ORDER BY Are Not Relational.

[14] Pascal, F., Language Redundancy and DBMS Performance A SQL Story.

[15] Pascal, F., Precision, Procedurality and SQL.

[16] Pascal, F., Fourth Order Properties Parts 1,2.

[17] Pascal, F., Natural, Programming, and Data Language.

[18] Pascal, F., C. J. Date and D. McGoveran On View Updating.

[19] Pascal, F., Physical Independence Parts 1-3.

[20] Pascal, F., Logical Validity and Semantic Correctness.

[21] Pascal, F., Simplicity: Forgotten, Misunderstood, Underrated Relational Objective.

[22] Codd, E. F., Normalized Data Base Structure: A Brief Tutorial.

[23] Pascal, F., Structure, Integrity, Manipulation: How to Compare Data Models.

[24] [11] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK (draft chapters), forthcoming.

No comments:

Post a Comment