Friday, November 1, 2019

Comments on a Stonebraker Article




These comments were prompted by a LinkedIn post referencing Michael Stonebraker's Those Who Forget the Past Are Doomed to Repeat It  -- something I often reiterate myself -- where he argues:
“Over the past decade, there have been a number of DBMSs introduced (typically labeled as NoSQL) which utilize a network or hierarchical data model. MongoDB and Cassandra come immediately to mind as examples. Some such systems support networks through the concepts of "links" and some support hierarchical data using a nested data model often utilizing JSON. In my opinion, these systems have not internalized lessons from history.
“At the SIGFIDET (now SIGMOD) annual conference in 1974, there was a "Great Debate" over the merits of the relational model versus the network and hierarchical models ... Basically, the argument was about which model [relational or network] was a better fit for structured data (as opposed to documents, e-mails, etc.) and boiled down to two questions:

Question 1: Are high-level data sublanguages a good idea?
Question 2: Are tables the best data structure or should one use a network or hierarchy?”

“The last 45 years have definitely affirmed Codd’s position on both issues ... The conclusion from the 1970s was that the relational model provides superior data independence, compared to the network and hierarchical [graph] models. Forty-five years later, this conclusion is still true. If you want to insulate yourself from the changes that business conditions dictate, use a relational DBMS. If you want the successor to the successor to your job to thank you for your wise decision, use a relational model.”
I couldn't agree more, having repeatedly argued this myself. But he misses some old aspects that the industry has failed to recognize, has ignored, or dismissed[1]; and some important new aspects due to a new understanding of Codd's work[2].

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE 

Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. The content of this site is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. Thank you.

NEW

  • 10/26/19: The POSTS page now links to all 2012-2018 posts (to be updated annually at year-end). Except for 2017, the (italicized) links are to abstracts of my columns @All Analytics site, which was discontinued (see below).
  • 10/26/19: Updated and cleaned up the WRITINGS page.
  • 08/09/19: Following my series of posts on data sublanguage (Parts 1-4), I have revised for consistency the corresponding section of paper #2 in the Understanding the Real RDM series, Logical Access, Data Sublanguage, Kinds of Relations, and Database Redundancy and Consistency, which is available for ordering from the PAPERS page.

LATEST PUBLICATIONS (order PAPERS and BOOKS) 


USING THIS SITE 

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search. 
  • Following the discontinuation of AllAnalytics site, the links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA 

I deleted my Facebook account. You can follow me:

  • @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture posts, and my exchanges on LinkedIn.
  • @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
  • @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

Note: We note, in passing:
  • "Structured data" is redundant -- there is no such thing as "unstructured data" (documents and emails included)[3];
  • The relational structure is not the table, but the relation -- confusing them has inhibited understanding of the RDM and should be dispelled, not reinforced[4].
“On question 1, the discussion at the debate centered around the appropriateness of high-level declarative languages (think SQL) versus low-level record-at-a-time notations (think the data management language for the IBM hierarchical DBMS, IMS). Obviously, a programmer writes less code in a high-level language and is insulated from many changes to the storage representation of data(i.e., high-level languages are good). Hence, the counter argument at the time was that "high-level languages could not possibly be made as efficient as lower level notations".”

“The argument about the second question revolved around "data independence". Codd mentioned to me many years ago that one of his inspirations for the relational model was his IBM programmer friends. They complained that every time the schema changed, they had to throw away their code and start anew. As such, data independence is the ability of application programs to survive changes in data storage or representation.”
I recently published a series on the poorly (if at all) understood concept of data sublanguage[5] -- the linguistic expression of a data model. I have written extensively about the multiple advantages when the model expressed is the RDM[6] -- declarativity is just one, and "record-at-a-time" is only one type of procedurality. There is much more to a relational data sublanguage than "less code" and "efficiency" aspects.

There are two kinds of data independence -- physical (PI) and logical (LI) -- that are equally important. The former insulates applications from changes in implementation (storage and access methods)[7] made possible by the theoretical foundation (SST/FOPL) of the RDM. The latter insulates them from schema changes via the relational view mechanism[8] -- we assume that's what Stonebraker means by "changes in representation", but he does not discuss LI in the article.

Consider now Stonebraker's example:

“I will illustrate with a a very simple example [that] start[s] with an entity-relationship representation of some data. Consider an entity "Employee" with a key of name and other attributes such as salary and birthdate. Consider a second entity "Department" with a key of dname, and other attributes such as floor_number and budget. There is a relationship between the two entities, "Works_in", indicating the department in which an employee works. If the company requires employees to be in a single department, then an obvious representation in tables is:

Employee (name,salary,birthdate,dname)
Department (dname,floor_number,budget)


In a hierarchical or network data model, one might construct a schema as:

Department (dname,floor_number,budget)
     Employee (name,salary,birthdate)}


This notation indicates that employees are stored "underneath" the department they work in using a hierarchical representation.”
For reasons explained elsewhere, we are proposing an alternative to E/R conceptual modeling[9] corresponding to a revised RDM better grounded in SST/FOPL, to be detailed in future papers[10]. A conceptual model consists of a multigroup -- a collection of related groups of entities, each of which is a collection of entities sharing individual and (as a group) collective properties (i.e., entities of the same type). So here we have a model with entities of two types ("Employee" and "Department", to which Stonebraker refers as entities, are actually entity types) and, thus, a multigroup of two entity groups.

We now know that primary keys (PK) should represent at the logical level neither assigned entity names, nor descriptive properties in conceptual models[11].


A schema -- be it relational or hierarchic -- is logical and says nothing about physical implementation[7]. Old generation, pre-RDM graph (hierarchic and network) DBMSs (which lacked PI) had a 1:1 relationship between the logical and physical structure, which is how it was known that "employees were stored underneath the department they work in". This may not necessarily hold for current graph DBMSs[12] (see below).

Note: Of course, just relations do not a schema make -- the semantic constraints on the relations are also part of it.
“Since we have already ascertained that high-level languages are good, here is the SQL to find the names of employees in a department on the first floor:
Select E.name
From Employee E, Department D
Where D.floor = 1 and E.dname = D.dname

In querying a hierarchical database, one must program from "outside in" to mimic the hierarchical representation. In a made-up high-level hierarchical language, here is the same query:

For all Departments where floor = 1
Select Employee.name

The above schemas can be optimized for the query we used. In a hierarchical system, one could use an index on floor, while a relational system would use indexes on D.floor and E.dname.”
Note that even though SQL is not really relational and not just a sublanguage[5], it is declarative relative to the hierarchic quasi-language.

Only a physical implementation -- not a schema -- can be optimized for performance[13] (why?), and usually not to maximize that of specific queries, but to achieve an optimal balance among multiple important queries (see below).

The example shows, correctly (if simplistically), that a conceptual model can be represented at the logical level as either a relational, or graph schema. At issue is, thus, how do you choose, and Stonebraker's lesson from the debate is to choose the former "to insulate yourself from the changes that business conditions dictate", which he illustrates as follows:

“In the real-world, workloads change on a regular basis as business conditions change. Typically, such changes occur once per quarter or more. Also, database applications often last a quarter century or more. Hence, one must plan for 100 or more such changes over the lifetime of an application, and database/application maintenance is a really big deal.”

“One very simple change is to the structure of the dominant query. Suppose it becomes "find the names of employees between 30 and 40 years old". In a relational world, one would cluster (or index) employees on age to support this query.  Note that the previous query continues to run, albeit with perhaps degraded performance. In a hierarchical system, one might (or might not) have "dependent" indexes that would allow an age index on the dependent records. Even with this capability, the employee records are not clustered on age, so query performance may be inadequate. To get superior performance, employee records cannot remain clustered with departments, but must be made a top-level object. This requires changing the schema. A side effect of this decision is that our first query fails.”
He clearly means physical changes (i.e., the choice is driven by PI): some new query does not perform satisfactorily, a change in implementation is necessary, and since RDBMSs support PI and insulate the application issuing the query, while hierarchic DBMSs may not (see below), opt for the RDM.

But two important aspects -- one unrecognized and one disregarded in the industry -- qualify the lesson.


Network Applications


The ignored one is that Codd intended the RDM for databases representing conceptual models that focus on the relationships among groups of entities (hence the pertinence of set theory[14]). He recognized that databases which serve what he called "network applications" (i.e., represent models focused on relationships among individual entities) are not served optimally by the RDM[15].

In Stonebraker's example the focus is on the relationship between the two types of entities which the two groups have as members ("Works_in"). If so, there is nothing to recommend a graph representation and the loss of relational advantages, including declarativity, PI, and LI. But what if the focus were on the relationships (typical of social networks) among individual employees and among individual departments?

This brings up the aspect disregarded by the industry.


Graph Data Model?


As outlined by Codd, a formal data model has three components -- structure, integrity, and manipulation[16] -- and is used to formalize conceptual models of reality as logical models that can be recorded in databases and "understood" algorithmically by DBMSs[17]. Consistent with the outline, the RDM has a formal theoretical foundation that is responsible for all the relational advantages provided by RDBMSs, foremost among them system-guaranteed logical validity and by-design semantic consistency[18].

There is a formal theoretical foundation -- directed graph theory (DGT) -- for a graph data model (GDM), but one has yet to be formalized (in particular, integrity and manipulation components).

As already mentioned, the old, pre-RDM generation GDBMSs were not implementations of such a GDM, but ad-hoc, abstracted from industry practices. Used for both types of relationships at the conceptual level, they proved prohibitively complex and inflexible, prompting the RDM alternative for at least inter-group relationships.

The jury is still out on whether a new generation of GDBMSs grounded in such a GDM can provide full PI and a mechanism for LI, but they require computationally complete undecidable languages (CCL), relegating correcteness  to users in applications, and are likely to be quite complex.


Conclusion


With hindsight and a better interpretation of Codd's work, the proper lesson of the great debate is:

  • Conceptual models focusing on inter-group relationships should be formalized using a revised RDM properly grounded in SST/FOPL;
  • Conceptual models focusing on relationships among individual entities should be formalized using a GDM grounded in GDT, with the realization that care should be exercised due to loss of relational advantages.

Unfortunately, there are neither true RDBMSs, nor true GDBMSs. That -- and the fact that none are being developed -- is the failure to internalize the lesson from the great debate -- always the consequence of lack of foundation knowledge and familiarity with the history of the field.



Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.


References

[1] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL.

[2] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK.

[3] Pascal, F., Structuring the World With 'Unstructured Data'.

[4] Pascal, F., Understanding Relations series.

[5] Pascal, F., Data Sublanguage series.

[6] Pascal, F., What Is a True Relational System (and What It Is Not).

[7] Pascal, F., Physical Independence series.

[8] Pascal, F., On View Updating (with D. Mcgoveran and C.J. Date).

[9] Pascal, F., Properties-Object Modeling.

[10] Pascal, F., Interpreting Codd.

[11] Pascal, F., The Key to Relational Keys: A New Understanding.

[12] Pascal, F., Graph Databases They Who Forget the Past...

[13] Pascal, F., Levels of Representation Conceptual Modeling, Logical Design and Physical Implementation.

[14] Pascal, F., Sets vs. Graphs.

[15] Codd, E.F., Normalized Data Structure: A Brief Tutorial, IBM Research Report RJ935 (1971).

[16] Pascal, F., What is a Data Model?

[17] Pascal, F., Business Modeling for Database Design: Formalizing the Informal.

[18] Pascal, F., Logical Validity and Semantic Consistency.




No comments:

Post a Comment

View My Stats