Sunday, September 22, 2019

Data Sublanguage Part 1: Relational vs. Computational Completeness




Note: I have revised the "Logical Access, Data Sublanguage, Kinds of Relations, Database Redundancy, and Consistency" paper in the "Understanding the Real RDM" series" (available from the PAPERS page) for consistency with this post.

“Recently I have read that SQL is actually a data sublanguage and not a programming language like C++ or Java or C# ... The answers ... have the pattern of "No, it is not. Because it's not Turing complete.", etc, etc. ... I am a bit confused, because since you can develop things through SQL, I thought it is similar to other programming languages ... I am curious about knowing why exactly is SQL not a programming language? Which features does it lack? (I know it can't do loops, but what else more?)”
--StackOverflow.com
“The SQL operators were meant to implement the relational algebra as proposed by Dr. Ted Codd. Unfortunately Dr. Codd based some of his ideas on a "extended set theory", which was an idea formulated and described in a 1977 paper by D. L. Childs ... But Childs’ extensions were not ideally suited, which is explained in quite some detail in [a] book ... by Professor Gary Sherman & Robin Bloor [who] argue that mainstream Zermelo-Fraenkel set theory (Cantor), would have been a better starting point. One key issue is that sets should be able to be sets of sets.”
--Dataversity.net

The concept of a sublanguge cannot be understood without foundation knowledge and familiarity with the history of the database management field, both lacking in the industry.


------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE

Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. The content of this site is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. Thank you.

NEW

  • 09/22/2019: Updated paper #2 in the Understanding the Real RDM series, Logical Access, Data Sublanguage, Kinds of Relations, and Database Redundancy and Consistency, which is available for ordering from the PAPERS page.

LATEST PUBLICATIONS (order PAPERS and BOOKS) 


USING THIS SITE 

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics may work.

SOCIAL MEDIA

I deleted my Facebook account. You can follow me:

  • @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture posts, and my exchanges on LinkedIn.
  • @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
  • @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

From Relation-Valued Domains to Normal Form


The RDM is essentially the relational algebra (RA), Codd's adaptation of mathematical set theory for, and applied to database management: a set of operations (manipulation) of semantically constrained (integrity) relations (structure) -- the three components of a data model[1]. A sublanguage that expresses the RA was first mentioned in Codd's 1969 IBM internal paper introducing the RDM:

“The adoption of a relational view of data ... permits the development of a universal retrieval sublanguage based on the second-order predicate calculus ... (rather than first-order) ... needed because the domains on which relations are defined may themselves have relations as elements.”[2]
Relations with attributes defined on relation-valued domains (RVD) (i.e., relation-valued attributes (RVA)) are sets of sets. Set theory comes in different flavors, and axiomatic set theory (AST) (Zermel-Fraenkel being one version thereof) requires sets of sets. A language must be based on second order logic (SOL) to accommodate them.

Within a year, however, in his (1970) publicly published paper, Codd abandoned AST and switched to the simple set theory (SST) of proper sets (i.e., sets that do not have sets as members), for which a sublanguage based on first order predicate logic (FOPL) is sufficient.

“The adoption of a relational model of data, as described above, permits the development of a universal data sublanguage based on an applied predicate calculus [which] suffices if the collection of relations is in normal form. Such a language ... would itself be a strong candidate for embedding (with appropriate syntactic modification) in a variety of host languages (programming, command- or problem- oriented).”[3]
Relations in normal form are devoid of RVAs[4] and, thus, are proper sets (hence, normalization to eliminate them by redesign[5]).

Note: SST allows sets of sets as long as the result of set operations is defined as a set of non-set members (e.g., the set C = A UNION B where A and B have no set members is the set having all the members of both A and B, not the set consisting of set A AND set B)[6].



From SOL to FOPL


SOL is expressively more powerful than FOPL, but it would have caused major complications, a critical one being self-referencing: the paradoxes associated with it render a language undecidable -- a DBMS would enter infinite loops evaluating the truth of self-referencing expressions. It would have, in fact, robbed the RDM of its advantages -- logical [7] and physical independence[8], by-design semantic consistency and DBMS-guaranteed logical validity[9], language declarativity, simplicity[10], and so on -- defeating its purpose. To avoid this, Codd traded off the expressive power of SOL for these database advantages, but retained it for non-database (application) purposes by hosting the FOPL-based language expressing strictly the RA as a sublanguage in programming languages based on the higher logic.

“Let us denote the retrieval sublanguage by R and the host language by H.
R permits the declaration of domains, together with relations of various degrees on those domains [and] the specification for retrieval of any subset of data from the data bank. A set so specified may be fetched for query purposes only, or it may be held for possible changes. Insertions take the form of adding new elements to declared relations ... [d]eletions the form of removing elements from declared relations... and alteration of components of any of its existing n-tuples ... a data bank is [thus] a collection of time-varying relations;
H permits supporting declarations which indicate, perhaps less permanently, how these relations are represented in storage. Any arithmetic functions needed can be defined in H and invoked in R.”[2]
When the data model is the RDM, a sublanguage is the relationally complete database component of a computationally complete programming language (CCL). Codd was explicit about this essential difference between the sublanguage and programming languages:
“The universality of the data sublanguage lies in its descriptive ability (not its computing ability) ... Thus, the class of qualification expressions which can be used in a set specification must have the descriptive power of the class of well-formed formulas of an applied predicate calculus”.[2]
Note: The hosting must be implemented with extreme care. Because a declarative language cannot be mixed with a procedural one, the interface between a CCL and a relational sublanguage can only be in terms of shared data types. As the result of a RA expression is always a relation, the CCL must implement a relation data type. In the other direction, it must return a CCL expression of a relation, tuple (specifc to some relation), attribute (specific to some relation), and domain member. The left-hand side of a CCL expression can then be a relation, a tuple, an attribute, or a domain value, and that CCL expression can then appear anywhere in an RA expression that the corresponding type could appear. The CCL expression will always be evaluated in the host language and the RA expression in the sublanguage.

This brings us to an important, but largely unnoticed detail: only in his 1970 paper did Codd refer to a data sublanguage -- in 1969 he used the term retrieval sublanguage. On this in Part 2



Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.


References

[1] Codd, E.F., Data models in database management.

[2] Codd, E.F., Derivability, Redundancy And Consistency Of Relations Stored In Large Data Banks.

[3] Codd, E.F., A Relational Model of Data for Large Shared Data Banks.

[4] Pascal, F., First Normal Form in Theory and Practice Parts 1-3.

[5] Pascal, F., Normalization and Further Normalization Parts 1-2.

[6] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK (draft chapters), forthcoming.

[7] Pascal, F., Physical Independence Parts 1-2.

[8] C. J. Date and David McGoveran On View Updating.

[9] Pascal, F., Logical Validity and Semantic Correctness.


[10] Pascal, F., Simplicity: Forgotten, Misunderstood, Underrated Relational Objective.



No comments:

Post a Comment

View My Stats