Friday, September 27, 2019

Data Sublanguage Part 2: Data Manipulation and Definition

Revised 10/10/2019.

In Part 1 we showed that Codd intended in 1969 to base the RDM on axiomatic set theory (AST) and second order logic (SOL) to accommodate relation-valued domains (RVD) (i.e., sets of sets), but that for the benefit of relational advantages and to avoid SOL problems he had to trade off the expressive power of AST/SOL for the simple set theory (SST) of proper sets (i.e., relations in normal form) expressible in first order predicate logic (FOPL) and, thus, computational for relational completeness[1]. He retained the power of the former for applications by hosting a relationally complete FOPL-based language expressing the RA in computationally complete programming languages (CCL).

We also alerted to an important, but unnoticed detail: data sublanguage appeared in the 1970 paper -- in 1969 Codd referred to retrieval sublanguage. This can be understood only with reference to the theoretical foundation of the RDM.

Retrieval Sublanguage

Retrieval sublanguage notwithstanding, the 1969 paper states, seemingly contradictorily:

“We now assert that a data bank is a collection of time-varying relations ... of assorted degrees. As time progresses, each n-ary relation may be subject to insertion of additional n-tuples, deletion of existing ones, and alteration of components of any of its existing n-tuples.”[2]
The problem is, however, that:
“Mathematical relational theory is a particular expression of standard finite SST and FOPL (these two formal systems are compatible in the sense that SST, its axioms and its theorems can be expressed using the language of FOPL). In set theory all sets are persistent subsets of a pre-existing universe and the semantics (i.e., the valid interpretations of the theory) do not include any concept of an object that can change. There are only set transformations: sets are neither created from scratch, and a derived set does not replace the existing set from which it derives and to which it bears a relationship -- there is a mapping between them (a specification is derived from an existing specification). Thus, a relation is a specific subset of a fixed pre-existing universe and it does not have "time-varying membership", nor can any symbol in the object language be given a time-varying instantiation.”
--David McGoveran

In other words, mathematical relational theory only describes relationships between relations: every RA operation "selects" a relation bearing a relationship to existing relations. 'Creating' or 'updating' in the database sense conventionally understood in the industry as "initial" (input, old) and "final" (output, new) relations, namely an assignement of the form:

R = R-(exising tuples to be deleted) + (new tuples to be inserted) + (existing, but modified tuples)

is outside SST/FOPL (and, thus, the RDM). An "insert", for example, is not a modification of an existing relation: it does not add tuples to it, but specifies a different relation that has a relationship to it -- a union with the database of a set of zero or more tuples selected from the complement of the database relative to its fixed Universe of Discourse (UoD), that derived relation of zero or more tuples satisfying some logical expression of the relationship[3]. 

So from a strictly theoretical viewpoint a FOPL-based language expressing the RA cannot define or update relations -- it is a purely retrieval sublanguage and the applicability of a static RDM to database management would be curtailed.

1969: Updates by Host and Data Declaration

Consider now Codd's functional outline of the retrieval sublanguage:

“R permits the declaration of domains, together with relations of various degrees on those domains [and] the specification for retrieval of any subset of data from the data bank. A set so specified may be fetched for query purposes only, or it may be held for possible changes. Insertions take the form of adding new elements to declared relations ... [d]eletions the form of removing elements from declared relations... and alteration of components of any of its existing n-tuples ... a data bank is [thus] a collection of time-varying relations.”[2]
Note, first, that in addition to "specification for retrieval of any subset of data from the data bank", it also "permits the declaration of domains, together with relations of various degrees on those domains". Declaration might not be accidental -- a declaration is not a definition, but specifies how a symbol will be used by relating it to something else. For example, CREATE RELATION x is definition (non-RA), and so is CREATE DERIVED RELATION y AS (RA expression), while RELATION x JOIN RELATION y is retrieval (RA) that includes declaration (of join structure constrained by integrity).
“It is unclear whether Codd considered data definition and data declaration distinct from each other or synonyms, and definition part of the sublanguage [but see below]. Inasmuch as he stated the sublanguage was at least a subset of first order predicate logic, it is simply not possible to express definitions in such a language (all objects -- relations and domains -- must pre-exist as parts of the static universe to which the language is being applied), I must think that his use of declaration carries meaning, but its subtlety became less important as he moved more and more towards practical concerns about implementing the RDM.”
--David McGoveran
Second, retrieved relations are only "held for possible changes", not actually changed. This indicates that initially Codd, sticking to theoretical considerations, thought that only retrieval and relational assignment were needed, and relegated updating to the host language (i.e., modifying the retrieved relation and writing it back). But he soon realized that this would be impractical (had he not, client/server would have never had a chance -- it would have been just like file servers). 

1970: Data Sublanguage

The reference to retrievals "held for possible changes" did not reappear in the 1970 paper, indicating the transfer of updating from the host to the sublanguage -- hence data instead of retrieval sublanguage[4].

Codd's own relational sublanguage ALPHA proposed in 1971 included "declaring" and "dropping" relations[5], and in 1985 one of his famous 12 rules --

 the Comprehensive Data Sublanguage rule --  dropped declaration altogether in favor of "data definition" and included other functions

As McGoveran suggests, with hindsight and today's understanding and terminology these changes can be understood in the context of increasing practical concerns about implementing the RDM.

Strictly speaking, only the retrieval sublanguage is based on relational theory: it expresses the RA in the language of FOPL. Updates do not and and cannot mean what users (and most theorists") in the industry typically understand them to mean -- Codd's "time-varying relations" is an (anti-set) gloss. The correct set-theoretic understanding is as set transformations and, so understood, the RA can provide appropriate operations that can model update, insert, delete. The details are technical and beyond the scope of this discussion -- [3] is a must read (chapter 20 in particular) to understand why any relationship between the "RDM" as understood in the industry and the real thing is purely coincidental.  

While the "old version" and the "new updated are in fact two distinct relations from the UoD having a specific relationship -- a transformation stated by a FOPL/RA expression -- by naming the derived relation same as the relation from which it derives, the complex mechanics required to mimick time-varying relations can be ignored, which would be impossible without this "renaming convention"[4]. 

Data sublanguage should be understood as short for what is known as relational data manipulation sublanguage (DML) expressing the combination of retrievals (querying) and updates as relation transformations.

What is known as data definition sublanguage (DDL) is outside the theory, but must be compatible with it, by which we mean consistent with the RDM, and care must be exercised not to include syntactic elements in the DDL that subvert or are at odds with the relational data sublanguage (DML). The power of each is needed, but they must be kept distinct and their interaction carefully controlled by the interface. The DML must not have access to the expressive power of the DDL, but must have access to its results recorded in the database catalog (i.e., defined and named abstract objects like relations, attributes, domains, etc.) by name and should be able to declare use of those names or dropping of those names (and so use) from the UoD[4].

As we shall see in Part 3, there are functions other than data definition that are outside the theory, but must be consistent with it, some mentioned by Codd.  We suspect (but cannot prove) that when he referred to a "comprehensive data sublanguage", it was a way of indicating their practical, but equally very careful combination with the relational data sublanguage (DML) into what, to avoid confusion, we prefer to call DBMS language


[1] Codd, E.F., Relational Completeness of Data Base Sublanguages.

[2] Codd, E.F., Derivability, Redundancy And Consistency Of Relations Stored In Large Data Banks.

[3] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK (draft chapters), forthcoming.

[4] Codd, E.F., A Relational Model of Data for Large Shared Data Banks.

[5] Codd, E.F., A data base sublanguage founded on the relational calculus.

[6] Codd, E.F., Is Your DBMS Really Relational? ComputerWorld (14 October 1985) and Does Your DBMS Run By the Rules ComputerWorld (21 October 1985); online copy.

No comments:

Post a Comment

View My Stats