Friday, October 11, 2019

Data Sublanguage Part 3: DBMS Language

Note: 10/10/2019 significantly revised Part 2, which requires a re-read.

A formal data model consists of structure, integrity and manipulation[1,2] and so requires (1) a language that expresses data manipulation (retrievals and updates) augmented with (2) a metalanguage used to define the model's structural elements. As we showed in Part 1  and Part 2, when the data model is the RDM:

  • Data sublanguage is short for a relationally complete data manipulation sublanguage (DML) that expresses  retrievals and updates, the latter correctly understood as set-theoretic relation transformations.
  • A data definition sublanguage (DDL) is a metalanguage for DML that is outside the theory but consistent with the RDM and at least as powerful expressively as the DML (e.g., a very carefully restricted SOL to avoid self-referencing).
  • The DML and the DDL can, for practical purposes, be carefully unified into what Codd called a "comprehensive data sublanguage", but we prefer DBMS language to avoid confusion.
By carefully we mean that because only the DML is, strictly speaking, based on relational theory, the DDL must be consistent with, but differentiated from it, such that the two can't be mixed in the same expression in a way that the former subverts the latter. A DML expression can be referenced as a sub-expression by a DDL expression (e.g., as in view definitions), but only if it contains strictly retrievals (e.g., SELECT) and no updates (e.g., INSERT, UPDATE, DELETE).

Note: Data definition and manipulation are possible without a DBMS. However:

“With a relational catalog, definition can be performed via the RA, which requires physical implementation to be determined exclusively by the catalog (behind the scenes as it were) -- a kind of skeletal, primitive, or rudimentary DBMS. This is why Codd created a relational catalog that contains a description of the database and could be managed using RA-based DML. It works well unless one is allowed to mix DDL (metalanguage) with DML (language) in the same expression. Otherwise put, the database can be read to modify the catalog, but not vice-versa (as far as the DML is concerned, the catalog that describes the database does not exist).

But with a data model that, unlike the RDM, does not define a catalog such that the same language can be used for both database and it, a rudimentary DBMS must provide a workaround, and if the model is computationally complete (like CODASYL was), there must limits on how "active" the catalog is to prevent users from writing self-referencing expressions that cannot be automatically implemented because they may corrupt the database (same as would mixing data sublanguage and host language). This is one reason some of the pre-RDM directed graph DBMSs had limited notions of catalog that often required completely separate facilities to maintain.”
                                                 --David McGoveran

The DML and the DDL express two core data management functions centralized in the DBMS. When the data model is the RDM, only data management functions are permitted to access the data. At the end of Part 2 we alluded to other such functions that, like data definition, are outside relational theory, but must be consistent with the RDM.



Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. The content of this site is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. Thank you. 

  • 08/09/19: Following my series of posts on data sublanguage (Parts 1-4), I have revised for consistency the corresponding section of paper #2 in the Understanding the Real RDM series, Logical Access, Data Sublanguage, Kinds of Relations, and Database Redundancy and Consistency, which is available for ordering from the PAPERS page.


  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics may work.


I deleted my Facebook account. You can follow me:
  • @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture posts, and my exchanges on LinkedIn.
  • @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
  • @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

Data Management Functions

In 1971 Codd introduced ALPHA as a "language for manipulating relations" that included the ambiguous "declaring" and "dropping" of relations[3]. By 1985, however, one of his famous 12 rules mandated a Comprehensive Data Sublanguage that included not only explicit "data definition", but also "integrity constraints", "authorization", and "transaction boundaries"[4]. We attribute this "unified" DBMS language to practical  considerations of implementing the RDM.

This raises the issue of what data management functions other than manipulation and definition should be centralized in the DBMS and unified in the DBMS language. On the one hand, not every conceivable function is necessary nor, on the other hand, is any set of functions guaranteed to be sufficient. Besides integrity enforcement, data authorization, and transaction management we identify concurrency control, backup/recovery, and resource management as essential, distinct from others that, while useful, are not indispensable.

Note: Are there non-data management functions centralized in the DBMS? Starting and stopping the DBMS, for example -- that controls when transactions can be started, forcing a wait for them to end or else an abort before shutdown -- while essential for the DBMS to do all data management functions, is not one itself, as it does not require data access.

Details of these functions are beyond the scope of this discussion -- we only offer some comments in the specific context of a DBMS language.

Integrity is a data model component[1,2]. From the perspective of logic, the structural aspect of  RDM is just a way of expressing required relationships which are axioms implemented as semantic integrity constraints[5,6] (tables are just visualizations of the relationships on some physical medium -- they help conceptualize the RDM, but have no formal consequences[7]; had Codd focused on expressing relationships in FOPL, normalization and further normalization[8,9] would have never come up.

In the RDM case, relations are constrained to be consistent with the conceptual model of reality the database represents. We distinguish two kinds of constraints:

  • “Global -- anticipated by Codd, and of which practitioners are aware, but do not really understand -- take the form CREATE (RA/DML expression) AS CONSTRAINT <name> behave like axioms in logical reasoning and are a permanent part of the semantics of the database. They are expressed by the DDL that can reference DML sub-expressions;
  • Scoped constraints take the form DECLARE (RA/DML expression) AS CONSTRAINT <name> behave like hypotheses or assumptions in logical reasoning in effect for only some specific inference chain (i.e., a single DML statement or transaction). They are expressed by the DML.”
                                        --David McGoveran

Only the former were anticipated by Codd and are the ones practitioners are aware of (but have poor understanding thereof).

Data authorization (security) is not a data model component, but a collection of pragmatic constraints that have nothing to do with data consistency. They are formalizations of security rules that are expressible relationally in a form similar to integrity constraints and, if so implemented, the sublanguage expressing this function -- like DDL, a metalanguage -- is used to constrain the user ID of every DML access request to be among those authorized to perform the relational operations on the database objects referenced in the request. A relational authorization database is created containing user information and metadata about databases to be accessed (the controllable elements), and security constraint expressions are conjoined with users' DML access requests and jointly optimized (when a request is logically a contradiction no data is accessed).

While traditionally the transaction sublanguage has also been implemented as a metalanguage, because its directives allow changing transaction behavior independent of the expressions those transactions' directives control, it could in principle be implemented using a definition and a manipulation sublanguage.

Except for integrity, which is part of the DML and DDL, in general and ideally a DBMS should provide separate sublanguages to express each data management function. Like the DDL they are not part of the data sublanguage (DML), must be implemented in a manner consistent with the RDM and can, for practical purposes, be carefully unified in the DBMS language.

“If a RDBMS supports one DBMS language that incorporates sublanguages expressing these functions, each must stand on its own -- all must be mutually orthogonal in a functional sense: removing any non-DML sublanguage, or changing details of its syntax must not have any effect on the functionality of the others. No non-DML sublanguage may incorporate syntactic constructs that are inconsistent with the RA DML (e.g., an authorization language that authorizes usage over "objects" -- as in OOP -- or other non-relational data structures), or otherwise subvert the RDM. While they may, at most, have a consistent syntax, they should not be fully integrated in the sense of allowing their mixing with the DML in the same expression, for the same reason you would not mix host language and data sublanguage. Their syntax should be sufficiently distinctive that users can tell them apart and avoid entangling them with the DML. The rules for combining sublanguages must allow to determine, solely from the syntax, for any expression, whether that expression is non-DML or DML.”
                                                       --David McGoveran 

Note: The "separateness" of these sublanguages is why early DBMS vendors got away with providing the required functionality as admin utilities rather than integrating them into the DBMS itself. When data management, security, concurrency, transaction management, etc were all separate utilities, the behavior of a DBMS depended on the platform it was installed on and which utilities were used to provide the functionality. Only later were all these required capabilities integrated tightly into the DBMS, enabling them to be sold as a package and especially on multiple platforms. Except for security, most DBMSs still do a poor job of this (I used to call Interbase "Gthis, Gthat").

Stay tuned for conclusions in Part 3.

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.


[1] Pascal, F., What Is a Data Model, and What It Is Not.

[2] Pascal, F., What Is a Data Model.

[3] Codd, E.F., A data base sublanguage founded on the relational calculus.

[4] Pascal, F., Interpreting Codd: The 12 Rules.

[5] Pascal, F., Relationships and the RDM Parts 1-3.

[6] Pascal, F., Relationships, Rules, Relations and Constraints.

[7] Pascal, F., Tables: So What?

[8] Pascal, F., The Costly Illusion: Normalization, Integrity and Performance.

[9] Pascal, F., Normalization and Further Normalization Parts 1-3.

No comments:

Post a Comment