Friday, October 11, 2019

Data Sublanguage Part 3: DBMS Language


Note: 10/10/2019 significantly revised Part 2, which requires a re-read.

A formal data model consists of structure, integrity and manipulation[1,2] and so requires (1) a language that expresses data manipulation (retrievals and updates) augmented with (2) a metalanguage used to define the model's structural elements. As we showed in Part 1  and Part 2, when the data model is the RDM:

  • Data sublanguage is short for a relationally complete data manipulation sublanguage (DML) that expresses  retrievals and updates, the latter correctly understood as set-theoretic relation transformations.
  • A data definition sublanguage (DDL) is a metalanguage for DML that is outside the theory but consistent with the RDM and at least as powerful expressively as the DML (e.g., a very carefully restricted SOL to avoid self-referencing).
  • The DML and the DDL can, for practical purposes, be carefully unified into what Codd called a "comprehensive data sublanguage", but we prefer DBMS language to avoid confusion.
By carefully we mean that because only the DML is, strictly speaking, based on relational theory, the DDL must be consistent with, but differentiated from it, such that the two can't be mixed in the same expression in a way that the former subverts the latter. A DML expression can be referenced as a sub-expression by a DDL expression (e.g., as in view definitions), but only if it contains strictly retrievals (e.g., SELECT) and no updates (e.g., INSERT, UPDATE, DELETE).

Note: Data definition and manipulation are possible without a DBMS. However:

“With a relational catalog, definition can be performed via the RA, which requires physical implementation to be determined exclusively by the catalog (behind the scenes as it were) -- a kind of skeletal, primitive, or rudimentary DBMS. This is why Codd created a relational catalog that contains a description of the database and could be managed using RA-based DML. It works well unless one is allowed to mix DDL (metalanguage) with DML (language) in the same expression. Otherwise put, the database can be read to modify the catalog, but not vice-versa (as far as the DML is concerned, the catalog that describes the database does not exist).

But with a data model that, unlike the RDM, does not define a catalog such that the same language can be used for both database and it, a rudimentary DBMS must provide a workaround, and if the model is computationally complete (like CODASYL was), there must limits on how "active" the catalog is to prevent users from writing self-referencing expressions that cannot be automatically implemented because they may corrupt the database (same as would mixing data sublanguage and host language). This is one reason some of the pre-RDM directed graph DBMSs had limited notions of catalog that often required completely separate facilities to maintain.”
                                                 --David McGoveran

The DML and the DDL express two core data management functions centralized in the DBMS. When the data model is the RDM, only data management functions are permitted to access the data. At the end of Part 2 we alluded to other such functions that, like data definition, are outside relational theory, but must be consistent with the RDM.

Friday, October 4, 2019

Test Your Foundation Knowledge


The Web is chockful of unnoticed/unquestioned pronouncements by novices or "experts", many self-taught, that are (1) wrong, or (2) gobbledygook. Any attempt to demonstrate lack of foundation knowledge underlying these misconceptions and their practical implications are usually dismissed as "theory, not practical", attacked as "insulting ad-hominem", or ignored altogether, regardless of the amount and quality of the supporting evidence and argument logic. This is understandable: in the absence of foundation knowledge and ability to reason, it is by definition impossible to comprehend and appreciate corrections that require them.

Practitioners who cannot detect such misconceptions and understand their practical implications and the importance thereof are insufficiently prepared for a professional career in data management. Worse, they cannot associate problems with their real causes and, thus, cannot come up with proper solutions, which explains the industry's "cookbook approach" and succession of fads.

What about you? This is another batch in the Test Your Foundation Knowledge regular series of posts of online statements reflecting common misconceptions that are difficult to discern without foundation knowledge. You can test yours by trying to debunk them in Comments, including what category, (1) or (2) do they fall in? If you can't, proper education is in order.


Friday, September 27, 2019

Data Sublanguage Part 2: Data Manipulation and Definition


Revised 10/10/2019.

In Part 1 we showed that Codd intended in 1969 to base the RDM on axiomatic set theory (AST) and second order logic (SOL) to accommodate relation-valued domains (RVD) (i.e., sets of sets), but that for the benefit of relational advantages and to avoid SOL problems he had to trade off the expressive power of AST/SOL for the simple set theory (SST) of proper sets (i.e., relations in normal form) expressible in first order predicate logic (FOPL) and, thus, computational for relational completeness[1]. He retained the power of the former for applications by hosting a relationally complete FOPL-based language expressing the RA in computationally complete programming languages (CCL).

We also alerted to an important, but unnoticed detail: data sublanguage appeared in the 1970 paper -- in 1969 Codd referred to retrieval sublanguage. This can be understood only with reference to the theoretical foundation of the RDM.


Wednesday, September 25, 2019

Test Your Foundation Knowledge

The Web is chockful of unnoticed/unquestioned pronouncements by self-taught novices or "experts" that are (1) wrong, or (2) gobbledygook. Any attempt to demonstrate lack of foundation knowledge underlying these misconceptions and their practical implications are usually dismissed as "theory that is not practical", attacked as "insulting ad-hominem", or ignored altogether, regardless of the amount and quality of the supporting evidence and argument logic. This is understandable: in the absence of foundation knowledge and ability to reason, it is by definition impossible to comprehend and appreciate corrections that require them.

I have always contended that practitioners who cannot detect such misconceptions, and understand their practical implications and the importance thereof are insufficiently prepared for a professional career in data management. Worse, neither can they associate problems with their real causes and, thus, cannot come up with proper solutions, which explains the industry's "cookbook approach" and succession of fads.

What about you? This is another batch in the Test Your Foundation Knowledge regular series of posts of online statements reflecting common misconceptions that are difficult to discern without foundation knowledge. You can test yours by trying to debunk them in Comments -- what category, (1) or (2) do they fall in? 

Sunday, September 22, 2019

Data Sublanguage Part 1: Relational vs. Computational Completeness


Note: I have revised the "Logical Access, Data Sublanguage, Kinds of Relations, Database Redundancy, and Consistency" paper in the "Understanding the Real RDM" series" (available from the PAPERS page) for consistency with this post.

“Recently I have read that SQL is actually a data sublanguage and not a programming language like C++ or Java or C# ... The answers ... have the pattern of "No, it is not. Because it's not Turing complete.", etc, etc. ... I am a bit confused, because since you can develop things through SQL, I thought it is similar to other programming languages ... I am curious about knowing why exactly is SQL not a programming language? Which features does it lack? (I know it can't do loops, but what else more?)”
--StackOverflow.com
“The SQL operators were meant to implement the relational algebra as proposed by Dr. Ted Codd. Unfortunately Dr. Codd based some of his ideas on a "extended set theory", which was an idea formulated and described in a 1977 paper by D. L. Childs ... But Childs’ extensions were not ideally suited, which is explained in quite some detail in [a] book ... by Professor Gary Sherman & Robin Bloor [who] argue that mainstream Zermelo-Fraenkel set theory (Cantor), would have been a better starting point. One key issue is that sets should be able to be sets of sets.”
--Dataversity.net

The concept of a sublanguge cannot be understood without foundation knowledge and familiarity with the history of the database management field, both lacking in the industry.

Tuesday, September 17, 2019

Test Your Foundation Knowledge

The Web is chockful of unnoticed/unquestioned pronouncements by novices or "experts", many self-taught, that are (1) wrong, or (2) gobbledygook. Any attempt to demonstrate lack of foundation knowledge underlying these misconceptions and their practical implications are usually dismissed as "theory, not practical", attacked as "insulting ad-hominem", or ignored altogether, regardless of the amount and quality of the supporting evidence and argument logic. This is understandable: in the absence of foundation knowledge and ability to reason, it is by definition impossible to comprehend and appreciate corrections that require them.

Practitioners who cannot detect such misconceptions and understand their practical implications and the importance thereof are insufficiently prepared for a professional career in data management. Worse, they cannot associate problems with their real causes and, thus, cannot come up with proper solutions, which explains the industry's "cookbook approach" and succession of fads.

What about you? This is another batch in the Test Your Foundation Knowledge regular series of posts of online statements reflecting common misconceptions that are difficult to discern without foundation knowledge. You can test yours by trying to debunk them in Comments, including what category, (1) or (2) do they fall in? If you can't, proper education is in order.

“Why is it so hard to standardize a Graph Query Language? It is because graph databases are strongly dependent on the data model and the physical layer implementation. And most important currently there is a lack of a uniform representation for these two factors that vary a lot.”
“...good points, yes in principle the query language should be independent of both the data model and the database storage engine. But with Graph Databases that is not happening. See SPARQL vs (Cypher, GraphQL, Gremlin, GSQL) competitive query languages. Then when you  examine the other side, i.e. the property graph databases, more carefully you discover that each vendor has built many tweaks and features that are based on their physical layer implementation.”
“...why I am skeptical, it is because I think the real bet in database arena will be to bridge Row and Column databases, i.e. SQL databases with graph databases (triple stores, property graph stores). Can we have a better approach that covers them all independent of the data model and the physical layer implementation ? Again we have seen such efforts with SPARQL-Relational mappings but...” 
“One of the key aspects that makes graph so powerful is that you have the ability to referentially annotate, either at the nodal level or at the assertion level (by creating a structure such as ?assertion :hasStructure {:subject ?s; :predicate ?p; :object ?o} (property graphs subsume this in the predicate, while RDF graphs don't). That annotation can contain advisory schematic information, constraint modeling and so forth. This is usually missing from Codd-oriented data stores, one reason why its a relatively easy trip from relational to semantic, but a considerably more complex one in the other direction. The other aspect (and something that you can argue both sides about) is that normalization is a key requirement for any many-to-many relationship in Codd algebra, but it is not necessarily a requirement in a graph. My biggest problem with OWL is not in its existence but in the implicit requirement of internal consistency and the overall complexity of the language.  In a purely mathematical environment, this makes sense, but in a data-world sense, inconsistency is pretty much a given.”
“When I talked about an ontology being semantically neutral, my argument was that you need some kind of operational ontology to present hooks on which to lay the topical or thematic ontology, something analogous to REST publishing modes. That ontology is comparatively primitive, but it is what provides the substrate to deduce the relevant relationships (or, put another way, to build a discovery mechanism upon).”
“You are right to ask about specifications about structural components of the "graph data model". In my opinion this is a key differentiation factor. But which graph data model we are talking about? In associative, semiotic, hypergraph data model (a.k.a R3DM/S3DM), there are Entities and Attributes that cover the metadata (dictionary) TBox database component, and then you have Associative Entity Sets (ASETs) and HyperAtom collections (HACOLs) that cover the ABox component. There are well defined transparent operations, especially SET operations on ASETs and HACOLs and there is a clean, distinct separation between ABox and TBox components ... There is a huge difference between theory and practice. I am afraid many theoreticians of the past, may rest in peace, and their followers in Relational/Graph domain have failed to understand the difference between software engineering, i.e. make something that works, and pure computer science, i.e. imagine something that works.”
“A good graph query language should, in general, be independent of the data model. What is needed, though, is a mechanism for enabling the discovery of specific types of predicate relationships. SPARQL is a good start, in terms of data model independence if you have a known core ontology (OWL, SHACL, SKOS, what have you) but if you don't have any means of discovering what the foundational language is, then it breaks down. SPARQL also doesn't handle anonymous paths well. GraphQL tries to turn a JSON database into a graph database, but it also faces the limitations of predicate discovery. Most other graph query languages work upon the assumption that you have property oriented graph implementation, but these tend not to scale well.”
“A lot of my work of late focuses on building knowledge bases. Typically you can define fairly complex classes (or more properly classes with a number of properties within the knowledge base itself that becomes the conceptual model for the creator of the knowledge base, but beneath that there is a second operational model (typically OWL, RDFS or SHACL-like) that is used primarily by the query engine. That operational model is simpler, more akin to a REST interface than anything, but it makes it possible to serve and update the knowledge base model. This is what I'm referring to when I talk about being independent of the data model - you're working with the operational model (which is primarily a  publishing model) in order to facilitate a more complex model.”
      --Why is it so hard to standardize a Graph Query Language?
                                         


References

Graph Databases: They Who Forget the Past...

Sets vs. Graphs

What Is a Data Model, and What It Is Not

What Is a Data Model

Data Model Neither Business, Nor Logical, Nor Physical Model




Sunday, August 25, 2019

Meaning Criteria and Entity Supertype-Subtypes Relationships


Note: This is a re-write of a previous post.
"I have a database for a school ... [with] numerous tables obviously, but consider these:
CONTACT - all contacts (students, faculty): has fields such as LAST, FIRST, ADDR, CITY, STATE, ZIP, EMAIL;
FACULTY - hire info, login/password, foreign key to CONTACT;
STUDENT - medical comments, current grade, foreign key to CONTACT."
"Do you think it is a good idea to have a single table hold such info? Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? At what point do you denormalize for the sake of being more practical? What would you do when you want to close out one year and start a new year? If you had stand-alone student and faculty tables then you could archive them easily, have a school semester and year attached to them. However, as you go from one year to the next information about a student or faculty may change. Like their address and phone for example. The database model now is not very good because it doesn’t maintain a history. If Student A was in school last year as well but lived somewhere else would you have 2 contact rows? 2 student rows?  Or do you have just one of each and have a change log. Which is best?"
How would somebody who "does not know past, or new requirements, modeling, and database design" and messes with a working database just because "he heard something about (insert your favorite fad here)" figure out correct from bad answers? Particularly if the answers suffer from the same lack of foundation knowledge as the question?