Monday, December 5, 2016

Domain vs. (Data) Type, Class vs. Relation (UPDATE)

Revised 12/8/16

What's wrong with last week's picture, namely:
"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases.
  1. "type" = "domain": I fully understand why one might equate "type" and "domain", but ... in today's programming practice, "type" and "domain" are quite different. The word "type" is largely tied to system-level (or "physical"-level) definitions of data, while a "domain" is thought of as an abstract set of acceptable values.
  2. "class" != "relvar": In simple terms, the word "class" applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a "relation" is a "class" (and trivially also a "set"), which contributes to confusion.
In modern programming parlance "class" is generally distinguished from "type" only in that "type" refers to "primitive" (system-defined) data definitions while "class" refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, "type" and "class" are actually synonymous."
With respect to 1, well, yes, they are distinct, but not for the stated reason. With respect to 2, well, no insofar as "programming parlance" goes. The terminology introduced by Codd was explicitly intended to distinguish formal concepts from set theory and first order predicate logic from the terminology used in programming practice. 

1. Domain vs. (Data) Type

"The theory behind data types in most programming languages is based on abstract data types, but programmers hardly ever use the term in this way and languages are rarely strong in this regard. The need for a formal theory (of abstract data) and the semantics of types was not addressed by either Codd or the current RDM interpretation. Codd's treatment of types was greatly simplified and its understanding in the current interpretation of the RDM is at best simplistic. An adequate treatment of the subject is beyond the scope of this discussion and will be addressed in Part III of LOGIC FOR SERIOUS DATABASE FOLKS". --David McGoveran
For our purposes here suffice it to say that type is used in two senses:
(a) Extensionally i.e., type denotes a specific set of typed object(s), which define the type;
(b) Intensionally i.e., type defines what is and is not permissible for a typed object.
Both relational domains and programming data types are types in the (a) sense: sets of values within a specified range to which certain operations are applicable. In his book THE RELATIONAL MODEL VERSION 2, Codd lists several distinctions of the former (which he called "extended types")from the latter: domains 
  • are types with database designer-constrained value ranges;
  • represent real world entity properties;
  • are under DBMS control.
while programming types are under programmer and application control and do not necessarily represent real world properties.

2. Relation vs. Class

"Whatever type and class are in "modern programming parlance", the meanings of class in set theory (vs. any other usages) should not be confused with how it is popularly used in programming or--for that matter--in the database literature (class vs. type is another good example of such confusion).

The distinctions between class and set vary with the specific version of set theory. To avoid problems, we will use the most broadly applicable definitions that will still apply to usages relevant to relational database theory and will try to:
1. be precise about how we use the terms;
2. identify the subject areas to which the definitions do not apply." --David McGoveran
In the real world
"...every property defines a class--namely, the set of [entities] possessing that property--whereas every class is a class simply by virtue of the fact that its members have common defining properties."--MEANING AND ARGUMENT: ELEMENTS OF LOGIC
In other words, entities are members of a class by virtue of common properties and when we say they are of the same type, we use type in the (b) sense.
"The definition of a class is intensional--it is a statement of the properties that distinguish members of the class from non-members. When applied to a particular universe of entities, a class definition selects out those that are members of the class. If the universe is well defined--a collection of entities in which each can, in principle though perhaps not in practical terms, be examined--the result is a set. Mathematicians say that a class over a universe "induces" a set. If one defines a class, one must then "compute" the set that is induced when that class definition is applied to a particular universe." --LOGIC FOR SERIOUS DATABASE FOLKS
At the class level by properties we mean:
  • Individual properties shared by entities that are class members;
  • Properties arising from relationships between individual properties;
  • Properties arising from relationships among all class members collectively;
There are also multi-class properties arising from relationships among two or more classes.

Note that while this seems to contradict "whether such a collection could actually exist", it does not because of the caveat regarding "well defined universe". If the collection could not actually exist, the universe is not well defined as required.

Conceptual modeling consists of specifying these relationships in natural language as informal business rules. Those rules correspond to a formal predicate that expresses the class i.e., they comprise the intensional definition of each class of interest. When applied to a universe of entities, the class induces a set of class members, facts about which are to be recorded in the database.

A relation is, thus, a set of tuples that represent in the database facts about the set of entities induced by the class. Every relation is associated with a relation predicate (RP)--the conjunction of integrity constraints that represent the business rules in the database. The RP represents formally in the database the intensional class definition (that was informally expressed by the business rules). When applied to a universe of entities, that RP induces the relation and serves as its membership function. The relation's tuples--its extension--satisfy that RP. This is another way of saying the tuples in a relation represent facts about a set of entities of the same type i.e., a RP is a relation type and a
tuple type specification statement.

Note very carefully that:

"Translating business rules into a formal first order predicate (let alone expressing it as integrity constraints in any DBMS-specific data language) is a big step that casts the die. There is no way to know you've done it incorrectly, except that you decide you are unhappy with the results--that the formalism doesn't produce something you think it should produce, or produces something you think it should not (usually detected by translating the constraints backwards and comparing to reality). We can minimize the likelihood of a bad modeling effort by following a careful methodology, but we must not confuse the conceptual with its formal representation, the former being the choice of subject matter and latter being the result of a choice of formalism." --LOGIC FOR SERIOUS DATABASE FOLKS
I shudder at comparing database practice to this recommendation.

Note also that, following Codd, we refer to relations rather than relvars.

"...set semantics do not have the concept of a computer variable to which values can be destructively assigned (or "updated") ... [such] variables can be expressed in certain systems of logic, but they cannot be expressed in elementary set theory, or first order predicate logic. Other, more expressively powerful systems are required. Unfortunately, such powerful formal systems do violence to the relational data model and its intended benefits." --LOGIC FOR SERIOUS DATABASE FOLKS
which is perhaps why Codd avoided relvars by using the term "time-varying relations" instead. His choice seems to skirt the need for such powerful formal systems, while relvars--which introduce the semantics of computationally complete programming languages and the higher logic that they entail--embrace it.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:


  1. Codd avoided relvars by using the term "time-varying relations" instead.

    Fabian: bunkum! McG is entitled to object to relvars because programming language variables have no place in set theory. (A set does not have persistent identity.) Then equally (as he says) there can be no place for a database variable.

    Then there can be no "relation" with a persistent identity that could be "time-varying". The best we could say is: look at these two database (value)s; at two different times; they both have a relation with such-and-such a predicate (RP); then we might mentally construct a persistent entity which is time-varying. But that's (convenient) mythology, not licensed by set theory.

    Equally, we could describe that situation by naming a relation; and adopting a convenient mythology that the name identifies a programming language variable.

    The problem with identifying a relation by RP is with schema evolution: we might have two database values with two slightly different RPs. We cannot say that is one time-varying relation. (The whole idea of RPs differing "slightly" is more mental fairy-tale.)

    So I do not see why McG is so critical of relvars. (He as good as admits he's being over-precious.) We can regard a database value as a set of (name, relation) pairs, where the name fills the role of a programming variable.

  2. At this time David McGoveran offers only this reaction and defers any further discussion until his formal exposition of the RDM is published in the book he's currently working on.

    I appreciate the fact that Clayden approves of my objection to relvars on the basis of set theory. On the other hand, he seems to completely ignore the problem relvars introduce into the language vis-à-vis computational completeness when he says he doesn't understand why I am so critical. BTW, I'm not sure what word was intended where he uses "precious", but it made me smile.

    I also don't understand his comment regarding schema evolution, especially inasmuch as his example seems only to reinforce my position that relation predicates (RP) do accurately identify relations (which is not only consistent with set theory, BTW, but with EFC as early as 1969--see "set specification"). That said, I've always said that relational theory has not addressed so-called "schema evolution" from any theoretical basis. I've also said it needs to be done properly.

    As to RPs differing only slightly being "mental fairy tale", I suggest that you can only make such judgments if you both understand how to write RPs in formal detail and define what it means for them to differ or be related, slightly or otherwise. I've defined such differences in terms of the deductive apparatus of FOPL and you can't get much more mechanical than that.

    1. "over-precious" is what I intended. I mean: yes the RM needs no more than set theory and FOPL for a database value to capture a world situation as at a point in time. That does not mean we have to preciously restrict ourselves to set theory when we come to the pragmatics of a DBMS as a Management System whose role within an enterprise is to express persistence of the enterprise's assets.

      I do not understand why McG says relvars (or programming language variables in general) get in the way of computational completeness: a variable stands for a value; just replace the variable with its (current) value in any computation.

      Or isomorphically: map a named relation's schema to a schema with an extra attribute, whose value is the relvar name.

    2. I figured that's what you intended.

      >"I do not understand..." This seems to be the problem, ain't it?

      First, restriction to FOPL is only for the DATA component and USER MODEL, such that no damage is done to the benefits from the RDM. Computational completeness (CC) that's what hosting of data sublanguage is for, but this must be done carefully in order to avoid the damage.

      I have no idea what "relvars get in the way of CC" means--they get in the way of the RDM, not CC.

    3. David McGoveran's reply:

      1. I do not concern myself with pragmatics. If you don't have the theory right, pragmatics are premature and likely to lead you to false conclusions or worse, actions that contradict the foundations on which your system is built.

      2. I have not said that relvars get in the way of computational completeness. Computational completeness disables data independence because computationally complete systems are never decidable. and relvars have no meaning outside computationally complete systems.

      3. There is no over importance that can be asserted regarding set theory. RM
      is a kind of set theory, namely that portion which has a representation in
      first order predicate logic. If you step away from this, you are no longer
      talking about RM. And I am.

      4. You should expound upon your last assertion or drop it. The correspondence you suggest does not explain how it pertains to relvars - at least as you've expressed it.

    4. I'd like to unpack David McGoveran's point 2, particularly "and relvars have no meaning outside computationally complete systems."

      This seems to be a different objection to relvars, compared to what I've seen before that they offend against the Information Principle.

      A programming model can be computationally complete without using variables. (For example the SKI Combinator Calculus.)

      A programming language can use variables without being computationally complete. (For example COBOL vintage 1980's.)

      To remain first-order, variables must range only over individuals, not over predicates or functions. Relvars range over sets of tuples. What's not first-order about them?

      If we take Codd's RA (that is, without transitive closure): that is not computationally complete; it has variables standing for relations. Are those variables not in effect relvars?

  3. For years the fuss about 'time-varying relations' has seemed incomprehensible to me. Surely Codd meant nothing more than references to different extensions at different times, without dictating any particular implementation language.

    Also, the comment that "a set does not have persistent identity" is nonsense. The situations that relational sets represent might come and go but sets of sentences about the situations are as everlasting as anything can be. Language devices that replace the sets under consideration don't change that. And the set ops MINUs and UNION used for updating make it perfectly clear which tuples/rows 'persist'.

  4. (I hope my comments won't disagree with anything David McGoveran has written, his explanations being usually much deeper/more fundamental than I could manage.)

    It should be obvious that language devices such as relvars/tables often represent more than one relation, for example a relation that is a join will often have proper subsets that are also joins. Those subsets would have distinct predicate extensions and therefore different intensions from those of the value of a relvar or table. The most obvious exception would be in systems where every relvar reference specifies a key value, so that every referenced relation is a singleton and so has no subsets. (I'd guess anybody who met Codd would know what a stickler he was for keys.)

    Beyond that, tuples of some relations such as joins can be projections of others.

    When a relvar has only one predicate (supposedly) but can represent multiple relations, each with a different predicate, it looks to be very tricky, if it's possible at all, to use relvars or tables to explain relational theory, which is to say explain database behaviour which is to say meaning of schemas. In other words using an implementation to explain everything else rather than using the theory to explain an implementation. (This hasn't stopped SQL systems from forcing users to use base tables for all updates, even when the reference manuals call such updates 'view updates'! The SQL updating situation is exactly the same as it was/is in file systems - every file update must be individually specified by users before the system can be 'correct'. So SQL systems should more accurately be called advanced file systems. It also didn't stop one of the System R developers from claiming in 1981 that data independence had been achieved!) The result is not only not relational but ignores the overall database meaning and consistency.

  5. Regarding there being no 'license' for mythology/fairy tales to vary relations, this could be a typical coder's view or even a physical view. Set operations allow differences between relations to be expressed, therefore they allow the expression of output relations that vary from the input relations. Most language implementations discard the inputs and differences after the expressions have been evaluated. McGoveran's view updating chapter gives a concise update definition/vary definition expressed as equations using set operators on inputs and differences/'transforms' and doesn't dictate anything about inputs or differences being discarded. (Any discarding would be an implementation choice, not a definition choice.)