Sunday, October 1, 2017

Class, Type, Relation and Domain in Database Management

Revised  12/01/17.

This is a 10/01/17 re-rewrite of a 08/12/12 post revised on 12/05/16 to bring it in line with David McGoveran's formal exposition and interpretation[1] of Codd's RDM (as distinct from its common "understanding" in the industry).

Here's what's wrong with last week's picture, namely:

"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases.
  • type = domain: I fully understand why one might equate type and domain, but ... in today's programming practice, type and domain are quite different. The word type is largely tied to system-level (or physical-level) definitions of data, while a domain is thought of as an abstract set of acceptable values.
  • class != relvar: In simple terms, the word class applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a relation is a class (and trivially also a set), which contributes to confusion.
In modern programming parlance class is generally distinguished from type only in that type refers to primitive (system-defined) data definitions while class refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, type and class are actually synonymous."
There is, indeed, a huge mess. And, as always, it is rooted in poor foundation knowledge [2], of which the comment itself suffers. 

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.

Class, type, and domain as used in the RDM are well defined formal concepts in set theory, mathematical relation theory and predicate logic -- the theoretical foundations of the RDM -- and were introduced by Codd in part explicitly to distinguish them from "programming parlance". Furthermore, Date's introduction of explicit relation variable (relvar) semantics in the data sub-language notwithstanding, in none of the three theoretical foundations are there variables with values that can be destructively assigned ("updated"). 

Domains and Programming (Data) Types

 "The theory behind data types in most programming languages is based on abstract data types, but programmers hardly ever use the term in this way and languages are rarely strong in this regard. The need for a formal theory (of abstract data) and the semantics of data types was not addressed by either Codd, or the current industry interpretation of the RDM. Codd's treatment of types was greatly simplified and its understanding in the current interpretation is at best simplistic." --David McGoveran
Codd introduced domains -- which he called "extended data types" -- to distinguish them from programming data types (PDT). A PDT is a named set of values application programmer defined and under application control that does not necessarily represent something in the real world. A domain is a database designer defined abstract data type under DBMS control that represents a property of a real world object. Furthermore, relational domains are simple (i.e., they do not have components meaningful to applications) and their values are treated as atomic by the data sub-language (e.g., if they are relation-valued domains (RVD), the data sub-language does not allow applications access to their attributes and tuples) [3].

Note: A relation with only simple domains is in its First normal form (1NF). It is accepted  in the industry that
a relation is at least in (1NF). However, a careful examination of Codd's early work leads to the conclusion that this is insufficient and  Fifth Normal Form (5NF) is required for both semantic correctness and system-guaranteed logical validity and to avoid anomalous results [4]. Had Codd defined the join operation as the one we use today, further normalization and 1NF-5NF would not have come up. Unfortunately, the initial join definition was different and tied to it was a single normal form that was distinct than the current 1NF. Because it was to the old join what 5NF is to current join, we believe that Codd would have eventually realized that relations should be in 5NF by definition.


Classes, Sets and Relations

"[E]very property defines a class -- namely, the group of [objects] possessing that property -- whereas every class is a class simply by virtue of the fact that its members have common defining properties."[5] --Robert Olson
Though the terms class and set are often used interchangeably, there is a subtle distinction between them which escapes many data professionals [1]. Formally:
"The definition of a class is intensional -- it is a statement of the properties that distinguish members of the class from non-members. When applied to a particular universe of objects, a class definition selects out those that are members of the class --i.e., have those properties. If the universe is well defined -- a collection of objects in which each can, in principle, though perhaps not in practical terms, be examined -- the result is a set. Mathematicians say that a class over a universe induces a set. If one defines a class, one must then compute the set of members induced when that class definition is applied to a particular universe. Thus, every set is a class, but every class is not necessarily a set. Any notion of class membership is derived (induced) rather than defined." --David McGoveran
The distinction between class and set varies with versions of set theory. We use the most broadly applicable definitions relevant to RDM and try to (1) be precise about how we use the terms and (2) identify the subject areas to which the definitions do not apply. A class can be specified
  • By the properties required for membership (i.e., its intension);
  • By the set of members induced by the application of its definition to some universe of objects (i.e., its extension);
The intension of the class -- the required properties -- can be understood as the specification of a type.

Note: Type in this sense should not be confused with domain and PDT; it and is not a term used in set theory -- we use set theory to explain it.


Object Groups and Relations

Note: We use object in the general sense of entity, not object oriented sense.

Conceptual modeling, in part, organizes some segment of reality into groups of property-sharing objects. Objects must have all the required properties specified by the rules -- be of the proper type -- to have membership in a group, including:

  • Individual properties belonging to each member and common to all;
  • Properties arising from relationships among individual properties;
  • Properties arising from relationships among all group members; and
  • Properties arising from relationships between groups;
In other words, modeling is formulation of informal business rules that jointly define each group's type -- the properties required for membership [6] (i.e., they are type specification statements). Because rules are expressed informally in natural language they are not "computable" and must be formalized symbolically as predicates such that a DBMS can enforce them as constraints to ensure database consistency with the rules. This formalization is part of database design.

An object group defined by the rules is represented formally in the database by a relation subject to several types of constraints that are formal expressions of the business rules. The constraints jointly comprise the relation predicate (RP) -- its type specification -- the criterion for tuple membership in the relation. The term (proper) class is reserved for a collection of objects that (1) have one or more definable properties in common and (2) are not sets. If the group's required properties are known, the RP can be applied to a universe of objects to select the set that satisfies it --i.e., those that have properties. If only the the set of members is known, the class definition needs to be inferred it.

A RP is, thus, the definition of a class of tuples, which, when applied to some universe of tuples, induces a relation, the set of tuples that satisfies the RP (i.e., those that represent (facts about) objects that are members of the represented group by virtue of having the required properties specified by the rules. By enforcing the corresponding constraints, the DBMS permits only tuples of the proper type that satisfy the RP), as members of the relation induced by the  class, into the relation.

No Relvars or Assignment

"Set semantics do not have the concept of a variable with values can be updated --i.e., destructively assigned. Such variables can be expressed in certain systems of logic, but they cannot be expressed in elementary set theory, or first order predicate logic (FOPL). Other, more expressively powerful formal systems are required. Unfortunately, such systems do violence to the RDM and rob it of its advantages and benefits." --David McGoveran
This is possibly why Codd did not include relvar semantics explicitly in the data sub-language, using the informal notion of "time-varying relations" instead. This skirted the more powerful formal systems that would introduce the semantics of computationally complete languages (CCL), which:
  • Are undecidable and, therefore, cannot be declarative;
  • Do not support physical and logical independence;
  • Do not guarantee logical validity and semantic correctness;
  • Are more powerful, but every algorithm must be tested for correctness and termination;
Hence FOPL-based relational data sub-languages, responsible strictly for data manipulation and integrity, properly hosted by CCLs that are not limited to data management functions[7].

Note: Properly means no violation of data sub-language decidability. Data definition is a DBMS, but host language function -- except for derived objects! -- hence the separate DDL and DML.

So, in conclusion:

  • Domain ≠ PDT, not just because the former "is tied to system-level (or physical-level) definitions of data", but because, unlike the former, the latter is not a logical term.
  • Class ≠ relvar, rather, relation = set -- the extension of a class, induced by applying the class intensional definition -- the RP -- to some universe of tuples representing facts about the object members of the class.


[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.


[3] Pascal, F., First Normal Form (1NF) in Theory and Practice Parts 1-3.

[4] Pascal, F., Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness.


[6] Pascal, F., Business Modeling for Database Design.

[7] Pascal, F., Data Sub-languages, Programming, and Data Integrity.


No comments:

Post a Comment