Sunday, October 1, 2017

Class, Type, Relation and Domain in Database Management

This is a 10/01/17 re-rewrite of a 08/12/12 post revised on 12/05/16 to bring it in line with David McGoveran's formal exposition and interpretation[1] of Codd's RDM (as distinct from its common "understanding" in the industry).

Here's what's wrong with last week's picture, namely:

"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases.
  • type = domain: I fully understand why one might equate type and domain, but ... in today's programming practice, type and domain are quite different. The word type is largely tied to system-level (or physical-level) definitions of data, while a domain is thought of as an abstract set of acceptable values.
  • class != relvar: In simple terms, the word class applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a relation is a class (and trivially also a set), which contributes to confusion.
In modern programming parlance class is generally distinguished from type only in that type refers to primitive (system-defined) data definitions while class refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, type and class are actually synonymous."
There is, indeed, a huge mess. And, as always, it is rooted in poor foundation knowledge[2], to which the comment itself is not immune. 


Class, type, and domain as used in the RDM are well defined formal concepts in set theory, mathematical relation theory and predicate logic -- the theoretical foundations of the RDM -- and were introduced by Codd in part explicitly to distinguish them from "programming parlance". Furthermore, Date's introduction of explicit relation variables (relvars) semantics in the data sub-language notwithstanding, in none of the three theoretical foundations are there variables with values that can be destructively assigned ("updated"). 


Domains and Programming (Data) Types

 "The theory behind data types in most programming languages is based on abstract data types, but programmers hardly ever use the term in this way and languages are rarely strong in this regard. The need for a formal theory (of abstract data) and the semantics of data types was not addressed by either Codd, or the current industry interpretation of the RDM. Codd's treatment of types was greatly simplified and its understanding in the current interpretation is at best simplistic." --David McGoveran
Codd introduced domains -- which he called "extended data types" -- to distinguish them from programming data types (PDT). A PDT is a named set of values under application control that does not necessarily represent something in the real world. A domain is a user defined abstract data type under DBMS control that represents a property of a real world object. Furthermore, relational domains are simple --i.e., they do not have components relied upon by applications -- and their values are treated as atomic by the data sub-language (e.g., if they are relation-valued domains (RVD), the data sub-language does not allow applications access to their attributes and tuples)[3].

Note: A relation with only simple domains is in its normal form (or normalized). The common understanding of the RDM -- such as it is -- (1) confuses the
the First Normal Form (1NF) in use today with the formal normal form borrowed by Codd from predicate logic -- they are not one and the same -- and (2) asserts that a relation is by definition in First Normal Form (1NF)[3]. However, a careful reading of Codd's early work reveals that a relation should be in Fifth Normal Form (5NF) by definition, which is required for both semantic correctness and system-guaranteed logical validity[4]. We believe Codd would have ultimately reached this conclusion -- normalization, further normalization and 1NF-5NF are an unfortunate accident of history due to Codd’s having initially defined relational algebra operations (especially join) differently from their later definitions used today. 

 

 

Classes, Types, Sets and Relations

"[E]very property defines a class -- namely, the group of [objects] possessing that property -- whereas every class is a class simply by virtue of the fact that its members have common defining properties."[5] --Robert Olson
Though the terms class and set are often used interchangeably, there is a subtle distinction between them which escapes many data professionals[1]. Formally:
"The definition of a class is intensional -- it is a statement of the properties that distinguish members of the class from non-members. When applied to a particular universe of objects, a class definition selects out those that are members of the class --i.e., have those properties. If the universe is well defined -- a collection of objects in which each can, in principle, though perhaps not in practical terms, be examined -- the result is a set. Mathematicians say that a class over a universe induces a set. If one defines a class, one must then compute the set of members induced when that class definition is applied to a particular universe. Thus, every set is a class, but every class is not necessarily a set. Any notion of class membership is derived (induced) rather than defined." --David McGoveran
The distinction between class and set varies with the specific version of set theory. We use the most broadly applicable definitions relevant to RDM and try to (1) be precise about how we use the terms and (2) identify the subject areas to which the definitions do not apply. A class can be specified
  • By the properties required for membership --i.e., its intension;
  • By the set of members induced by the application of its definition to some universe of objects --i.e., its extension;
The intension of the class -- the required properties -- can be understood as the specification of a type.

Note: Type in this sense should not be confused with domain and PDT; it and is not a term used in set theory -- we use set theory to explain it.

 

 

Object Groups and Relations


Note: We use object in the general sense of entity, not object oriented sense.

Conceptual modeling, in part, organizes some segment of reality into groups of property-sharing objects. Objects must have all the required properties specified by the rules -- be of the proper type -- to have membership in a group, including:

  • Individual properties belonging to each member;
  • Properties arising from relationships among individual properties;
  • Properties arising from relationships among all group members; and
  • Properties arising from relationships between groups;
In other words, modeling is formulation of informal business rules that jointly define each group's type -- the properties required for membership[6] --i.e., they are type specification statements. Because rules are expressed informally in natural language, they must be formalized symbolically such that a DBMS can manipulate them mechanically for information retrieval (querying). This formalization is the purpose of database design.

An object group defined by the rules is represented formally in the database by a relation subject to several types of constraints that are formal expressions of the business rules. The constraints jointly comprise the relation predicate (RP) -- its type specification -- the criterion for tuple membership in the relation. The term (proper) class is reserved for a collection of objects that (1) have one or more definable properties in common and (2) are not sets. If the group's required properties are known, the RP can be applied to a universe of objects to select the set that satisfies it --i.e., those that have properties. If only the the set of members is known, the class definition needs to be inferred it.

A RP is, thus, the definition of a class of tuples, which, when applied to some universe of tuples, induces a relation, the set of tuples that satisfies the RP --i.e., those that represent (facts about) objects that are members of the represented group by virtue of having the required properties specified by the rules. By enforcing the corresponding constraints, the DBMS permits only tuples of the proper type that satisfy the RP as members of the relation induced by the  class.



No Relvars or Assignment

"Set semantics do not have the concept of a variable with values can be updated --i.e., destructively assigned. Such variables can be expressed in certain systems of logic, but they cannot be expressed in elementary set theory, or first order predicate logic (FOPL). Other, more expressively powerful formal systems are required. Unfortunately, such systems do violence to the RDM and rob it of its advantages and benefits." --David McGoveran
This is possibly why Codd did not include relvar semantics explicitly in the data sub-language, using the informal notion of "time-varying relations" instead. This skirted the more powerful formal systems that would introduce the semantics of computationally complete languages (CCL), which:
  • Are undecidable and, therefore, cannot be declarative;
  • Do not support physical and logical independence;
  • Do not guarantee logical validity and semantic correctness;
  • Are more powerful, but every algorithm must be tested for correctness and termination;
Hence FOPL-based relational data sub-languages, responsible strictly for data manipulation and integrity, properly hosted by CCLs that are not limited to data management functions[7].

Note: Properly means no violation of data sub-language decidability. Data definition is a DBMS, but host language function -- except for derived objects! -- hence the separate DDL and DML.

So, in conclusion:

  • Domain ≠ PDT, not just because the former "is tied to system-level (or physical-level) definitions of data", but because, unlike the former, the latter is not a logical term.
  • Class ≠ relvar, rather, relation = set -- the extension of a class, induced by applying the class intensional definition -- the RP -- to some universe of tuples representing facts about the object members of the class.

References


[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.

[3] Pascal, F., First Normal Form (1NF) in Theory and Practice Parts 1-3.

[4] Pascal, F., Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness.

[5] Olson, R., MEANING AND ARGUMENT.

[6] Pascal, F., Business Modeling for Database Design.

[7] Pascal, F., Data Sub-languages, Programming, and Data Integrity.







 

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment