Sunday, September 20, 2015

Interpreting Codd: The 12 Rules


Note: This is a re-write for consistency of both this post and the interpretation of the rules with the McGoveran formalization and interpretation [1] of Codd's true RDM.

I have recently come across an "explanation" of Codd's 12 rules for RDBMS in a book appendix posted online that is mostly a regurgitation of the rules, or incorrect -- typical for an industry lacking foundation knowledge [2].

Shortly after Codd published the RDM, vendors of hierarchic and network DBMSs that preceded it and SQL were adding the suffix /R to the names of their products and declaring them relational. Codd introduced these "quick rules of thumb" -- neither rigorous, nor systematic, nor complete, nor independent -- that identify some important specific criteria that need to be met by a RDBMS if it’s to be truly relational which, if missing from products, could disqualify relational claims.

Although they are no longer used, inquiries about them persist and with the current proliferation of non-relational products (e.g., NoSQL, graph DBMSs) there is value in understanding them. The closest the industry came to the RDM is SQL DBMSs which, despite poor relational fidelity, proved much superior relative to the complexity and inflexibility of preceding DBMSs. But the rules still expose the relational infidelity of SQL DBMSs that have not been addressed for four decades, while new RDM violations have been introduced.

We offer here our clarifications on the rules. For each rule, we:

  • Explain its intended objective;
  • Offer clarifications, some of which reflect our current understanding of the RDM -- distinct from conventional wisdom -- based on its dual theoretical foundation and a careful analysis of Codd's work;

Note: The rules are often expressed in table, column and row terms, which are elements of visualization of relations as R-tables on some physical medium. We express it correctly in terms of relations, attributes and tuples. There are, in fact, 13 rules -- there is a Rule #0 We usually refer to Rule #1 as the Information Principle (IP).

#0. Foundation Rule: For any system that is advertised as, or claimed to be, a relational database management system, that system must be able to manage databases entirely through its relational capabilities.

Rule #0 was intended to defeat relational claims for products that were not genuinely RDM compliant and only emulated some limited functionality equivalent to a RDBMS.

The rule is fundamental: while compliance with the other rules per se does not make a DBMS relational, violation of Rule #0 disqualifies it, regardless of how many of the other rules it satisfies. It should be noted that only the data management DBMS functions should be RDM-compliant, the other DBMS functions are not restricted to relational capability
[3] (see Rule 5 and McGoveran's comments below).
#1. Information Principle: All information in a relational database is represented in exactly one and only one way — by values in relations.

Rule #1 was intended to defeat relational claims for products that used data structures other than relations to represent information (e.g., trees and networks).

The relation is the only data structure consistent with the two theoretical foundations of the RDM -- relational algebra (RA) and first order predicate logic (FOPL) -- that guarantee relational advantages (logical validity, semantic correctness, physical and logical independence, decidability) [4,5]. If any information is not represented as values in relations, it is inconsistent with RA and FOPL, several other rules are violated, including Rule #0, and the advantages are lost.

#2. Guaranteed Logical Access: Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of relation name, primary key value and attribute name.

Rule 2 was intended to defeat relational claims for products that forced or enabled applications to rely on means other than the 3-part logical addressing scheme to access data (e.g., physical pointer navigation).

Relations are at least in First Normal Form (1NF) (i.e., defined on simple domains that have no components meaningful to applications, the values of which are treated as atomic by the data language) [6,7]. Thus, they provide a simple addressing scheme that guarantees DBMS logical access to each and every value, without reliance on machine internals. As one component of the scheme, a PK is not more important than the other two. SQL DBMSs permit nameless attributes, duplicate tuples and missing values (marked by NULL) that defeat the logical addressing scheme and applications must resort to physical access (e.g., Oracle's ROWID), which violates this rule (and rule 8).

#3. Systematic Treatment of NULL Values: NULL values (distinct from empty character string or a string of blank characters and distinct from zero or any other number) are supported in the fully relational RDBMS for representing missing information in a systematic way, independent of data type.

Its good intention notwithstanding, Rule 3 was a blunder [8]. Missing values violate two-valued logic (2VL) underlying FOPL and, thus, a relation cannot have missing values. NULL is not a value, but a marker for a missing value, that cannot be treated as a value by the RA. A SQL table with NULLs is, therefore, not a relation and it violates the rule, as well as Rules #0 and #1.

#4. Dynamic Online Catalog Based on the Relational Model: The database description is represented at the logical level in the same way as ordinary data, so authorized users can apply the same relational language to its interrogation as they apply to regular data.

Rule #4 was intended to defeat relational claims for products that either had no catalog (most of them) or, if they had one, it was not queriable by users, or was structured differently than the database, requiring a different query language.

The catalog as a relational database within a relational database is a concept introduced by the RDM. Aside from relational access to meta-data, two critical functions of the relational catalog are

  • To record semantics -- the formal meaning assigned to the database -- such that it is accessible to users on demand;
  • To support constraint inheritance [9] (see Rule #6);
which are completely ignored because no DBMS supports them. While the catalog should be queriable just like the database, it should not be manipulated and particularly not updated by the DBMS jointly with the database in the same expressions (which can result in self-referencing expressions and undecidability).
#5. Comprehensive Data Sublanguage: A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and whose ability to support all of the following is comprehensible:
  • data definition
  • view definition
  • data manipulation (interactive and by program)
  • integrity constraints
  • authorization
  • transaction boundaries (begin, commit, and rollback).
Rule #5 was intended to defeat relational claims for products that did not have a separate data language handling all DBMS functions, with a RA-FOPL-based relational data sublanguage component.

We refer to Codd's data sublanguage as data language, responsible for all DBMS functions (e.g., data definition, manipulation, security, transactions, resource management) that must have (1) carefully separated components for each DBMS function, but (2) a unified syntax. We reserve data sublanguage for the component responsible for data manipulation that must be relational (i.e., based on RA-FOPL). SQL was developed initially as a query-only prototype data sublanguage, but it has always had very poor relational fidelity. It has become a data language over time, but is not as comprehensive as it could and should have been. A data language is hosted by computationally complete languages, which are responsible for application functions (i.e., calculations, result presentation and communication with the DBMS) [3,10].

#6. View Updating: All views that are theoretically updatable are also updatable by the system.

Rule #6 was intended to defeat relational claims for products that did not update views that are theoretically updatable.

A RDBMS supports constraint inheritance: as derived relations, views are subject to (inherit) constraints derived from (1) the constraints on the base relations from which they are derived and (2) the RA operations and any additional constraints applied to derive them. The derived constraints are recorded in the database catalog and the DBMS can rely on them to propagate view updates back to the base relations. Some views are not theoretically updatabble (e.g., that contain aggregate attributes) [11,12].

#7. High-Level Insert, Update, and Delete: The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data but also to the insertion, update, and deletion of data.

Rule #7 was intended to defeat relational claims for products that operated at the set level for retrieval, but did record-at-a-time updates.

Tuples are sets of values, but are not the sets that the rule refers to. Rather, it refers to relations that the RA operates upon as sets of tuples, rather than tuple-at-a-time.

#8. Physical Data Independence: Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representation or access methods.

Rule #8 was intended to defeat relational claims for products that enabled or required queries and applications to rely on machine internals for data access, which impaired them when databases were reorganized in storage.

Violations of Rule 8 often resulted from failures to comply with Rule #2 (e.g., allowing duplicate tuples and using physical means to access them individually, such as Oracle's ROWID) [13].

#9. Logical Data Independence: Application programs and terminal activities remain logically unimpaired when information preserving changes of any kind that theoretically permit unimpairment are made to the base relations.

Rule #9 was intended to defeat relational claims for products that did not update views (mainly SQL DBMSs) that are theoretically updatable.

SQL DBMSs do not permit updates of multirelation views, many of which (e.g., join views) are updatable, because there is no support of constraint inheritance and the DBMS cannot guarantee correct propagation of the updates to the base relations [9,11,12]. Compliance with it is dependent on compliance with Rule #6: views are the mechanism for logical independence (LI) and if not all the updatable ones can be updated, LI is impaired, as with SQL DBMSs.

#10. Integrity Independence: Integrity constraints must be definable in the relational data sublanguage and recorded in the catalog, not in application programs.

Rule #10 was intended to defeat relational claims for products that did not enforce relational integrity constraints in the database, relegating this DBMS function to each and every application.

Under the RDM, integrity enforcement is a critical DBMS function. Constraints are formalizations of business rules that enforce database consistency with the real world meaning denoted by the rules. We are unaware of any current DBMS that supports all types of relational integrity constraint declaratively, namely:

  • Domain;
  • Attribute;
  • Tuple;
  • Multituple;
  • Database (i.e., multirelation);

Initially SQL and its implementations lacked integrity support altogether. Support of only two types of contraint (primary key (PK) and referential) was added post-hoc (always a source of complexity and limitations). SQL DBMSs have not implemented the standard ASSERTION. Some support constraints procedurally, via proprietary stored/triggered procedures, which is inferior to declarative support [14].

#11. Distribution Independence: The data manipulation sublanguage of a relational DBMS must enable application programs and terminal activities to remain logically unimpaired whether and whenever data are physically centralized or distributed.

Rule #11 was intended to defeat relational claims for products with which queries and applications were impaired when databases were first physically distributed, redistributed, or centralized.

The frequent claims in the industry that RDBMs are not distributable notwithstanding, there are reasons why they are actually the most suited for distribution, consistent with Rule #8 [15]. The claims apply, at best, to SQL DBMSs and, interestingly, the product that came closest to a true DDBMS was a RDBMS that had a data language relationally superior to SQL [17].

#12. Non-Subversion: If a relational system has or supports a low-level (single-record-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity constraints expressed in the higher-level (set-at-a-time) relational language.

Rule #12 was intended to defeat relational claims for products that superimposed a relational interface on their non-relational engine, which could be bypassed and subverted by queries and applications.


David McGoveran Response to Erwin Smout's comment to this post.


When characterizing the limitation on the expressive power of a relational data language, care must be exercised to distinguish between:

  • The object language -- the abstract language being used (e.g., relational algebra (RA) or first order predicate logic (FOPL));
  • The subject language of the application -- either a natural or a formal language;
  • The meta-language:
  • the language used to implement the language used to describe interpretation (rules of correspondence between object language and subject language); or
  • the language used to prove the properties of the object language;

This implies understanding which functionality of a DBMS pertains to object language, subject language, and metalanguage.

Indeed, a DBMS that used RA for all its functionality would be very limited. Rather, we only require that the (declarative) language through which users interact with the DBMS has no more expressive power than FOPL. This implies acceptance of certain limitations on what users can do directly in the language, in return for the advantatges that the RDM confers:

  • System guaranteed logical validity;
  • Semantic correctness;
  • Physical and logical independence;
  • Decidability;

A revised, more accurate, explanation of Rule #0 would, therefore, be:

A RDBMS implements a relationally complete data language, such that it does not directly include any expressive capability beyond FOPL.

By contrast with "... does not directly include...", an indirectly included expression is one in the object language (RA), but at most is merely a primitive non-logical symbol in the object language. It can, however, be explained outside the object language (in a meta-language such as ordinary English or perhaps in a higher ordered language) as having some meaning not expressible in FOPL. It cannot be analyzed (broken into constituent expressions), or derived (via component) in the (FOPL limited) object language. When such a symbol is referenced in the object language, the DBMS may invoke an expression (consistent with the intended interpretation) in a higher ordered language (e.g., a program or function) and evaluate it (i.e., execute it), substituting the value resulting from evaluation in place of the symbol and then returning control to the object language.

Transitive closure (TC), for example, cannot be implemented via RA or any FOPL, but if a TC 'function' is implemented using a computationally complete language and returns its results in the form of a relation, then a symbol (i.e., pure syntax) of type relation can be defined in RA that references/invokes that function. From within RA, it appears to be just a relation. It is up to the user to understand what the value of the returned relation (i.e., as representing the TC) means. That understanding/interpretation is outside RA and passed to users only via documentation (i.e., via some metalanguage).

While sufficient to check for self reference for some applications, TC is not the only way. For example, a user discipline incorporating the following rules will suffice to insure that views are not self referential, even though recognition of that sufficiency is not provable in RA (and need not be):

  • Defining views only on those views that have already been defined (based on a simple check that every relation (derived or otherwise) referenced in the definition is already in the system catalog);
  • Never allowing a view definition to be redefined;

Note: View redefinition must be implemented by

  • Dropping all dependent view definitions and then the view definition;
  • Redefining the view by creating a new view;
  • Recreating all dependent views.

Notice the implied and necessary strict hierarchy among interdependent view definitions.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.

[3] Pascal, F., Logical Symmetric Access, Data Sub-language, Kinds of Relations, Redundancy and Consistency.

[4] Pascal, F., The Interpretation and Representation of Database Relations.

[5] Pascal, F., What Relations Really Are and Why They Are Important.

[6] First Normal Form in Theory and Practice Part 1, 2, 3.

[7] Pascal, F., Simple Domains and Value Atomicity.

[8] Pascal, F., The Final NULL in the Coffin: A Relational Solution to Missing Data.

[9] Pascal, F., The Principle of Orthogonal Database Design Part III.

[10] DBMS Functions, Data Language, Data Sublanguage and Host Language, forthcoming.

[11] On View Updating (C. J. Date and D. McGoveran).

[12] McGoveran, D., Can All Relations Be Updated? (or Can Any Relation Be Updated?)

[13] Pascal, F., Physical Data Independence.

[14] Pascal, F., To Really Understand Integrity, Don't Start with SQL.

[15] Date's Twelve Rules for a DDBMS.

[16] Ingres 10.2 Star User Guide



No comments:

Post a Comment