Saturday, October 17, 2020

Understanding Codd's 12 Rules for RDBMS



In response to an online publication of a book appendix regurgitating Codd's 12 famous rules (some of which were, typically, incorrect[1]) I posted earlier a clarification of the rules. This is a revision thereof for better consistency with the new understanding of the RDM based on McGoveran's re-interpretation, extension and formalization[2] of Codd's work.

--------------------------------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST UPDATES
-
07/22/20: LINKS update: Added “An Argument for Controlled Natural Languages in Mathematics”, “Let’s Make Set Theory Great Again”.
- 07/21/20 LINKS update: Added “How Gödel’s Proof Works”.

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
- @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

----------------------------------------------------------------------------------------------------------------------------------------------


Codd published the rules shortly after his introduction of the RDM, in response to vendors of hierarchic and network DBMSs that preceded it and SQL, who were adding the suffix /R to the names of their products and declaring them relational. They were intended as quick "rules of thumb" -- neither rigorous, nor systematic, nor complete, nor independent -- to help disqualify false relational claims. Inquiries about them persist and, given the current proliferation of "new" non-relational products (e.g., NoSQL, graph DBMSs), there is still value in understanding them and the underlying fundamentals. SQL DBMSs are the closest the industry came to the RDM and, while much to superior to the complex and inflexible predecessors, they have poor relational fidelity and the 12 rules can help illustrate that and test your foundation knowledge. For each rule we:

  • Explain its intended objective;
  • Offer clarifications, some of which reflect our new understanding of the RDM -- distinct from conventional wisdom, such as it is -- based on McGoveran's forthcoming formalization, re-interpretation and extension of Codd's work;
  • Comment on SQL systems compliance.


Note: The rules are often expressed in table (column and row) terms, which are elements of visualization of relations as R-tables. We express them correctly in terms of relations (attributes and tuples). There are, in fact, 13 rules -- there is a Rule #0.

#0. Foundation Rule: For any system that is advertised as, or claimed to be, a relational database management system, that system must be able to manage databases entirely through its relational capabilities.


Intended to defeat relational claims for products that were not genuinely RDM compliant and only emulated some limited relational functionality.

It is fundamental: its violation disqualifies a product even if it complies with one or more of the other rules. Note very carefully that only the data retrieval functions need be RDM-compliant, the other DBMS functions are not so limited, but must not subvert the relational capabilities[3] (see Rule 5 and McGoveran's comments below).

#1. Information Principle: All information in a relational database is represented in exactly one and only one way — by attribute values drawn from domains in relations.

Intended to defeat relational claims for products that used data structures in addition to relations (e.g., trees and networks) to represent information.

The relation is the only data structure consistent with the dual theoretical foundation of the RDM -- relational algebra (RA) and first order predicate logic (FOPL) -- that guarantee its critical advantages -- logical validity, semantic consistency, physical and logical independence, decidability and declarativity)[4,5]. If any information is represented in any other way, they do not materialize, which defeats the purpose (other rules -- including Rule #0 -- are violated and all bets are off).

#2. Guaranteed Logical Access: Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of relation name, primary key value and attribute name.

As a corrolary of #1, intended to defeat relational claims for products that forced or enabled applications to rely on means other than the 3-part logical addressing scheme to access data (e.g., physical pointer navigation).

A relation is by definition in both (1) First Normal Form (1NF) (i.e., attributes are defined on simple domains, the values of which are treated as atomic by the data sublanguage)[6,7] and (2) Fifth Normal Form (5NF) (i.e., all dependencies that hold in it are functional dependencies of non-key attributes on the primary key -- a formal way of saying that it represents facts about entities of a single type). As only one element of the simple logical addressing scheme, a PK is not more important than the other two.

Note: SQL DBMSs permit nameless attributes, duplicates (that are not tuples, which are unique by definition) and missing values (marked as NULLs) that defeat the logical addressing scheme. Applications must resort to non-relational addressing schemes -- a violation of this rule (and rule 8) (see also next rule).

#3. Systematic Treatment of NULL Values: NULL values (distinct from empty character string or a string of blank characters and distinct from zero or any other number) are supported in the fully relational RDBMS for representing missing information in a systematic way, independent of data type.

Its good intention notwithstanding, this was a Codd blunder[8]. Missing values violate two-valued logic (2VL) underlying SST/FOPL -- a relation does not have missing values. A NULL is not a value, but a marker for an absence of value, which cannot be treated as a value by the RA -- a violation of #1.

Note: A SQL table with NULLs is not a relation.

#4. Dynamic Online Catalog Based on the Relational Model: The database description -- meta-data -- is represented at the logical level in the same way as ordinary data, so authorized users can apply the same relational data sublanguage to its interrogation as they apply to regular data.

Intended to disqualify products that either had no database catalog (most of them at the time) or, if they had one, it was not user-queriable, or was structured differently than the database, requiring different, non-relational access.

The catalog as a relational database within a relational database is a concept introduced by the RDM. Aside from relational access to meta-data, two critical functions of the relational catalog are to record (1) formal semantics -- the meaning assigned to the database by its designer -- such that it is accessible to users on demand and (2) constraint inheritance[9] (see #6), which are completely ignored because no SQL DBMS supports them. While the catalog should be queriable just like the database, it should not be updated by the DBMS jointly with the user database in the same data sublanguage expressions (which can result in self-referencing expressions and undecidability).

#5. Comprehensive Data Sublanguage: A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and whose ability to support all of the following is comprehensible:

  • data definition
  • view definition
  • data manipulation (interactive and by program)
  • integrity constraints
  • authorization
  • transaction boundaries (begin, commit, and rollback).

 

With hindsight we distinguish a relational data sublanguage (i.e., based on the relational algebra) for data manipulation (i.e., retrieval) and a DBMS language that subsumes the data sublanguage, as well as data management functions other than retrieval that, while not relational, must be carefully separated components, albeit with some a unified syntax, that do not subvert the relational capabilities of the sublanguage (see Rule #12).

The SST/FOPL theoretical foundation of the RDM is applicable only to data manipulation. A data sublanguage is hosted by computationally complete languages (CCL), which are responsible for application functions (i.e., computation, result presentation, and communication with the DBMS)[3,10].

Note: SQL was initially a data sublanguage prototype with poor relational fidelity. It has become more of a DBMS language(s) over time, but is not as comprehensive as it could and should have been, nor does it guarantee non-subversion.

#6. View Updating: All views that are theoretically updatable are updatable by the system.

Intended to defeat relational claims for products that did not update theoretically updatable views and, thus, lack support of logical independence.

A true RDBMS supports constraint inheritance: as derived relations, views are subject to (inherit) constraints derived from (1) the constraints on the base relations from which they are derived and (2) the RA operations and any additional constraints applied to derive them. The derived constraints are recorded in the database catalog and the DBMS can rely on them to propagate view updates back to the base relations. Some views are not theoretically updatable (e.g., that contain aggregate attributes) [11,12].

#7. High-Level Insert, Update, and Delete: The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data, but also to the insertion, update, and deletion of data.

Intended to defeat relational claims for products that operated at the set level for retrieval, but did tuple-at-a-time updates.

#8. Physical Data Independence: Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representation or access methods.

Intended to defeat relational claims for products that violated Rule #2 by subverting the logical addressing scheme[13].

Note: For example, SQL systems allow duplicates which can be accessed only non-logically via such means as Oracle's ROWID).
 

#9. Logical Data Independence: Application programs and terminal activities remain logically unimpaired when information preserving changes of any kind that theoretically permit unimpairment are made to the base relations.

 

Intended to defeat relational claims for products that did not update theoretically updatable views.

Note: Because they lack support of constraint inheritance, SQL DBMSs do not permit updates of multirelation views
(e.g., join views) even if they are theoretically updatable and, thus, correct propagation to the underlying base relations is guaranteeable[9,11,12]. Compliance with it is dependent on compliance with Rule #6 -- views are the mechanism for logical independence (LI) (see also next rule). 

 

#10. Integrity Independence: Integrity constraints must be definable in the relational data sublanguage and recorded in the catalog, not in application programs.

 

Intended to defeat relational claims for products that did not enforce relational integrity constraints in the database (and consequently lack support of constraint inheritance), relegating this function (if at all) to applications.

The RDM is theory applied to database management: database relations are semantically constrained to be consistent with the business rules comprising the conceptual model (constraints are formalizations of the rules at the logical database level).

Note: We are unaware of any current DBMS that supports declaratively all relational integrity constraints (domain, attribute, tuple, multi-tuple and database). Initially SQL and its implementations lacked any integrity support. Support of primary key (PK) and referential constraints was added post-hoc (always a source of complexity and limitations). SQL DBMSs have not implemented the standard ASSERTION. Some support constraints procedurally, via proprietary stored/triggered procedures, which is inferior to declarative support[14].

#11. Distribution Independence: The data sublanguage of a relational DBMS must enable application programs and terminal activities to remain logically unimpaired whether and whenever data are physically centralized or distributed.

Intended to defeat relational and distributed claims for products with which queries and applications were impaired when databases were first physically distributed, re-distributed, or centralized.

Claims in the industry notwithstanding, there are are reasons why true RDBMSs are most suited for distribution, in part due to support of physical independence (see Rule #8)[15].

Note: Indeed, the product that came closest to a true DDBMS was IngresStar, which had a data sublanguage (QUEL) superior to SQL[16].

#12. Non-Subversion: If a relational system has or supports a low-level (single-tuple-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity constraints expressed in the higher-level (set-at-a-time) relational sublanguage.

Intended to defeat relational claims for products that provided non-relational access to data.

Note: One example of subversion was the addition by vendors of a SQL interface to an existing non-relational DBMS, which could be bypassed by queries and applications.

David McGoveran's Response to a Comment on this Post

When characterizing the limitation on the expressive power of a relational data sublanguage, care must be exercised to distinguish between:

  • The object language -- the abstract language being used (e.g., relational algebra (RA) or first order predicate logic (FOPL));
  • The subject language of the application -- either a natural or a formal language;
  • The meta-language -- the language used to implement the language used to describe interpretation (rules of correspondence between object language and subject language), or the language used to prove the properties of the object language.

This implies understanding which functionality of a DBMS pertains to object language, subject language, and meta-language.

Indeed, a DBMS that used RA for all its functionality would be very limited. Rather, we only require that the (declarative) data sublanguage through which users interact with the DBMS has no more expressive power than FOPL. This implies acceptance of certain limitations on what users can do directly in the language, in return for the advantages that the RDM confers:

  • System guaranteed logical validity;
  • By design semantic consistency;
  • Physical and logical independence;
  • Declarativity;
  • Decidability.


A revised, more accurate, explanation of Rule #0 would, therefore, be: A RDBMS implements a relationally complete data sublanguage, such that it does not directly include any expressive capability beyond FOPL.

By contrast with "... does not directly include...", an indirectly included expression is one in the object language (RA), but at most is merely a primitive non-logical symbol in the object language. It can, however, be explained outside the object language (in a meta-language such as ordinary English or perhaps in a higher ordered language) as having some meaning not expressible in FOPL. It cannot be analyzed (broken into constituent expressions), or derived (via component) in the (FOPL limited) object language. When such a symbol is referenced in the object language, the DBMS may invoke an expression (consistent with the intended interpretation) in a higher ordered language (e.g., a program or function) and evaluate it (i.e., execute it), substituting the value resulting from evaluation in place of the symbol and then returning control to the object language.

Transitive closure (TC), for example, cannot be implemented via RA or any FOPL, but if a TC 'function' is implemented using a CCL and returns its results in the form of a relation, then a symbol (i.e., pure syntax) of type relation can be defined in RA that references/invokes that function. From within RA, it appears to be just a relation. It is up to the user to understand what the value of the returned relation (i.e., as representing the TC) means. That understanding/interpretation is outside RA and passed to users only via documentation (i.e., via some meta-language).

While sufficient to check for self reference for some applications, TC is not the only way. For example, a user discipline incorporating the following rules will suffice to insure that views are not self referential, even though recognition of that sufficiency is not provable in RA (and need not be):

  • Defining views only on those views that have already been defined (based on a simple check that every relation (derived or otherwise) referenced in the definition is already in the system catalog);
  • Never allowing a view definition to be re-defined.


Note: View re-definition must be implemented by:

  • Dropping all dependent view definitions and then the view definition;
  • Re-defining the view by creating a new view;
  • Re-creating all dependent views.

Notice the implied and necessary strict hierarchy among interdependent view definitions.


References

[1] Pascal, F., DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.

[2] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK (draft chapters), forthcoming.

[3] Pascal, F., Logical Symmetric Access, Data Sub-language, Kinds of Relations, Redundancy and Consistency.

[4] Pascal, F., The Interpretation and Representation of Database Relations.

[5] Pascal, F., What Relations Really Are and Why They Are Important.

[6] First Normal Form in Theory and Practice Part 1, 2, 3.

[7] Pascal, F., Simple Domains and Value Atomicity.

[8] Pascal, F., The Final NULL in the Coffin: A Relational Solution to Missing Data.

[9] Pascal, F., The Principle of Orthogonal Database Design Part III.

[10] DBMS Functions, Data Language, Data Sublanguage and Host Language, forthcoming.

[11] On View Updating (C. J. Date and D. McGoveran).

[12] McGoveran, D., Can All Relations Be Updated? (or Can Any Relation Be Updated?)

[13] Pascal, F., Physical Data Independence.

[14] Pascal, F., To Really Understand Integrity, Don't Start with SQL.

[15] Date's Twelve Rules for a DDBMS.

[16] Ingres 10.2 Star User Guide

 

 

 

 

No comments:

Post a Comment

View My Stats