Sunday, September 20, 2015

Interpreting Codd: The 12 Rules

I have recently come across an "explanation" of Codd's 12 RDBMS rules in a book appendix posted on line that is a set of mostly rule regurgitations. While they are no longer used to assess the relational fidelity of DBMS's, inquiries about them persist, yet they are still misunderstood.

In the current context of proliferation of non-relational products e.g. NoSQL, there is value in understanding the rules' origins and they can still help expose persistent flaws of SQL implementations and the superiority of RDBMS's over non-relational products. So here are the book "explanations", followed by mine.

These rules go beyond relational theory and define more specific criteria that need to be met in an RDBMS, if it’s to be truly relational.
The rules specify certain important DBMS features deriving from the theory which, if missing from a product, disqualify claims that it is a RDBMS.

Early after Codd published the RDM, vendors of existing hierarchic and network DBMS's that preceded SQL were simply adding the suffix /R to the names of their products and declaring them relational. Codd introduced these "quick" rules of thumb--neither systematic, nor complete, nor independent--that identify important features required by the RDM which, if missing from products, could be used to defeat fake RDBMS claims.

Note: There is, in fact, a 13th rule.
  • Foundation Rule 0: For any system that is advertised as, or claimed to be, a relational database management system, that system must be able to manage databases entirely through its relational capabilities.
A truly and fully relational DBMS with a relationally complete data language does not need to rely on any capability beyond first order predicate logic (FOPL)--read: any fully computational language--to manage databases. Such languages allow self-referencing and introduce undecidability.

The closest the industry came to implementing the RDM is SQL which, despite its poor relational fidelity, proved much superior relative to the complexity and inflexibility of preceding DBMS's. But the rules still expose poor relational fidelity of SQL DBMS's.
  • Information Rule 1: All information in the relational database is represented in exactly one and only one way--by values in tables.
This rule is an informal definition of a relational database and indicates that every piece of data that we permanently store in a database is located in a table.
Core rule intended to defeat claims of relational fidelity for products that represented data in non-relational ways e.g. in hierarchic or network structures.

Codd meant R-tables i.e., faithful database representations of relations that have

  1. uniquely named, unordered single-valued columns (SVC)
  2. keyed, unordered rows
  3. no missing values
From its initial versions SQL permitted columns that were nameless or had duplicate names, for which reason it had to rely on their order. SQL DBMS's still allow non-keyed/duplicate rows and marks for missing values (NULL's).
  • Guaranteed Access Rule 2: Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.
This rule stresses the importance of primary keys for locating data in the database. The table name locates the correct table, the column name finds the correct column, and the primary key value finds the row containing an individual data item of interest. In other words, each (atomic) piece of data is accessible by the combination of table name, primary key value, and column name.
Intended to defeat claims of relational fidelity for products that permitted non-relational access to data via means other than the combination of these three logical elements, particularly physical addressing schemes such as pointer navigation in hierarchic and network systems.

It stresses the importance of strictly logical access of every data value in any R-table in the database. DBMS compliance with this rule depends on compliance with rule 1.

SQL DBMS's require users to ensure database compliance with rule 1, namely to always define at least one key and name every column uniquely. Some also allow physical addressing schemes e.g. ROWID's in Oracle.

  • Systematic Treatment of NULL Values Rule 3: NULL values (distinct from empty character string or a string of blank characters and distinct from zero or any other number) are supported in the fully relational RDBMS for representing missing information in a systematic way, independent of data type.
This rule requires that the RDBMS support a distinct NULL placeholder, regardless of datatype. NULL's are distinct from an empty character string or any other number, and they are always to be considered as unknown values. NULL's must propagate through mathematic operations as well as string operations. NULL + <anything> = NULL, the logic being that NULL means “unknown.” If you add something known to something unknown, you still don’t know what you have, so it’s still unknown.
Its good intention notwithstanding, this rule was a blunder. For a relational solution to missing data without NULL's see The Last NULL in the Coffin).

The rule actually refers to nulls as in logic, not NULL's--the SQL-specific flawed implementation of missing values. Moreover, "NULL value" is a contradiction in terms: a placeholder for a value is not a value and tables with missing values/markers thereof, including NULL's, violate rule 1. They are not R-tables.

The SQL DBMS flawed implementations of NULL violate the two-valued logic (2VL) of the real world, the FOPL that is based on it and the RDM.

  • Dynamic Online Catalog Based on the RDM Rule 4: The database description is represented at the logical level in the same way as ordinary data, so authorized users can apply the same relational language to its interrogation as they apply to regular data.
This rule requires that a relational database be self-describing. In other words, the database must contain certain system tables whose columns describe the structure of the database itself, or alternatively, the database description is contained in user-accessible tables.
Intended to defeat claims of relational fidelity for DBMS's preceding SQL which, if they had catalogs--most did not--they were not queriable by users, or structured differently than the database, requiring a different query language.

A catalog queriable by users and applications with the same data language as the database was a concept introduced by the RDM.

  • Comprehensive Data Sublanguage Rule 5: A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and whose ability to support all of the following is comprehensible:
  1. data definition
  2. view definition
  3. data manipulation (interactive and by program)
  4. integrity constraints
  5. authorization
  6. transaction boundaries (begin, commit, and rollback).
This rule mandates the existence of a relational database language, such as SQL, to manipulate data. SQL as such isn’t specifically required. The language must be able to support all the central functions of a DBMS: creating a database, retrieving and entering data, implementing database security, and so on.
Intended to defeat claims of relational fidelity for products with only procedural, record-at-a-time data languages.

The RDM requires a set oriented, declarative data language based on first order predicate logic (FOPL) that are relationally, but not computationally complete. Products preceding RDM and SQL had procedural, record-a-time languages, or data access was via computationally complete application development/programming languages. Based on higher order logic, they allow logic that allow self-referencing and introduce undecidability. Codd's concept of a relational data sub-language hosted by computationally complete languages that called it for data access.

SQL was developed at IBM as a query-only prototype language for the System R research project investigating the feasibility of a RDBMS. Gradually the 1,2,4,5 and 6 capabilities were added post-hoc and even elements of a fully computational language.

  • View Updating Rule 6: All views that are theoretically updatable are also updatable by the system.
This rule deals with views, which are virtual tables used to give various users of a database different views of its structure. It’s one of the most challenging rules to implement in practice, and no commercial product fully satisfies it today. A view is theoretically updatable as long as it’s made up of columns that directly correspond to real table columns.
Intended to defeat claims of relational fidelity for products that lacked full support of view updating.
Views are theoretically updatable if, based on the operations of the relational algebra that define them, a RDBMS can logically propagate updates through the views to the underlying base tables.
See also rule 9.

SQL DBMS's do not allow updates of multitable views, including those that are theoretically updatable e.g. join views. 

  • High-Level Insert, Update, and Delete Rule 7: The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data but also to the insertion, update, and deletion of data.
This rule stresses the set-oriented nature of a relational database. It requires that rows [sic] be treated as sets in insert, delete, and update operations. The rule is designed to prohibit implementations that support only row-at-a-time, navigational modification of the database.
Intended to defeat claims of relational fidelity for products that applied record-at-a-time updates to tables.
  • Physical Data Independence (PDI) Rule 8: Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representation or access methods.
Applications must still work using the same syntax, even when changes are made to the way in which the database internally implements data storage and access methods. This rule implies that the way the data is stored physically must be independent of the logical manner in which it’s accessed. This is saying that users shouldn’t be concerned about how the data is stored, or how it’s accessed. In fact, users of the data need only be able to get the basic definition of the data they need.
Intended to defeat claims of relational fidelity for products whose physical reorganization--storage and access methods--impaired queries and applications.

Compliance with this rule is dependent on compliance with rule 2.

SQL examples of violations are Oracle ROWID's and key dependence on indexes.

  • Logical Data Independence Rule 9: Application programs and terminal activities remain logically unimpaired when information preserving changes of any kind that theoretically permit unimpairment are made to the base tables.
Along with rule 8, this rule insulates the user or application program from the low-level implementation of the database. Together, they specify that specific access or storage techniques used by the RDBMS—and even changes to the structure of the tables in the database—shouldn’t affect the user’s ability to work with the data. In this way, if you add a column to a table and if tables are split in a manner that doesn’t add, or subtract columns, then the application programs that call the database should be unimpaired.
Intended to defeat claims of relational fidelity for products whose non-loss logical reorganization--schema--impaired queries and applications.

Views can be viewed as windows into base R-tables and are the relational mechanism for LDI--the insulation of queries and applications from non-loss changes to logical base R-tables. Compliance with this rule requires compliance with rule 6.

SQL DBMS's do not satisfy rule 6 and rule 9.

  • Integrity Independence Rule 10: Integrity constraints specific to a particular relational database must be definable in the relational data sub-language and storable in the catalog, not in the application programs.
The database must support at a minimum entity integrity (no component of a primary key is allowed to have a NULL value; and referential integrity (for each distinct non-NULL foreign key value in a relational database, there must exist a matching primary key value from the same domain).
This rule says that the database language should support integrity constraints that restrict the data that can be entered into the database and the database modifications that can be made. In other words, the RDBMS must internally support the definition and enforcement of entity integrity (primary keys) and referential integrity (foreign keys). It requires that the database be able to implement constraints to protect the data from invalid values and that careful database design is needed to ensure that referential integrity is achieved.
Intended to defeat claims of relational fidelity for products that lacked DBMS-enforced integrity constraints in the database and relied on each and every updating application to enforce them.

The RDM introduced mandatory DBMS enforcement of integrity constraints. A 5NF R-table is subject to seven types of constraints:

  1. Domain constraints
  2. Key constraint(s)
  3. Functional dependency constraints
  4. Column constraints
  5. Referential constraint(s) (FK's)
  6. Special constraint(s) (optional, database-specific)
Initially SQL and its implementations lacked integrity support entirely. It was added post-hoc, due to which it suffers from unnecessary complexity and limitations. SQL DBMS's do not support true domains and some of them support constraints procedurally, via stored or triggered procedures, rather than declaratively.
  • Distribution Independence Rule 11: The data manipulation sublanguage of a relational DBMS must enable application programs and terminal activities to remain logically unimpaired whether and whenever data are physically centralized or distributed.
This rule says that the database language must be able to manipulate data located on other computer systems. In essence, we should be able to split the data on the RDBMS out onto multiple physical systems without the user realizing it.
Intended to defeat claims of relational fidelity for products that impaired queries and applications when databases were physically first distributed, redistributed, or undistributed.
  • Non-Subversion Rule 12: If a relational system has or supports a low-level (single-record-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity rules or constraints expressed in the higher-level (multiple-records-at-a-time) relational language.
This rule requires that alternate methods of accessing the data are not able to bypass integrity constraints, which means that users can’t violate the rules of the database in any way.
Intended to defeat claims of relational fidelity for products that superimposed a relational interface on their non-relational engine, which could be subverted by non-relational direct data access.

UPDATE: This is a response via David McGoveran to Erwin Smout's comment to this post.

When characterizing the limitation on the expressive power of a relational language as an object language, or discussing such issues, care must be exercised to distinguish between:
  • the abstract language being used--object language such as relational language (RA) or first order predicate logic (FOPL);
  • the language of the application--subject language, either a natural language or a formal language;
  • the language used to implement, the language used to describe interpretation (rules of correspondence between object language and subject language), or the language used to prove the properties of the object language (all of which are metalanguages).
This implies that you understand which functionality of a DBMS pertains to object language, subject language, and metalanguage.

Indeed, a DBMS that used RA for all its functionality would be very limited. Rather, we only require that the (declarative) language through which users interact with the DBMS has no more expressive power than FOPL. This implies acceptance of certain limitations on what users can do directly in the language.

A revised, more accurate, explanation of rule 0 would, therefore, be:

A truly and fully relational DBMS implements a relationally complete data language, such that the relationally complete data language does not directly include any expressive capability beyond  FOPL.
By contrast with "... does not directly include...", an indirectly included expression is one in the object language (RA), but at most is merely a primitive non-logical symbol in the object language. It can, however, be explained outside the object language (in a metalanguage such as ordinary English or perhaps in a higher ordered language) as having some meaning not expressible in FOPL. It cannot be analyzed (broken into constituent expressions), or derived (via component) in the (FOPL limited) object language. When such a symbol is referenced in the object language, the DBMS may invoke an expression (consistent with the intended interpretation) in a higher ordered language (e.g., a program or function) and evaluate it (i.e., execute it), substituting the value resulting from evaluation in place of the symbol and then returning control to the object language.

Transitive closure (TC), for example, can't be implemented via RA or any FOPL, but if a TC 'function' is implemented using a computationally complete language and returns its results in the form of a R-table, then a symbol (i.e., pure syntax) of type R-table can be defined in RA that references/invokes that function. From within RA, it appears to be just a R-table. It is up to the user to understand what the value of the returned table (i.e., as representing the TC) means. That understanding/interpretation is outside RA and passed to users only via documentation (i.e., via some metalanguage).

While sufficient to check for self reference for some applications, TC is not the only way. For example, a user discipline incorporating the following rules will suffice to insure that views are not self referential, even though recognition of that sufficiency is not provable in RA (and need not be):

  • Defining views only on those views that have already been defined (based
    on a simple check that every R-table (derived or otherwise) referenced in
    the definition is already in the system catalog);
  • Never allowing a view definition to be redefined.
Note: View redefinition must be implemented by 
  • dropping all dependent view definitions and then the view definition;
  • redefining the view by creating a new view;
  • recreating all dependent views. 
Notice the implied and necessary strict hierarchy among interdependent view definitions.


Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment