Saturday, May 13, 2017

To Really Understand Integrity, Don't Start with SQL

Here's what's wrong with the picture of two weeks ago, namely:
"Constraints are categorized as follows:
  • Domain integrity Constraints
  • Entity integrity Constraints
  • Referential integrity Constraints
  • Not null
  • Unique
  • Foreign key
  • Check
  • Primary key
Constraints are always attached to a column not a table."
--Dayakar, SQL Constraints
Despite being a critical database function, integrity is insufficiently understood and appreciated. Few practitioners know much beyond just awareness of primary key and referential constraints and question even their necessity.

SQL inhibits understanding, so if you want to really understand integrity, don't start with SQL. Instead, educate yourself on relational integrity and put your SQL DBMS's features in that context--how correctly and completely does it support all the necessary constraints? Then you can (1) make sure that the constraints that it does support are enforced and (2) be aware of those that it does not, the potential risks thereof and the defensive actions to be taken, if necessary. A bonus is that you will finally realize one of the many important differences between SQL DBMSs and a true RDBMS, which are confused in the industry[1]

I dare you to find this information anywhere else!!!!

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.

Business Rules

A conceptual model consisting of business rules expressed informally in natural language that structure some segment of reality of interest into distinct object groups[2]. Each group is defined by rules that specify the required properties that objects must satisfy to be members of the group. Per McGoveran, these group-defining properties fall into four categories[3]:
  • Individual properties shared by members--first order properties (1OP);
  • Collective properties that members must satisfy collectively as a group, which arise from within-group relationships:
- between 1OPs--second order properties (2OP);
- between all members of a group--third order properties (3OP);
  • Properties that groups must satisfy collectively as a set, which arise from group relationships, if any--fourth order properties (4OP).
and give rise to four categories of rules:
  • Property rules (1OP)
  • Group rules
  •  object rules
  •  multi-object rules (3OP)
  • Multi-group rules (4OP)

Predicates and Constraints

A DBMS can only manipulate abstract symbols mathematically. Informal groups, objects, properties and business rules must be formalized for database representation.

A relation is a formal data structure specially adapted to represent information about an object group in relational databases[4]. It is a set of tuples, each of which is a set of attribute values drawn from a domain:

  • A domain is a set of values of a type within a constrained range, which encapsulates a set of operators that, when applied to those values, produce values of the same type. Domains represent 1OPs of all the possible members of an object group;
  • An attribute is, per McGoveran, a subset of domain values that constitute the range of a simple function--a function with an inverse, a one-to-one mapping--of a domain, which can be interpreted as a representation of a domain in a specific context. Attributes represent 1OPs of the actual members of an object group;
  • A tuple is a unique set of unordered attribute values. Tuples represent facts about the actual group members at a point in time.
The specification of a relation is a symbolic expression in a formal language
of the informal business rules that jointly define the object group represented by the relation by specifying the required group properties. First order predicate logic (FOPL) is a language part of which can express business rules that define object groups and relationships between them as formal predicates.

The predicates corresponding to the conjunction of rules specifying 1OPs, 2OPs and 3OPs of an object group comprise a formal compound relation predicate (RP). The conjunction of all RPs with the predicates corresponding to the rules specifying 4OPs comprise the database predicate (DBP).

RPs and DBPs are the best formal approximation to the business rules that can be enforced by a DBMS in the database and can be used to declare base and derived relations to it. Their enforcement ensures database consistency: it constrains relations to be consistent in their representation of the object groups with the business rules defining them, in which context they are referred to as constraints. Per Codd[5], constraints add context-specific meaning to the database: business rules are the informal real world interpretation--meaning--assigned by the database designer to the relations and the database and constraints are their formal database representation. Otherwise put, they  guarantee database representation faithful to the conceptual model.

The following constraint categories correspond to the four categories of business rules:

  • Domain and attribute constraints (property rules);
  • Tuple constraints (object rules);
  • Multi-tuple constraints (multi-object rules);
  • Database constraints (multi-group rules).
Note: As you can see, the constraint categories in the comment suffer from conceptual-logical conflation (CLC) (e.g., entities are conceptual and constraints are logical) and other confusions (e.g., referential constraints are the same as FK constraints, CHECK is not a type of constraint but just a SQL syntactic implementation of several types of constraints).

A conjunction of constraints of the first three types comprise the relation constraint; a conjunction of all of four types comprise the database constraint.

Integrity Constraints, Relational Completeness and SQL

There can be a variety of FOPL-compliant data languages, each with its own syntax. We shall refer to constraints expressed in a specific FOPL-based data language native to a particular DBMS as integrity constraints, to distinguish them from the relational constraints expressible in FOPL.

A FOPL compliant relationally complete data language can support all constraints in the four categories declaratively, which SQL is not and does not. Aside from the fact that SQL tables are not necessarily relations--duplicate rows and NULLs are allowed--every relational constraint that is not expressible/enforceable poses a potential database consistency risk. SQL DBMSs may support some constraints procedurally--via, for example, stored procedures (SP)--and others not at all--requiring application code--both methods resulting in loss of certain advantages of using a DBMS, let alone a RDBMS. Imperative SPs, like computationally complete languages,

  • Require higher ordered logic than FOPL;
  • Are more complex and prone to error;
  • Do not benefit from system-guaranteed correctness;
  • Cannot be integrated into the DBMS optimization strategy;
  • Are proprietary;
  • If they allow self-referencing they
  • are usually undecidable and
  • cannot support either physical or logical independence.
These are likely reasons Codd defined relational data sub-languages as separate from, but hosted by computationally complete (e.g., programming) languages that go beyond FOPL.

We will consider here, therefore, only SQL declarative integrity constraints.

Domain and Attribute Constraints

Domains are designer-constrained strong data types and represent the 1OPs of all possible members of an object group. Domains can be  
  • Primitive, or derived from primitive domains;
  • Simple (with atomic values), or non-simple.
Attributes are further constrained domains--simple functions of domains--that represent 1OPs of actual members at a point in time in specific real world contexts and can be representations of domains[3].

For example, objects of type employee earn salaries and commissions. Assume a primitive domain NUMBER with a wide range. A domain MONEY can be derived from NUMBER by constraining it to a narrower range per the business policy that defines the valid lower of minimum of salary and commission and the higher of their maximum. Then attributes SALARY and COMMISSION can be defined by further constraining MONEY in two distinct ways--i.e., in the context of the two different roles that the 1OP represented by MONEY play in this specific context. If commissions are paid in in UK£ and MONEY is defined in US$, then there exists a simple function by which US$ amounts are  converted into UK£ amounts and vice-versa and COMMISSION is a UK£ representation of MONEY.

SQL DBMSs come with built-in system data types--e.g., INTEGER, or NUMBER (MONEY is usually among them). At best they can be viewed as primitive domains with vendor-defined ranges. SQL does not support derived domains with designer-defined ranges as needed to represent properties. 

Importantly, nor does it support simple domains. It cannot, therefore, enforce the normal form of relations, which a truly relational FOPL-compliant data language does (in fact, SQL DBMSs implement special functions that can be used to subvert atomicity of system defined data types)[6].
Attribute constraints are generally not expressible in SQL, but there are exceptions in some commercial dialects, with limited expressibility I suspect. In SQL Server, for example:

CREATE TABLE employees
  ename VARCHAR(15),
  salary MONEY CHECK (salary < 300,000));

SQL does not support attributes as representations of domains.

Note: We have changed the example from:

(no NUMBER(2),
 name VARCHAR(10),
 marks NUMBER (3) CHECK (marks > 300));
which is meaningless and lacks a primary key. Such example should never be used for instructional purposes.

Tuple (Multi-attribute) Constraints

Tuple constraints represent rules that express relationships between 1OPs--i.e., 2OPs.

For example, if there is a business rule that commission should not exceed 50% of salary, the relationship commission < 0.5*salary should be expressible as a declarative integrity constraint in the data language.

As far as I know, such constraints are not expressible in SQL.

Multi-tuple (Relation) Constraints

Multi-tuple constraints enforce rules that specify relationships among all actual members of an object group--i.e., 3OPs--on the tuples of the representing relation. There are four categories of such constraints, corresponding to four types of relationship.

Note: Actually, to be precise, there are five[6], but we discuss only four here.

Uniqueness constraints represent rules that specify distinguishability in the real world of actual objects members of a group.

Primary key (PK) constraints are an important special case of uniqueness constraints (uniqueness can also apply to non-PK attributes). They represent rules that specify unique identifiability of actual object members of a group in the real world. Relations represent facts about objects that are uniquely identifiable in the real world by at least one combination of one or more 1OPs, so by definition every relation has a PK that represents that identifier in the database.

PK and uniqueness constraints are expressible in SQL via a declarative shorthand. However, as we saw above, PKs are optional and with their necessity broadly questioned, SQL tables are often bag-tables, not R-tables, in violation of the RDM.

Tuple-attribute constraints represent, per McGoveran, rules that specify a special type of relationships that associate objects with their 1OPs[6]. Since the identifying 1PO expresses the identity of objects, its values serve as stand-ins for the objects, so these constraints are expressed as relationships between  identifying 1OP--standing-in for the objects--and each of the descriptive 1OPs.

For example, if salary is a 1OP of objects of type employee uniquely identified by employee number, the relationship that associates every employee number--standing-in for every employee--with the employee's salary, which we might call 'earns', can be expressed symbolically as the two-place formal predicate EARNS(EMP#,SALARY) and enforced as a multi-tuple (why?) integrity constraint on the EMPLOYEES relation. A similar constraint can be defined and enforced for each descriptive 1PO.

"This type of constraint has been uniformly ignored in the published literature heretofore. This failure has led to assertions that important uses of relation predicates (e.g., the Principle of Orthogonal (Database) Design and view updating algorithms)--as presented through straw man examples--do not meet expectations."--David McGoveran
Note that such constraints are named and, therefore, a semantic enrichment of RPs.

Tuple-attribute constraints are not expressible in SQL.

Functional dependency (FD) constraints represent rules that specify a direct relationship between descriptive 1POs and identifiers. For a group of objects of the same type there is exactly one value of each descriptive 1PO for every value of the identifying 1PO--another way of saying that all of the former are functionally dependent on the latter. If a relation is designed such that it represents objects of a single type, the corresponding FD constraints on EMPLOYEES enforce the 1:1 relationship between the values of each non-PK attribute and the PK values.

You should have recognized relations representing objects of a single type as being in fifth normal form (5NF), defined formally as being subject exclusively to FDs on the PK[7]. For reasons that we explain elsewhere[8,9] we now believe that relations should be in 5NF--not just in 1NF--by definition, which entirely obviates 2NF-4NF and the need for further normalization.

FD constraints, which would help enforce 5NF, are not expressible in SQL.

Aggregation constraints represent rules that specify relationships that constrain aggregate values of attributes (e.g., sum, count, average). Note that these aggregates cannot be defined in FOPL--they must be implemented as primitive functions in a host language and invoked in the data language, as recommended by Codd.

Aggregation constraints are expressible in SQL.

Database (multi-relation) constraints

Database constraints represent rules specifying inter-group relationships--i.e., 4OPs--and enforce them on the set of relations that comprise the database. There are two categories of such constraints.

Referential constraints represent rules that specify consistency of relations that reference each other. They enforce matching of referencing FK and referenced PK values[7].

Disjunctive constraints represent rules that specify relationships between an object supertype and its subtypes (OSS), when represented by properly designed relations[10].

Referential integrity constraints are expressible in SQL via a declarative shorthand. It is not clear whether disjunctive constraints are expressible, but even if they are, considerable drudgery is required to declare and enforce them, because SQL does not support OSS relationships explicitly.

NOT NULL Should Not Be an Explicit Constraint

The reasons for prohibiting NULLs have been amply documented by us and others. Instead, SQL authors (prompted by a mistake by Codd!) blundered by accepting them and offering a NOT NULL option to disallow them. Not only does this introduce numerous problems without compensating benefit, but it also obscures them--practitioners are lured into thinking that the problems can be avoided by simply applying NOT NULL as any other constraint (e.g., "NOT NULL ... constraint ... is used to avoid null values.")

Well, NULLs aren't values and the problems are not all avoided with NOT NULL. The consequences radiate throughout the system and are rather insidious, with most practitioners unaware of them. All this is unnecessary, because missing data can be handled relationally without NULLs, or any other markers for the absence of values[11].

Constraints Apply to Relations

 "We can add constraints in three ways.
Column level -- Along with the column definition
Table level -- After the table definition
Alter level -- Using alter command."
Since the relation is the most basic relational data structure--attributes and tuples are elements of relations--all constraints are applied either to relations, or to sets of relations, not to relation elements. The specific syntax of the data language is unimportant as long as it is FOPL-compliant.



[2] Pascal, F., Business Modeling for Database Design, Practical Database Foundations Series.

[3] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS, forthcoming.

[4] Codd, E. F., Extending the Database Relational Model to Capture More Meaning, ACM Trans. Database Syst. 4(4)(1979).

[5] Pascal, F., Simple Domains and Value Atomicity.

[6] Pascal, F., 
First Normal Form (1NF) in Theory and Practice Parts 1-3

[7] Pascal, F., The Interpretation and Representation of Database Relations, Understanding the Real RDM Series.

[8] Pascal, F., Kinds of Relations, Data Sub-language and Database Consistency,
Understanding the Real RDM Series.

[9] Pascal, F., The Manipulation of Database Relations, forthcoming.

[10] Pascal, F., Meaning Criteria and Object Supertype-Subtypes

[11] Pascal, F., The Final NULL in the Coffin, Practical Database Foundations Series.

No comments:

Post a Comment

View My Stats