Sunday, July 29, 2012

What Meaning Means: Business Rules, Predicates, Constraints, Integrity Constraints and Database Consistency

This is a 6/10/17 rewrite of a 7/29/12 post to bring it in line with the McGoveran interpretation of Codd's RDM[1].

To understand what's wrong with the picture of two weeks ago, namely:

"If we step back and look at what RDBMS is, we’ll no doubt be able to conclude that, as its name suggests (i.e., Relational Database Management System), it is a system that specializes in managing the data in a relational fashion. Nothing more. Folks, it’s important to keep in mind that it manages the data, not the MEANING of the data! And if you really need a parallel, RDBMS is much more akin to a word processor than to an operating system. A word processor (such as the much maligned MS Word, or a much nicer WordPress, for example) specializes in managing words. It does not specialize in managing the meaning of the words ... So who is then responsible for managing the meaning of the words? It’s the author, who else? Why should we tolerate RDBMS opinions on our data? We’re the masters, RDBMS is the servant, it should shut up and serve. End of discussion." --Alex Bunardzic, Should Database Manage The Meaning?
it helps to consider the quote in the context of the author's other article, "The Myth of Data Integrity", where he reveals that those "DBMS opinions" are integrity constraints (the article has been deleted, but a few comments remain online and are highly recommended for a feel  of the consequences of lack of foundation knowledge).

Hugh Darwen concurred:

"Couldn't agree more. The strength of the relational model lies in its total abandonment of meaning. BTW, I detest the term "semantic constraint" that some people use, imagining that some constraints are to do with meaning and others are not. They are all just constraints. The word "meaning" is bandied about sometimes in ways that make me wonder if some people don't know what it means, whether they spell it that way or the posh way, "semantics."
Now, relations can indeed represent facts about anything and a RDBMS manipulates them mathematically as sets, independently of what they mean (just like arithmetic manipulates numbers). This is due to its foundation in logic, which is[2]:
"... an analytical theory of the art of reasoning whose goal is to systematize and codify principles of valid reasoning. It has emerged from a study of the use of language in argument and persuasion and it is based on the identification and examination of those parts of language which are essential for these purposes. It is formal in the sense that it lacks reference to meaning. Thereby, it achieves versatility: it may be used to judge the correctness of a chain of reasoning (in particular, a “mathematical proof”) solely on the basis of the form (and not the content) of the sequence of statements, which make up the chain." --R. R. Stoll
It is also true that few data professionals understand what meaning in database management means—Bunardzik being an example.

But as McGoveran points out, were logic or the RDM to "totally abandon meaning", they would be unusable! As Stoll points out, it is only the formal language "essential for argument and persuasion" (i.e., deduction) that "lacks reference to meaning" and thereby "achieves versatility." It does not abandon meaning, it merely lacks reference to it. This applies only to the formal deductive part of logic, not to its equally important methods of formal interpretation (semantics)—which are essential to its application
the RDM is theory applied to database management. As I explained so many times, it is impossible to interrogate a database sensibly, ensure correct results and interpret them correctly without knowledge of the meaning assigned to relations and the database by their designer. That is why meaning should be documented within the system and accessible to users on demand (which it is not). What is more:
  • In the context of formal systems, syntax and semantics have very special meanings and the latter is not just "posh spelling" of the former: meaning is informal, semantics is formal.
  • All meaning that can be formally captured relationally is enforced as constraints.
  • The RDM can capture more of it than what the current, often erroneous, understanding would have you believe, certainly more than what SQL DBMSs--with which true RDBMSs are constantly confused—support[3]

Business Rules and Meaning


Relations are sets that represent in the database facts about object groups in the real world. Their meaning is conveyed by informal descriptions of the groups, expressed in natural language, that specify their defining properties[4]. I refer to such descriptions as business rules. I already explained[5] that, per McGoveran, there are several types of properties that rules specify to describe a group of related groups, which is what a database represents. The meaning of every relation and of the database as a whole is conveyed by the conjunction of the rules that specify all the properties of those types.

The following example is for explanatory purposes only and not intended to be realistic. Consider three object groups of interest
parts, suppliers and supplies—where a supply is a relationship between a supplier. Each group is represented by a relation, with
SUPPLIES {S#,P#,QTY}
representing the supplies group. The kind of informal English interpretation—meaning—that would usually be given for SUPPLIES is:
Supplier with supplier number S# supplies part with part number P# in quantity of QTY
which we shall refer to as the supply rule. First, it specifies only the three first order properties (1OP) shared by supply objects—supplier number, part number and quantityand the second order property (2OP) rule arising from the relationship between them. So it does not convey the full meaning of SUPPLIES, which includes rules that specify other 2OPs, if any, as well as third order properties (3OP). Second, natural language is ambiguous:
"Does this mean QTY is the packaging quantity? The only quantity ever supplied? Is the tuple recording a previously supplied quantity? Conclusion: "Supplier S# supplies part P# in quantity QTY" is ambiguous! It takes a lot more English verbiage to make the meaning clear." --David McGoveran
In short, the supply rule is a poor assignment of meaning to the relation.

Predicates and Constraints


Users familiar with the reality represented by SUPPLIES (as they well should be to use the database) believe they know what suppliers, parts and supplies are and, therefore, what the database meansi.e., what is intended to represent. They work around incompleteness and ambiguity by inferring meaning from familiarity, but their interpretation can easily be at odds of the designer's intended meaning and often is, which is the problem!

For example, all objects (including supplies) are distinguishable in the real world—i.e., uniquely identifiable—by a combination of one or more 1OPs. So if they know that for the period during which supply facts are recorded there is only one supply of a specific part by a specific supplier, users can infer the distinguishability rule:

Supplies are uniquely identified by the combination of supplier number and part number
This is a multi-object rule that specifies a 3OP arising from a relationship between all supplies in the group—uniqueness.

A DBMS does not understand informal meaning, like users do—it can only manipulate abstract symbols mathematically—so rules must be expressed formally. In the case of RDBMSs, that implies being expressed symbolically in first order predicate logic (FOPL), or an equivalently expressive formal language. The supply rule formalizes as a tertiary (3-place) FOPL predicate:

S( S#,P#,QTY )
But because the supply rule does not specify all the properties (not even all the 1OPs!), the S() predicate does not formalize the full meaning of SUPPLIES. A DBMS cannot make the kind of inferences users might and is, therefore, unaware of the uniqueness predicate corresponding to the distinguishability rule
{EQ(<S#,P#,QTY1>,<S#,P#,QTY2>) OR NOT ((<S#,P#,QTY1>|SUPPLIES)
 AND (<S#,P#,QTY2>|SUPPLIES))}
where:

  • EQ(<S#,P#,QTY1>,<S#,P#,QTY2>) is an equality binary predicate asserting two arbitrary tuples are equali.e., all their values are equal;
  • | symbolizes 'in', or 'belongs to';
  • <> set off a tuple;

Its informal English interpretation is "either two tuples are equal
i.e., the same tupleor they are not both in SUPPLIES".

It is the way to express tuple uniqueness in FOPL and the predicate can be added as a conjunct to the S() predicate:

S(S#,P#,QTY1) AND {EQ(<S#,P#,QTY1>,<S#,P#,QTY2>)
OR NOT [(<S#,P#,QTY1>|SUPPLIES) AND (<S#,P#,QTY2>|SUPPLIES)]}

Constraints and Database Consistency


Relation-specifying predicates corresponding to group-describing rules can be expressed in a form that, when declared to the DBMS in a declarative FOPL-base data sub-language, can be enforced by it to constrain relations to be consistent with the rules. For which reason we shall refer to expressions in that form constraints.

Here's the constraint expression of the conjunction of the S() with the uniqueness predicate:

FORALL (<S#,P#,QTY1>),(<S#,P#,QTY2>) 
[S(S#,P#,QTY1) AND {EQ(<S#,P#,QTY1>,<S#,P#,QTY2>)
OR NOT(<S#,P#,QTY2>|SUPPLIES) AND (<S#,P#,QTY2>|SUPPLIES))}]
the interpretation of which is "For all tuples <S#,P#,QTY1>  and all tuples <S#,P#,QTY2>, tuple <S#,P#,QTY1> is in the SUPPLIES relation and tuple <S#,P#,QTY1> satisfies S() , either tuple <S#,P#,QTY2>  equals
tuple <S#,P#,QTY1> or else it is not the case that both tuple <S#,P#,QTY1> is in  SUPPLIES and tuple <S#,P#,QTY2> is in SUPPLIES", another way of saying that there are only tuples with unique (S#,P#) values in SUPPLIES.


Note that while rules are expressed in natural language in real world terms (groups, properties, objects), constraints are expressed in FOPL in database terms (relations, domain/attributes, tuples).

The predicates corresponding to all the rules can be added as conjuncts in the same way. Thus:

  • The conjunction of constraints corresponding to all the 1OP, 2OP and 3OP rules describing an object group comprises the relation predicate (RP) of the relation, that represents in the database the formalized meaning--semantics--of the relation representing the group;
  • The conjunction of all RPs with the constraints corresponding with the 4OP rules comprises the database predicate (DBP) that represents the semantics of the database representing the group of groups.
Note: In Appendix A in [6] I provide a more detailed example of constraint formulation and verification, but the reader is warned that it is "proof of concept" rather than rigorous.


Integrity Constraints


To be enforced, FOPL constraints must be expressed and declared to the DBMS in the syntax of a data language native to it—in which case we refer to them as integrity constraints.

In the distinguishability case, an identifier is represented formally in the database by a key—a combination of one or more attributes that satisfies a uniqueness constraint in the relation. A primary key (PK) integrity constraint is both (1) a predicate that identifies the PK and (2) a uniqueness predicate satisfied by the PK, expressed in a specific data language.

Unfortunately, there is currently no declarative, relationally complete FOPL-based data language, but the shorthand in SQL is an example of possible syntax for a PK integrity constraint:

CREATE TABLE supplies
 (S# CHAR(3),
  P# CHAR(3),
  PRIMARY KEY (S#,P#),
  QTY(INTEGER));

To understand how integrity constraints lend themselves to algorithmic implementation for enforcement by a DBMS, it helps to think of the constraint


FORALL (<S#,P#,QTY1>),
FORALL (<S#,P#,QTY2>)
[S(S#,P#,QTY1) AND  {EQ(<S#,P#,QTY1>,<S#,P#,QTY2>) OR
NOT((<S#,P#,QTY2>|SUPPLIES) AND (<S#,P#,QTY2>|SUPPLIES))}]

in procedural terms as two nested loops.
"The first loop steps through all tuples in the SUPPLIES relation selecting a tuple for each pass, while the inner loop selects a second (not necessarily distinct) tuple. For each selection of specific values for the first tuple (the outer loop) and second tuple (the inner loop), the bracketed expression is evaluated. Notice that the tuple selected by the inner loop can differ from the first tuple only in the values of QTY1 and QTY2, with S# and P# being fixed until the  next iteration of the outer loop. So, this means that for a specific tuple with its values of S# and P#, there can be only one QTY value if both the first and second tuples belong in SUPPLIES." --David McGoveran
Note carefully that the constraint enforcement procedure is undertaken transparently by the DBMS and is implicit in the user's declarative syntax, such as the SQL shorthand.

Note: A PK integrity constraint can only be enforced by using such an algorithm. SQL systems implement it physically with a unique index. For each tuple candidate for insertion there are tuples already in the index (or none, in which case the first values can't be rejected)--these are the <S#,P#,QTY1> tuples--and the new tuple--<S#,P#,QTY2>. The integrity constraint is applied separately for each insert. This is yet one more reinforcement of the logical-physical confusion (LPC): the SQL user is lead to think in terms of the physical implementation rather than the logical constraint declaration. DBAs often drop the index, then wonder why enforcing uniqueness requires so much processing when they recreate it.

 
With the integrity constraint enforced, only tuples with unique (S#,P#) values can be inserted in SUPPLIES, consistent with both the supply and distinguishability rules.


Conclusion


A RDBMS enforces the integrity constraints corresponding to the RPs and DBP declared to it to constrain consistency with the business rules describing the reality represented by the database.

First, integrity constraints are not "RDBMS opinions"—that is nonsense—and the avoidance of reference to meaning in the deductive system implies neither that "integrity is a myth", nor "total abandonment of meaning". Meaning is conveyed by user business rules and integrity constraints are formal expressions thereof, which the DBMS can enforce algorithmically. As Erwin Smout commented, by enforcing them, the DBMS ensures database consistency with the meaning the users themselves—via the designer—assigned to it. And in this sense "semantic constraints" is not nonsense, even if the DBMS does not understand semantically, like users. Indeed, it is failure to appreciate the semantic function of constraints that is responsible for many misconceptions in the industry and the weak integrity support by SQL DBMSs[7], among them the mistaken illusion I often deplore that the meaning of a relation can be inferred from sheer inspection of a tabular display of a relation, or comments such as this[8]:

"I just took "Please stop insisting that Primary and Foreign keys are mandatory. They are good design habits but by no means mandatory" to be a sloppy way of saying that defined constraints are not mandatory."
Second, consistency with the full meaning of relations and the database is ensured by enforcing the integrity constraints corresponding to the RPs and DBP that correspond to all the rules and that is possible only with a declarative FOPL-based relationally complete data sub-language, which has not yet been implemented. Any relational constraint expressible in FOPL that is not expressible in the data language remains an integrity risk, unless enforcement is relegated to application or procedural code, the approach that database management was explicitly intended to end for a multiplicity of reasons.

Third,

"A name of a database object—whether of a database, relation, domain, attribute, or tuple—should never be treated as simply a reference to data to be manipulated, but as a mnemonic for its formal set definition by constraints and a reference to that set. The DBMS can then find and manipulate data members of those sets in the usual manner, but according to the expression that results after substitution of the appropriate defining integrity constraints for those names." --David McGoveran
Fourth, whether we like it or not,
"All semantics that can be formalized in FOPL--including verbs such as ‘supplies’—formalize as constraints. We use constraints to reduce the number of permissible interpretations—meanings—of the purely abstract FOPL deductive system. Codd did bring integrity to the forefront by making it an explicit component of a formal data model, but that does not mean that all the semantics that can be captured formally, have been captured in the standard description of RDM, let alone supported in products. In 1979 Codd described a way to "capture" meaning using the relational formalism beyond the then current understanding. That formalism doesn't tell you how to discover meaning, but if you have it, he shows (at least to some degree) how to express the corresponding semantics relationally." --David McGoveran
None of this is properly understood in the industry.


References



[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS, (forthcoming)

[2] Stoll, R. R., SET THEORY AND LOGIC

[3] Codd, E. F., Extending the Database Relational Model to Capture More Meaning, ACM Trans. Database Syst. 4(4): 397-434 (1979)

[4] A Relational Model of Data for Large Shared Data Banks, Commun. ACM 13(6): 377-387 (1970)



[7] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment