Sunday, June 11, 2017

What Meaning Means: Business Rules, Predicates, Integrity Constraints and Database Consistency




Note: This is a 10/23/17 rewrite of a 7/29/12 post to bring it in line with the McGoveran interpretation [1] of Codd's true RDM.

To understand what's wrong with the picture of two weeks ago, namely:
"If we step back and look at what RDBMS is, we’ll no doubt be able to conclude that, as its name suggests (i.e., Relational Database Management System), it is a system that specializes in managing the data in a relational fashion. Nothing more. Folks, it’s important to keep in mind that it manages the data, not the MEANING of the data! And if you really need a parallel, RDBMS is much more akin to a word processor than to an operating system. A word processor (such as the much maligned MS Word, or a much nicer WordPress, for example) specializes in managing words. It does not specialize in managing the meaning of the words ... So who is then responsible for managing the meaning of the words? It’s the author, who else? Why should we tolerate RDBMS opinions on our data? We’re the masters, RDBMS is the servant, it should shut up and serve. End of discussion." --Alex Bunardzic, Should Database Manage The Meaning?
it helps to consider the quote in the context of another article by the author, "The Myth of Data Integrity", where he reveals that those "DBMS opinions" are constraints (the article has been deleted, but a few comments remain online and are highly recommended for a feel  of the consequences of lack of foundation knowledge).

--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.
---------------------------------------------------------------------------------

Hugh Darwen, probably unaware of that article, concurred with Bunardzik :
"Couldn't agree more. The strength of the relational model lies in its total abandonment of meaning. BTW, I detest the term "semantic constraint" that some people use, imagining that some constraints are to do with meaning and others are not. They are all just constraints. The word "meaning" is bandied about sometimes in ways that make me wonder if some people don't know what it means, whether they spell it that way or the posh way, "semantics."
Now, relations can indeed represent facts about anything and a RDBMS manipulates them mathematically as sets, (just like arithmetic manipulates numbers) independently of what they mean. This is due to its foundation in logic, which is [2]:
"... an analytical theory of the art of reasoning whose goal is to systematize and codify principles of valid reasoning. It has emerged from a study of the use of language in argument and persuasion and it is based on the identification and examination of those parts of language which are essential for these purposes. It is formal in the sense that it lacks reference to meaning. Thereby, it achieves versatility: it may be used to judge the correctness of a chain of reasoning (in particular, a “mathematical proof”) solely on the basis of the form (and not the content) of the sequence of statements, which make up the chain." --R. R. Stoll
But it is also true that few data professionals understand what meaning in database management means -- Bunardzik being one example.

As McGoveran points out, were the RDM (and logic) to "totally abandon meaning", it would be unusable! Logic is only the formal language "essential for argument and persuasion" (i.e., deduction). It merely "lacks reference to meaning" and thereby "achieves versatility", it does not abandon it. This applies only to the formal deductive part of logic, not to its equally important methods of formal interpretation (semantics) -- which are essential to its application -- the RDM is theory adapted and applied to database management. As I explained so many times, it is impossible to  design and interrogate a database sensibly, ensure semantic correctness of results and interpret them correctly without knowledge of the meaning assigned to relations and the database by their designer. That is why meaning should be documented within the system and accessible to users on demand (which it is not). What is more: In the context of formal systems, syntax and semantics have very special meanings and the latter is not just "posh spelling" of the former: meaning is informal, semantics is formal.


Meaning is denoted by business rules. Constraints are computable versions of logic predicates -- FOPL formalizations of the rules that are enforced by the DBMS. Consequently, all meaning that can be formally captured relationally is enforced as constraints. The RDM can capture more of it than what its current, often erroneous, understanding would have you believe, certainly more than what SQL DBMSs -- with which true RDBMSs are constantly confused -- support [3].


Relation Interpretation: Business Rules


A relation is a formal database representation of an object group -- a set of tuples each of which represents a fact about a group member object.

Conceptual modeling defines a group by formulating business rules of several types that specify the property requirements that objects must satisfy individually and collectively for group membership [4,5]. The rules are the interpretation of the relations -- the real world meaning that the database designer assigns to it.

The following example is for explanatory purposes and not intended to be realistic. Consider three object groups of interest, parts, suppliers and supplies, where supplies are three-way relationships between suppliers, parts and supplied quantities. The following rule specifies the three first order (individual) properties (1OP) shared by members of the supply group and the second order property (2OP) arising from the relationship between them:

Supplier with supplier number (S#) supplies part with part number (P#) in quantity of (QTY).
Rules are expressed informally in natural language, which is ambiguous (is QTY the packaging quantity? The only quantity ever supplied? Is the tuple recording a previously supplied quantity? It takes a lot more English verbiage to make the meaning clear). Users familiar with the context -- the reality represented by the database (as they well should be to use the database) -- may believe they can work around ambiguities by inferring meaning from familiarity, but their interpretations can easily be at odds of the intended meaning of the database designer and often are.


Constraints as Predicates


A DBMS cannot make such inferences. It has no context, does not understand natural language as users do (i.e., semantically) and can only manipulate abstract symbols mathematically, so informal rules must be formalized symbolically for computability (i.e., expressed in a form that can be implemented as an algorithm) [6]. This is what FOPL, or an equivalently expressive formal language, is for. The supply rule can be formalized in FOPL as the tertiary (3-place) predicate:

S(S#,P#,QTY)
where S#, P# and QTY# symbolize 1OPs and the predicate S() symbolizes the supply group as a three-way 2OP arising from their relationship.

Assume, for a moment, that the supply rule is the full interpretation of the relation

SUPPLIES {S#,P#,QTY}
that represents the supply group, defined on three domains with value ranges consistent with the three shared 1OPs of supply objects. Then the S() predicate would be the DBMS full "interpretation" of the relation, with the consequence that the DBMS would deem any set of three attribute values drawn from the domains a valid tuple representing a fact about a supply object and accept it in the database.

Obviously that wouldn't be correct. We know, for example, that all objects in the real world are distinguishable (i.e., uniquely identifiable) by a combination of one or more 1OPs (supplier number and part number in the case of supplies), yet based on the S() predicate the DBMS would accept duplicate tuples in the database.

The problem, of course, is that the supply rule is not the full meaning of the relation, which, as mentioned, is the conjunction of the rules specifying the 1OPs, 2OPs and 3OPs required for group membership, all of which must be formalized as parts of the relation predicate (RP) associated with SUPPLIES. Otherwise put, the relation defined on domains must be constrained to be consistent with its full meaning.

The collective distinguishability property can be expressed as an identity rule

Supplies are uniquely identified by supplier number and part number.
It is a collective 3OP -- uniqueness -- arising from a relationship between all supplies that is required for membership in the group. The rule formalizes in FOPL as
{EQ(<S#,P#,QTY1>,<S#,P#,QTY2>) OR NOT((<S#,P#,QTY1>|SUPPLIES)
 AND (<S#,P#,QTY2>|SUPPLIES))}
where:
  • EQ() is an equality binary (two-place) predicate;
  • | symbolizes 'in', or 'belongs to';
  • <> set off a tuple;
The informal English interpretation is
Either tuples t1,t2 are equal (i.e., are the same tuple), or they are not both members of SUPPLIES.
Here's the conjunction of S() with the constraint:
S(S#,P#,QTY1) AND {EQ(<S#,P#,QTY1>,<S#,P#,QTY2>)
OR NOT [(<S#,P#,QTY1>|SUPPLIES) AND (<S#,P#,QTY2>|SUPPLIES)]}
It constrains the relation to be consistent with not just the supply, but also the identity rule: based on it, the DBMS does not permit duplicate tuples.

All the rules jointly defining an object group by specifying the required 1OPs, 2OPs and 3OPs can be similarly formalized as predicates and ANDed into a RP expressed as constraints, the enforcement of which ensures consistency with the full meaning of the relation. 


Note: More precisely, the constraint expression of the conjunction of the S() with the uniqueness predicate is:

FORALL (<S#,P#,QTY1>),(<S#,P#,QTY2>)
[S(S#,P#,QTY1) AND {EQ(<S#,P#,QTY1>,<S#,P#,QTY2>)
OR NOT(<S#,P#,QTY2>|SUPPLIES) AND (<S#,P#,QTY2>|SUPPLIES))}]
the informal English interpretation of which is
For all tuples <S#,P#,QTY1> and all tuples <S#,P#,QTY2>,
tuple <S#,P#,QTY1> is in the SUPPLIES relation and tuple <S#,P#,QTY1> satisfies S(), either tuple <S#,P#,QTY2> equals tuple <S#,P#,QTY1> or else it is not the case that both tuple <S#,P#,QTY1> is in  SUPPLIES and tuple <S#,P#,QTY2> is in SUPPLIES.
a FOPL way of saying that there are only tuples with unique (S#,P#) values in SUPPLIES.
 

Constraint Enforcement for Consistency


To be enforced, constraints must be expressed and declared to the DBMS using the Data Definition Language (DDL) component of its data language. To understand how constraints lend themselves to algorithmic implementation for enforcement by a DBMS, it helps to think of the constraint in procedural terms as two nested loops:
FORALL (<S#,P#,QTY1>),
FORALL (<S#,P#,QTY2>)
[S(S#,P#,QTY1) AND  {EQ(<S#,P#,QTY1>,<S#,P#,QTY2>) OR
NOT((<S#,P#,QTY2>|SUPPLIES) AND (<S#,P#,QTY2>|SUPPLIES))}]
"The first loop steps through all tuples in the SUPPLIES relation selecting a tuple for each pass, while the inner loop selects a second (not necessarily distinct) tuple. For each selection of specific values for the first tuple (the outer loop) and second tuple (the inner loop), the bracketed expression is evaluated. Notice that the tuple selected by the inner loop can differ from the first tuple only in the values of QTY1 and QTY2, with S# and P# being fixed until the  next iteration of the outer loop. So, this means that for a specific tuple with its values of S# and P#, there can be only one QTY value if both the first and second tuples belong in SUPPLIES." --David McGoveran
The constraint enforcement procedure can be undertaken transparently. For example, it is implicit in the SQL declarative shorthand
CREATE TABLE ... PRIMARY KEY ...
Note: A PK constraint can only be enforced by using such an algorithm. SQL systems implement it physically with a unique index. For each tuple candidate for insertion there are tuples already in the index (or none, in which case the first values can't be rejected) -- these are the <S#,P#,QTY1> tuples -- and the new tuple -- <S#,P#,QTY2>. The constraint is applied separately for each insert. This is yet one more reinforcement of the logical-physical confusion (LPC): the SQL user is led to think in terms of the physical implementation rather than the logical constraint declaration. DBAs often drop the index, then wonder why enforcing uniqueness requires so much processing when they recreate it.
 

Conclusion


A RDBMS enforces constraints declared as part of RPs and database predicate (DBP) to ensure database consistency with the business rules or, in other words, faithful representation of the conceptual model of reality.

First, constraints are not "RDBMS opinions" -- that is nonsense -- and the avoidance of reference to meaning in the deductive system implies neither that "integrity is a myth", nor "total abandonment of meaning". Meaning is conveyed by user business rules and constraints are formal expressions thereof, which the DBMS can enforce algorithmically. As Erwin Smout commented, by enforcing them, the DBMS ensures database consistency with the meaning the users themselves -- via the database designer -- assigned to it. In this sense "semantic constraints" is not nonsense, even if the DBMS does not understand semantics. Indeed, it is failure to appreciate the semantic function of constraints that is responsible for many misconceptions in the industry and the weak integrity support by SQL DBMSs [7], among them the mistaken illusion I often deplore that the meaning of a relation can be inferred from sheer inspection of its tabular display, or comments such as  [8]: "I just took "Please stop insisting that Primary and Foreign keys are mandatory. They are good design habits but by no means mandatory" to be a sloppy way of saying that defined constraints are not mandatory."

Second, consistency with the full meaning of relations and the database is ensured by enforcing the constraints comprising the RPs and DBP that correspond to all the rules and that is possible only with a declarative FOPL-based data sub-language, and a relationally complete one has not been implemented. Any relational constraint expressible in FOPL that is not expressible in the data sub-language is an integrity risk, unless enforcement is relegated to application or procedural code, the approach that database management was explicitly intended to end for a multiplicity of reasons.

Third,

"A name of a database object -- whether of a database, relation, domain, attribute, or tuple -- should never be treated as simply a reference to data to be manipulated, but as a mnemonic for its formal set definition by constraints and a reference to that set. The DBMS can then find and manipulate data members of those sets in the usual manner, but according to the expression that results after substitution of the appropriate defining integrity constraints for those names." --David McGoveran
Fourth, whether we like it or not,
"All semantics that can be formalized in FOPL -- including verbs -- formalize as constraints. We use constraints to reduce the number of permissible interpretations -- meanings -- of the purely abstract FOPL deductive system. Codd did bring integrity to the forefront by making it an explicit component of a formal data model, but that does not mean that all the semantics that can be captured formally have been captured in the standard description of RDM, let alone supported in products. In 1979 Codd described a way to "capture" meaning using the relational formalism beyond the then current understanding. That formalism doesn't tell you how to discover meaning, but if you have it, he shows (at least to some degree) how to express the corresponding semantics relationally." --David McGoveran
None of this is properly understood in the industry.

References



[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS, (forthcoming)

[2] Stoll, R. R., SET THEORY AND LOGIC

[3] Codd, E. F., Extending the Database Relational Model to Capture More Meaning, ACM Trans. Database Syst. 4(4): 397-434 (1979)

[4] A Relational Model of Data for Large Shared Data Banks, Commun. ACM 13(6): 377-387 (1970)



[7] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER

No comments:

Post a Comment

View My Stats