”If we step back and look at what RDBMS is, we’ll no doubt be able to conclude that, as its name suggests (i.e., Relational Database Management System), it is a system that specializes in managing the data in a relational fashion. Nothing more. Folks, it’s important to keep in mind that it manages the data, not the MEANING of the data! And if you really need a parallel, RDBMS is much more akin to a word processor than to an operating system. A word processor (such as the much maligned MS Word, or a much nicer WordPress, for example) specializes in managing words. It does not specialize in managing the meaning of the words ... So who is then responsible for managing the meaning of the words? It’s the author, who else? Why should we tolerate RDBMS opinions on our data? We’re the masters, RDBMS is the servant, it should shut up and serve. End of discussion.” --Alex Bunardzic, Should Database Manage The Meaning?
It helps to consider this quote in the context of
another article by the author, The Myth of Data Integrity, where he
reveals that those "DBMS opinions" are constraints (the
article has been deleted, but a few comments remain online and are highly
recommended for a feel of the consequences of lack of foundation
knowledge).
Hugh Darwen, probably
unaware of that article, concurred with Bunardzik:
"Couldn't agree more. The strength of the relational model lies in its total abandonment of meaning. BTW, I detest the term "semantic constraint" that some people use, imagining that some constraints are to do with meaning and others are not. They are all just constraints. The word "meaning" is bandied about sometimes in ways that make me wonder if some people don't know what it means, whether they spell it that way or the posh way, "semantics."
Now, database relations can indeed represent facts about anything, and a proper relational algebra (RA) manipulates them mathematically as sets (just like numeric algebra manipulates numbers) without reference to meaning. This is due to a portion of simple set theory being expressible in first order predicate logic (SST/FOPL). Logic is
“... an analytical theory of the art of reasoning whose goal is to systematize and codify principles of valid reasoning. It has emerged from a study of the use of language in argument and persuasion and it is based on the identification and examination of those parts of language which are essential for these purposes. It is formal in the sense that it lacks reference to meaning. Thereby, it achieves versatility: it may be used to judge the correctness of a chain of reasoning (in particular, a “mathematical proof”) solely on the basis of the form (and not the content) of the sequence of statements, which make up the chain.” –R. R. Stoll, LOGIC
But as McGoveran points out,
“Were logic (and RDM) to totally abandon meaning, it would be unusable! Logic is only the formal language "essential for argument and persuasion" (i.e., deduction). It merely “lacks reference to meaning” and thereby "achieves versatility", it does not abandon it! This applies only to the formal deductive part of logic, not to its equally important methods of formal interpretation (semantics)—which are essential to its application—the RDM is abstract formal theory applied to database management.”
It is impossible to design and interrogate a database sensibly, and ensure semantic consistency of results with the CM it is intended to represent without users and DBMS knowledge of the meaning assigned to the database by the CM—its interpretation. But BRs expressed in natural language are not computable (i.e., accessible to a DBMS). Predicates are formal expressions of BRs in FOPL: a true RDBMS supports a FOPL-based data sublanguage, in which predicates can be expressed as constraints—computable expressions of the BRs. The name derives from constraining the data in the database to be consistent with the corresponding CM. The meaning in the BRs that can be captured in FOPL as predicates and are expressible in the data sublanguage as constraints can be enforced algorithmically by the DBMS on the data for semantic consistency—a formal way of saying faithful representation of the CM.
This is why (1) the meaning should be documented within the database and accessible to users on demand (which it is not) and (2) a proper RA should include the semantic constraints expressions (not just the data) in its operations (which it does not).
Business Rules and Predicates
Conceptual modeling formulates BRs that define object—entity, group, and multigroup—types by specifying the required properties that qualify those objects as instances of those types.
Each group of entities of a single type in a CM is represented formally at the logical level by a database relation. For each relation, BRs express in natural language the meaning assigned to the relation by the modeler (i.e., its interpretation) as understood by users.
Consider a CM consisting of a three-group multigroup—two groups of unitary entities, Parts and Suppliers, one of associative entities Supplies (associating parts and suppliers), and a three-way relationship between them. The following BR defines the Supplies group of associative entities by specifying their three first order properties in context (1OPiCs):
Supplier with supplier number (S#) supplies part with part number (P#) in quantity of (QTY).
Natural language is informal, often ambiguous (is QTY the packaging quantity? the only quantity ever supplied? is the tuple recording a previously supplied quantity?) It takes a lot more English verbiage to make meaning clear/precise. Users familiar with the context (the business reality which the database is intended to represent) may believe they can guess around ambiguities, but their interpretation can easily be at odds with that of the modeler.
A DBMS cannot make such guesses. It has no context, does not understand what parts, suppliers, and supplies are, and can only manipulate abstract symbols mathematically. Informal BRs must be formalized for computability (i.e., expressed in a form that can be implemented as an algorithm "understood" by a DBMS. FOPL formal expressions of BRs can serve as templates for constraint expressions in FOPL-based data sublanguages native to RDBMSs.
For example, the supplies BR can be formalized in FOPL as the tertiary (3-place) predicate:
S(S#,P#,QTY)
where S#, P#, and QTY symbolize 1OPiCs, and the predicate S() symbolizes the Supplies group arising from the relationship between Suppliers and Parts groups. Assume, for a moment, that the BR is the full interpretation of the SUPPLIES relation:
SUPPLIES (S#,P#,QTY)
that
represents the Supplies group, defined on three domains, with value ranges of
the three attributes consistent with the shared 1OPiCs of supplies. Then the
three-place predicate would be the DBMS full "interpretation" of the
relation, with the consequence that the DBMS would deem any set of three
attribute values drawn from the domains is a valid tuple representing a true
fact about a supply and accept it in the database. Obviously that wouldn't be
correct. We know, for example, that all entities in the real world are uniquely
identifiable by a combination of one or more 1OPiCs (S# and P# in the case of
supplies), yet based on the S() predicate the DBMS would accept duplicate
tuples in the database.
The problem, of course, is that the supply BR is not
the full meaning of the relation, which, as mentioned, is the conjunction of
BRs specifying the individual defining first order properties in (group)
context (1OPiCs) and required second order (2OPs), and collective required third order properties (3OPs) for group
membership, all of which must be formalized as parts of the relation
predicate (RP) associated with SUPPLIES relation. Otherwise put, relation
data must be semantically constrained for consistency with the
meaning assigned in the CM—its interpretation. The RP is the formal symbolic
expression of it (which corresponds to the group membership criterion).
For example, supply uniqueness is a group collective 3OP due to a relationship between all supplies that can be expressed as an identity BR in English:
Supplies are uniquely identified by (supplier number, part number).
that has a somewhat cumbersome predicate expression in FOPL:
{EQ(<S#,P#,QTY1>,<S#,P#,QTY2>) OR NOT[(<S#,P#,QTY1> Î SUPPLIES)
AND (<S#,P#,QTY2>|SUPPLIES)]}
where:
· EQ() is an equality binary (two-place) predicate;
· Î symbolizes 'in', or 'belongs to';
· <> sets off a tuple;
the English interpretation of which is rather cumbersome:
Either tuple s1,s2 are equal (i.e., are the same tuple), or they are not both members of SUPPLIES.
a FOPL way of saying that there are only tuples with unique (S#,P#)
values in SUPPLIES.
The RP would have to contain the conjunction of the S() and uniqueness predicate, as follows:
S(S#,P#,QTY1) AND
{EQ(<S#,P#,QTY1>,<S#,P#,QTY2>)
OR NOT [(<S#,P#,QTY1>|SUPPLIES) AND (<S#,P#,QTY2> Î SUPPLIES)]}
Note: More precisely, the expression of the conjunction is:
FORALL
(<S#,P#,QTY1>),(<S#,P#,QTY2>)
[S(S#,P#,QTY1)
AND {EQ(<S#,P#,QTY1>,<S#,P#,QTY2>)
OR
NOT(<S#,P#,QTY2>|SUPPLIES) AND (<S#,P#,QTY2> ÎSUPPLIES))}]
the informal English interpretation of which is
For
all tuples <S#,P#,QTY1> and all tuples <S#,P#,QTY2>,
tuple
<S#,P#,QTY1> is in the SUPPLIES relation and tuple <S#,P#,QTY1> satisfies
S(), either tuple <S#,P#,QTY2> equals tuple <S#,P#,QTY1> or else it
is not the case that both tuple <S#,P#,QTY1> is in SUPPLIES and
tuple <S#,P#,QTY2> is in SUPPLIES.
All BRs defining a group that jointly express the full interpretation of the corresponding relation (the membership criterion) by specifying the required 1OPiCs, 2OPs, and 3OPs can be similarly formalized symbolically as predicates and ANDed together into a RP.
Predicates and Constraints
True RDBMSs—unavailable today—would usually not implement FOPL directly, but rather FOPL-based data sublanguages with more "user-friendly" syntax. Predicates would be expressed in such DBMS-specific languages would be implemented algorithmically as semantic constraints that constrain database relations for consistency with the CM/BRs.
To understand how constraints lend themselves to algorithmic implementation for enforcement by a DBMS, it helps to think of the uniqueness constraint in procedural terms as two nested loops.
FORALL (<S#,P#,QTY1>),
FORALL (<S#,P#,QTY2>)
[S(S#,P#,QTY1) AND {EQ(<S#,P#,QTY1>,<S#,P#,QTY2>) OR
NOT((<S#,P#,QTY2> ÎSUPPLIES) AND (<S#,P#,QTY2> Î SUPPLIES))}]
"The first loop steps through all tuples in the SUPPLIES relation selecting a tuple for each pass, while the inner loop selects a second (not necessarily distinct) tuple. For each selection of specific values for the first tuple (the outer loop) and second tuple (the inner loop), the bracketed expression is evaluated. Notice that the tuple selected by the inner loop can differ from the first tuple only in the values of QTY1 and QTY2, with S# and P# being fixed until the next iteration of the outer loop. So, this means that for a specific tuple with its values of S# and P#, there can be only one QTY value if both the first and second tuples belong in SUPPLIES." --David McGoveran
For an example of expressing a predicate in a data sublanguage as a constraint, the SQL syntax for the uniqueness predicate is:
CREATE TABLE ... PRIMARY KEY ...
notwithstanding that SQL is not relational, and not a proper data sublanguage.
Note: A PK constraint can be enforced only by using such an algorithm. SQL systems implement it physically with a unique index. For each tuple candidate for insertion there are tuples already in the index (or none, in which case the first values can't be rejected)—these are the <S#,P#,QTY1> tuples -- and the new tuple -- <S#,P#,QTY2>. The constraint is enforced separately for each insert.
This a reinforcement of the logical-physical confusion (LPC): the SQL user is led to think in terms of the physical implementation rather than the logical constraint declaration. DBAs often drop the index, then wonder why enforcing uniqueness requires so much processing when they recreate it.
Conclusion
Constraints are not "RDBMS opinions"—that is nonsense: without reference to meaning in the deductive system implies neither "total abandonment of meaning", nor "integrity is a myth". Constraints are predicates expressed in a data sublanguage, which are formal expressions of BRs—the meaning assigned by the modeler to the database as understood by users! By enforcing constraints, the DBMS ensures semantic consistency of the database with the CM/BRs—database faithful representation thereof.
It is not nonsense that all constraints are semantic: it is precisely because the DBMS does not understand semantics that BRs are formalized as constraints, so that they can be implemented algorithmically and enforced by the DBMS. Indeed, it is failure to appreciate the semantic function of constraints that is responsible for many misconceptions in the industry, and the weak integrity support by SQL DBMSs. Among them is the illusion that the meaning of a relation can be inferred from sheer inspection of its tabular display (precisely because constraints are not part of tables), or comments such as:
“Please stop insisting that Primary and Foreign keys are mandatory. They are good design habits, but by no means mandatory.”
Any BR expressible in FOPL as a predicate, but not in the DBMS data sublanguage, or not declared to the DBMS, is an integrity risk, with enforcement relegated to procedural or application code—what the RDM was explicitly intended to reduce for a multiplicity of reasons.
“A name of a database object—whether of a domain, attribute, tuple, relation, or database—should never be treated as simply a reference to data to be manipulated, but as a mnemonic for its formal set definition by constraints, and a reference to that set. The DBMS can then find and manipulate data members of those sets in the usual manner, but according to the expression that results after substitution of the appropriate defining constraints for those names.” --David McGoveran
Whether we like it or not,
“All semantics that can be formalized in FOPL—including verbs—formalize as constraints. We use constraints to reduce the number of permissible interpretations—meanings—of the purely abstract FOPL deductive system. Codd did bring integrity to the forefront by making it an explicit component of a formal data model, but that does not mean that all the semantics that can be captured formally have been captured in the standard description of RDM, let alone supported in products. In 1979 Codd described a way to "capture" more meaning using the relational formalism beyond the then current understanding. That formalism doesn't tell you how to discover meaning, but if you have it, he shows (at least to some degree) how to express the corresponding semantics relationally.” --David McGoveran
None of this, if any, is properly understood in the industry.
No comments:
Post a Comment