This is
the second part of a three-part tutorial article on the subject of integrity
constraints. In Part 1,
after a number of preliminaries, I discussed a set of six examples, giving in
each case a natural-language statement and a formal expression of the
constraint in question. Now I want to
move on to introduce some simple but fundamental concepts and terminology.
Let's
return to the first example from Part I (Every supplier status value is in the
range 1 to 100 inclusive). Here again
is the precise formulation:
FORALL s#*S#,sn*NAME,st*INTEGER,sc*CHAR
(IF {S# s#,SNAME sn,STATUS
st,CITY sc}*S
THEN st*1 AND st*100)
As I pointed out in Part I, this
constraint involves just a single relvar; in fact, it involves just one
variable of any kind--namely, the suppliers relvar S. Let me immediately qualify that remark! In
fact, of course, the constraint does also involve certain
so-called "bound" variables: namely, s#, sn, st, and sc. However, bound
variables aren't variables in
the usual programming language sense--instead, they're variables in the sense
of logic. You can think of a bound variable as a kind of "dummy"
variable, in a sense. For
consider:
·
If we replace all appearances within the formal
expression shown above of, e.g., the symbol st by any other symbol, say the
symbol xyz, the constraint remains logically unchanged.
·
By contrast, the same is certainly not true if we
replace all appearances of, e.g., the symbol S--which denotes a
"nondummy" variable, of course--by, say, the symbol SP.
From this point forward, I'll take
the term 'variable' to mean a variable in the usual programming language sense
specifically, not a variable in the sense of logic (barring explicit statements
to the contrary, of course).
To say it again, then, the formal
expression of the constraint involves a variable, S. Thus, although that
expression is indeed truth-valued, we can't say what its value is--i.e.,
we can't say what truth-value it yields--until we substitute a value for
that variable. (Indeed, different
substitutions will yield different truth-values, in general.) In other words, the expression is a
predicate. What's a predicate? In general, a predicate is a truth-valued
function (i.e., a function that, when it's invoked, returns a truth value);
it has a set of parameters (as do all functions), and an argument value
has to be supplied for each such parameter when the function is invoked
("when the predicate is instantiated," as the logicians say). In the case at hand, then,
when we want to
instantiate the predicate--which is to say, when we want to check the constraint--we
supply as argument (the sole argument, as it happens) the relation
that's the current value of relvar S, and the expression can then be
evaluated.
Now, when we do instantiate that
predicate, and in effect replace the sole parameter by some argument, we wind
up with a truth-valued expression that involves no variables at all, only
values. And, of course, analogous
remarks apply to constraints involving two, three, four, ..., or any number of
relvars; in all cases, when we need to evaluate the expression (i.e., when we
need to check the constraint), we replace each parameter by the current value
of the applicable relvar, and what we wind up with is an expression that
involves no variables at all.
In logic, a truth-valued
expression that involves no variables at all is said to be a proposition.
A proposition thus evaluates to either true or false,
unequivocally (it can be thought of as a degenerate predicate--i.e., a
predicate that involves zero variables).
Here are a few simple examples:
·
The sun is a star.
·
The moon is a star.
·
The sun is further away from us than the moon.
·
George W. Bush won the US presidential election in the
year 2000.
I'll leave it to
you to decide which of these propositions evaluate to true and which to false. Do
please note, however, that not all
propositions are ones that evaluate to true (to think otherwise is a
mistake all too easily made).
Terminology: Propositions are also said to be
closed
expressions, while expressions that involve at least one variable are said
to be open expressions. In general, therefore, propositions are closed
expressions and predicates are open ones (except for the degenerate case in
which the predicate in question is in fact a proposition).
Anyway, it should
be clear from all of the above that, logically speaking, a constraint is a
predicate--but when the constraint is checked, argument values are
substituted for the parameters, thereby reducing the predicate to a
proposition, and that proposition is then required to evaluate to true.
Of course, any
given relvar will be subject to many constraints, in general. Let R be a relvar. Then the
relvar predicate for R
is the 'logical AND' or conjunction of all of the constraints that apply
to (i.e., mention) that relvar R.
Please don't get confused here!--each individual constraint is a
predicate in its own right, as we already know, but "the" relvar
predicate is the conjunction of all of those individual predicates. For example, if we assume for
simplicity
that the six constraints discussed in Part I of this article are the only
constraints that apply to the suppliers and parts database (apart from a
priori ones), then the relvar predicate for suppliers is the conjunction of
Numbers 1, 2, 4, 5, and 6, and the relvar predicate for shipments is the
conjunction of Numbers 5 and 6. Note
that these two relvar predicates "overlap" each other, in a sense,
inasmuch as they have certain constituent constraints in common.
Note: I might possibly be accused of moving the
goal posts here, slightly. In my book AN INTRODUCTION TO DATABASE
SYSTEMS,
I defined the relvar predicate for relvar R to be the conjunction of all
relvar constraints that applied to R (where, as explained in Part
I, the term 'relvar constraint' means a constraint that refers to just one
relvar). In THE THIRD MANIFESTO, by
contrast, Hugh Darwen and I refined that definition; to be specific, we now
define the relvar predicate for relvar R to be the conjunction of all
constraints, not just all relvar constraints, that apply to R. Apologies to anyone
who might find this
shift in terminology confusing.
Now let R
be a relvar, and let RP be the relvar predicate for R. Clearly, then, R must
never be
allowed to have a value that, when it's substituted for R in RP
(and when any other necessary argument-for-parameter substitutions have also
been made in RP), causes RP to evaluate to false. In fact, I can now introduce
"The
Golden Rule" (or the first version of that rule, at any rate):
No update
operation must ever assign to any relvar a value that causes its relvar
predicate to evaluate to false.
Now let D
be a database, and let D contain relvars R1, R2, ...Rn
(only). Let the relvar predicates for
those relvars be RP1, RP2, ..., RPn, respectively. Then the database predicate for
D--DP,
say--is the conjunction of all of those relvar predicates:
DP * RP1
AND RP2 AND ... AND RPn
As an aside, let
me remind you that (as we've seen) two distinct relvar predicates RPi
and RPj might have certain constituent constraints in common. It follows that the very
same constraint
might appear many times over in the database predicate DP. From a logical point of view, of
course,
there's no harm in this state of affairs, because if c is a constraint,
then c AND c is logically equivalent to just c--but I
naturally hope the system would be smart enough in such a situation to
evaluate c once and not twice.
Here then is the
extended (more general, and final) version of the Golden Rule:
No update
operation must ever assign to any database a value that causes its database
predicate to evaluate to false.
Of course, a
database predicate will evaluate to false if and only if at least one of
its constituent relvar predicates does so too.
Likewise, a relvar predicate will evaluate to false if and only
if at least one of its constituent constraints does so too.
I'll have more to
say about The Golden Rule in a few moments. First, however, I want to consider the question
of what
everything I've said so far implies for the practical issue of actually doing
constraint checking. Once again, let's
return to our very first example (Every supplier status value is in the range 1
to 100 inclusive):
FORALL s#*S#,sn*NAME,st*INTEGER,sc*CHAR
(IF {S# s#,SNAME sn,STATUS
st,CITY sc}*S
THEN st*1 AND st*100)
As I pointed out in Part I of this
article, this constraint is effectively saying that IF a certain tuple
appears in relvar S, THEN that tuple has to satisfy a certain condition
("status in the range 1 to 100," in the case at hand). So, apparently, if we try to
insert a new
supplier tuple with status (say) 200, the sequence of events has to be:
1.
Insert the new tuple.
2.
Check the constraint.
3.
Undo the update (because the
check fails).
But this is absurd! Clearly, we would like to catch
the error before
the INSERT is done in the first place.
So what the implementation clearly has to do is use the formal
expression of the constraint to infer the appropriate check(s) to be
performed on tuples that are presented for insertion. I don't want to get into details of what's
involved in that
inference process here. However, I
think you can at least see that if the database predicate includes a constraint
of the form
IF {S# s#,SNAME sn,STATUS st,CITY sc}*S
THEN ...
-- i.e., if the antecedent (the
piece between the IF and the THEN) in some implication within the overall
predicate is of the form "some tuple appears in S"--then the
consequent (the piece after the THEN) in that implication is essentially a
constraint on tuples that are presented for insertion into relvar S.
I remark too that if the database
is designed in accordance with the Principle of Orthogonal Design--see
the book mentioned a couple of times already, AN INTRODUCTION TO
DATABASE SYSTEMS--then any tuple presented for insertion
will be a plausible INSERT candidate for at most one relvar in the
database (implying that the tuple will have to be checked against the relvar
predicate for at most one relvar in that database). But now I'm beginning to stray into an area
that deserves a
sizable discussion of its own, and I think I'd better leave it to a subsequent
article.
Back to theGolden Rule.Now, I didn't point out the fact
explicitly before, but you might have
realized for yourself that the rule as stated implies that all checking is
immediate. Why? Because it talks in terms of update
operations--in other words, INSERT, DELETE, and UPDATE operations, loosely
speaking--and not in terms of transactions (see below). In effect, therefore, theGolden
Rulerequires integrity constraints to be satisfied at statement boundaries,
and there's no notion of "deferred" or COMMIT-time integrity checking
at all.
In order to explain the foregoing
properly, I first need to say a little more about transactions. Of course, transactions are
another big
topic in their own right, even if we limit our attention (as I want to do here)
to their integrity aspects specifically, so what follows is only the briefest
of brief sketches. I plan to elaborate
on the points involved in another subsequent article.
First of all, then, I do assume
you're familiar with the transaction concept: in particular, with the so-called
"ACID properties" of transactions. ACID is an acronym for atomicity
- consistency - isolation - durability.
Atomicity means transactions are "all or nothing." Consistency means they transform a
consistent state of the database into another consistent state, without
necessarily preserving consistency at all intermediate points. Isolation means that any given
transaction's
updates are concealed from all other transactions, until the given transaction
commits. Durability means that once a
transaction commits, its updates survive in the database, even if there's a
subsequent system crash. The standard
reference on transactions is the highly recommended book TRANSACTION
PROCESSING: CONCEPTS AND TECHNIQUES, by Jim Gray and Andreas Reuter (Morgan
Kaufmann, 1993).
Back to the Golden Ruleonce
again. As you'll surely realize, the
idea that all checking must be immediate, not deferred, is a very unorthodox
one. After all, the SQL standard
certainly includes support for deferred checking, and I believe there's at
least one commercial product--namely, IBM's SQL/DS--that implements such a
capability (in fact, I believe SQL/DS actually allows checking to be deferred
past COMMIT time, which isn't allowed in the standard). Moreover, one of the arguments in favor of
the transaction concept has always been that transactions are supposed to be
(among other things) a unit of integrity: As I said a few moments ago, they're supposed to
"transform
a consistent state of the database into another consistent state, without
necessarily preserving consistency at all intermediate points." So do I contradict myself?
Very well, I contradict myself. To be specific, I no longer believe in
transactions as a "unit of integrity"; instead, I now think statements
have to be that unit.
So why have I changed my
mind? I have at least three reasons,
but easily the biggest is as follows.
As I've written elsewhere, a database can be regarded as a collection
of propositions, assumed by convention to be ones that evaluate to true. And if that
collection is ever
allowed to include any inconsistencies, then all bets are off! You
can never trust the answers you get from an inconsistent database; in fact, you
can get absolutely any answer whatsoever from such a database (indeed,
it's easy to prove that you can get absolutely any answer you like--or
don't like--from such a database).
While proper concurrency control, or in other words the isolation or
"I" property of transactions, can mean that no more than one
transaction ever sees any particular inconsistency, the fact remains
nonetheless that that particular transaction does see the inconsistency
and can thus produce wrong answers.
(Indeed, it's precisely because inconsistencies can't be tolerated, not
even if they're visible to no more than one transaction at a time, that we need
the constraints to be enforced in the first place--and that's why
integrity is such a fundamental feature of a database system.)
Second, I don't really agree with
the argument that any given inconsistency can be seen by just one transaction,
anyway. For consider:
·
Obviously, transactions must be allowed to communicate
with the outside world (where by "the outside world" I mean the world
outside the database).
·
It follows that it's only because of certain
protocols--certain un-enforced (and in fact unenforceable) protocols, I might
add--that transactions are isolated from one another. If transaction T1 obtains some incorrect
information from the
database and writes it to a file F, and transaction T2 then reads that same
information from file F, then T1 and T2 aren't really isolated from each
other. Even if they run at totally
different (i.e., non-overlapping) times!
·
In the same kind of way, a transaction can obviously
provide incorrect information to the end user.
In other words, I don't really
believe in the "I" property of transactions.
Third, semantic optimization requires
the database to be consistent at all times, not just at transaction
boundaries. (I don't really want to get
into details of semantic optimization here; suffice it to say that it's a
technique for using integrity constraints to simplify queries--in order to
improve performance, of course.
Clearly, if the constraints aren't satisfied, then the simplifications
will be invalid, and the query answers will be incorrect, in general.)
But--I hear some readers
objecting--surely some integrity checks have to be deferred, don't
they? As a trivial example, consider
the constraint "Supplier S1 and part P1 are in the same city." If supplier S1 moves, say
from London to
Paris, then part P1 must move from London to Paris as well. The conventional solution to this
problem is
to wrap the two updates up into a single transaction, like this:
BEGIN TRANSACTION ;
UPDATE s WHERE s# = s#
('S1') city := 'Paris' ;
UPDATE p WHERE P# = P#
('P1') city := 'Paris' ;
COMMIT ;
In this conventional solution, the
constraint is defined to be DEFERRED, and the checking is done at COMMIT--and
the database is inconsistent between the two UPDATE operations. Note in particular that if the same
transaction were to ask the question "Are supplier S1 and part P1 in the
same city?" between the two UPDATEs, it would get the answer no.
By contrast, THE THIRD MANIFESTO
solves this kind of problem by introducing a multiple assignment
operator, which lets us carry out several assignments as a single operation,
without any integrity checking being done until all of the assignments in question
have been executed. (INSERT, DELETE,
and UPDATE are of course just shorthand for certain assignment operations, as I
pointed out in an earlier article in this occasional series, viz., There's Only One
Relational Model!
For example:
UPDATE s WHERE s# = s# ('S1') city := 'Paris' ,
UPDATE P WHERE P# = P# ('P1') city := 'Paris' ;
Note the comma separator,
which means the two UPDATEs are both part of the same statement. (Of course, I don't mean to
suggest that the
solution to the problem under consideration can be obtained by a tiny change in
syntax! Rather, the point is that we do
need to be able to perform the two UPDATEs as part of the same statement--in
parallel, in fact--and so we need some syntax to specify the necessary
"bundling." So we use a comma
for that purpose, and no integrity checking is done "until we get to the
semicolon." Note in particular
that there's now no way for the transaction to see an inconsistent state of the
database between the two UPDATEs, because the notion of "between the two
UPDATEs" is one that has no meaning in this context.
Given the availability of such an
operator, there's now no need at all for deferred checking in the traditional
sense (i.e., checking that's deferred to end-of-transaction). As already indicated, however, this
is a
topic that deserves an article of its own, and I hope to write that article
soon.
(To be continued in Part 3
Posted 05/10/02