The work described in this
two-part article was done jointly by David McGoveran and myself during
1993.
It was the view update question
that was the overall spur to our work.
I had been developing the outlines of a new approach to view updating,
one that enjoyed several nice properties and overall looked quite promising,
but I kept running into the problem that the approach I was proposing could
occasionally lead to strange results. Then David pointed out that those strange
results occurred only if the database design was "strange" too in a
certain sense, and proposed a new design principle that said, in effect
"Don't design databases in this strange way."
And that observation led directly to the
production of the original versions of the view update articles.
Despite the foregoing, I must make
it clear that avoiding view update anomalies is far from being the sole benefit
that accrues from the new design principle.
Indeed, I would claim that, like the principles of further
normalization, the new principle injects a little rigor into a field (database
design) that as I have written elsewhere is still very much of an art, not a
science.
1. Introduction
The purpose of this paper is to
present and describe a new and, we believe, very fundamental database design
principle. Like the well-known
principles of normalization, our principle can be seen in part as a systematic
means of avoiding redundancy, and thereby avoiding certain update anomalies
that might otherwise occur. Although
the principle is very simple, we have not previously seen it articulated in
published form; in fact, the experience of one of us (McGoveran) suggests that
the principle is quite often violated in practice.
We briefly explore some of the consequences of such violations.
2. The Loves-Hates Example
We begin with a simple example.
Consider the following database:
CREATE DOMAIN PERSONS ... ;
CREATE BASE TABLE LOVES
( X DOMAIN ( PERSONS ),
Y DOMAIN ( PERSONS ) ... ) ;
CREATE BASE TABLE HATES
( X DOMAIN ( PERSONS ),
Y DOMAIN ( PERSONS ) ... ) ;
The intended semantics are, of
course, that the row <x,y> appears in LOVES only if "x
loves y" is true, and the row <x,y> appears in HATES
only if "x hates y" is true (Throughout this paper we
use a modified form of conventional SQL syntax, for reasons of simplicity and
explicitness.)
Now suppose the row
<Romeo,Juliet> is to be inserted into this database.
The user responsible for the INSERT will
presumably insert the row into base table LOVES.
Note carefully, however, at he or she could just as easily
have inserted the row into base table HATES instead.
It is only because the user had some
additional information namely, the knowledge that "Romeo loves
Juliet" is true that he or she decided to insert the row into LOVES and
not into HATES. Since that
additional information is not known to the DBMS, the "decision
procedure" for deciding whether a given row should be inserted into LOVES
or HATES is likewise not known to the DBMS.
In other words, part of the meaning of the database is concealed from
the system.
Note, moreover, that it is not
just the DBMS that is affected by this concealment.
Exactly the same information is being concealed from other
users as well. That is, given the
same row <Romeo,Juliet>, another user will be just as unable to decide
which of LOVES and HATES the row is to go into if he or she does not have the
necessary extra information (viz., that "Romeo loves Juliet"
is true). To put the point another
way: Suppose we are told that for a
certain pair of persons x and y either "x
loves y"
is true or "x hates y" is true, but we are not told
which. In general, then, we will only
be able to tell which of the two possibilities is in fact the case by looking
to see which of the two base tables the row <x,y> appears in.
Note further that even when we
have found which table the row <x,y> appears in, we still don't
really know which of the two possibilities is the casewe know only that the row
appears in LOVES or HATES, as the case may be.
And lest the reader object that we surely do now know which
possibility is the case, since the necessary information is represented by the
names
of the two base tables involved (LOVES and HATES), let us now rename those
tables ABC and XYZ, respectively. Now
can you tell which of "x loves y" and "x
hates y" is true? The
answer, of course, is that you can tell only if you know that ABC
"means" loves and XYZ "means" hates.
In fact, of course, there is nothing to stop
us renaming LOVES as HATES and HATES as LOVES ... Now can you tell?
Before attempting to draw any
conclusions from the foregoing, let us move on to examine another example.
3. The Employees Example
Suppose we have a database
concerning employees, in which every employee has a (unique) employee number
EMP#, a name ENAME, a department number DEPT#, and a salary SAL.
Further, suppose that we decide (for some
reason the precise reason is not important for the moment) to represent
employees by two base tables, EMPA and EMPB. See Fig. 1 for some sample
values where
·
EMPA contains rows for employees in department D1;
·
EMPB contains rows for employees who are either not in
department D1 or have a salary in excess of 33K.
EMPA EMPB
+----------------------------+
+----------------------------+
¦
EMP# ¦ ENAME ¦ DEPT# ¦ SAL ¦ ¦ EMP# ¦ ENAME ¦ DEPT# ¦ SAL ¦
+------+-------+-------+-----¦
+------+-------+-------+-----¦
¦
E1 ¦ Lopez ¦ D1 ¦ 25K ¦ ¦ E2 ¦ Cheng ¦ D1 ¦ 42K ¦
¦
E2 ¦ Cheng ¦ D1 ¦ 42K ¦ ¦ E3 ¦ Finzi ¦ D2 ¦ 30K ¦
+----------------------------+
¦ E4 ¦ Saito ¦ D2 ¦ 45K ¦
+----------------------------+
Fig. 1: Base tables EMPA and
EMPB (first version): sample values
The reader will surely agree that
this is a bad design. But why
exactly is it bad? Some insight
into this question can be gained by considering the following scenario.
Suppose first of all that we start off with
an empty database (i.e., base tables EMPA and EMPB both contain no rows at
all). Suppose next that we are asked to
insert information regarding employee E2 (name Cheng, department D1, salary
42K) into this database. We construct
the row
<E2,Cheng,D1,42K>
But which base table do we put it
in? The answer, obviously, has to be both
(as suggested by Fig. 1). It has to be
both, because the new row satisfies both (a) the criterion for membership in
EMPA (the department number is D1), and (b) the criterion for membership in
EMPB (the salary is greater than 33K).
After all, if the row were to be inserted into just one of the two base
tables, the question is which one? There
are no grounds, except arbitrary ones, for choosing either table over the other.
(In fact, if we put the row into
just one of the tables, say, EMPA and not EMPB, we could be accused of a
contradiction. For the appearance of the row in EMPA would
mean that Cheng works in department D1 and earns 42K, while the simultaneous
nonappearance of the row in EMPB would mean that Cheng either does not work in
department D1 or does not earn more than 33K.)
Thus we see that one reason the
design is bad is that it leads to redundancy:
The very same information is represented
twice, in two distinct base tables.
Of course, it is fairly easy to
see what causes the redundancy in this particular example.
To be specific, there is a certain lack
of independence between the two base tables EMPA and EMPB, inasmuch as
"their meanings overlap" (it is possible for the same row to satisfy
the membership criterion for both).
Lack of independence between objects is generally to be avoided if
possible, because it implies that changes to one will require changes to the
other as well. For instance, a DELETE
on one table might require a DELETE on another (as is indeed the case in our
EMPA-EMPB example, if we wish to delete the information for employee E2).
Thus it looks as if a good design
principle might be "Don't have tables whose meanings overlap"
and indeed,
so it is. Before we can make this
principle more precise, however, we need to examine the question of the
meaning
of a table in greater depth. Note in
particular that the principle as just-- very loosely! --Articulated is not
sufficient in itself to explain what is wrong with the LOVES-HATES example
discussed earlier.
4. Integrity Constraints
In order to discuss what tables
mean, we must first digress for a few moments to consider the general issue
of integrity
constraints. We classify such constraints
into four kinds, namely domain constraints,
column constraints, table constraints,
and database constraints (see Constraints and Predicates Parts 1,
2, 3).
Ø
A domain constraint is just the definition of the set
of values that go to make up the domain in question; in other words, it
effectively just enumerates the values in the domain (We mention domain
constraints merely for completeness they play no part in the database design
principle that is the primary focus of this paper.
Ø
A column constraint states that the values appearing in
a specific column must be drawn from some specific domain.
For example, consider base table EMPB from
the previous section. The columns of
that table are subject to the following column constraints:
·
e.EMP#
IN EMP#_DOM
· e.ENAME
IN NAME_DOM
· e.DEPT#
IN DEPT#_DOM
·
e.SAL
IN US_CURRENCY_DOM
Here e represents an arbitrary
row of the table and EMP#_DOM, NAME_DOM, etc., are the names of the relevant
domains.
Ø
A table constraint states that a specific table
must satisfy some specific condition, where the condition in question refers
solely
to the table under consideration i.e., it does not refer to any other table,
nor to any domain. For example, here
are two table constraints for the base table EMPB:
1.
e.DEPT# =/ 'D1' OR e.SAL > 33K
2.
IF e.EMP#
= f.EMP# THEN e.ENAME = f.ENAME
AND e.DEPT#
= f.DEPT#
AND e.SAL =
f.SAL
The first of these is self-explanatory.
The second says that if two rows e
and f have the same EMP# value, then they also have the same ENAME
value, the same DEPT# value, and the same SAL value in other words, they are
the same row. (Of course, this is just
a longwinded way of saying that EMP# is a candidate key.
Naturally we assume that all tables do have
at least one candidate key! i.e., duplicate rows are not permitted.)
Note: Observe that we talk of
table constraints in general, not just base table constraints. The point is,
all tables, base or
otherwise, are subject to table constraints, as the authors have discussed in
detail elsewhere [5]. For present
purposes, however, it is indeed base table constraints in particular that are
of primary interest; for the remainder of this paper, therefore, we will take
the unqualified term "table constraint" to mean a base table
constraint specifically, barring explicit statements to the contrary.
Ø
A database constraint states that the overall database
must satisfy some specific condition, where the condition in question can refer
to as many tables as desired. For
example, suppose the database of Fig. 1 was extended to include a departments
table DEPT. Then the referential
constraints from EMPA and EMPB to that table DEPT would both be database
constraints (they would both in fact refer to exactly two tables).
5. The Question Of Meaning
Now we can get back to our
discussion of what tables (and indeed databases) mean.
The first point is that every table, be it a
base table, a view, a query result, or whatever certainly does have an
associated meaning. And, of course,
users must be aware of those meanings if they are to use the database correctly
and effectively. For example, the
meaning of base table EMPB is something like the following:
"The employee with the
specified employee number (EMP#) has the specified name (ENAME), works in the
specified department (DEPT#), and earns the specified salary (SAL).
Furthermore, either the department number is
not D1 or the salary is greater than 33K (or both).
Also, no two employees have the same employee number."
Formally, the foregoing
"meaning" is an example of what is called a predicate, or
truth-valued function, a function of four arguments, in this particular
case. Substituting values for the
arguments is equivalent to invoking the function (or
"instantiating" the predicate), thereby yielding an expression that
evaluates to either true or false. For example, the substitution
EMP# = 'E3' ENAME =
'Finzi' DEPT# = 'D2' SAL = 30K
yields the value true. By contrast, the substitution
EMP# = 'E3' ENAME =
'Clark' DEPT# = 'D2' SAL = 25K
yields the value false.
And at any given time, of course, the table
contains exactly those rows that make the predicate evaluate to true at
that time.
It follows from the foregoing that
if (for example) a row is presented as a candidate for insertion into some
table, the DBMS should accept that row only if it does not cause the corresponding
predicate to be violated. More
generally, the predicate for a given table represents the criterion for
update acceptability for that table, that is, it constitutes the criterion
for deciding whether or not some proposed update is in fact valid (or at least
plausible) for the given table. In
other words, such a predicate corresponds to what we earlier referred to as the
membership criterion for the table in question.
In order for it to be able to
decide whether or not a proposed update is acceptable for a given table,
therefore, the DBMS needs to be aware of the predicate for that table.
Now, it is of course not possible for the
DBMS to know exactly what the predicate is for a given table.
In the case of base table EMPB, for example,
the DBMS has no way of knowing a priori that the predicate is such that
the row <E3,Finzi,D2,30K> makes it true and the row
<E3,Clark,D2,25K> does not; it also has no way of knowing exactly what
certain terms appearing in that predicate (such as "works in" or
"earns") mean. However, the
DBMS certainly does know a reasonably close approximation to that
predicate. To be specific, it knows
that, if a given row is to be deemed acceptable, all of the following must be
true:
·
The EMP# value must be a value from the domain of
employee numbers
·
The ENAME value must be a value from the domain of
names
·
The DEPT# value must be a value from the domain of
department numbers
·
The SAL value must be a value from the domain of US
currency
·
Either the DEPT# value is not D1 or the SAL value is
greater than 33K (or both)
·
The EMP# value is unique with respect to all such
values in the table
In other words, for a base table
such as EMPB, the DBMS does at least know all the integrity constraints (column
constraints and table constraints) that have been declared for that base
table. Formally, therefore, we can define
the (DBMS-understood) "meaning" of a given base table to be the
logical AND of all column constraints and table constraints that apply to that
base table (and it is this meaning that the DBMS will check whenever an update
is attempted on the base table in question).
For example, the formal meaning of base table EMPB is:
e.EMP# IN EMP#_DOM
AND
e.ENAME IN NAME_DOM AND
e.DEPT# IN DEPT#_DOM AND
e.SAL IN
US_CURRENCY_DOM AND
(e.DEPT# =/'D1' OR e.SAL > 33K) AND
(IF e.EMP# = f.EMP# THEN e.
ENAME = f.ENAME
AND
e.DEPT# = f.DEPT#
AND e.SAL = f.SAL)
We will refer to this expression;
let us call it PE as the table predicate for base table EMPB.
Incidentally, note how the
foregoing remarks serve to point up once again the fundamental importance of
the relational domain concept.
Relational vendors should be doing all within their power to incorporate
proper domain support into their DBMS products.
It is perhaps worth pointing out too that "proper domain
support" here does not mean support for the very strange construct
called "domains" in the SQL/92 standard.
To return to the main thread of
our discussion: As indicated above, in
order for the DBMS to be able to decide whether or not a given update is
acceptable on a given table, the DBMS needs to be aware of the table predicate
that applies to the table in question.
Now, the DBMS certainly is aware of the relevant predicate in the case
of a base table, as we have just seen.
But what about derived tables e.g., what about views?
What is the table predicate for a derived
table? Clearly, what we need is a set
of rules such that if the DBMS knows the table predicate(s) for the input(s) to
any relational operation, it can deduce the table predicate for the output from
that operation. Given such a set of
rules, the DBMS will then know the table predicate for all possible
tables.
It is in fact very easy to state
such a set of rules they follow immediately from the definitions of the relational
operators. For example, if A and
B are any two type-compatible tables tType-compatibility is usually
referred to as union-compatibility in the literature.
We prefer our term for reasons that are beyond the scope of the
present discussion.) and their respective table predicates are PA and PB,
then the table predicate PC for table C, where C is
defined as A INTERSECT B, is obviously (PA) AND (PB);
that is, a row r will appear in C if and only if it appears in
both A and Bi.e., if and only if PA(r) and PB(r)
are both true. So if, for example, we
define C as a view and try to insert r into that view, r
must satisfy both the table predicate for A and the table predicate for B,
or the INSERT will fail.
Here is another example:
The table predicate for the table that
results from the restriction operation
T WHERE condition
is (PT) AND (condition),
where PT is the table predicate for T.
For example, the table predicate for EMPB WHERE SAL < 40K is
(PE) AND (SAL < 40K)
where PE is the table predicate
for EMPB as defined earlier.
Stating the table predicates
corresponding to the other relational operators is left as an exercise for the
reader.
We conclude this section by
remarking that although it is somewhat irrelevant to the main theme of this paper
the overall database has a formal meaning too, just as the individual
base tables do. The meaning of the
database the database predicate for that database is essentially the
logical AND of all individual table predicates for the (named) tables in that
database, together with all database constraints that apply to that database.
Originally published in Database
Programming & Design 7, No. 6 (June 1994) and published as a two-part
article in RELATIONAL
DATABASE WRITINGS1991-94. It is republished here by permission of David
McGoveran, Miller Freeman Inc. and Pearson
Education, Inc. Research has shown that certain detail level
corrections might be needed, which we may
undertake in the future. However, we still believe strongly that the overall
approach is sound.
(Continued in
Part 2)