The following is the draft of the abstract of a planned paper
on this subject.
Database design is more art than science, that is, it is
mainly of an informal nature. Whatever formality is there, it has been limited
to what we currently call the Principle of Full Normalization (POFN),
expounded in The
Costly Illusion: Normalization, Integrity and Performance (paper #3 in
this series). In 1993, however, McGoveran and Date presented in a two-part
paper an additional formal design principle, The Principle of
Orthogonal Design (POOD). Jointly, the two principles are intended
to prevent redundancy, update anomalies,
harder to understand databases, and other practical complications in
database design.
The POOD formally defined in the original paper was
informally described as prohibition of tables with overlapping meanings.
But there is now disagreement between the two authors on the original
definition of POOD, which has been questioned by at least one (see On Non-Loss Decomposition); Date has since revised
his formal definition in Data Redundancy and
Database Design, Further Thoughts Part 1, but it is quite complex, it
does not have a very easy informal description, and we are not sure about some
of the assumptions behind it.
The purpose of this paper is to further clarify POOD, for a
better understanding that would enable adherence with relative ease by the
average practitioner.
(Continued from Part 1)
6. Tables With Overlapping Meanings
Now we can pin down what we mean
when we say that the meanings of two tables overlap. Let A and B be any two tables,
with associated table predicates PA and PB respectively. Then the meanings of A and B
are said to overlap if and only if some row r can be constructed such
that PA(r) and PB(r) are both true.
Given this definition, our new
design principle --
+----------------------------------------------------------------+
¦ Within
a given database, no two distinct base tables should ¦
¦ have
overlapping meanings. ¦
+----------------------------------------------------------------+
--is now precise.
However, two very important
corollaries of the principle are perhaps not immediately obvious, and are in
any case worth stating explicitly. The
first has to do with isomorphic tables.
For the purposes of this paper, we define two tables
A { A1, ..., An }
B { B1, ..., Bn }
to be isomorphic if and only if
there exists a one-to-one correspondence between the columns of A and
the columns of B, say A1:B1, ..., An:Bn, such that in each pair
of columns Ai:Bi (i = 1, ..., n) the two columns are
defined on the same domain (two tables that are type-compatible are
certainly isomorphic, but tables can be isomorphic without necessarily being
type-compatible. This is because
type-compatibility as we define it requires the two tables to have identical
column names.
Here then is the first corollary:
Ø
Two tables
cannot possibly have overlapping meanings if they are not isomorphic.
It follows that our design principle
applies specifically to isomorphic tables.
However, we caution the reader that the tables in question are not
necessarily base tables! Refer
to the section AN IMPORTANT CLARIFICATION later in this paper.
The second corollary is as
follows:
Ø
When a given
row r is presented for insertion into the database, the DBMS should be able to
decide for itself which table (if any) that row r belongs to.
In other words, the process of
inserting a row can be regarded as a process of inserting that row into the database
(rather than into some specific table), provided the design principle is
adhered to.
Note: On being informed of this point, the reader might very well
respond "So what?" -- relational languages always require the user to
specify the target table on an INSERT, so what is the advantage of having the
system be able to figure out for itself what the target table is? One answer to this question is that the
specified target table might be a view.
Consider an INSERT into a view V defined as the union of two
tables A and B. As the
present authors have discussed in detail elsewhere, it is very desirable that
the system be able to decide for itself which of A and B the new
row belongs to.
7. The Examples Revisited
Now let us revisit the examples
discussed in Part
1. First of all, the EMPA-EMPB
design is clearly bad, since the meanings of the two tables clearly
overlap. But suppose we were to
redefine those tables as follows (See Fig. 2 for some sample values):
· EMPA
contains rows for employees in department D1;
· EMPB
contains rows for employees not in department D1.
EMPA
EMPB
+----------------------------+
+----------------------------+
| EMP# ¦ ENAME ¦ DEPT# ¦ SAL ¦ ¦ EMP# ¦ ENAME ¦
DEPT# ¦ SAL ¦
+------+-------+-------+-----¦
+------+-------+-------+-----¦
¦ E1 ¦
Lopez ¦ D1 ¦ 25K ¦ ¦ E3 ¦ Finzi ¦ D2 ¦ 30K ¦
¦ E2 ¦
Cheng ¦ D1 ¦ 42K ¦ ¦ E4 ¦ Saito ¦ D2 ¦ 45K ¦
+----------------------------+
+----------------------------+
Fig. 2: Base tables EMPA and EMPB
(second version): sample values
The meanings of the tables now do
not overlap. However, the design is
still bad if the DBMS is not aware of that fact. That is, if the table predicates for the two tables as stated
to the DBMS do not include the terms
... AND e.DEPT# = 'D1' ... /* for EMPA */
... AND e.DEPT# =/ 'D1' ... /* for EMPB */
then the meanings still do overlap
so far as the DBMS is concerned. In
other words, the word "meaning" in our design principle refers
specifically to the "meaning" as understood by the DBMS (of
course), not necessarily to the meaning as understood by the user.
What about the LOVES-HATES
example? Well, here are the table
predicates for the design as originally given:
r.X IN PERSONS AND r.Y IN PERSONS /* for LOVES */
r.X IN PERSONS AND r.Y IN PERSONS /* for HATES */
(where r is an <x,y>
row). These two predicates are
identical, of course, and therefore most certainly do overlap! In fact, the example is not really different
in kind from the EMPA-EMPB example (second version) just discussed
Note: The same would still be
true if we renamed the X and Y columns (say) L1 and L2 in LOVES and H1 and H2
in HATES (the tables would still be isomorphic). Consider, for example, what would happen on an attempt to insert a
row into the union of the two tables.
Here by contrast is a revised design that does not violate
our design principle:
CREATE DOMAIN PERSONS ...;
CREATE DOMAIN L_OR_H VALUES {'loves','hates'};
CREATE BASE TABLE LOVES
(X DOMAIN (PERSONS),
R DOMAIN (L_OR_H),
Y DOMAIN
(PERSONS)...);
CREATE BASE TABLE HATES
(X DOMAIN (PERSONS),
R DOMAIN (L_OR_H),
Y DOMAIN
(PERSONS)...);
The table predicates are now as follows (and should be so
defined to the DBMS):
r.X IN PERSONS AND r.Y IN PERSONS
AND r.R
= 'loves' /* for LOVES */
r.X IN PERSONS AND r.Y IN PERSONS
AND r.R
= 'hates' /* for HATES */
To insert the information that Romeo
loves Juliet, it is now necessary to insert the row
<Romeo,loves,Juliet>. Note,
incidentally, that there is nothing to stop us inserting the row
<Romeo,hates,Juliet> as well! In
fact, the two three-column base tables LOVES and HATES might as well now be
replaced by a single base table that is the union of the two. (Exercise for the reader: What would the corresponding table predicate
be?) An analogous remark applies to the
EMPA-EMPB example (second version) above.
8. An Important Clarification
The message of this paper thus far
might be summed up as follows: Whenever
your database design includes two distinct base tables that are isomorphic, be
sure that the DBMS-understood meanings of those two tables do not overlap. This rule (or discipline) is easy to state
and easy to apply, and it would be nice if matters stopped right there. Unfortunately, however, there is one
important ramification that we have so far overlooked. Consider the tables EMPX and EMPY shown in
Fig. 3.
EMPX EMPY
+----------------------+ +-----------------------+
¦ EMP# ¦ ENAME ¦ DEPT# ¦ ¦ EMP# ¦ ENAME ¦ SALARY ¦
+------+-------+-------¦ +------+-------+--------¦
¦ E1 ¦
Lopez ¦ D1 ¦ ¦ E1
¦ Lopez ¦ 25K ¦
¦ E2 ¦
Cheng ¦ D1 ¦ ¦ E2
¦ Cheng ¦ 42K ¦
¦ E3 ¦
Finzi ¦ D2 ¦ ¦ E3
¦ Finzi ¦ 30K ¦
¦ E4 ¦
Saito ¦ D2 ¦ ¦ E4
¦ Saito ¦ 45K ¦
+----------------------+ +-----------------------+
Fig. 3: Base tables EMPX and EMPY:
sample values
It should be clear that the design
of that figure is once again a bad one, because of the redundancy it
implies. Here, however, the overlap in
meaning occurs, not between the two tables EMPX and EMPY, but rather between
the two projections of those tables over EMP# and ENAME. Clearly, therefore, we need to extend our
design principle to deal with such a situation, as follows:
+----------------------------------------------------------------+
¦ Let
A and B be any two base tables in the database. Then
¦
¦ there
must not exist nonloss decompositions of A and B ¦
¦ into
A1, A2, ..., Am and B1, B2, ..., Bn (respectively) ¦
¦ such
that two distinct projections in the set A1, A2, ..., ¦
¦
Am, B1, B2, ..., Bn have overlapping meanings. ¦
+----------------------------------------------------------------+
By the term "nonloss
decomposition" (of some given table), we mean, of course, a decomposition
of that table -- according to the well-known principles of normalization --
into a set of projections such that (a) the given table can be reconstructed by
joining those projections back together again, and (b) none of those
projections is redundant in that reconstruction process.
Note: This refined version of our design principle in fact subsumes the
original version, because one "nonloss decomposition" of any given
table T is the set of projections consisting of just the "identity
projection" T itself. In
other words, if we agree to refer to tables that have projections whose
meanings overlap as having meanings that partially overlap, then total
overlap is just a special case (i.e., two tables that have totally overlapping
meanings certainly have partially overlapping meanings, a fortiori).
9. Concluding Remarks
There are several remarks to be made by way of conclusion.
Ø
First, readers might be tempted to think that our new
design principle is very obvious and really just common sense. And in a way they would be right. But the principles of normalization (third
normal form, etc.) are likewise "obvious and just common sense." The point is, however, that the principles
of normalization take those common sense ideas and provide a precise,
accurate characterization of those intuitive concepts. In a similar manner, our new design
principle provides a precise, accurate characterization of certain additional
intuitive concepts.
Ø
At least one of the present authors (McGoveran) has not
only encountered genuine database designs in which the principle has been
flouted (despite the fact that it is "really just common sense"), he
has also encountered database practitioners and database "experts"
who have expressly recommended flouting that principle. Indeed, we have probably all seen designs
such as the following --
ACTIVITIES_88 {ENTRY#, DESCRIPTION,
AMOUNT, NEW_BALANCE}
ACTIVITIES_89 {ENTRY#, DESCRIPTION,
AMOUNT, NEW_BALANCE}
ACTIVITIES_90 {ENTRY#, DESCRIPTION,
AMOUNT, NEW_BALANCE}
ACTIVITIES_91 {ENTRY#, DESCRIPTION,
AMOUNT, NEW_BALANCE}
ACTIVITIES_92 {ENTRY#, DESCRIPTION,
AMOUNT, NEW_BALANCE}
ACTIVITIES_93 {ENTRY#, DESCRIPTION,
AMOUNT, NEW_BALANCE}
Ø
(etc. etc.) -- in which activities for different years
are kept in different tables.
Ø
In a design such as the one just illustrated, part of
the (informal, user-understood-but-not-DBMS-understood) meaning of the database
is encoded in the table names.
Such a design can thus be seen as violating Codd's Information
Principle, which can be stated as follows:
Ø
All information in the database must be cast
explicitly in terms of values in tables and in no other way.
Ø
Our design principle -- at least, that part of it that
recommends against the use of names to carry meaning -- can thus be seen
as a corollary (though not a very obvious one) of Codd's Information Principle
(
Ø
Note: Some readers might be aware of the fact
that in his further discussion of the Information Principle, Codd goes on to
point out that names too are "cast in terms of values": "Even ... names are represented as
character strings in [certain tables that] are normally part of the builtin
database catalog". This fact is
something of a red herring, however; in no way does it invalidate our new
design principle.
Ø
Note that adherence to our design principle has the
consequence that if A and B are any two type-compatible base
tables, then it will be true for all time that:
A UNION B is a disjoint
union
A INTERSECT B is empty
A MINUS B is equal to A
Ø
Adherence to our principle also has the very desirable
consequence that the rules for updating union, intersection, and difference
views work very well and never produce what we referred to in that paper [forthcoming]as "surprising results."
Ø
One final andvery important remark: The new design principle is equally
applicable to the design of what might be called "individual user
databases" -- that is, an individual user's perception (as defined by
views and/or base tables) of some underlying shared database. In other words, such an "individual
user database" ought not to include any views and/or base tables whose
meanings overlap (even partially), for essentially all of the same reasons that
the shared database ought not to include any base tables whose meanings overlap
(even partially).
(The authors would like to thank
Hugh Darwen, Fabian Pascal, and Paul Winsberg for their helpful comments on
earlier drafts of this paper.)
Appendix A: A Remark On Defaults
Section 8 of this paper discussed
"an important clarification" to the first version of our design
principle, having to do with projections of the given base tables. Unfortunately, another important
clarification is also required, having to do with column defaults. This second clarification was included in
the original draft of the paper but was omitted from the version published in Database
Programming & Design. We
therefore discuss it in this appendix.
First of all, relational systems
typically permit the user to insert partial rows into the database; if
the user does not provide a value for some particular column, then the system
completes the row by placing a default value (Possibly NULL in SQL. We do not wish to get sidetracked into a
discussion of the problems of NULLs here; for present purposes, it is
irrelevant whether the default is null or something else.) in that column
position. For example, suppose we are
given a variation on the usual suppliers-and-parts database, in which base
tables S (suppliers) and P (parts) look like this:
S {ID#,STATUS,CITY}
P {ID#,COLOR,WEIGHT,CITY}
Then the INSERT operation--
INSERT (ID# = 'S6',CITY = 'Tucson')
INTO S ;
-- might cause the system to
complete the row by placing the value 0 in the STATUS position (if 0 is defined
as the default for column S.STATUS) before inserting the new row into the
suppliers table S.
Now consider the partial row
<X5,Tucson>, where "X5" is an ID number and "Tucson"
is a CITY value. Unless we have
appropriate additional information, we cannot immediately tell whether this is
a partial suppliers (S) row or a partial parts (P) row. It follows that if, e.g., we define a view
as follows --
CREATE VIEW V AS (SELECT ID#,CITY FROM S)
UNION
(SELECT
ID#,CITY FROM P);
-- and we try to insert the row <X5,Tucson>
into this view, then all the system can do is attempt to insert an
appropriately completed row into both table S and table P. And if columns S.STATUS, P.COLOR, and
P.WEIGHT all do possess a corresponding default value (i.e., none of them has
"defaults not allowed"), the effect will indeed be to insert "an
appropriately completed row" into both tables.
It follows from the foregoing that our design principle
requires some slight refinement, as follows:
+----------------------------------------------------------------+
¦ Let
A' and B' be any two base tables in the database, and ¦
¦ let
A and B be the projections of A and B (respectively) ¦
¦ over
all columns that do not have a default value (i.e., ¦
¦ all
columns that have "defaults not allowed"). Then there ¦
¦ must
not exist nonloss decompositions of A and B into A1, ¦
¦
A2, ..., Am and B1, B2, ..., Bn (respectively) such that ¦
¦ two
distinct projections in the set A1, A2, ..., Am, B1, ¦
¦
B2, ..., Bn have overlapping meanings. ¦
+----------------------------------------------------------------+
Originally published in Database
Programming & Design 7, No. 6 (June 1994) and published as a two-part
article in RELATIONAL
DATABASE WRITINGS 1991-94. It is republished here by permission
of David McGoveran, Miller Freeman Inc. and Pearson Education, Inc. © All
rights reserved by C.J. Date. Research has shown that certain detail
level corrections might be needed, which we may undertake in the future.
However, we still believe strongly that the overall approach is sound.