THE PRINCIPLE OF ORTHOGONAL DESIGN PART 2
by Chris Date and David McGoveran

 

 

 

Update Notice

 

The following is the draft of the abstract of a planned paper on this subject.

 


Database design is more art than science, that is, it is mainly of an informal nature. Whatever formality is there, it has been limited to what we currently call the Principle of Full Normalization (POFN), expounded in The Costly Illusion: Normalization, Integrity and Performance (paper #3 in this series). In 1993, however, McGoveran and Date presented in a two-part paper an additional formal design principle, The Principle of Orthogonal Design (POOD). Jointly, the two principles are intended to prevent redundancy, update anomalies,  harder to understand databases, and other practical complications in database design.

 

The POOD formally defined in the original paper was informally described as prohibition of tables with overlapping meanings. But there is now disagreement between the two authors on the original definition of POOD, which has been questioned by at least one (see On Non-Loss Decomposition); Date has since revised his formal definition in Data Redundancy and Database Design, Further Thoughts Part 1, but it is quite complex, it does not have a very easy informal description, and we are not sure about some of the assumptions behind it.

 

The purpose of this paper is to further clarify POOD, for a better understanding that would enable adherence with relative ease by the average practitioner.


Posted 1/13/06

 

 

 

(Continued from Part 1)

 

 

6. Tables With Overlapping Meanings

 

Now we can pin down what we mean when we say that the meanings of two tables overlap.  Let A and B be any two tables, with associated table predicates PA and PB respectively.  Then the meanings of A and B are said to overlap if and only if some row r can be constructed such that PA(r) and PB(r) are both true.

 

Given this definition, our new design principle --

 

+----------------------------------------------------------------+

¦   Within a given database, no two distinct base tables should  ¦

¦   have overlapping meanings.                                   ¦

+----------------------------------------------------------------+

 

--is now precise. 

 

However, two very important corollaries of the principle are perhaps not immediately obvious, and are in any case worth stating explicitly.  The first has to do with isomorphic tables.  For the purposes of this paper, we define two tables

 

   A { A1, ..., An }

   B { B1, ..., Bn }

 

to be isomorphic if and only if there exists a one-to-one correspondence between the columns of A and the columns of B, say A1:B1, ..., An:Bn, such that in each pair of columns Ai:Bi (i = 1, ..., n) the two columns are defined on the same domain (two tables that are type-compatible are certainly isomorphic, but tables can be isomorphic without necessarily being type-compatible.  This is because type-compatibility as we define it requires the two tables to have identical column names. 

 

Here then is the first corollary:

 

Ø       Two tables cannot possibly have overlapping meanings if they are not isomorphic.

 

It follows that our design principle applies specifically to isomorphic tables.  However, we caution the reader that the tables in question are not necessarily base tables!  Refer to the section AN IMPORTANT CLARIFICATION later in this paper.

 

The second corollary is as follows:

 

Ø       When a given row r is presented for insertion into the database, the DBMS should be able to decide for itself which table (if any) that row r belongs to.

 

In other words, the process of inserting a row can be regarded as a process of inserting that row into the database (rather than into some specific table), provided the design principle is adhered to.

 

Note:  On being informed of this point, the reader might very well respond "So what?" -- relational languages always require the user to specify the target table on an INSERT, so what is the advantage of having the system be able to figure out for itself what the target table is?  One answer to this question is that the specified target table might be a view.  Consider an INSERT into a view V defined as the union of two tables A and B.  As the present authors have discussed in detail elsewhere, it is very desirable that the system be able to decide for itself which of A and B the new row belongs to.

 

 

7. The Examples Revisited

 

Now let us revisit the examples discussed in Part 1.  First of all, the EMPA-EMPB design is clearly bad, since the meanings of the two tables clearly overlap.  But suppose we were to redefine those tables as follows (See Fig. 2 for some sample values):

 

·  EMPA contains rows for employees in department D1;

·  EMPB contains rows for employees not in department D1.

 

EMPA                           EMPB

+----------------------------+ +----------------------------+

| EMP# ¦ ENAME ¦ DEPT# ¦ SAL ¦ ¦ EMP# ¦ ENAME ¦ DEPT# ¦ SAL ¦

+------+-------+-------+-----¦ +------+-------+-------+-----¦

¦ E1   ¦ Lopez ¦ D1    ¦ 25K ¦ ¦ E3   ¦ Finzi ¦ D2    ¦ 30K ¦

¦ E2   ¦ Cheng ¦ D1    ¦ 42K ¦ ¦ E4   ¦ Saito ¦ D2    ¦ 45K ¦

+----------------------------+ +----------------------------+

Fig. 2: Base tables EMPA and EMPB (second version): sample values

 

 

The meanings of the tables now do not overlap.  However, the design is still bad if the DBMS is not aware of that fact.  That is, if the table predicates for the two tables as stated to the DBMS do not include the terms

 

... AND e.DEPT# = 'D1' ...   /* for EMPA */

... AND e.DEPT# =/ 'D1' ...   /* for EMPB */

 

then the meanings still do overlap so far as the DBMS is concerned.  In other words, the word "meaning" in our design principle refers specifically to the "meaning" as understood by the DBMS (of course), not necessarily to the meaning as understood by the user. 

 

What about the LOVES-HATES example?  Well, here are the table predicates for the design as originally given:

 

r.X IN PERSONS AND r.Y IN PERSONS   /* for LOVES */

r.X IN PERSONS AND r.Y IN PERSONS   /* for HATES */

 

(where r is an <x,y> row).  These two predicates are identical, of course, and therefore most certainly do overlap!  In fact, the example is not really different in kind from the EMPA-EMPB example (second version) just discussed

 

Note: The same would still be true if we renamed the X and Y columns (say) L1 and L2 in LOVES and H1 and H2 in HATES (the tables would still be isomorphic).  Consider, for example, what would happen on an attempt to insert a row into the union of the two tables.

 

Here by contrast is a revised design that does not violate our design principle:

 

CREATE DOMAIN PERSONS ...;

CREATE DOMAIN L_OR_H VALUES {'loves','hates'};

 

CREATE BASE TABLE LOVES

     (X DOMAIN (PERSONS),

      R DOMAIN (L_OR_H),

      Y DOMAIN (PERSONS)...);

CREATE BASE TABLE HATES

     (X DOMAIN (PERSONS),

       R DOMAIN (L_OR_H),

       Y DOMAIN (PERSONS)...);

 

The table predicates are now as follows (and should be so defined to the DBMS):

 

 

r.X IN PERSONS AND r.Y IN PERSONS

               AND r.R = 'loves'   /* for LOVES */

 

r.X IN PERSONS AND r.Y IN PERSONS

               AND r.R = 'hates'   /* for HATES */

 

To insert the information that Romeo loves Juliet, it is now necessary to insert the row <Romeo,loves,Juliet>.  Note, incidentally, that there is nothing to stop us inserting the row <Romeo,hates,Juliet> as well!  In fact, the two three-column base tables LOVES and HATES might as well now be replaced by a single base table that is the union of the two.  (Exercise for the reader:  What would the corresponding table predicate be?)  An analogous remark applies to the EMPA-EMPB example (second version) above. 

 

 

8. An Important Clarification

 

The message of this paper thus far might be summed up as follows:  Whenever your database design includes two distinct base tables that are isomorphic, be sure that the DBMS-understood meanings of those two tables do not overlap.  This rule (or discipline) is easy to state and easy to apply, and it would be nice if matters stopped right there.  Unfortunately, however, there is one important ramification that we have so far overlooked.  Consider the tables EMPX and EMPY shown in Fig. 3.

 

EMPX                      EMPY

+----------------------+  +-----------------------+

¦ EMP# ¦ ENAME ¦ DEPT# ¦  ¦ EMP# ¦ ENAME ¦ SALARY ¦

+------+-------+-------¦  +------+-------+--------¦

¦ E1   ¦ Lopez ¦ D1    ¦  ¦ E1   ¦ Lopez ¦    25K ¦

¦ E2   ¦ Cheng ¦ D1    ¦  ¦ E2   ¦ Cheng ¦    42K ¦

¦ E3   ¦ Finzi ¦ D2    ¦  ¦ E3   ¦ Finzi ¦    30K ¦

¦ E4   ¦ Saito ¦ D2    ¦  ¦ E4   ¦ Saito ¦    45K ¦

+----------------------+  +-----------------------+

Fig. 3: Base tables EMPX and EMPY: sample values

 

 

It should be clear that the design of that figure is once again a bad one, because of the redundancy it implies.  Here, however, the overlap in meaning occurs, not between the two tables EMPX and EMPY, but rather between the two projections of those tables over EMP# and ENAME.  Clearly, therefore, we need to extend our design principle to deal with such a situation, as follows:

 

+----------------------------------------------------------------+

¦    Let A and B be any two base tables in the database.  Then   ¦

¦    there must not exist nonloss decompositions of A and B      ¦

¦    into A1, A2, ..., Am and B1, B2, ..., Bn (respectively)     ¦

¦    such that two distinct projections in the set A1, A2, ...,  ¦

¦    Am, B1, B2, ..., Bn have overlapping meanings.              ¦

+----------------------------------------------------------------+

 

By the term "nonloss decomposition" (of some given table), we mean, of course, a decomposition of that table -- according to the well-known principles of normalization -- into a set of projections such that (a) the given table can be reconstructed by joining those projections back together again, and (b) none of those projections is redundant in that reconstruction process.

 

Note:  This refined version of our design principle in fact subsumes the original version, because one "nonloss decomposition" of any given table T is the set of projections consisting of just the "identity projection" T itself.  In other words, if we agree to refer to tables that have projections whose meanings overlap as having meanings that partially overlap, then total overlap is just a special case (i.e., two tables that have totally overlapping meanings certainly have partially overlapping meanings, a fortiori).

 

 

9. Concluding Remarks

 

There are several remarks to be made by way of conclusion.

 

Ø       First, readers might be tempted to think that our new design principle is very obvious and really just common sense.  And in a way they would be right.  But the principles of normalization (third normal form, etc.) are likewise "obvious and just common sense."  The point is, however, that the principles of normalization take those common sense ideas and provide a precise, accurate characterization of those intuitive concepts.  In a similar manner, our new design principle provides a precise, accurate characterization of certain additional intuitive concepts.

 

Ø       At least one of the present authors (McGoveran) has not only encountered genuine database designs in which the principle has been flouted (despite the fact that it is "really just common sense"), he has also encountered database practitioners and database "experts" who have expressly recommended flouting that principle.  Indeed, we have probably all seen designs such as the following --

 

ACTIVITIES_88 {ENTRY#, DESCRIPTION, AMOUNT, NEW_BALANCE}

ACTIVITIES_89 {ENTRY#, DESCRIPTION, AMOUNT, NEW_BALANCE}

ACTIVITIES_90 {ENTRY#, DESCRIPTION, AMOUNT, NEW_BALANCE}

ACTIVITIES_91 {ENTRY#, DESCRIPTION, AMOUNT, NEW_BALANCE}

ACTIVITIES_92 {ENTRY#, DESCRIPTION, AMOUNT, NEW_BALANCE}

ACTIVITIES_93 {ENTRY#, DESCRIPTION, AMOUNT, NEW_BALANCE}

 

Ø       (etc. etc.) -- in which activities for different years are kept in different tables.

 

Ø       In a design such as the one just illustrated, part of the (informal, user-understood-but-not-DBMS-understood) meaning of the database is encoded in the table names.  Such a design can thus be seen as violating Codd's Information Principle, which can be stated as follows:

 

Ø       All information in the database must be cast explicitly in terms of values in tables and in no other way.

 

Ø       Our design principle -- at least, that part of it that recommends against the use of names to carry meaning -- can thus be seen as a corollary (though not a very obvious one) of Codd's Information Principle (

 

Ø       Note: Some readers might be aware of the fact that in his further discussion of the Information Principle, Codd goes on to point out that names too are "cast in terms of values":  "Even ... names are represented as character strings in [certain tables that] are normally part of the builtin database catalog".  This fact is something of a red herring, however; in no way does it invalidate our new design principle.

 

Ø       Note that adherence to our design principle has the consequence that if A and B are any two type-compatible base tables, then it will be true for all time that:

 

A UNION B is a disjoint union

A INTERSECT B is empty

A MINUS B is equal to A

 

Ø       Adherence to our principle also has the very desirable consequence that the rules for updating union, intersection, and difference views work very well and never produce what we referred to in that paper [forthcoming]as "surprising results."

 

Ø       One final andvery important remark:  The new design principle is equally applicable to the design of what might be called "individual user databases" -- that is, an individual user's perception (as defined by views and/or base tables) of some underlying shared database.  In other words, such an "individual user database" ought not to include any views and/or base tables whose meanings overlap (even partially), for essentially all of the same reasons that the shared database ought not to include any base tables whose meanings overlap (even partially).

 

(The authors would like to thank Hugh Darwen, Fabian Pascal, and Paul Winsberg for their helpful comments on earlier drafts of this paper.)

 

 

Appendix A: A Remark On Defaults

 

Section 8 of this paper discussed "an important clarification" to the first version of our design principle, having to do with projections of the given base tables.  Unfortunately, another important clarification is also required, having to do with column defaults.  This second clarification was included in the original draft of the paper but was omitted from the version published in Database Programming & Design.  We therefore discuss it in this appendix.

 

First of all, relational systems typically permit the user to insert partial rows into the database; if the user does not provide a value for some particular column, then the system completes the row by placing a default value (Possibly NULL in SQL.  We do not wish to get sidetracked into a discussion of the problems of NULLs here; for present purposes, it is irrelevant whether the default is null or something else.) in that column position.  For example, suppose we are given a variation on the usual suppliers-and-parts database, in which base tables S (suppliers) and P (parts) look like this:

 

S  {ID#,STATUS,CITY}

P  {ID#,COLOR,WEIGHT,CITY}

 

Then the INSERT operation--

 

INSERT (ID# = 'S6',CITY = 'Tucson')

INTO S ;

 

-- might cause the system to complete the row by placing the value 0 in the STATUS position (if 0 is defined as the default for column S.STATUS) before inserting the new row into the suppliers table S. 

 

Now consider the partial row <X5,Tucson>, where "X5" is an ID number and "Tucson" is a CITY value.  Unless we have appropriate additional information, we cannot immediately tell whether this is a partial suppliers (S) row or a partial parts (P) row.  It follows that if, e.g., we define a view as follows --

 

CREATE VIEW V AS (SELECT ID#,CITY FROM S)

                  UNION

                 (SELECT ID#,CITY FROM P);

 

-- and we try to insert the row <X5,Tucson> into this view, then all the system can do is attempt to insert an appropriately completed row into both table S and table P.  And if columns S.STATUS, P.COLOR, and P.WEIGHT all do possess a corresponding default value (i.e., none of them has "defaults not allowed"), the effect will indeed be to insert "an appropriately completed row" into both tables.

 

It follows from the foregoing that our design principle requires some slight refinement, as follows:

 

+----------------------------------------------------------------+

¦   Let A' and B' be any two base tables in the database, and    ¦

¦   let A and B be the projections of A and B (respectively)     ¦

¦   over all columns that do not have a default value (i.e.,     ¦

¦   all columns that have "defaults not allowed").  Then there   ¦

¦   must not exist nonloss decompositions of A and B into A1,    ¦

¦   A2, ..., Am and B1, B2, ..., Bn (respectively) such that     ¦

¦   two distinct projections in the set A1, A2, ..., Am, B1,     ¦

¦   B2, ..., Bn have overlapping meanings.                       ¦

+----------------------------------------------------------------+

 

Originally published in Database Programming & Design 7, No. 6 (June 1994) and published as a two-part article in RELATIONAL DATABASE WRITINGS 1991-94. It is republished here by permission of David McGoveran, Miller Freeman Inc. and Pearson Education,  Inc. © All rights reserved by C.J. Date. Research has shown that certain detail level corrections might be needed, which we may undertake in the future. However, we still believe strongly that the overall approach is sound.