Monday, September 19, 2016

The Principle of Orthogonal Database Design Part I


Note: This post replaces several previous posts which now redirect here. The example is not meant to be realistic, but to convey specific points in an accessible way.

"The principle of orthogonal design (abbreviated POOD) ... is the second of the two principles of database design, which seek to prevent databases from being too complicated or redundant, the first principle being the principle of full normalization (POFN). Simply put, it says that no two relations in a relational database should be defined in such a way that they can represent the same facts. As with database normalization, POOD serves to eliminate uncontrolled storage redundancy and expressive ambiguity, especially useful for applying updates to virtual relations (e.g., view (database)). Although simple in concept, POOD is frequently misunderstood ... is a restatement of the requirement that a database is a minimum cover set of the relational algebra. The relational algebra allows data duplication in the relations that are the elements of the algebra. One of the efficiency requirements of a database is that there be no data duplication. This requirement is met by the minimum cover set of the relational algebra." --Wikipedia.org


Base and Derived Relations

Every relation has an associated formal relation predicate (RP), a conjunction of integrity constraints expressed in a DBMS-specific relational data language that represent informal business rules in the database denoting the meaning of the relation and its tuples.

Relational databases consist of a set of base and derived relations, the latter--query results, views and snapshots--derived from the former by relational algebra operations. SQL base tables induced the notion that base relations are stored, but while their data can and often will be stored (and not necessarily as tables), this is not not a requirement. Rather, they represent the base facts (axioms) about the real world entities of interest (i.e., serve as the basis of the database). Derived relations represent facts that are logical implications of the axioms (theorems) and are still about the entities represented by base relations--no entities are created by data manipulation.

Relations are derived from base relations either directly, or via other derived relations and inherit the integrity constraints on the relations from which they were derived, supplemented by any added by the relational operations applied to produce them (e.g., restrictions, projections, joins and so on). Thus, at every relation derivation a true RDBMS knows and records the derived RP (constraints) in the database catalog, which is another way of saying that it supports constraint inheritance


Some derived relations (e.g., views) are given names, but they are just a shorthand for their RPs (constraints). Others (e.g., query results) don't have names. Note very carefully that application developers and end-users do not and need not know whether they interact with base or derived relations. 

Three Formal Design Principles


Formal systems governed by three principles:
  1. Principle of orthogonality (or axiomatic independence): No axiom is derivable from other axioms;
  2. Principle of minimality: There are no unnecessary axioms;
  3. Principle of completeness: All meaningful theorems are derivable from the axioms.

have desirable properties. Relational databases are formal logic systems and, therefore, will acquire those properties if their design adheres to the principles[1]There is a McGoveran conjecture that the Principle of Full Normalization (POFN) implies the three principles, but not vice-versa.
 
The Principle of Orthogonal Design (POOD) is the principle of orthogonality applied to relational databases. Base relations are independent if the facts they represent are not derivable from those represented by the other base relations. With respect to identical relations see Database Design: Relation Predicates and "Identical Relations". While the POFN and the POOD reduce logical redundancy, their principal objective is logical and semantic correctness.

Entity Supertypes and Subtypes


Consider a (very) simple conceptual model consisting of the following business rules and assume for now that no data are missing.
  • All employees have an id, name and hire date;
  • Some earn a salary;
  • Some make a commission;
  • Salary and commission are mutually exclusive.

Many practitioners would employ the following design:
EMPLOYEES (EMP#,ENAME,HIREDATE,SALARY,COMMISSION)
that in SQL would contain NULLs that stand for "inapplicable"--an artifact of poor design: the inaccurate representation of the conceptual model. The model specifies that salary and commission properties do not apply to some employees (i.e., they don't have them), yet the table design assigns the corresponding attributes to all tuples, forcing NULLs.

NULLs violate the core Information Principle (IP) of the relational data model (RDM), which mandates that all information in a relational database be represented explicitly and in exactly one way: as values of relation attributes defined on domains. A NULL is not a value, but a marker for the absence of a value that cannot be legitimately treated as a value--that would be a logical type error--so the design does not  represent a relation. Most data professionals are oblivious to the problems caused by NULLs (see [2] and bibliography in [3]), because they focus on representation (structure) and ignore the implications for integrity enforcement and manipulation.

In the past I advocated the following design:

EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,SALARY)
COMM_EMPS (EMP#,COMMISSION)
It avoids NULLs, but what entities do the latter two relations represent?

When entities have some properties in common, but some also have unique properties, not all entities are of the same type. Entities with unique properties in addition to common ones are a subtype of those with only common entities, the supertype--an entity supertype-subtype (ESS) relationship.

Salaried and commissioned employees are subtypes of the employee supertype. The instances of a supertype are necessarily disjoint from instances of its subtypes, so three relations, one for each type, with multi-relation disjunctive constraints ensuring mutually exclusive tuple memberships would do the job.

EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,ENAME,HIREDATE,SALARY)
COMM_EMPS (EMP#,ENAME,HIREDATE,COMMISSION)
Physical design should not contaminate logical design, however, once the base relations are determined, for reasons of physical efficiency (e.g., anticipation of lots of queries about all employees) and barring other application requirements, there is nothing to prevent physical storage of the data as if they were of a single relation
(EMP#,ENAME,HIREDATE,SALARY,COMM)
with EMPS, SAL_EMPS and COMM_EMPS as derived projection views (relations do not occur, nor need they at the physical level, so this is a perfectly acceptable physical design). 

Note: Default values outside the otherwise permissible value ranges for the attributes can be used instead of NULLs for inapplicable SALARY and COMMISSION values. Default values are not just NULLs by another name: they occur in the physical, not logical design. They extend the domain and, except by caveat, are not legitimate values for computational purposes (again, logical inference and relational operators occur only in the logical model). So there is never a concern to the user--they are handled by a true RDBMS transparently, provided users understand the domains. 

The application developer, let alone the end user, should never be presented with a logical design that reflects anything other than the entities with which they are familiar. For example, if only employees that are salaried or commissioned are to be tracked, EMPS data need not be stored--if necessary, it can be derived logically as a union view from SAL_EMPS and COMM_EMPS. For this reason, it is assumed that applications will be built on top of views that represent particular entity supertypes or subtypes. 

In fact, to take advantage of logical independence (LI)--the insulation of users and applications from certain non-loss logical reorganizations of the database--applications should access the database exclusively via views, which must be updatable.

(Stay tuned for Part II)
 

References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS, forthcoming (chapter drafts).
[2] McGoveran, D., Nothing from Nothing, Parts 1-4.
[3] Pascal, F., The Final NULL in the Coffin: A Relational Solution to Missing Data.




Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

11 comments:

  1. I don't see anything wrong with having them all in one table, with nulls and a check constraint to ensure mutual exclusivity.

    ReplyDelete
    Replies
    1. There is little to respond to such a well reasoned argument.

      Delete
    2. That may work fine for a couple of sub-type entities that have only a small number of differences. Try it with many subtype entities with many unique attributes and complex validation rules though...it quickly breaks down, and may easily exceed the capability of a check constraint.

      Delete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. This comment has been removed by a blog administrator.

      Delete
    3. This comment has been removed by the author.

      Delete
  3. It seems a bunch of paragraphs picked from different sources and tossed here without a logical order. POOD?

    ReplyDelete
    Replies
    1. Boasting about your lack of knowledge and ability to understand is not a good idea.

      Delete
  4. There is a McGoveran conjecture that the Principle of Full Normalization (POFN) implies the three principles, but not vice-versa. And McG has three other Principles that correspond to your three for Formal Systems: POOD, Principle of (Expressive) Completeness, Principle of (Representational) Minimality.

    The trouble is, I've never seen any of those Principles stated in precise language. Indeed I don't know whether they're supposed to apply to a schema (including integrity constraints/Relation Predicates) or to the formal semantics of the possible facts the application is capable of expressing.

    I'm finding in both Date's (many) and McG's treatments of View Updating (or Database Updating), a lot of high-fallutin language and capital 'P' Principles. It sounds like it's a treatment in logic. But it ends up same place as attempting to square the circle.

    Or do you know something that isn't available from the literature?

    ReplyDelete
    Replies
    1. The RDM is a formal logic system and, therefore, should be governed by formal system properties. This is something Codd understood, but many of its interpreters (including myself) did not were fully understand for quite a while. What David has been doing is developing the correct interpretation. This is VERY hard work and he did his best in some very difficult circumstances. If you follow my site you should have seen that he is working on a book that will contain all the formalisms. I am working on informal treatments to make them accessible: revising my papers, writing a book and posting some of the ideas here. But it takes time--this is not your mum and pup "RDM", SQL or otherwise.

      Specifically to the three principles--as the post states, formal systems that are governed by them have certain desirable properties. Applied to databases, they guarantee view updatability and logical independence, and logical and semantic correctness of query and update results.

      For the formalisms you will have to wait for David's book, but can read some of the draft chapters at his website. It requires intellectual effort, particularly to those without a background in formal logic.

      What I am trying to do is provide informal and more accessible explanations. You will have to wait for my paper revisions, forthcoming book and follow posts here.

      Delete