The Principle of Orthogonal Database Design Part I

Monday, September 19, 2016

The Principle of Orthogonal Database Design Part I

Follow @DBDebunk Follow @ThePostWest

Note: This is a 11/24/17 re-write of Part I of a three-part series that replaced several older posts (the pages of which which now redirect here), to bring in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

"The principle of orthogonal design (abbreviated POOD) ... is the second of the two principles of database design, which seek to prevent databases from being too complicated or redundant, the first principle being the principle of full normalization (POFN). Simply put, it says that no two relations in a relational database should be defined in such a way that they can represent the same facts. As with database normalization, POOD serves to eliminate uncontrolled storage redundancy and expressive ambiguity, especially useful for applying updates to virtual relations (views). Although simple in concept, POOD is frequently misunderstood ... is a restatement of the requirement that a database is a minimum cover set of the relational algebra. The relational algebra allows data duplication in the relations that are the elements of the algebra. One of the efficiency requirements of a database is that there be no data duplication. This requirement is met by the minimum cover set of the relational algebra." --Wikipedia.org

Well, not quite.

Base and Derived Relations

Relational databases consist of a set of base and derived relations, the latter -- query results, views and snapshots -- derived from the former by applying relational algebra (RA) operations. Every database relation has an associated relation predicate (RP), a conjunction of constraints, which are formalizations of informal business rules that specify the individual and collective object properties required for membership in the object group represented by the relation in the database. Constraints are "computable" -- they can be enforced by a DBMS for consistency with the rules.

Base relations represent the base facts -- axioms -- about the real world objects of interest. Derived relations represent logical implications of (inferences from) of the axioms -- theorems -- about the objects represented by base relations -- no new objects are created relations are derived from base relations.

Note: The common misconception that base relations are stored is induced by SQL's base tables, which are stored. Base relations can be and often -- but not always -- are stored (not as tables), but that is not what base means. SQL tables are even guaranteed to be relations.

Constraint Inheritance and Logical Independence

Derived relations inherit constraints from the base relations from which they derive, supplemented by any constraints added by the relational operations applied to produce them (e.g., restrictions, projections, joins and so on). Thus, for every derived relation its constraints, including those inherited from the base relations (i.e., its RP) are recorded in the database catalog and enforceable by the DBMS -- another way of saying that the DBMS supports constraint inheritance -- it always knows how constraints on each derived relation relate back to the base constraints.

Persisted derived relations (e.g., views, snapshots) are given names, which to the DBMS are just shorthand for their RPs. In a truly relational system application developers and end-users do not need to know if they interact with base and or derived relations. In fact, database access should be only via views, to take advantage of logical independence (LI) -- insulation from logical reorganizations of the database.

Three Database Design Principles

Formal systems adhering to three principles have certain desirable properties. Relational databases are formal logic systems and, therefore, will acquire those properties if their design adheres to these principles [1] which, applied to database management, are:

Principle of Orthogonal Design (POOD): Base relations are independent;
Principle of Representational Parsimony (PORP): There are no superfluous base relations;
Principle of Expressive Completeness (POEC): All meaningful relations are derivable from the base relations;

Two desirable properties are:

Logical validity: A query result is logically valid (i.e., provably correct) if and only if it is derived by any sequence of RA operations on one or more relations. This implies no anomalous resultsa: an algebra does not have anomalies -- their occurrence means that either the operands are not relations, or that the operations are not RA operations (i.e., no algebra).

Semantic correctness: A query result is semantically correct if an only if for every assignment of meaning to RA relation operands under which their tuples represent true facts, the tuples of the result relations also represent true facts.

There is a McGoveran (yet unproven) conjecture that the three principles jointly imply the Principle of Full Normalization (POFN), but not vice-versa. Thus, there is much more to full normalization than the common industry belief of just redundancy elimination.

The Principle of Orthogonal Design

"A relation R is derivable from a set S of relations if there exists a sequence of [relational operations] which yields R from members of S." --E. F. Codd (1969)

Thus, a base relation is independent if it is not derivable from other base relations (identical base relations are a trivial case of non-independence).

The tuples of non-independent base relations represent non-independent facts (i.e., rather than distinct facts).

"An axiomatic system is said to be consistent if it lacks contradiction (i.e., the ability to derive both a statement and its denial from the system's axioms). In such a system, an axiom is called independent if it is not a theorem that can be derived from other axioms in the system. A system will be called independent if each of its underlying axioms is independent. Although independence is not a necessary requirement for a system, consistency is." --Wikipedia

The following example is not intended to be realistic, but to convey a point in an accessible way. Consider a (very) simple conceptual model consisting of the following business rules and assume that no data are missing.

All employees have an id, name and hire date;
Some earn a salary, some make a commission, which are mutually exclusive.

Many, if not most SQL practitioners would design the following table:

EMPLOYEES (EMP#,ENAME,HIREDATE,SALARY,COMMISSION)

with inapplicable NULLs. The NULLs are an artifact of poor design -- an inaccurate representation of the conceptual model, which specifies that some employees do not have a salary or a commission, yet the design assigns both to all tuples, forcing NULLs.

NULLs violate the core Information Principle (IP) of the RDM, which mandates that all information in a relational database be represented explicitly and in exactly one way -- as values of relation attributes defined on domains. NULL is not a value, but a marker for the absence of a value, which cannot be legitimately treated as a value by relational expressions -- that would be a logical type error -- so the table is not a R-table (i.e., does not visualize a relation). Data professionals are oblivious to the problems caused by NULLs [2,3], because they focus on representation (structure) and miss the implications for integrity enforcement and manipulation, which is where the problems crop up.

In the past I advocated the following NULL-avoiding design:

EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,SALARY)
COMM_EMPS (EMP#,COMMISSION)

but what types of object do the latter two tables represent? Not salaried or commissioned employees -- they have employee numbers and names. Application developers and end users should always be presented with a logical design that reflects objects with which they are familiar.

When objects have both shared and unique properties, they are not of the same type. Objects with unique properties in addition to the shared ones are a subtype of those with only shared shared properties, the supertype (i.e., there are entity supertype-subtype (ESS) relationships).

Note: For historical reasons we prefer object (not in the OO sense) to entity, but ESS is entrenched.

Salaried and commissioned employees are subtypes of the employee supertype. One possible design is one base relation each for the supertype and two subtypes:

EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,ENAME,HIREDATE,SALARY)
COMM_EMPS (EMP#,ENAME,HIREDATE,COMMISSION)

Note: Once base relations are determined, for reasons of physical efficiency (e.g., anticipation of lots of queries about all employees) and barring other application requirements, there is nothing to prevent physical storage of the data in an indexed sequantial file with the following records:

(EMP#,ENAME,HIREDATE,SALARY,COMM)

Relations do not occur, nor need they, at the physical level, so this is a perfectly acceptable physical design. EMPS, SAL_EMPS and COMM_EMPS can be produced as projection views at the logical level. Default values outside the otherwise permissible value ranges for the attributes can be used instead of NULLs for inapplicable SALARY and COMMISSION values -- they are not just NULLs by another name: they occur in the physical, not logical design, they extend the domain and, except by caveat, are not legitimate values for computational purposes (again, logical inference and relational operators occur only in the logical model). So there is never a concern to the user -- they are handled by the DBMS transparently, provided users understand the domains and the DBMS supports PI (true RDBMSs do).

While this design avoids NULLs and each relation represents a real world object group, there is logical redundancy and it violates the POOD: EMPS is not independent -- it is derivable via union of projections of SAL_EMPS and COMM_EMPS.

(Continued in Part II)

References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] McGoveran, D., Nothing from Nothing, Parts 1-4.

[3] Pascal, F., The Final NULL in the Coffin: A Relational Solution to Missing Data.

15 comments:

AnonymousSeptember 25, 2016 at 4:44 AM
I don't see anything wrong with having them all in one table, with nulls and a check constraint to ensure mutual exclusivity.
ReplyDelete
Replies
AnonymousSeptember 26, 2016 at 6:37 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
AnonymousSeptember 26, 2016 at 7:35 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownSeptember 26, 2016 at 4:50 PM
There is a McGoveran conjecture that the Principle of Full Normalization (POFN) implies the three principles, but not vice-versa. And McG has three other Principles that correspond to your three for Formal Systems: POOD, Principle of (Expressive) Completeness, Principle of (Representational) Minimality.

The trouble is, I've never seen any of those Principles stated in precise language. Indeed I don't know whether they're supposed to apply to a schema (including integrity constraints/Relation Predicates) or to the formal semantics of the possible facts the application is capable of expressing.

I'm finding in both Date's (many) and McG's treatments of View Updating (or Database Updating), a lot of high-fallutin language and capital 'P' Principles. It sounds like it's a treatment in logic. But it ends up same place as attempting to square the circle.

Or do you know something that isn't available from the literature?
ReplyDelete
Replies
UnknownNovember 7, 2016 at 8:30 PM
I've just noticed your opening paragraph use of meaning. (Not sure if this was in your o.p. or is a recent revision.)

Every database relation has an associated formal relation predicate (RP), a conjunction of integrity constraints expressed in a DBMS-specific data language. They that represent the informal business rules in the database, which denote the meaning of the relation and its tuples.

Your sentence is somewhat ambiguous. Do you mean:
* The informal business rules denote the meaning of the relation?
* The integrity constraints denote the meaning of the relation?

(In either case, I would say it is the relvar that carries meaning, not a relation value. Or a derived relation carries meaning derived from its based-on relvars combined with the relational expression that produces the relation.)

The informal business rules cannot denote meaning of a relation: they are rules about the business, not about the database or its content.

The integrity constraints cannot denote meaning: they constrain the possible values that can appear in the database, but say nothing about what it means. For example, stipulating that EMP# be a key for relvar EMPLOYEES tells nothing about what an employee number is, or what an employee is.

McGoveran's writings seem to continually mix up 'internal predicate' (what you call RP) with 'external predicate' (what TTM calls 'relvar predicate'). It might well be that integrity constraints are a consequence of business rules, which are a consequence of the enterprise's intended meaning and usage for its descriptions of entities and possible worlds. But the integrity constraints do not denote meaning.
ReplyDelete
Replies
Fabian PascalNovember 7, 2016 at 9:25 PM
You will notice that we don't use the term relvar. Codd used "time-varying" relations for a reason: set theory does not have the concept of variables. There may be relvars "under cover", but not at the data language level that users interact with.

Each and every base relation of the 'relation type' has the same meaning denoted by the informal business rules. However, they must be formalized and symbolized as integrity constraints to be accessible to the DBMS for update decisions. They are the formal database representation of the rules. So both denote the meaning of the relations at different levels of representation.

Of course they are about the business. But relations are assigned that meaning by the database designer.

How exactly do you formulate constraints, what do they represent in the database? They are predicates -- formal symbolizations of informal rules. If you have problem with 'meaning', think of rules as meaning as understood semantically by humans and constraints as understood algorithmically by the DBMS. There is one meaning with two representations at two levels of abstraction.

There are differences between the McGoveran and Date/Darwen interpretations of the RDM. It's up to you to pick one, but make sure you have sufficient knowledge and understanding of formal set theory and predicate logic.

Some of this is explained in my recent book which I recommend. For the formal theory wait for David's book.
ReplyDelete
Replies

Add comment

POSTS

Monday, September 19, 2016