The Principle of Orthogonal Database Design Part II

Monday, November 7, 2016

The Principle of Orthogonal Database Design Part II

Follow @DBDebunk Follow @ThePostWest

Note: This is a 11/24/17 re-write of Part II of a three-part series that replaced several previous posts (the pages of which redirect here), to bring in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

(Continued from Part I)

To recall from Part I, adherence to the POOD means independent base relations (i.e., not derivable from other base relations), which the design example in Part I,

EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,ENAME,HIREDATE,SALARY)
COMM_EMPS (EMP#,ENAME,HIREDATE,COMMISSION)

violates: EMPS is derivable via union of projections of SAL_EMPS and COMM_EMPS. It requires at least:

A disjunctive constraint on each of the SAL_EMPS and COMM_EMPS relations, to ensure mutual exclusivity;
A redundancy control constraint to prevent inconsistency due to partial updates (can you formulate it?);
Use of the transaction management component of the data language to ensure that each candidate tuple is properly inserted (1) into EMPS and (2) the correct subtype relation;

Disjunctive and Redundancy Control Constraints

There is little declarative integrity support in SQL beyond shorthands for primary key (PK) and referential constraints. Even though disjunctive constraints are syntactically similar to referential constraints, they are not supported (at least not declaratively), in part because SQL DBMSs have not implemented the ANSI/ISO standard ASSERTION [2,3].

Note: Some provide imperative support via stored procedures. Although better than application enforced integrity, SPs tend to be complex, use proprietary syntax, vary considerably across products and are difficult to optimize. All products have the same non-trivial issues to solve, summarized by an expert as follows (see Chapter 11 in [4] for a more in-depth discussion):

"It is very difficult to (a) come up with an efficient DBMS implementation--one that does not 'activate' validation of a trigger when, given a transaction, it is not necessary, and one that, in the case it is necessary, to validate only what is minimally necessary and (b) cater to multiple concurrent transactions. The vendors address these issues, particularly isolation levels, in fundamentally different ways. In practice and assuming a 'last insert wins' protocol, a disjunctive constraint requires to delete from one table whenever inserted into the other. This is relatively easy in SQL Server, Sybase, and DB2, much harder in Oracle. Both correctness and efficiency are paramount. Without them, a solution will not be accepted in the real world, where large and high transaction rate databases are mainstream. " --Toon Koppelaars

This burden is exacerbated by SQL being a highly redundant language -- a query can be expressed in multiple ways -- which makes optimization much more difficult [5,6].

Declarative constraints would relieve application developers to focus on the logic, rather than on implementation details, for which DBMS vendors are in a much better position. Of course, declarative constraints are only as good as how efficient CPU-wise and concurrency-wise the underlying implementation code that actually validates the constraint during transactions is. This is why the real solution are shorthands similar to those for PK and referential constraints -- they are more reliable and relieve developers from drudgery. They would also help vendors: they can develop shorthand-specific validation code, which is much easier to do than accept developer formulated ASSERTIONs, parse and deep-analyze them.

Unique Relation Predicates vs. Names

3 would involve something like:

BEGIN TRANSACTION
INSERT INTO emps (EMP#="emp#",ENAME="ename",HIREDATE=date)
INSERT INTO sal_emps (EMP#="emp#",ENAME="ename",HIREDATE=date,SALARY=salary}
END TRANSACTION;

for tuples representing facts about salaried employees. But consider the candidate tuple {A21,Aronson,11/3/99,95000): does it belong in SAL_EMPS, or COMM_EMPS? The insertion is based on the relation (and attribute) names specified by the user/application. If they are wrong, the database will be rendered inconsistent without any awareness of it!

This is why the DBMS should decide and not ever rely on names. A relation name should be just a reference to the associated RP, namely the conjunction of

Domain constraints;
Attribute constraints;
Tuple constraints;
Multi-tuple constraints;

Note: After-the-insert relation must also satisfy, of course, any database (multi-relation) constraints to which it is subject.

Whenever the DBMS encounters a name, it should substitute for it the corresponding RP. But this will work only if RPs are unique (i.e., every relation is uniquely constrained) and, thus, a candidate tuple satisfies only one RP! For example, SALARY and COMMISSION should not be defined on the same domain and have overlapping value ranges.

Note: There is a common industry misconception that the meaning of a relation is in its heading -- the attribute names. But relations do not have headers -- they are an element of the tabular display of relations (meta-data). The informal meaning of a relation is denoted by business rules describing the represented object group and the formal semantics by its RP (constraints).

DBMS Support

Let's now adhere to POOD and make EMPS a union view (for simplicity, assume the data of the two subtype base relations are stored). A true RDBMS supports constraint inheritance and LI, which means that the view is updatable [7] and view updates correctly through to the two (projections of) base relations from which it is derived -- if all relations are uniquely constrained. In fact, even if the EMPS union view is defined with derived attribute INCOME (salary or commission), the DBMS can still push the correct value to the correct base relation, if each of the update values are uniquely constrained to be one of SALARY or COMMISSION.

What is more, with explicit support of ESS relationships, a RDBMS can even create the view and generate the disjunctive constraints.

Unfortunately, SQL DBMSs support none of this.

(Continued in Part III)

References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Koppelaars, T., CREATE ASSERTION: Impossible Dream?

[3] Smout, E., CREATE ASSERTION: Neither Impossible Nor a Dream.

[4] Koppelaars, T., and de Haan, L., APPLIED MATHEMATICS FOR DATABASE PROFESSIONALS.

[5] Pascal, F., Language Redundancy and DBMS Performance.

[6] Fernandez, I., The way you write your query matters.

[7] On View Updating (C. J. Date and D. McGoveran).

17 comments:

Erwin SmoutNovember 9, 2016 at 3:39 AM
Maybe that's just semantics and I'm not very good at those.

That was in reply to someone not making the imo proper distinction between an application programmer and a db designer, claiming the db designer "is also a programmer, namely of what the DBMS is to do". And it was mostly ironical.

"My DBMS is sufficiently expressive, thank you."

That was in reply to a remark that if one's DBMS cannot express the predicates, one should switch DBMS. My DBMS has structural built-in support for expressing all relevant predicates. That is, it has support for keeping a record of what the precise business meaning is of the tuples recorded in [a relvar in] the db. And it has support for expressing (AND enforcing !!!) all applicable constraints. That which Toon in his book claimed no one knew how to do.

In that light, I don't feel I have any pondering to do let alone any revisiting and what I said remains what I said.

Also in that light, and regarding

"What happens when users are not sophisticated enough to "read the mind" of the designer and have either incompatible interpretations or -- just as common -- interpretations that vary by day, mood, focus, etc.? What if two applications have incompatible interpretations?"

With my DBMS, users are not required to read any designer's mind, they only need to query the catalog. The risk of "Incompatible interpretations" can never be entirely eliminated when the predicate is expressed in NL (typical for external predicates defining business meaning) but can indeed be eliminated if the predicate is expressed as a mathematical formula (typical for definitions of constraints), which in my DBMS it is.
ReplyDelete
Replies
Erwin SmoutNovember 9, 2016 at 4:43 AM
(I'm going over this finish to start this time)

"any error in the name specification will result in the tuple being inserted in the wrong relation. But if the design adheres to the POOD the DBMS can decide on its own, based on which RP the tuple satisfies,"

If a user can make a mistake when trying to pick the right relvar/relation name, he can also make a mistake when trying to pick the right attribute name. The effect stays just the same.
ReplyDelete
Replies
Erwin SmoutNovember 9, 2016 at 4:57 AM
"However, the meaning -- captured by the business rules -- is represented in the database by the formal relation predicate (RP) -- a set of integrity constraints that the DBMS can enforce -- which is the criterion that the DBMS uses to decide whether the tuple belongs in the relation or not. And we should rely as much as possible on the DBMS, not the user, to make that decision."

If this is your way of saying that all applicable constraints should be definable to the DBMS in such a way that the DBMS, and the DBMS alone, will be able to derive/determine how to actually enforce them, and effectively do that, I agree wholeheartedly, and I'm aware of LOTS of participants on various db discussion fora who feel EXACTLY the same.

That all those people, when using SQL, have no other option than to hand-code lots of enforcement strategies (in SPROCS, in the business app itself, wherever) is not their fault. Ref. the remark you quoted from Toon&Lex book.
ReplyDelete
Replies
Erwin SmoutNovember 9, 2016 at 5:18 AM
"In order for this to work, however, every RP must be unique -- another way of saying that the database design adheres to the POOD. Erwin's claim -- assumed by many -- is that RPs are unique even if only relation (and attribute) names are distinct".

I have no idea what could possibly be meant by "every RP must be unique". Perhaps this problem could be resolved once and for all if you could exemplify two RP's that are "not unique".

If relvar/relation names are used as the identifier for all update and query operations, then uniqueness is indeed guaranteed by virtue of the name having to be able to act as the identifier. I don't see the problem. (You could theoretically have two distinct relvars/relations/tables in a db, or in distinct db's, representing the very same business meaning. That is more a cultural/organizational problem than it is one that is intrinsic to RM, imo.)
ReplyDelete
Replies
Erwin SmoutNovember 9, 2016 at 5:26 AM
"Of course, declarative constraints are only as good as how efficient CPU-wise and concurrency-wise the underlying implementation code that actually validates the constraint during transactions is, which should be the purview of the DBMS designer. This is why the real solution are shorthands like the ones for key and referential constraints, which are more reliable and relieve developers from the drudgery. They would also help vendors: they can develop shorthand-specific validation code, which is much easier to do than accept developer formulated ASSERTIONs, parse and deep-analyze them."

In this context, maybe I should once more point to my article in the NOCOUG journal on this very subject (NOV 2013 issue). And the one by Toon published in the preceding issue (AUG 2013) to which mine was a response.
ReplyDelete
Replies
coremansterNovember 12, 2016 at 2:52 PM
hmm... all attempts at defining formal systems are doomed to be incomplete. Doesn't mean we try to define all the rules and boundaries, etc., but it doesn't take too many user interactions in said system for holes, inconsistencies and outright breaks in said system can get found out by simply blind and stupid user interaction.

"you're not holding it right"...
ReplyDelete
Replies

Add comment

POSTS

Monday, November 7, 2016