Monday, November 7, 2016

The Principle of Orthogonal Database Design Part II




Note: This is a 11/24/17 re-write of Part II of a three-part series that replaced several previous posts (the pages of which redirect here), to bring in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

(Continued from Part I)

To recall from Part I, adherence to the POOD means independent base relations (i.e., not derivable from other base relations), which the design example in Part I,

EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,ENAME,HIREDATE,SALARY)
COMM_EMPS (EMP#,ENAME,HIREDATE,COMMISSION)

violates: EMPS is derivable via union of projections of SAL_EMPS and COMM_EMPS. It requires at least:
  • A disjunctive constraint on each of the SAL_EMPS and COMM_EMPS relations, to ensure mutual exclusivity;
  • A redundancy control constraint to prevent inconsistency due to partial updates (can you formulate it?);
  • Use of the transaction management component of the data language to ensure that each candidate tuple is properly inserted (1) into EMPS and (2) the correct subtype relation;

 Disjunctive and Redundancy Control Constraints


There is little declarative integrity support in SQL beyond shorthands for primary key (PK) and referential constraints. Even though disjunctive constraints are syntactically similar to referential constraints, they are not supported (at least not declaratively), in part because SQL DBMSs have not implemented the ANSI/ISO standard ASSERTION [2,3].

Note: Some provide imperative support via stored procedures. Although better than application enforced integrity, SPs tend to be complex, use proprietary syntax, vary considerably across products and are difficult to optimize. All products have the same non-trivial issues to solve, summarized by an expert as follows (see Chapter 11 in [4] for a more in-depth discussion):

"It is very difficult to (a) come up with an efficient DBMS implementation--one that does not 'activate' validation of a trigger when, given a transaction, it is not necessary, and one that, in the case it is necessary, to validate only what is minimally necessary and (b) cater to multiple concurrent transactions. The vendors address these issues, particularly isolation levels, in fundamentally different ways. In practice and assuming a 'last insert wins' protocol, a disjunctive constraint requires to delete from one table whenever inserted into the other. This is relatively easy in SQL Server, Sybase, and DB2, much harder in Oracle. Both correctness and efficiency are paramount. Without them, a solution will not be accepted in the real world, where large and high transaction rate databases are mainstream. " --Toon Koppelaars
This burden is exacerbated by SQL being a highly redundant language -- a query can be expressed in multiple ways -- which makes optimization much more difficult [5,6].

Declarative constraints would relieve application developers to focus on the logic, rather than on implementation details, for which DBMS vendors are in a much better position. Of course, declarative constraints are only as good as how efficient CPU-wise and concurrency-wise the underlying implementation code that actually validates the constraint during transactions is. This is why the real solution are shorthands similar to those for PK and referential constraints -- they are more reliable and relieve developers from drudgery. They would also help vendors: they can develop shorthand-specific validation code, which is much easier to do than accept developer formulated ASSERTIONs, parse and deep-analyze them.


Unique Relation Predicates vs. Names


3 would involve something like:

BEGIN TRANSACTION
 INSERT INTO emps (EMP#="emp#",ENAME="ename",HIREDATE=date)
 INSERT INTO sal_emps (EMP#="emp#",ENAME="ename",HIREDATE=date,SALARY=salary}
END TRANSACTION;

for tuples representing facts about salaried employees. But consider the candidate tuple {A21,Aronson,11/3/99,95000): does it belong in SAL_EMPS, or COMM_EMPS? The insertion is based on the relation (and attribute) names specified by the user/application. If they are wrong, the database will be rendered inconsistent without any awareness of it!

This is why the DBMS should decide and not ever rely on names. A relation name should be just a reference to the associated RP, namely the conjunction of

  • Domain constraints;
  • Attribute constraints;
  • Tuple constraints;
  • Multi-tuple constraints;
Note: After-the-insert relation must also satisfy, of course, any database (multi-relation) constraints to which it is subject.

Whenever the DBMS encounters a name, it should substitute for it the corresponding RP. But this will work only if RPs are unique (i.e., every relation is uniquely constrained) and, thus, a candidate tuple satisfies only one RP! For example, SALARY and COMMISSION should not be defined on the same domain and have overlapping value ranges.

Note: There is a common industry misconception that the meaning of a relation is in its heading -- the attribute names. But relations do not have headers -- they are an element of the tabular display of relations (meta-data). The informal meaning of a relation is denoted by business rules describing the represented object group and the formal semantics by its RP (constraints).
 

DBMS Support


Let's now adhere to POOD and make EMPS a union view (for simplicity, assume the data of the two subtype base relations are stored). A true RDBMS supports
constraint inheritance and LI, which means that the view is updatable [7] and view updates correctly through to the two (projections of) base relations from which it is derived -- if all relations are uniquely constrained. In fact, even if the EMPS union view is defined with derived attribute INCOME (salary or commission), the DBMS can still push the correct value to the correct base relation, if each of the update values are uniquely constrained to be one of SALARY or COMMISSION.

What is more, with explicit support of ESS relationships, a RDBMS can even create the view and generate the disjunctive constraints.

Unfortunately, SQL DBMSs support none of this.

(Continued in Part III)


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Koppelaars, T., CREATE ASSERTION: Impossible Dream?

[3] Smout, E., CREATE ASSERTION: Neither Impossible Nor a Dream.

[4] Koppelaars, T., and de Haan, L., APPLIED MATHEMATICS FOR DATABASE PROFESSIONALS.

[5] Pascal, F., Language Redundancy and DBMS Performance.

[6] Fernandez, I., The way you write your query matters.

[7] On View Updating (C. J. Date and D. McGoveran).




17 comments:

  1. Maybe that's just semantics and I'm not very good at those.

    That was in reply to someone not making the imo proper distinction between an application programmer and a db designer, claiming the db designer "is also a programmer, namely of what the DBMS is to do". And it was mostly ironical.

    "My DBMS is sufficiently expressive, thank you."

    That was in reply to a remark that if one's DBMS cannot express the predicates, one should switch DBMS. My DBMS has structural built-in support for expressing all relevant predicates. That is, it has support for keeping a record of what the precise business meaning is of the tuples recorded in [a relvar in] the db. And it has support for expressing (AND enforcing !!!) all applicable constraints. That which Toon in his book claimed no one knew how to do.

    In that light, I don't feel I have any pondering to do let alone any revisiting and what I said remains what I said.

    Also in that light, and regarding

    "What happens when users are not sophisticated enough to "read the mind" of the designer and have either incompatible interpretations or -- just as common -- interpretations that vary by day, mood, focus, etc.? What if two applications have incompatible interpretations?"

    With my DBMS, users are not required to read any designer's mind, they only need to query the catalog. The risk of "Incompatible interpretations" can never be entirely eliminated when the predicate is expressed in NL (typical for external predicates defining business meaning) but can indeed be eliminated if the predicate is expressed as a mathematical formula (typical for definitions of constraints), which in my DBMS it is.

    ReplyDelete
    Replies
    1. I was as ironic as you were. Note my comment in the article "I am not worried about Erwin". As to pondering, it's up to you.

      As you will see, a true RDBMS according to our RDM interpretation should document semantics and provide it on demand to users. However, this does not mean that reliance on the DBMS should not preferred to human intervention, as you say you agree.

      Delete
  2. (I'm going over this finish to start this time)

    "any error in the name specification will result in the tuple being inserted in the wrong relation. But if the design adheres to the POOD the DBMS can decide on its own, based on which RP the tuple satisfies,"

    If a user can make a mistake when trying to pick the right relvar/relation name, he can also make a mistake when trying to pick the right attribute name. The effect stays just the same.

    ReplyDelete
    Replies
    1. Absolutely. Which is why update decisions should not be made based on attribute names either. Names are only references to the integrity constraints. An attribute name is a reference to the attribute constraint.

      Delete
  3. "However, the meaning -- captured by the business rules -- is represented in the database by the formal relation predicate (RP) -- a set of integrity constraints that the DBMS can enforce -- which is the criterion that the DBMS uses to decide whether the tuple belongs in the relation or not. And we should rely as much as possible on the DBMS, not the user, to make that decision."

    If this is your way of saying that all applicable constraints should be definable to the DBMS in such a way that the DBMS, and the DBMS alone, will be able to derive/determine how to actually enforce them, and effectively do that, I agree wholeheartedly, and I'm aware of LOTS of participants on various db discussion fora who feel EXACTLY the same.

    That all those people, when using SQL, have no other option than to hand-code lots of enforcement strategies (in SPROCS, in the business app itself, wherever) is not their fault. Ref. the remark you quoted from Toon&Lex book.

    ReplyDelete
    Replies
    1. Then why do you insist that decisions should be based on names such that users must "help the DBMS" when adherence to the POOD would obviate the need for such help?

      Delete
    2. Because I still have no idea what alternatives are available in your proposed data language for identifying the target relation(s) of a given update.

      Or rather, because I have no idea how "adherence to the POOD" can be enforced unless by relying on ... attribute names.

      Delete
    3. I hope you are not asking for syntax -- we are not designing a data language. We are talking formulation of constraints such that the set of values that comprise a candidate tuple will satisfy only one RP (conjunction of constraints). Then the name is just a reference to the RP and if there is only one that is satisfied, then using the wrong name won't insert the tuple in the wrong relation. Duplicate RPs can't prevent it.

      Orthogonality is a design discipline, same as full normalization (excepting normalization to 1NF which can be enforced), as I explained in a previous article.

      I also mentioned a McGoveran (unproven) conjecture that the POFN implies the other 3 principles, not the other way around.





      The semantics

      Delete
    4. No I'm not asking for syntax. But you have eschewed relvar (names) and attribute names as a means to express updates. We can't express tuple content as merely a bunch of values, because a given tuple type might have more than one String attribute or more than one Int, etc.

      Both you [Pascal] and McGoveran expend considerable verbiage telling us what update is not. I've seen nothing saying what it is. Not in enough detail to design a database that is updatable (orthogonally or otherwise).

      I have by now (after a long time of waiting) come to the conclusion the reason so little is forthcoming is because the quest for POOD is like trying to square the circle.

      Delete
    5. I suggest you re-read the article because you apparently did not understand it. See also my last reply to Erwin.

      As to how to handle updates of relational databases given that SET THEORY HAS NO CONCEPT OF VARIABLES (whether you like it or not), you'll have to have patience for David's book. That work is much more profound and difficult than what passes for relational theory these days. What I am offering is just some basic ideas here and there that will be fully developed in the book.

      In the meantime, why don't you refresh your knowledge -- assuming you have some -- of set theory and predicate logic, to verify that my claim is correct.

      That you dk or understand something does not mean it does not hold.

      FYI: Principle of axiomatic independence is basic in axiomatic theory, of which set theory is one. Designing independent base relations such the facts they represent are not derivable is a clear design principle and it was even alluded to by Codd's first 1969 paper when he discussed weak and strong redundancy. I suggest you re-read that too.

      SO please, we're not gonna spoon-feed you on fundamentals -- go educate yourself and spare me.

      Delete
  4. "In order for this to work, however, every RP must be unique -- another way of saying that the database design adheres to the POOD. Erwin's claim -- assumed by many -- is that RPs are unique even if only relation (and attribute) names are distinct".

    I have no idea what could possibly be meant by "every RP must be unique". Perhaps this problem could be resolved once and for all if you could exemplify two RP's that are "not unique".

    If relvar/relation names are used as the identifier for all update and query operations, then uniqueness is indeed guaranteed by virtue of the name having to be able to act as the identifier. I don't see the problem. (You could theoretically have two distinct relvars/relations/tables in a db, or in distinct db's, representing the very same business meaning. That is more a cultural/organizational problem than it is one that is intrinsic to RM, imo.)

    ReplyDelete
    Replies
    1. Insofar as a tuple insertion is concerned, name uniqueness ***means nothing*** to the DBMS -- it's just 2 distinct names x,y. Uniqueness of RP (conjunction of constraints on relation), OTOH, makes a single relation the recipient of the tuple without the DBMS having to understand meaning -- it "just computes" (somebody used this term).

      Delete
  5. "Of course, declarative constraints are only as good as how efficient CPU-wise and concurrency-wise the underlying implementation code that actually validates the constraint during transactions is, which should be the purview of the DBMS designer. This is why the real solution are shorthands like the ones for key and referential constraints, which are more reliable and relieve developers from the drudgery. They would also help vendors: they can develop shorthand-specific validation code, which is much easier to do than accept developer formulated ASSERTIONs, parse and deep-analyze them."

    In this context, maybe I should once more point to my article in the NOCOUG journal on this very subject (NOV 2013 issue). And the one by Toon published in the preceding issue (AUG 2013) to which mine was a response.

    ReplyDelete
    Replies
    1. Are they online? If so, can you provide the links?

      Delete
    2. Toon's : https://nocoug.files.wordpress.com/2014/08/nocoug_journal_201308.pdf

      Mine : https://nocoug.files.wordpress.com/2014/08/nocoug_journal_201311.pdf

      Delete
  6. hmm... all attempts at defining formal systems are doomed to be incomplete. Doesn't mean we try to define all the rules and boundaries, etc., but it doesn't take too many user interactions in said system for holes, inconsistencies and outright breaks in said system can get found out by simply blind and stupid user interaction.

    "you're not holding it right"...

    ReplyDelete
    Replies
    1. And your point is? That systems not fomally defined are better, or at least as reliable as formal ones?

      One of the points of a well-designed formal system is limit as much as possible "blind and stupid interaction" without affecting the meaningful functionality.

      Delete

View My Stats