Sunday, January 27, 2013

Database Design, Relational Fidelity and SQL Part III: Constraints, Triggers and View Updating

In Part II I discussed two possible database designs for a business model consisting of supertype and subtype entity classes: one proposed by WB and consisting of two base tables, two views and a referential (foreign key) constraint, and an alternative, consisting of two different base tables, a disjunctive (mutually exclusive) constraint and one view. Even though the second design seems in my subjective perception more faithful to the business model than the first, in my book PRACTICAL ISSUES IN DATABASE MANAGEMENT I recommended the first. Why?


As you probably know, referential constraints are supported by practically all SQL products with a shorthand. You don't have to bother formulating even the relatively simple SQL standard ASSERTION I specified in Part II. All you have to do is add a REFERENCES clause to the CREATE TABLE statement and the DBMS takes care of the rest.

In contrast, even though in standard SQL the disjunctive ASSERTION is pretty similar to the referential one, SQL products enforce the former via triggered procedures. They are better than application enforced constraints, but unlike shorthands, they leave the burden of implementation details--which tend to be complex, use proprietary syntax, implementations vary considerably across products and, depending on how procedural they are, are difficult to optimize--to developers.

In Chapter 11 of their book, APPLIED MATHEMATICS FOR DATABASE PROFESSIONALS (recommended and available via this site), Toon Koppelaars and Lex de Haan detail what it takes to implement constraints via triggers. All products have exactly the same non-trivial issues to solve, which Toon summarizes as follows:
It is very difficult to (a) come up with an efficient DBMS implementation--one that does not 'activate' validation of a trigger when, given a transaction, it is not necessary, and one that, in the case it is necessary, to validate only what is minimally necessary and (b) cater to multiple concurrent transactions. The vendors address these issues, particularly isolation levels, in fundamentally different ways.
In practice and assuming a 'last insert wins' protocol, the disjunctive constraint requires to delete from one table whenever inserted into the other. This is relatively easy in SQL Server, Sybase, and DB2, much harder in Oracle.

I asked Will Sisson, Toon Koppelaars, Craig Mullins and Nathan Allan, specialists in SQL Server, Oracle, DB2 and Dataphor, respectively, to comment on the trigger implementation of the disjunctive constraint and they kindly offered their thoughts and even some sample code.
  • SQL Server (2 triggers): 
You need to have a lot of knowledge about how a particular DBMS works to implement them correctly. If I wanted to be really sure about my trigger implementation of the constraint from a concurrency point of view, I would have to work out every possible set of history sequences that could occur and whether any of them could violate the integrity of the database. I agree absolutely that DBMS vendors are better positioned to have the DBMS relieve the developers from this error prone burden.
The use of the INSERTED and DELETED system-generated temporary tables allows the constraint to be restricted only to the set of rows being inserted, updated or deleted - after that it is just another SQL expression and is evaluated by the optimizer just like any other expression.
  • Oracle (4 very long and complex compound triggers): 
Since Oracle does not provide an "inserted" table that holds just the rows to be inserted (or new versions of the rows in case they are updated), we need row-level triggers and a temporary table to maintain ourselves. We also need a separate procedure to cater to acquisition of predicate locks. Given Oracle's default 'snapshot isolation', one has to introduce predicate locks and acquire them at the appropriate time inside the transaction, to ensure that the queries that execute as part of the validation of a constraint can be sure to 'conclude' it is OK all the way until the end of the transaction (when it commits).
  • DB2 (4 triggers): Judging from the sample code, it falls somewhere between SQL Server and Oracle, but closer to the former.
Note that, as Toon points out, both correctness and efficiency are paramount. Without them, a solution will not be accepted in the real world, where large and high transaction rate databases are mainstream. This burden is exacerbated by SQL being a highly redundant language--the same query can be expressed in multiple ways--which makes optimization much more difficult (as I demonstrated in an 80's article, Language Redundancy and DBMS Performance; see The way you write your query matters; that's one reason Oracle does not optimize triggers).

Declarative ASSERTIONs would allow developers to focus on constraint logic, rather than on implementation details, which would be expressed more declaratively. Here, for example, is how the disjunctive constraint is expressed in Dataphor, a truly relational DBMS whose native data language is not SQL:
CREATE CONSTRAINT no_living_dead
 NOT EXISTS (dead_people {person_id} JOIN (live_people {person_id}));
Of course, Toon reminds us, declarative assertions are only as good as how efficient CPU-wise and concurrency-wise the underlying implementation code that actually validates the constraint during transactions is.

This is why the real solution are shorthands, which relieves developers from the implementation burden. They would also help vendors: they can develop shorthand-specific validation code, which is much easier to do than accept developer formulated ASSERTIONs and parse and deep-analyse them.

We have seen that both design options use views,

the first to represent the 2 subtype classes, the second to represent the supertype class. Here are the SQL view definitions of  the first design:
CREATE VIEW live_people
AS SELECT person_id, name, birth
   FROM people
   WHERE NOT EXISTS
               (SELECT *
                FROM people, deaths
                WHERE people.person_id = deaths.person_id);
CREATE VIEW dead_people
AS SELECT people.person_id, people.name, people.birth, deaths.death
   FROM people, deaths
   WHERE EXISTS (SELECT *
                 FROM people, deaths
                 WHERE people.person_id = deaths.person_id);
and here's the one of the second design:
CREATE VIEW people
AS SELECT person_id, name
   FROM live_people
   UNION
   SELECT person_id, name
   FROM dead_pople;
In fact, to benefit  from logical data independence applications should access a database only through views. A TRDBMS that supports correctly the relational model i.e., "understands" predicates) would be able to update these views, but due to backward compatibility with various violations of the model, SQL DBMS's cannot guarantee propagation of views updates  to the underlying base tables, so the views in either model are not updatable, defeating one of the core practical advantages of the model.

Note: There is some genuine controversy regarding view updating (see my exchange with Erwin Smout).

To recall, this series was triggered by WB's inquiry into logical design that avoids NULLs. The first three posts covered the implications of avoidance of 'inapplicable' NULLs and assumed no missing data.

In Part IV David McGoveran will discuss the correct relational solution to missing data--outlined in my The Final NULL in the Coffin paper--and another burden on developers due to its lack of DBMS support.


Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

18 comments:

  1. Regarding the concurrency problem:

    1. I think that postgres has a correct implementation of transaction serialization. This could help avoid all those lock (as explained in Applied Mathematics... book).
    2. I have used Dataphor quite a bit and still wonder it handles concurrency (asa constraints are concerned).

    Any confirmation/ideas

    Thank you

    J-C

    ReplyDelete
    Replies
    1. A "correct implementation of transaction serialization" does not "avoid" locks, it __depends__ on them. Even preemptive scheduling mechanisms require a transaction to declare upfront the set of resources that transaction will need, and verify at each attempted resource access that the transaction doesn't step outside of its self-declared bounds. Although slightly different in kind from the ones we typically think of when speaking of "locks", that too is a form of "lock" !!!

      Delete
  2. I do not consider myself a product expert, which is why I consult product experts when necessary.

    The point of this series is to sensitize to the many ways in which relational failures impact practicality.

    The core issue is that (1) implementation details should be tackled once by the vendor that are better equipped to do it (2) every developer would become more productive (3) systems and applications would become more reliable (4) there would be less dependency on uneven developer skills (5) less risk that individual developers would err and render databases inconsistent for everybody and so on.


    ReplyDelete
  3. Thanks for this continued series - I am just now starting to get a hint of the big picture and so have much more studying and reading to do :-). Joe Celko recently mentioned the declarative assertion (I had never heard of it until your series) as part of the ANSI SQL standard that is un-implemented (http://www.simple-talk.com/sql/t-sql-programming/unique-constraints-in-sql/). DBMS vendor advocates will defend not implementing parts of the ANSI standard as they must instead implement features customers demand instead of those in a standard. I think the 5 point core issue you close with is right on, but given the customer demand model software is built under not sure how it can be advanced.

    As an aside, I've always been confused as to why access through views achieves logical data independence. In practice I've always observed the view requirement is an administrative mess. In fact Craig Mullins argues against it in his DB2 Developers guide. Perhaps this can be a good topic of some future post (or was of some past post or article that I've missed :-)...

    ReplyDelete
    Replies
    1. I advise care with Celko's material.

      It is hardly surprising that vendors defend their failure to implement the RM and various important features of the standard. Why should they bother when they can leave most of the burden to the customers and the latter accept it. Since foundation knowledge is poor at both ends, there is little incentive for vendors to make the effort when users demand features that are not sound that are easier to implement.

      Logical data independence is rather easy to understand IF you know the RM and what views are. A view is a saved query that defines a tabular window into one or more base table(s). It is a virtual table that reflect the data from the underlying base table(s) over which the view is defined. So from the user perspective he looks at the base table(s) THROUGH the view window.

      Now suppose you have a base table that for various reasons you decide to split into two projections for further normalization. Applications that accessed the original table will no longer work and require maintenance. But you can define a join view over the two new tables and give it the name of the original table, and the applications will continue to work unimpaired. In essence the view insulates them from changes in the logical database structure.

      This, of course, holds true in a relational system that supports updating of join views (or multitable views in general). SQL systems do not, so updates must be applied directly to the base table, rather than through the view and when it splits, there is no insulation.

      Delete
    2. If a DB design has a table {K,D1,D2} (key K), then this design dictates a rule "if a D1 value corresponding with a K is deleted, then its corresponding D2 value must be deleted too".

      Existing programs will, of necessity, comply with that rule.

      But if the design is changed by splitting this table in two, then that rule goes away too. Is it even valid to just simply assume that the existing program really should continue to do as it did before, or could it be that the design change means a change in the set of business rules, that this change in the set of business rules affects the operation of that program, and that therefore the program MUST be revised ?

      Making the assumption that the program can just go on doing what it did before, seems like a gratuitious choice for less work to do and getting results quicker, and any remaining needed thinking will be done later.

      Delete
    3. Well, the example I gave was one of further normalization. IOW, the original R-table is not in 5NF and there are very few who are even aware that an integrity constraint must be added to control the redundancy, let alone what the constraint is how it will defeat performance gains from denormalization, if any.

      So in the example I gave, in practice the need for a constraint that should have been added to the original R-table but wasn't is eliminated, not a need for a new one created.

      You are, of course, correct about a change in the logical structure that requires a new constraint. This is actually clear from the example in the series, if, for example, if you switched from one of the design to the other--and from a referential to a disjunctive constraint.

      Delete
    4. My point regarding confusion on how views achieves logical data independence was more of a practical nature. What I have observed over the years is that in almost all cases the changes made to base tables are adding business function and must also be made known to the application, using views or not, in order to present the new functionality to the user. Almost never have I seen a change as you describe that is done for non functional reasons and can be totally shielded from the consuming application. So while "logical data independence" is theoretically possible (even without the ability to update join views which can be circumvented in current RDBMS products with instead of triggers) in practice it seems to me the application is practically dependent on the base schema despite the concept. The view layer becomes nothing more than administrative overhead. This is what Craig addresses in his book, recommending instead creating views only when they add a specific value in a specific situation (as you showed in your example) instead of en masse.

      Delete
    5. 1st, my further normalization case I illustrated is a good example of the benefit of LDI without "functional extension", that is nevertheless extremely valuable given the enormous number of databases denormalized for performance.

      Second, in his 4th part, David alerts to the importance of having some notion of the patterns of data access by application to define views intelligently that will not be defeated by ALL functional extensions.

      You can't ask views to protect you from base changes that views CANNOT protect from. My guess is, though, that if practitioners understood and appreciated views and LDI and vendors implemented them correctly, there would be a huge number of circumstances in which views and LDI would save a lot of time, effort and money.

      Delete
    6. I agree with Erwin Smout's comment that changing the database design must mean that a change to the business rules has taken place.

      For example if we declare the relvar Person{Person_ID, Person_Name, Date_of_Birth} key {Person_id} we are explicitly stating the constraint that we cannot introduce a person into our database unless we know their name and date of birth.

      If the business rules subsequently change such that we are allowed to introduce a person into the database without knowing their date of birth then the design of the database must be changed accordingly to reflect this new rule.

      As a specific example a person with an unknown date of birth could be a valid sales lead for a life insurance company, but no life insurance policy can be sold until the person's date of birth is known.

      Delete
    7. Views are not a panacea. There are db changes that are amenable to LDI and others that are not.

      A change in the business rules is a change in the business model and LDI may not be guaranteed. That's the case of a switch between the 2 designs.

      My example was one of design repair, which does not involve a change in business rules. Given that so many databases are "denormalized", if there ever was an intention to repair, it would be practically impossible without LDI.




      Delete
  4. En passant, I’d eschew columns named name for as often happens a name collision emerges! As for the utility of views I’ll add some brief, not terribly inspiring, anecdotes. One’s programming style can be view (and table) centric with some advantages of modularity and a tendency toward declarativism over proceduralism – particularly when contrasted with procedural wrappers with embedded complex queries that must be excised for analysis. I.e. accomplish as much computing with tables and virtual tables as practical. In the manner of appreciating the admonition to avoid premature optimization one should also avoid premature proceduralism. (I know this must be hard for Java school programmers who work from Java toward SQL rather than using as little Java “glue” as possible.) In vendor world, such as Oracle, views can facilitate some ease of management of performance hints as compared to using such hints in queries embedded in procedural code. Also in Oracle, views facilitate virtual private databases which provide some performance and data access security advantages in an Oracle conventional manner. Now have I mentioned materialized views? No? At times these beasts become necessary in Oracle SQL land for performance, and at times for assertions!

    ReplyDelete
    Replies
    1. I consider this issue part and parcel of foundation knowledge. A smart developer can probably realize this from experience, but it may well take him a long time. Whereas less talented developers will not.

      Foundation knowledge needs to be part of education, which in turn is made effective by intellectual development Sheer experience with tools is not sufficient and because the tools are often flawed, it can actually be counterproductive.

      It is better for security to be handled by the security system than through views. Mixing logical design with security is a source of complications and management problems.

      Delete
    2. Re: your security remark in response to my views and VPDs remark.

      "For various reasons, such an alternative design — an alternative view of the database, in effect — might be considered more suitable than the base design for certain users. More importantly, that alternative design might actually exclude parts of the underlying or “real” database that some users have no interest in, or perhaps are not authorized to see.

      Date, C. J. (2012-12-20). View Updating and Relational Theory (Kindle Locations 290-293). O'Reilly Media. Kindle Edition.

      A further conflation of security and SQL DBMS is hinted at in http://en.wikipedia.org/wiki/WinFS . Interesting in the notion of rethinking operating system architectures in terms of databases rather than file systems. Reminds me of Lisp machine OS-s written in Lisp: http://en.wikipedia.org/wiki/Genera_(operating_system)

      Delete
    3. Yes, I know, but I and Chris do not always agree. Besides, I did not say using views for security is prohibited, I only said that it is preferable not to mix views and security, so that everything secuity is in one place. It's not a theoretical, but a pragmatic suggestion.

      As to building OS on db foundations -- old story. Don't hold your breath.

      Delete
  5. "I've always been confused as to why access through views achieves logical data independence. In practice I've always observed the view requirement is an administrative mess. In fact Craig Mullins argues against it in his DB2 Developers guide." ???

    Every time I've ever used views in a system (in SQL it requires building the views AND restricting access to the database by using stored procedures for every operation) it has worked like a charm and the project became simpler to maintain and extend. But many times I've failed to implement this just because of people whining that it will be too much work@$#%! I hope they enjoyed the sweat and contortions for every change they made since I left! My Advice: it is not how fast you go; it is how you go fast.

    ReplyDelete
  6. In general if you are educated on RM and really understand views, there is no to fail to realize how crucally beneficial they are.

    Of course, on the one hand that does not happen and on the other lack of view updatability in SQL does defeat LDI.

    In the absence of education, there is no distinction between RM and SQL and there is inability to connect the negative implications with their real source.

    IOW, they may well run into that sweat, but they will not know what caused it and, with SQL systems, they won't be able to do much about it anyway.

    And that's how "RM sucks" and systems that generate even more sweat (NoSQL) proliferate.

    ReplyDelete
  7. I may have the (mis?)fortune to know enough Oracle to challenge "Since Oracle does not provide an "inserted" table that holds just the rows to be inserted (or new versions of the rows in case they are updated) ..." The terse Oracle DDL: create materialized view log on tab; would make the creation of such a table less arduous than the reader is lead to believe. In turn, materialized views may facilitate hand crafted assertions with a single simple trigger. I'll leave it at that for now and rejoin someday when I've reprised material from Messrs Gupta; Koppelaars and Lex de Haan -- and when I've returned from analytics land to OLTP land.

    ReplyDelete