Tuesday, January 1, 2013

Integrity Is Not Only Referential

Many years ago I used this title for an article in which integrity meant something different: it was a critique of a vendor claim that its product was supporting referential integrity (RI), when in reality the enforcement was at the application level. I am reusing the title, but this time the meaning is strictly technical.

One of the online exchanges I participated in is Use of constraints on Db to improve data quality - good idea or better in the applications? initiated by the following question:
Can I ask whether people make use of the functionality provided by the database to ensure adequate data quality. Secondly do people apply this retrospectively as seems quite reasonable to me when a data problem is identified and the source data cleaned up - to do so could prevent future errors. There appears to be a tension between this sort of implementation and at least a perception of flexibility as database changes would be required should additional allowable values be required.

This is, of course, a question that foundation knowledge would obviate the need for and the same goes for some of the answers. I wrote frequently about this subject (most recently: Database vs Application Enforced Integrity).

Here is one sensible answers:
Why use an RDBMS if you are not going to use its strengths? DB level constraints come out of the box. Why code application logic that involves shifting data out of the db into a middle tier in order to decide whether it needs to accept or reject a given piece of data because of an RI rule, only to have to shift data I the db at the end of it? DB constraints are highly tuned in the database, a fundamental part of the architecture, tried and tested, robust and not likely to fail. Also, your data store has the rules built into it so if you change your application or allow access from more applications then the same RI rules always apply regardless of the route taken to access the data.

Coded RI in applications cannot be easily demonstrated through models or diagrams. ER models nicely show the presence of relationships (keys) in their context.
You don't need to reinvent wheels that are squarer, more wobbly and less useful that the wheels that are already available.
But there are two prerequisites for this:
  • The DBMS's data language supports fully and correctly integrity constraints of arbitrary complexity;
  • Database professionals are sufficiently proficient in logic and the data language to formulate them and verify their correctness.
Of course, this often does not hold true, hence my reaction:
RI and keys, PK and FK, are general constraints and are the easier part. The real burden are what we refer to as "arbitrary constraints", (for lack of a better term), which are business- and database-specific. Their support by SQL [products] is weak, usually proprietary and quite procedural [via triggers, see next post], which encourages application constraints.
My guess is that not many database professionals are aware of the classification of integrity constraints into general--common to all databases--and, for lack of a better term, "arbitrary". Domain, R-table (key, column) and database (referential or foreign key) constraints fall into the general category and, except for domains, SQL DBMSs usually provide shorthands for them. But most of the integrity burden is due to constraints of arbitrary complexity that represent business rules specific to each business and which are, therefore, database-specific. e.g. "Department X can hire only N employees of skills Y and Z and pay salaries not greater than that of its manager".

In an Appendix to my Business Modeling for Database Design paper I show that the formulation and verification of even the simpler general constraints would not be trivial were it not for SQL implementation of shorthands. Constraints of arbitrary complexity are much more demanding and I would trust neither data language designers nor users lacking proficiency in logic with their correct formulation (agian, see next post). Consider, in this context, an email I recently received from William Sisson:
A while back when we were looking at doing some restructuring of our design we looked at the possibility of making the database cross platform.
The data description language for creating the tables and most of the views proved to be quite straightforward, likewise the primary key, foreign key and check constraints. The main difficulty was of course with the stored procedures and though we didn't go ahead with the cross platform plan it prompted us to remove a lot of code from stored procedures and put the logic in views instead. Advances in recent versions of SQL have made many things that might previously have required SPs possible in SQL expressions - but there were also some things in SPs because we hadn't thought them through clearly enough.
The real stumbling block was the translation of constraints that are expressed in triggers. It is here that the different dialects differ very widely. So from the point of view of the book, I think it might be possible to settle on some kind of "standard" syntax for SQL, but constraints (other than those supported directly by SQL) are a difficulty.
The real "cross platform" layer is of course RM not SQL (I think this ties in with Erwin Smout's comment about people trying to teach converting SQL to relational algebra, rather than the other way round).
I have been reading Richard Snodgrass's book "Developing Time-Oriented Database Applications in SQL"+. The book is based on the SQL-92 standard and I was surprised to discover that the assert statement has been in SQL since then. Snodgrass uses the assert statement to express temporal constraints which I am sure would be very clumsy to express by any other means.
The question is why, more than 20 years after the standard,  do we still have no implementation of assertions in any SQL product (or at least none that I know of)? It doesn't even seem as if anybody is working on the issue, Toon Koppelaars comments on his Triggers Considered Harmful, Considered Harmful blog* "As far as I know the problem areas described above [about assert], haven't been fully researched yet by the scientific community: for one I'm unable to find this research,"
An implementation of the assert statement is one of the very few things that might make me consider changing to a different SQL-DBMS.
The answer is, of course, that the importance of fundamentals is not appreciated by neither vendors nor data professionals, nor management. If the design of a data language flouts them, it becomes much more difficult to correct or extend it later. So much so that vendors prefer to leave the prohibitive implementation burden to database and application developers and they tolerate it.

More on this in the next (four part) post.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment