Sunday, November 11, 2012

DBMS vs. Application Enforced Integrity

One of the online exchanges I participated in is Use of constraints on Db to improve data quality - good idea or better in the applications? initiated by the following question:
Can I ask whether people make use of the functionality provided by the database to ensure adequate data quality. Secondly do people apply this retrospectively as seems quite reasonable to me when a data problem is identified and the source data cleaned up - to do so could prevent future errors. There appears to be a tension between this sort of implementation and at least a perception of flexibility as database changes would be required should additional allowable values be required.

This is, of course, a question that foundation knowledge would obviate the need for and the same goes for some of the answers. Here is one sensible response:
Why use an RDBMS if you are not going to use its strengths? DB level constraints come out of the box. Why code application logic that involves shifting data out of the db into a middle tier in order to decide whether it needs to accept or reject a given piece of data because of an RI rule, only to have to shift data I the db at the end of it? DB constraints are highly tuned in the database, a fundamental part of the architecture, tried and tested, robust and not likely to fail. Also, your data store has the rules built into it so if you change your application or allow access from more applications then the same RI rules always apply regardless of the route taken to access the data.

Coded RI in applications cannot be easily demonstrated through models or diagrams. ER models nicely show the presence of relationships (keys) in their context.

You don't need to reinvent wheels that are squarer, more wobbly and less useful that the wheels that are already available.
But, of course, there are two prerequisites:
  • DBMS data language full and correct support of all constraints, including those of arbitrary complexity;
  • Proficiency in logic and the data language by database professionals,  necessary to formulate and verify the constraints;
My guess is that not many database professionals are aware of the classification of integrity constraints into general--common to all databases--and, for lack of a better term, "arbitrary". Domain, R-table (key, column) and database (foreign key) constraints fall into the general category. Most of the integrity burden is due to constraints of arbitrary complexity that represent business rules specific to each business and which are, therefore, database-specific. e.g. "Department X hires only N employees of skills Y and Z and pays salaries not greater than that of its manager".

Computers do not understand constraints semantically, like users do, but algorithmically. For every integrity constraint declared to it the DBMS must must have an algorithm that it uses to check whether the data to be inserted in the database satisfies the constraint and, should, therefore, be accepted. SQL implementations simplify the formulation and verification of general constraints by providing shorthands: the user specifies the type of constraint e.g. PRIMARY KEY and the DBMS generates the checking algorithm. Hence my reply in the exchange:
Keys and RI are general constraints and are the easy part. The real burden are what we refer to for lack of a better term, as "arbitrary constraints", which are business- and database-specific. Their formulation is [nontrivial without knowledge of logic] and support in SQL leaves much to be desired.
In an Appendix to my newly revised Business Modeling for Database Design paper I show that the formulation and verification of even the simpler general constraints would be non-trivial in the absence of those SQL shorthands. Constraints of arbitrary complexity are much more demanding.

Consider, in this context, an email I recently received from a reader about the selection of a "standard" SQL dialect:
A while back when we were looking at doing some restructuring of our design we looked at the possibility of making the database cross platform.

The data description language for creating the tables and most of the views proved to be quite straightforward, likewise the primary key, foreign key and check constraints. The main difficulty was of course with the stored procedures and though we didn't go ahead with the cross platform plan it prompted us to remove a lot of code from stored procedures and put the logic in views instead. Advances in recent versions of SQL have made many things that might previously have required SPs possible in SQL expressions - but there were also some things in SPs because we hadn't thought them through clearly enough.

The real stumbling block was the translation of constraints that are expressed in triggers. It is here that the different dialects differ very widely. So from the point of view of the book, I think it might be possible to settle on some kind of "standard" syntax for SQL, but constraints (other than those supported directly by SQL) are a difficulty.

The real "cross platform" layer is of course the relational model [FP: that is, predicate logic] not SQL (I think this ties in with Erwin Smout's comment about people trying to teach converting SQL to relational algebra, rather than the other way round).

...I was surprised to discover that the ASSERT statement has been in the SQL standard since SQL-92. [Certain constraints] would be very clumsy to express by any other means. The question is why, more than 20 years after the standard,  do we still have no implementation of ASSERT in any SQL product (or at least none that I know of)? It doesn't even seem as if anybody is working on the issue, Toon Koppelaars comments on his Triggers Considered Harmful, Considered Harmful blog:
As far as I know the problem areas described above [about ASSERT], haven't been fully researched yet by the scientific community: for one, I'm unable to find this research."
An implementation of the ASSERT statement is one of the very few things that might make me consider changing to a different SQL-DBMS.
Toon states that efficient and correct support of assertions is hard. It is certainly not easy, but practically a lost cause if the DBMS is not designed as a true and full implementation of the relational model, which SQL DBMSs are not. If the authors of data languages initially flout sound relational and good language design principles, subsequent corrections or extensions require drastic surgery, which proves very difficult, particularly if the installed user base is large. It took a long time for vendors to add some of the general constraints and implementations are complex and often inefficient.

Is there any wonder that so much integrity is either enforced by applications, or not at all? Or that integrity is traded off for efficiency?

But note very carefully that
  • Application constraints are even less likely to be correct or efficient and awareness, detection and correction of errors even more difficult;
  • NoSQL or other non-relational "innovations", which tend to give up on consistency altogether, are certainly not the solution;

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment