Sunday, June 9, 2013

Foreign Keys, Part 2: The Costs of Application-Enforced Integrity

In my previous Forward to the Past debunking of the exchange on foreign keys I promised that if I find any leftovers warranting comments, I will offer them. I did, there were, and here is a second installment.

I would like, first, to point out that even though the exchange is about FK's in particular--which, by the way, are referential integrity constraints--most of the discussion applies to integrity constraints in general, of which there are several kinds (see Business Modeling for Database Design).

The discussion hovered around whether FK's are mandatory or optional (which I may tackle in a future post). Here I will comment on some of replies triggered by the FK question, most of which demonstrated the value of foundation knowledge over those that were not grounded in it. They were readily distinguishable by the informed eye and if you cannot tell which is which, education is in order.

I went through all the informed responses and made a list of the negative consequences claimed to flow from integrity enforced by applications rather than the DBMS.

Relational infidelity: Integrity of the data and provably logically correct query results are guaranteed if and only if (1) the formal integrity constraints that represent informal business rules in the database are enforced (2) the DBMS implementation and logical database design are truly and fully relational, otherwise all bets are off.

Multiple sources of corruption: The database can be updated by multiple applications and even direct data language operations and if one corrupts the database, it does so for all data access. Even if some application developers were able to enforce constraints as well as a DBMS, it cannot be guaranteed that all of them are. Integrity must be enforced centrally. This is exacerbated by poorly relational and designed data languages like SQL and developers without sufficient, if any, logic background, particularly when it comes to constraints of arbitrary complexity (see Appendix A, Business Modeling for Database Design).

Consistency checks and and cleanup procedures: This burden will have to be undertaken on a regular and frequent basis to detect and correct any integrity violations that escaped via applications.

Development/maintenance redundancy and burden: If they are not enforced centrally, constraints must be enforced by each and every updating application. Aside from the unnecessary burden on developers and increased risk of errors of omission or commission, consider what happens when business rules change!

Hidden information: The core relational Information Principle (IP) requires that all information must be represented (1) explicitly (2) in exactly one way, as values in R-tables. Any representation that violates the IP e.g., information embedded in row order, encoded in column or table names, or concatenated in single columns, are hidden from and inaccessible to the DBMS and undocumented, with obvious consequences. This is also true of the referential constraints that protect the consistency of data values (see next).

Inhibited performance optimization: Integrity enforcement has a significant impact on performance and it is critical for the DBMS optimizer to take constraints into consideration, if it is to do its job effectively. When constraints are scattered in applications and "hidden" from the DBMS, optimization is inhibited (this, of course, requires a TRDBMS and fully normalized R-tables.

With all this in mind, let's now consider some of the replies.
...compromise for performance reasons, as too many foreign key validations can slow down high volume inserts. Other times you have to create breakpoints just to keep the web of relationships from becoming too tangled and connecting hundreds or thousands of tables.
It is important for data professionals to keep always in mind that performance is determined entirely at the physical implementation level and if they encounter performance issues, that's where solutions should be sought. If and only if they exhaust all options there--of which there are a myriad--and performance is still unsatisfactory, they should (1) consider themselves forced by the specific DBMS implementation to trade off integrity for performance and (2) should do so with full and conscious consideration of all the above mentioned costs imposed by such tradeoffs (see The Costly Illusion: Normalization, Integrity and Performance). What they should not do is blame the relational data model for poor performance, which confuses the solution with the problem.
I believe in "defense in Depth" ... which means that you enforce referential integrity at every possible point: Data entry, data load, database.
This is a tricky, if pragmatic, rather than theoretical issue. There are some valid reasons for multi-level integrity enforcement, for example to give the user feedback by the application of violations, before it updates the database. But this should be considered relative to the cost: redundancy, development and maintenance burden, increased error proneness, possible inter-level inconsistencies and the performance impact.
I think, you have to learn about data structures and logical data design (not only database which is nowadays are interpreted mainly as only RDBMS), to be clear about usage primary, alternate, and foreign keys, normal forms, data integrity - and database integrity, because your database will work suboptimally without these knowledge if it will work anyhow.
Obviously, this respondent detected that the question reveals poor foundation knowledge, confirming my own claim in the previous, first post on this subject.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment