Sunday, June 2, 2013

Forward to the Past: Foreign Keys -- Are They Really Necessary?

R: Do I Have to Use Foreign Keys? If I am already manipulating data properly, are foreign keys required? Do they have another purpose that I’m just not aware of? I appreciate the guidance!
If you are not sure what I mean when I so often deplore product self-training not preceded by education, this is a good example. Those who ask this question are not yet ready for database practice (and if they had a formal education, they should demand a tuition refund). Knowing a product's features is not enough for understanding why they are there, judge whether they should be there (and what others are missing), what are the pros/cons and how to exploit them intelligently (and work around what is missing).

The question triggered a large number of reactions, some that benefit from foundation knowledge and some suffer from the lack thereof. It is difficult for a practitioner who asks this kind of question to assess which is which--they may all seem reasonable, but the relevance, importance, or accuracy can be easily missed. Online exchanges are not effective and reliable education tools. Since my objective is to debunk misconceptions, I will focus on what is the problematic in the exchange.
DR: ... if you are "enforcing" FK integrity in your data pipeline by rejecting records which do not satisfy lookups or joins, for example. I think it is preferable to have FK constraints as an additional security layer and they can be disable[d] during loading if required; however, you need to be 100% certain your ETL is enforcing the constraints.
It is best to do both - have the ETL reject records which fail FK checks and report on these whilst also enforcing FK intergrity on insert / update, if appropriate. Even then, rejects can be handled and actioned appropriately. The only additional thing I can add is - when you delete from a FK enabled DB, make sure you do it in the correct order.
Integrity constraints are the formal approximation of the informal business rules in the conceptual model. FK's--the relational shorthand for representing attribute relationships in the database as referential constraints--is an important kind thereof. Because database consistency must be enforced for every application that updates the database, integrity enforcement is a DBMS, not application function. The whole point is to replace application/developer enforced integrity, which cannot be guaranteed to be complete and correct in each and every application accessing the database and to avoid constraint redundancy. A single application failure can render the database inconsistent for all.

Points arising:
  • This response implicitly assumes that the question refers to data warehousing, but there is nothing to indicate that that's what the asker meant. But even when that is the case, the last thing you want is to disable constraints!
  • Assuming that the "ETL layer" is a utility external to the DBMS, in the integrity context it is an application. You may argue that it is a "centralized" sort of application, but then what exactly is the advantage of having to declare the integrity constraint to it instead of to the DBMS? This would be particularly questionable when it comes to FK constraints, for which DBMS's implement shorthands. (see Database Design, Relational Fidelity and SQL Parts 1-4). And what happens if there are applications that update the database that do not go through the ETL layer?
  • Two integrity layers is a shaky proposition. What happens when business rules and, therefore, integrity constraints change? Integrity enforcement is placed in the DBMS precisely to avoid implementing it redundantly in each updating application, so why introduce such redundancy in another way? In the context of the ill-advised, but common willingness to trade off integrity for performance (see The Costly Illusion: Normalization, Integrity and Performance), does enforcing the constraints twice make sense? For what benefit?
JSH: We don't have every possible logical relationship enforced by the database. Sometimes you have to compromise for performance reasons, as too many foreign key validations can slow down high volume inserts. Other times you have to create breakpoints just to keep the web of relationships from becoming too tangled and connecting hundreds or thousands of tables.
See what I mean?
RJ: Databases can work with or without primary keys and foreign keys. The choice is yours... However ... enforcing referential integrity can be done by many methods ... TMTOWTDT = There is more than one way to do this ... It all depends on your approach... In the last ten years... every one is enforcing referential integrity with help of primary and foreign keys but before this ... a lot of applications were working without primary and foreign keys to enforce referential integrity and to avoid orphaned rows/avoid duplicate records.
Some database systems (e.g., MS Access) may have a limit on the parent-child links that you can establish for referential integrity (e.g., the number of automatic Cascade Deletes you can set up for related tables). To effectively implement referential integrity in these situations, I build lookup and delete queries in the application code for the applications that are attached to the database (incidentally, I would not call MS Access a database system precisely because of such limitations.
Referential integrity can be enforced by the application software also many applications were written before the year 2000 which did not depend on referential integrity being enforced by the database primary and database foreign keys.
This is an historically accurate description from which RJ fails to derive the right conclusion. He is essentially saying that because integrity was enforced by applications in the past, there is no good reason not to continue doing that--it's just a matter of individual preference. He misses the whole point of database management--to centralize certain functions that were cost-ineffectively scattered in applications--practically oblivious to the technological progress of the last four decades.

Think of what products and practices would have been if, instead of saying "if the DBMS tool does not support integrity constraints, I  enforce them in applications, practitioners had been saying "why should a DBMS force me either to incur unneccesary integrity burdens, or to trade integrity for performance?"
F: I think, you have to learn about data structures and logical data design (not only database, which is nowadays interpreted mainly as only RDBMS), to be clear about usage primary, alternate, and foreign keys, normal forms, data integrity-and database integrity, because your database will work suboptimally without these knowledge if it will work anyhow.
Exactly! (I assume that by database F means DBMS product).

If there’s something else worth debunking, I will tackle it in a next post.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment