Monday, July 8, 2013

Relational Theory and Database Practice

I shared the links to my recent three-part series on foreign keys (and integrity constraints in general) on LinkedIn. Comments on the second installment raised an important issue about keys (discussed in more depth in Business Modeling for Database Design), which deserves attention.
NK: Let me first affirm my position that I believe foreign keys are the fundamental bases on which relational database managements system operate. Foreign keys provide the relationship in database normalization. Foreign keys are like the framework of a building structure. While some developers may have the notion that constraints and integrity checks can be handled better at the application layer, I would want to refer them to tools like ER Studio, ERWIN, and Visual Studio ... A good database design starts at the logical design level. Abstracting constraints and integrity checks from this layer to the application layer can lead to corrupt database designs. A simple case in point; How would you enforce a unique constraint on a table with 10 million rows? Will it make better sense to have a unique index on the table\field or have the application layer enforce the constraint?

In my reply I agreed that, first, FK's are fundamental, but not more so than any other integrity constraint that represents some real world business rule in the database. Referential relationships are just one type of rules.

Second, "some" is actually a considerable number (possibly the majority) and that inclusion of a constraint in some design software is not the reason on which DBMS enforcement of integrity rests; neither is SQL DBMSs's use of physical indexes to enforce logical keys. Rather, such enforcement is a DBMS function in general and the relational model in particular: provably logically correct query results are guaranteed if and only if both the DBMS implementation and logical database design are truly and fully relational, otherwise, all bets are off.

I suspect that proponents of application-enforced integrity do not have keys in mind, but rather the several other types of constraint specified in my above-mentioned paper, particularly constraints of arbitrary complexity, for which SQL DBMS's do not provide shorthands.

This evoked the following response:
JH: I disagree that foreign keys are "not more so than any other integrity constraint." First of all, primary and foreign keys have no real world existence, they are an artifice used by relational databases [sic] to maintain referential integrity. If a flat file data store was being used, there would be no primary and foreign keys - right? For an RDBMS to work correctly, PK's and FK's are needed, because that's the way they're designed to work. Anything else constrained in the database is great, but pointless if referential integrity is lost.
The second part of the argument is exactly what I claim about correctness. But it is simply not true that "keys have no real world existence [and] are a "relational artifice".  To the contrary, PK's are an adaptation of relational theory to give it applicability to database management; FK's are "used in relational databases to maintain referential integrity", but integrity is itself a logical representation of business rules in the real world.

A relation on domains is a set of tuples, each of which is a collections of values, one from each domain. As a Cartesian product, the tuples are the set of all possible unique combinations of the domain values. As pure mathematical abstractions, domains, relations and tuples have no relationship to the real world.

Databases, on the other hand, represent facts about the real world. Consider, for example, the database R-table


If it were a relation, it would have one row for each unique possible combination of five domain values from which its columns draw their values. But that would not be a representation of employees, but rather of all poossible employees, past, present and future, that are considered valid possibilities by the employer. They include multiple "versions" or "generations" of an employee e.g. same EMP# value, but different DEPT# values (possible reassignments), or SALARY values (possible promotions), which cannot exist concurrently in the database.

In other words, uniqueness of all possible domain value combinations is not sufficient. To apply relational theory to database management requires the adaptation of abstract relation tuples to database R-table rows that represent real world actual employees. Real world entities do have an identifier (or identity attribute), or else they would not be distinguishable. A PK is a column representing the identifier.

Similarly, there are different classes of entities in the real world e.g., employees and departments, and some adaptation was needed to represent cross-class relationships e.g., departmental assignments of employees referencing existing departments, hence FK's.

It's such adaptations of the theory that made it applicable and useful to database management practice. As I demonstrate in my paper, every feature of the relational model represents some feature in the real world. It is ironic, therefore, to consider the practical adjustments "artifacts of the theory" and then question its practicality .

Some additional points:
  • There are very good reasons to assume that if integrity constraints are not enforced by the DBMS, they are unlikely to be enforced at all. Very often databases are "denormalized for performance" and contain redundancies. Many practitioners are unaware of the need to add integrity constraints to control the redundancy. But not only would their formulation be at the application level and prone to error, it would defeat the very purpose of denormalization (see The Costly Illusion: Normalization, Integrty and Performance).
  • Mixing logical features such as keys and constraints with physical features like indexes and "flat-files", indicates logical-physical confusion.
  • I argue the exact opposite of JH's interpretation: I certainly do not mean that keys are not fundamental, I merely mean that any constraint that is not enforced by the DBMS carries the same risk: data inconsistency and wrong query results.

No comments:

Post a Comment

View My Stats