Sunday, August 9, 2015

Surrogate Key Illusions

Note: This is a 12/05/17 re-write of two earlier posts, to bring them in line with the McGoveran's formalization and interpretation [1] of Codd's true RDM. The posts were my reactions, as author of multiple writings on keys, to  Ami Levin's presentation, Surrogate Key Illusions.
 

Primary Keys: A FOPL Formal Mandate


Levin's reference to a disagreement between Codd and Date on "what a primary key is" is inaccurate. The disagreement is on whether PKs should be mandatory (Codd) or just "a good idea" (Date, who considers candidate keys essential). A likely reason is that Codd did not explain the formal motivation for the PK mandate and Date focuses on PK selection -- the criteria for which are pragmatic (familiarity, stability and simplicity) -- and overlooks the formal requirement of first order predicate logic (FOPL) [1].




Keys Are Logical


An attendee defined key as "a field/combination of fields to select a record", which I suspect is rather a common view. But files, records and fields are physical implementation features, while keys are logical and confusing the two levels of representation is logical-physical confusion (LPC).

A key is a combination of one or more attributes that uniquely identifies tuples and represents formally in the database an identifier of object members of a group represented by the relation in the database. Relations, attributes and tuples are logical features. Issues such as performance and portability are determined exclusively by implementation, not logical factors. PK selection is purely logical design that should not be contaminated by implementation considerations [2] due to LPC. If PK selection affects the performance of a specific DBMS, it is a consequence of its implementation, not of the PK choice per se. Such effects, if they occur, should be considered a matter of how well the DBMS implementation handles the choice.

Use of surrogate keys (SK) is, thus, based on performance considerations, nor is it a matter of just personal preference. There are specific circumstances in which natural keys (NK) -- properly understood!!! -- will not do [3]. Moreover, the three PK selection criteria often conflict and must be traded off (e.g., a simple SK is sometimes preferable as referential target to a meaningful NK). Be that as it may, more often than not SKs are over-employed for wrong reasons [3]. They certainly should not be added universally to every relation and when they are, each SK should be valid -- defined such that:

  • It has 1:1 relationships with the NK;
  • All non-key attributes are functionally dependent on it;

Surrogate Keys As PKs


"When defining a surrogate primary key for a [SQL Server] table, two options are the most common: Integer and UniqueIdentifier (aka Globally Unique Identifiers, or GUID's) ... Historically, Integer has been the logical choice. It’s human-readable, requires minimal storage, and can be set as an identity (auto-incrementing) to prevent the need for additional application logic. UniqueIdentifier comes with significant disadvantages. The most immediately noticeable is that it’s user-unfriendly. You’ll never hear a user or developer ask you about record “A78383A3-4AB1-42CF-B3FC-A4A23AD10398”. With high availability and replication becoming highly prevalent, UniqueIdentifier is being chosen more often, but has caveats that mean it isn’t always the optimal solution."

While NKs represent real world intrinsic properties, ensure object integrity and provide meaningful logical database access, they tend to be complex (i.e., are either composite, or have a complex representation) and do not make good referential targets. SKs represent assigned artificial names, do not ensure object integrity or provide meaningful access, but they can be simpler referential targets and can substitute for NKs when they won't do [3]. GUIDs are themselves complex, so they make no sense as PKs. And as already explained, globally unique (database-wide) keys that emulate the object IDs of "OODBMSs" serve no function in relational databases [4].

Note: SQL DBMS system data types (SDT) such as INTEGER are, at best, primitive domains with vendor-defined value ranges. User-defined domains cannot be derived from them by constraints, simple domains are not enforced [5] and attribute constraints are not supported [6]. The domain from which an attribute draws its values and, therefore, the type of those values, is determined by the real world property that the attribute represents and the operations that make sense for that property. The choice is easier with support of attributes as representations of domains [6], unavailable in SQL.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., Don't Mix Model with Implementation.

[3] Pascal, F., Understanding Keys: Intrinsic Properties and Object Names.

[4] Pascal, F., Kinds of Keys: Natural, Surrogate and Primary Keys Are Sufficient.


[5] Pascal, F., Simple Domains and Value Atomicity.

[6] Pascal, F., To Really Understand Integrity, Don't Start with SQL.




Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

2 comments:

  1. Most complete yet pointed explanation I've read regarding NK, SK, and PK. FYI, I think the "Third" in "Third, surrogate PK's require joins" above should be "Fifth".

    ReplyDelete
    Replies
    1. Just checking if readers are awake :).

      Yup, thanks. See my today's post on keys.

      Delete