Sunday, August 9, 2015

Surrogate Key Illusions

 Revised 7/26/16 (See Understanding Keys of 7/31/16 for a more in-depth discussion). 
"When defining a surrogate primary key for a table, two options are the most common: Integer and UniqueIdentifier (aka Globally Unique Identifiers, or GUID's) ... Historically, Integer has been the logical choice. It’s human-readable, requires minimal storage, and can be set as an identity (auto-incrementing) to prevent the need for additional application logic. UniqueIdentifier comes with significant disadvantages. The most immediately noticeable is that it’s user-unfriendly. You’ll never hear a user or developer ask you about record “A78383A3-4AB1-42CF-B3FC-A4A23AD10398”. With high availability and replication becoming highly prevalent, UniqueIdentifier is being chosen more often, but has caveats that mean it isn’t always the optimal solution."

Keys are one of several practical adaptations of the mathematical theory of relations to database practice. Relations are mathematical abstractions that do not represent anything in the real world--their tuples are unique by definition. A 5NF database relation, on the other hand, represents a set of (facts about) a class of property-sharing entities, which are distinguishable in the real world by one or more properties, otherwise we would not be able to tell them apart, or even count them. NK's represent in the database entity-defining identifiers (EDI)--the set of inherent properties that defines them as such and uniquely identifies them. A class of entities without an EDI is a flag for reconsidering the modeling. [1]

If every relation has at least one NK, why a SK? NK's are immutable and meaningful to users, but tend to be complex (i.e., either composite, or have a complex representation) and do not make good targets for references by other relations, which, for practicality, should be simple. SK's are simple keys that substitute for NK's as primary keys (PK) for referential purposes.

Note: The most common--and erroneous argument for SK's is that they are more stable than NK's. But a NK, properly understood, is immutable, because it represents a set of inherent entity properties (e.g., DNA for people, ownership and a set of books for companies), that change only if the entities themselves change. SK's, on the other hand, are simple human or system generated arbitrary unique names that substitute for complex NK's and are subject to change (e.g., SSN, patient id). [1]

Note, first, that this disqualifies the likes of SQL Server's GUID's--deploying a complex SK as PK defeats the purpose. The quoted comment argues that they are
"[T]he only correct selection in any replication-like scenario. If multiple sources may be generating data for the same table (as in SQL Server replication), a UniqueIdentifier is required to ensure uniqueness of the PK. A standard auto-incrementing key could have potential collisions when the replication process takes place."
Second, SK's do not replace NK's, but are added to them. This preserves meaningful logical data access which SK's lack.

Third, to be valid, a SK must be assigned such that all descriptive attributes and other keys are functionally dependent on it, a principle not guaranteed to be followed in practice. In which case, by themselves, they cannot ensure entity integrity (there is nothing to prevent the insertion of a tuple more than once with distinct generated SK values). If the only difference between tuples is a SK, it is most likely misused.

Fourth, often SK's are deployed for the wrong reason. Keys--whether natural (NK) or surrogate (SK)--are a logical, not physical feature. But  SK's are often deployed for the wrong reasons, i.e., to improve performance (e.g., a numeric SK instead of a character NK, or a short SK instead of a long NK). It is often forgotten that  in SQL keys are implemented with indexes, so each SK  increases both physical storage and maintenance burden for the DBMS.

Fifth, surrogate PK's require joins that would be unnecessary with a natural PK ("You’ll never hear a user or developer ask you about record A78383A3-4AB1-42CF-B3FC-A4A23AD10398”). In other words, if a SK improves performance, it also imposes joins that complicate queries and may well defeat the purpose, as Ami Levin has argued [3].

 He points out that it is the familiarity of the NK that will determine the amount of reduced, (or unnecessary) joins. He warns some magical performance improvement myths that are commonly associated with SK's should, like all myths, not be taken for granted and be critically investigated. As demonstrated, in many cases it is the NK that wins in terms of performance, consistency, and maintenability.

[1] Understanding Keys,, forthcoming 7/31/16.
[2] The Key to Keys: A Matter of Identity, PRACTICAL DATABASE FOUNDATIONS paper #4.
[3] Levin, A., Microsoft SQL Server, Relational Databases and Primary Keys.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:


  1. Most complete yet pointed explanation I've read regarding NK, SK, and PK. FYI, I think the "Third" in "Third, surrogate PK's require joins" above should be "Fifth".

    1. Just checking if readers are awake :).

      Yup, thanks. See my today's post on keys.