Sunday, April 29, 2018

A New Understanding of Keys Part 3: Surrogate Key Illusions




Note: This the third of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades. 


(Continued from Part 2)
"When defining a surrogate primary key for a [SQL Server] table, two options are the most common: Integer and UniqueIdentifier (aka Globally Unique Identifiers, or GUID's) ... Historically, Integer has been the logical choice. It’s human-readable, requires minimal storage, and can be set as an identity (auto-incrementing) to prevent the need for additional application logic. UniqueIdentifier comes with significant disadvantages. The most immediately noticeable is that it’s user-unfriendly. You’ll never hear a user or developer ask you about record “A78383A3-4AB1-42CF-B3FC-A4A23AD10398”. With high availability and replication becoming highly prevalent, UniqueIdentifier is being chosen more often, but has caveats that mean it isn’t always the optimal solution."
--Jeffrey J. Keller, Vertabelo.com

As we explained in Parts 1 and 2, keys can be properly understood only within the RDM. We revealed a new perspective on keys, discussed relationally valid kinds of keys, and revised definitions of natural (NK) and surrogate keys (SK).

As we have seen, the formal PK mandate is distinct from PK selection, which may be pragmatic. A PK must represent a name -- either pre-assigned, or generated only when there is no simple name CK. Generated keys must ensure entity integrit and are managed by the DBMS transparently to users.

All this is absent from conventional wisdom and database practice, as the above example illustrates: generated SKs are overused for the wrong reasons, the most common being emulation of OIDs (a SK -- often database-wide and, so, unique across relationsn), followed by performance.

Note: While OIDs have unique values, they often also have some physical significance.



------------------------------------------------------------------------------------------------------------------
SUPPORT THIS SITE 

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you. 

NEW PUBLICATIONS

HOUSEKEEPING

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 
@THE POSTWEST

  • The Dystopia of Western Decadence, the Only Acceptable Racism, and the Myth of a “Palestinian nation”
------------------------------------------------------------------------------------------------------------------ 

The SK is considered without regard as to whether a suitable name CK exists and SQL DBMS do not manage SKs transparently. An integer type is recommended on grounds of "human-readibility" and "user-friendliness", which seems to be some combination of familiarity and simplicity in our parlance, but is actually neither:  they would be equally unfamiliar (i.e., would not provide meaningful logical database access), but integer is simpler and a better referential target. In other words, there is no formal or pragmatic database justification for ever considering the GUID SK (although there may be non-database considerations such as security).

Which brings us to "You’ll never hear a user or developer ask you about record “A78383A3-4AB1-42CF-B3FC-A4A23AD10398".
 

Logical-physical Confusion


Relations, attributes and tuples are logical. A PK is a combination of one or more attributes representing a name that uniquely identifies tuples and, thus, is logical too[2], while performance is determined exclusively at the physical level, by implementation.

So generating SKs for performance reasons (see, for example, Natural versus Surrogate Keys: Performance and UsabilityPerformance of Surrogate Key vs Composite Keys)  is logical-physical confusion (LPC)[3]. Performance can be considered in PK choice only when there is no logical reason for choosing one key over another.

We do not mean that performance with a generated PK cannot be better than with a a pre-existing name CK, although such effects are much less common and significant than what proponents of universal SK generation believe. What is, however, key (pun intended) is not to confuse levels of representation[4]: if and when such gains occur, they are due to the specific DBMS and database implementations, not the PK choice per se. Otherwise put, if a generated SK is necessary for satisfactory performance, blame the DBMS and the physical implementation of the database for not performing well with the correct logical design, not -- as is common -- the RDM for imposing keys, neither of which has anything to do with performance[5].

Generating SKs for performance reasons may even defeat the intended purpose. For example, SQL DBMSs implement key constraints by indexing keys. Each SK adds not just data, but also the index, increasing both storage and DBMS management burden.  While there is nothing wrong with indexing to maximize performance, making a logical feature dependent on a specific physical feature violates physical independence (PI)[6] and induces LPC (e.g., many data professionals confuse indexes with keys).

It is because both users and the DBMS do a lot more work, and use a lot more storage to translate between SKs with values such as A78383A3-4AB1-42CF-B3FC-A4A23AD10398 and the alternate keys (AK) that provide meaningful logical database access, that DBMS managament of SK is desirable, but SQL DBMSs do not support it. This is why we keep reminding data professionals that while proper logical design in general, and PK selection in particular, are always advantageous, their full benefits accrue only with true RDBMSs. Had data professionals understood the RDM -- instead of confusing SQL DBMSs with RDBMSs -- they wouldn't have tolerated the former and might have had the latter by now.

Note: When data professionals refer to a "SK that provides meaningful access", it is a very good indication something is wrong.

"It is the familiarity of natural keys that will determine the amount of reduced, (or unnecessary) joins. The magical performance improvement myths that are commonly associated with SKs should, like all myths, not be taken for granted and be critically investigated. In many cases it is the former that wins in terms of performance, consistency, and maintenability."
--Ami Levin, Surrogate Key Illusions
provided that the NKs represent names, not properties.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., The Key to Relational Keys - A New Understanding.

[3] Pascal, F., Don't Mix Model with Implementation.

[4] Pascal, F., Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation.

[5] Pascal, F., "Denormalization for Performance": Don't Blame the Relational Model.

[6] Pascal, F., Physical Data Independence.


Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.




2 comments:

  1. Most complete yet pointed explanation I've read regarding NK, SK, and PK. FYI, I think the "Third" in "Third, surrogate PK's require joins" above should be "Fifth".

    ReplyDelete
    Replies
    1. Just checking if readers are awake :).

      Yup, thanks. See my today's post on keys.

      Delete

View My Stats