Sunday, April 22, 2018

A New Understanding of Keys Part 2: Kinds of Keys




Note: This the second of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades. 


(Continued from Part 1)
"Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a debate rises on the use (and misuse) of all these keys ... The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)? Of all kinds of key, the primary key and the surrogate key gained the most discussion."

"If we take a look at the relational model we only see of one or more attributes that are unique for each tuple in a relation -- no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys. We formally do not have a primary key, it is a choice we make and as such we might treat this key slightly different from all other available keys in a relation. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys."
--Martijn Evers,dm-unseen.blogspot.com
I've deplored the misuse and abuse of terminology due a general lack of foundation knowledge in the industry [3] for longer than I care to remember, and keys are not an exception. If "the discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys", then there is no understanding of relational keys whatsoever: whatever it is, a data structure that contains NULLs is not a relation, one reason for which SQL tables are not relations, SQL databases are not relational and SQL DBMSs are not RDBMSs (for a relational solution to missing data without NULLs see[4]).

We sure can disambiguate, but the key (pun intended) to keys is that they are a relational feature and, thus, can only be properly understood within the dual theoretical foundation of the RDM, which is an adaptation and application of simple set theory (SST) expressible in first order predicate logic (FOPL) to database management. Thus, their "nomenclature on properties and behavior" should reflect what from the real world they represent, and what function they fulfill in the RDM. Which is precisely what the industry disregards.



------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE 

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you. 

NEW PUBLICATIONS


HOUSEKEEPING

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 

@THE POSTWEST

  • The Dystopia of Western Decadence, the Only Acceptable Racism, and the Myth of a “Palestinian nation”
------------------------------------------------------------------------------------------------------------------ 

Primary Keys


As we've seen in Part 1, conventional wisdom notwithstanding, it is the primary key (PK), not candidate keys (CK), that has primacy within the RDM, and must satisfy several formal requirements of the theory[2]. There are fewer other kinds of relationally significant keys than those floated in the industry.



Candidate and Alternate Keys


Codd recognized that there may be multiple sets of attributes that uniquely identify tuples in a relation, known now as candidate keys (CK). They serve as a pool from which one is formally designated as PK.

Alternate keys (AK) are the CKs not designated PK, but are enforced together with the PK via key constraints on relations. There must be at least one representing an entity identifier in the real world, with which the PK has a 1:1 value relationship [2].

As far as we know, the CK and AK terminology does not appear in any of Codd's published writings.  



Composite Keys


In the original RDM papers, keys were essentially domains[5]. Only later did he introduce attributes, and we now understand that (1) an attribute is, formally, a domain representation, and that there are domain, as well as attribute constraints[6]. Notice that this distinction is essential to understanding a PK: it is the appearance of a domain as an attribute in a relation that has PK characteristics and, therefore, is subject to the PK constraint, not the domain on which the attribute is defined!
 

Today refer to a combination of attributes (that are defined on domains) as a composite key.  Composite keys have relational significance due to the formal requirements of (1) PK irreducibility[2] and (1) 5NF relations. Based on a careful analysis of Codd's work, we now contend that, conventional wisdom notwithstanding, a relation is not in just First Normal Form (1NF), but in Fifth Normal Form (5NF) by definition[7].


Natural and Surrogate Keys


Natural (NK) and surrogate keys (SK) are industry, not relational terms, the use of which has changed over time. As far as we know, the term SK was not used until Codd referred to "permanent surrogates for entities" in 1979[8].

Currently data professionals consider any name or combination of properties used to identify entities in the real world (that pre-exist the database) as "natural", hence the CKs representing them are NKs (e.g., SSN and fingerprints are both NKs). SK is reserved for PKs generated expressly for the database, the application, or the system. Pre-existing name CKs may be more familiar to users than generated SKs, but usually neither can be used for meaningful logical database access, which is why property AKs are also enforced.

Were we to adhere to the NK vs. SK classification, a more sensible view would be:

  • A NK is a CK that arises from the conceptual model of the enterprise and represents a name or a property, either of which may be composite;
  • A SK is artificially created as a substitute for an NK (possibly where no NK is known) and generated specifically for the application, for the business/enterprise, or in the common context and created by some other "organization" or group (e.g., SSN);

However, the classification of keys as NKs vs. SKs is, by itself, imprecise and does not relate in a useful way to either modeling (conceptual, logical, or physical), or the formal aspects of the RDM. Unfortunately, it is entrenched and hard to redefine. So we propose an alternative classification[2] that corresponds to a more formal SK concept (and in line with Codd's intent) than the current one in use.

To be the designated PK, a pre-existing CK or a generated SK must meet all the formal PK requirements[2]. However, a SK must also be managed entirely and transparently by the DBMS, as proposed by Codd. This means that a pseudo-SK created and managed by application code is, at best, a workaround for DBMS deficiencies and, at worst, confusion about the intended purpose of a SK.

Note: Global identifiers (i.e., system-wide, rather than just database-wide), system created and managed could be proper SKs, but iff the DBMS is properly integrated with the system facility that manages them.

This is about it. The other "keys" listed by Martijn disregard the RDM, or are used in explicitly in non-relational contexts. For example, an "intelligent key" is an "overloaded" attribute the values of which are concatenations of multiple property values, those properties often not having been made explicit in the conceptual model (e.g., prior to about 1990 SSN had meaningful components). Because it is a single attribute in the logical model, but can be and is parsed into components that have independent meaning and -- most important -- applications/users depend on knowledge of those parts, it violates value atomicity [9] and 1NF [10]. What is more, that information is embedded within the structure of the attribute, not represented as simple key values as required by the core Information Principle (IP) of the RDM [11] -- hardly intelligent practice [12]. 



(Continued in Part 3)


References


[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., The Key to Relational Keys - A New Understanding.

[3] Pascal, F., DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER

[4] Pascal, F., The Last NULL in the Coffin: A Relational Solution to Missing Data.

[5] Codd, E. F., A Relational Model of Data for Large Shared Data Banks, Commun. ACM 13(6): 377-387 (1970).

[6] Pascal, F.,
The Interpretation and Representation of Database Relations.

[7] Pascal, F., What Relations Really Are and Why They Are Important.

[8]
Codd, E. F., Extending the Database Relational Model to Capture More Meaning, ACM Trans. Database Syst. 4(4): 397-434 (1979).

[9] Pascal, F., Simple Domains and Value Atomicity. 

[10] Pascal, F., First Normal Form in Theory and Practice Part 1, 2, 3.

[11] Interpreting Codd: The 12 Rules.

[12] Pascal, F., Outsmarting the DBMS.



Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.



8 comments:

  1. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Well, no. It's the entity's inherent properties in the real world that are immutable, a NK only represents them in the database. Addressing--be it physical or logical--is a separate issue.

      The adjective "natural" distinguishes between NK's that represent inherent and (non-natural) keys that represent assigned properties. I don't think Codd made this specific distinction--I suspect he used NK for keys representing both assigned or inherent properties and surrogate keys for those generated at database design time as a substitute.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. I have no idea what your point is, probably because succinctness is not your forte.

      The immutability David and I refer to is of properties in the real world. A NK is just a database representation of it in the database.
      It is purely logical and has nothing to do with implementation--that's exactly what physical independence means.

      The physical implementation does have to preserve the properties of the logical design--which is why the former is derived from the latter--but that's a different matter.

      Delete
  3. What would be lost in considering any natural as a point in time more familiar surrogate key ? Would not that simplify definitions ?

    ReplyDelete
    Replies
    1. I don't understand your question.

      Delete
    2. You defined a natural key as an application specific surrogate key. Therefore, if application specific I am questionning the need to include such distinction as part of RDM. Is not a natural key a surrogate key that is familiar and becomes universally accepted.

      Delete
    3. The idea is "externally assigned", which includes app-specific.

      Delete