Sunday, January 7, 2018

Understanding Relational Keys - A New Perspective: Primary Keys

Note: Rewritten 1/1/18. This is one of three re-writes of older posts on keys to bring them in line with the McGoveran formalization and interpretation [1] of Codd's true RDM (see second [2] and third [3]). They are abbreviated extracts of the forthcoming rewrite of [4], which proposes a new perspective on relational keys in depth and more detail, distinct from the conventional wisdom of the last five decades. Re-reads are strongly recommended.

Revised: 1/14/18.

"A key is a column or columns that together have no duplicate values across rows. Also the columns must be irreducibly unique, meaning no subset of the columns has this uniqueness ... In most databases primary keys have survived as a vestige, and nowadays merely provide some conveniences rather than reflecting or determining physical layout. For instance declaring a primary key includes a NOT NULL constraint automatically, and defines the default foreign key target in a PostgreSQL table. Primary keys also give a hint that their columns are preferred for joins." --Joe Nelson, SQL Keys in Depth, begriffs.com

Q: "My understanding has always been that a primary key should be immutable, and my searching since reading this answer has only provided answers which reflect the same as a best practice. Under what circumstances would a primary key value need to be altered after the record is created?"

A: "When a primary key is chosen that is not immutable?"
--Why should a primary key change, StackExchange.com

There is a general and persistent lack of foundation knowledge in the industry [5] and keys are not an exception. 70% of searches hitting this site are about keys, just one indication that this fundamental relational feature is poorly understood decades after the RDM. "wading through sixty-four articles, skimming sections in five books, and asking questions on IRC and StackOverflow" to "put the pieces together"? And then what he put together is conventional wisdom (besides, SQL is the last source to go to for really understanding anything in depth, let alone relational features [6].

Keys can only be understood within the RDM, which is simple set theory (SST) expressible in first order predicate logic (FOPL) adapted and applied to database management -- which SQL authors never understood -- that is disregarded in the industry. For a proper understanding of keys, read my trilogy of posts, [4] for the in-depth treatment,  then compare to Nelson's take and your SQL DBMS support of keys.


Relation Interpretation: Object Identification, Properties and Names


Note: In what follows, object is used in the entity, not OO, sense.



Objects in the real world are uniquely identified within a group by either a combination of one or more properties, or by an assigned name.

A property is a set of observable/measurable values (obtained by a well-defined method). Properties describe objects and can be:

  • Intrinsic (i.e., inherent to objects) and either
- immutable (e.g., fingerprints) or
- dynamic (e.g., age is intrinsic, but its values change);
  • Contextual (e.g., an employee's job assignment);
All objects have a combination of intrinsic immutable properties that define them (i.e., constitute their identity).

A name is often assigned to objects as a simpler object identifier, but unlike a property (part color in Codd's original example), a name (part number) identifies or designates objects, but does not describe them. Although names can be (and in the industry are) thought of as properties, they are not observable/measurable and are inherently non-intrinsic: their values can change without changing the referent objects.

Since names, whether pre-existing, or created expressly for the database, are non-intrinsic, a name uniquely identifies objects iff there is a 1:1 relationship between its values and those of the combination of defining  (i.e., intrinsic immutable) properties (i.e., there is a well defined procedure that associates each name value with exactly one value of such a property combination).


Primary Keys


The primary key (PK) is one of Codd's several adaptations of a relation -- a mathematical abstraction devoid of any  meaning -- that were necessary to make it applicable to database management, namely, assign it real world meaning as representing an object group, with tuples representing (facts about) object members and attributes object properties or names [7]. While in SST the tuples of a relation are unique combinations of all their values by mathematical definition, to represent unique real world objects, a database relation must be constrained such that the set of attributes that represents the corresponding properties or names have unique values.

The original Codd RDM papers (1969-70) used the PARTS example for the definition of a PK which, expressed in today's and more precise language, is "a minimal set of relation attributes with unique values" [7]. But, as is usually the case, Codd should not be read casually: knowledge of SST and FOPL are required to understand the RDM he envisioned.

Aside from uniqueness and minimality, another formal PK property that Codd did not mention (probably because, as was his wont as a mathematician, he deemed obvious),  is stability:
"We assume that in the real world a combination of any one or more properties or substitute names identifies objects permanently and distinguishes each from the others. This assumption is the only basis for the claim that an object observed at one time is the same as an object observed at another time. Without this assumption SST simply cannot be applied to the real world ... conceptual modeling of reality cannot be formally represented in relational databases without permanent object identification." --David McGoveran

Formal Primary Key Mandate


Codd gave no formal motivation for the PK mandate, which focused attention on the "arbitrary" PK selection and, leading Date to question it: in the absence of a theoretical basis, a PK is a good idea practically, but it's the candidate keys (CK) -- a term he introduced -- that are essential [8]. This is much responsible for "Are PKs mandatory or optional?" becoming one of the most frequent questions in database practice.

Both users and the DBMS do a lot more work and use a lot more storage to translate among multiple CKs, so a PK is indeed practical. But the PK mandate is distinct from the "arbitrary" PK selection -- it is a formal FOPL requirement [4]. In other words, PKs are mandatory for a theoretical reason.

And that is not all. While within SST and the relational algebra (RA) a combination of any attributes that uniquely identifies tuples will do as (and any such combination will act analogously to) a PK, this is not the case within FOPL. There is a hint to that effect in Codd's "part number would be a primary key, while part color would not be", the significance of which has been missed. It is not just uniqueness: Codd seems to have thought of PKs as what he later called "permanent surrogates for entities", implying that PKs should represent names, not properties, which also turns out to be a formal FOPL requirement [4]. He recognized the possibility of changes in user controlled PK values, but identified problems with such keys and proposed name PKs generated expressly for the database and managed by the DBMS transparently (i.e., neither values, nor changes thereof visible to users) which, if they are changed by the DBMS, object integrity (i.e., tuple correspondence to some unique real world object) is preserved. He also introduced a coalesce operator for when two tuples turn out to represent one object
[9].

Primary Keys of 5NF Relations


According to conventional wisdom a database relation is at least in First Normal Form (1NF), and preferably in Fifth Normal Form (5NF) if certain drawbacks are to be avoided [10]. But for reasons we explain elsewhere [11], we now contend that relations are in 5NF by definition, otherwise they are not relations. A relation is in 5NF if (1) it is in 1NF and (2) the only dependencies that hold in it are functional dependencies of all non-key attributes on the PK -- a formal requirement that must hold for every PK [10].

Note: A relation is in 1NF if it is defined on simple domains with atomic values [12].

Primary Key Defined


We are now in a position to define a PK formally.

A PK is a combination of one or more relation attributes that:

  • Represents an object name;
  • Has unique values;
  • Is minimal;
  • Is stable;
  • Is user designated and, if generated, is DBMS managed;
  • All non-key attributes are functionally dependent on;
(see [4] for more details).
"Uniqueness, minimality and stability are formal properties of PKs. Names have an inherent scope in time for any particular context and after some period of time, one or more PK values may change. When stability is lost, uniqueness and minimality must be maintained at the cost of some maintenance of the PK values and possibly redesign to re-establish stability for a further period of time. All that matters is that (1) we select a PK that is as stable as possible and (2) ensure object integrity, namely, that every tuple is uniquely identifiable and represents the same object for the life of the database." --David McGoveran

Primary Key Selection


Only a name CK that meets all the formal PK requirements should be designated as PK. Properties, even if they meet all the formal PK requirements, must not be.
  • If no name CK pre-exists the database, one is generated and managed by the DBMS transparently, as proposed by Codd;
  • If multiple name CKs pre-exist and all satisfy the formal PK requirements, while there is no theoretical rule or algorithm for PK selection, neither is the choice purely arbitrary. There are two pragmatic considerations:
- Familiarity;
- Simplicity;
Name CKs are simpler, but often unfamiliar (particularly if generated) and do not provide meaningful logical database access. This is why property CKs are also enforced jointly with the PK.
"All the defining properties are required for unique object identification -- that is precisely the sense in which, taken together, they are defining the objects. In practice, we often know some properties of objects, but not necessarily all the defining ones, or even the sufficient combination for unique identification. Nor do we always know if a pre-existing name has a 1:1 relationship with such a combination.We know those defining properties exist, because we somehow manage to identify the objects as group members and to distinguish individual members, but perhaps we have not analyzed or observed the object group sufficiently to determine those defining properties. We might still have a 1:1 relationship between a name PK and some set of attributes representing a subset of known defining and or contextual properties,  but that relationship is confined to the logical model and, by its nature, violates database consistency with the conceptual model of reality." --David McGoveran
Did you know that? Compare to Nelson's take and your SQL DBMSs PK support.

We can now answer the questions at the beginning of this post.

A PK is immutable iff:

  • Its values have a 1:1 relationship with those of a CK that represents the defining object properties; and
  • Its implementation preserves immutability;


References

[1] McGoveran, D., LOGIC FOR DATABASE FOLK, forthcoming.


[2] Pascal, F., Understanding Relational Keys - A New Perspective: Kinds of Keys, forthcoming.

[3] Pascal, F., Understanding Relational Keys - A New Perspective: Surrogate Key Illusions, forthcoming.
 
[4] Pascal, F., The Key to Relational Keys:  A New Perspective, Practical Database Foundations #4.v3, forthcoming.


[5] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER.
 

[6] Pascal, F., To Really Understand Integrity, Don't Start with SQL.

[7] Codd, E. F., A Relational Model of Data for Large Shared Data Banks, Communications of the ACM Volume 13 Issue 6, June 1970, Pages 377-387. 

[8] Date, C. J., The Primacy of Primary Keys in RELATIONAL DATABASE WRITINGS 1991-1994 (Addison Wesley, 1995).

[9] Codd, E. F., Extending the Database Relational Model to Capture More Meaning, ACM Trans. Database Syst. 4(4): 397-434 (1979).

[10] Pascal, F., The Costly Illusion: Normalization, Integrity and Performance.

[11] Pascal, F., What Relations Really Are and Why They Are Important.

[12] Pascal, F., The First Normal Form in Theory and Practice Parts 1,2,3.









Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment