Monday, August 15, 2016

Understanding Keys: Entity-Defining Properties and Entity Names

Here's what's wrong with last week's picture, namely:
"My understanding has always been that a primary key should be immutable, and my searching since reading this answer has only provided answers which reflect the same as a best practice. Under what circumstances would a primary key value need to be altered after the record is created?"

"When a primary key is chosen that is not immutable?"
--Why should a primary key change, StackExchange.com
This is, a more detailed treatment of keys undertaken in:
[1] On Kinds of Keys: Natural, Primary and Surrogate Are Sufficient
[2] Surrogate Key Illusions
[3] Duplicates: Stating the Same Fact More Than Once Doesn't Make It Truer, Only Redundant
which I have just revised. It is the result of several discussions I had with David McGoveran as part of my familiarization with his formal interpretation of Codd's relational data model (RDM), which he'll introduce in his forthcoming book LOGIC FOR SERIOUS DATABASE FOLKS. My own forthcoming book--THE DBDEBUNK GUIDE TO FUNDAMENTAL DATA MANAGEMENT MISCONCEPTIONS--incorporates some of his thoughts in an informal form accessible to the thinking data professional. As you shall see, it diverges in some important ways from the current understanding of the RDM (well, whatever little understanding there is).


Entity-Defining Identifiers and Natural Keys

Recall that a database relation represents a set of (facts about) property-sharing entities in the real world. A property is an observable/measurable set of values obtained by a well-defined procedure (e.g., fingerprinting, weighing, checking documents). Entities are distinguishable by definition--otherwise we could not tell them apart, or even count. They have an entity-defining identifier (EDI)--a set of inherent defining properties by which we recognize them--whether explicitly or implicitly--that uniquely identifies them and distinguishes each from the others (e.g., fingerprints for people, ownership and a set of books for companies and so on).

Natural keys (NK) are relation attributes at the logical level that represent EDI's at the conceptual level.

"If there exists in the real world a naturally occurring set of entity-defining properties that
  • Distinguishes each entity from the others and uniquely identifies them (i.e., the properties have unique values in combination);
  • Is immutable (i.e., can be used repeatedly to identify an entity for the intended life of the conceptual model in the context of the universe of discourse)
that property set is represented by a natural key (NK) in the database." --D. McGoveran
NK's are so labeled because they represents inherent entity properties. Immutability is an assumption based on belief (from evidence, not whim) about the real world--(1) that the EDI properties will be stable during the intended life of the conceptual and corresponding logical model and, therefore, the database and applications that use it and (2) there is no evidence to the contrary--belief which is incorporated in the conceptual model. We can be wrong and if we are, there is no way around database redesign (see below). Otherwise put, NK's change only if and when the entities themselves change.

Because they represent real world properties, NK's are used for meaningful logical data access.

Primary Keys

Codd mandated a primary key (PK) for each relation as the designated identifier of entities and--where appropriate--target for references by foreign keys (FK) of other relations. When there is more than one candidate key (CK), one would be selected as PK, or a PK would be created.
The criteria for choosing a PK are both pragmatic--simplicity,familiarity/naturalness (the DBMS has to do a lot more work and use a lot more storage to translate among multiple CK's)--and formal-- stability and uniqueness. 


Codd's seems to have thought of PK's as what he later called "permanent surrogates for entities", but he failed to make clear the formal motivation for them.

Every base tuple corresponds to some entity (possibly abstract) that is identifiable in the real world. In set theory, entity identity is determined by testing for properties and their values (observation/measurements in the real world). A unique set of properties having specific values determines a specific entity--the defining properties for that type of entity.

To make simple set theory (and the RDM) applicable to the real world, we assume some properties identify an entity permanently while others (e.g., location) change over time. This is the only basis for the claim that an entity observed at one time is the same as an entity observed at another time. Without that, simple set theory and the RDM--which relies on it as its foundation--simply cannot be applied to the real world, nor can first order predicate logic (FOPL), as it requires a predicate variable over which predicates range and in FOPL as it appears in RDM that is a tuple variable. Modeling is not possible without permanent identifiers.
Codd did recognize that users might have to change the values of "user controlled keys" (e.g., companies merging) and his own relational data language ALPHA allowed such updates via delete and insert. In the 1979 paper he discussed this issue and introduced system assigned surrogate keys (SSK) that, unlike those promoted in the industry
  • would not change in any way visible to the user;
  • if changed by the system, "tuple identity" (i.e., correspondence to some unique entity in the "real world") would be preserved.
He also introduced a "coalesce" operator to handle the case of two entities turning out to be one. --David McGoveran

Entity Names and Surrogate Keys

Consider SSN's. They were introduced as a simple numeric name to stand in for a complex set of properties that the SSA used to classify US citizens and alien residents as eligible to earn income in the US (i.e., it had some well-defined procedure by which it associated a SSN value with a set of distinct value of the EDI consisting of those properties). Although in time the SSN became a recognizable identifier, it's not the SSN that defines US eligible citizens and residents--the set of properties for which it stands do--it is just a set of simple artificial entity names that substitutes for them. Very often when we encounter multiple CK's for a relation, it is because the entity type it represents has multiple names. Such assigned names are represented by surrogate keys (SK) in the database.
"A SK assigns an entirely artificial name or label to every entity, such that
  • There is a well-defined procedure that ensures a 1:1 relationship between SK values and NK values (however abstract, complex, or difficult to measure/observe the NK values may be);
  • It is defined such that all attributes representing descriptive entity properties are functionally dependent on it.
There are three cases in which a SK might be used:
  • As a substitute for a (possibly complex) NK;
  • As a substitute for other, pre-existing names for the entity;
  • As a substitute for an unspecified NK (e.g., when we distinguish entities without knowing exactly the properties we use implicitly for that purpose).
In all these cases, once we identify a SK as the PK, it is just a synonym for any other entity name.

If the values of the defining properties represented by the presumed NK change within the intended life of the conceptual model, but still serve to uniquely identify the entity in the real world, the "changing NK" fails to capture this identification. If a SK is added to the relation and the properties represented by the presumed NK are treated as ordinary attributes, the SK enables tracking the relationship between the old and new NK property values, while recognizing that only one entity is being identified. NK's simply cannot suffice for this situation.

If entities themselves mutate (e.g., companies involved in mergers, acquisitions, spinoffs), NK's (which represent inherent entity defining properties) are almost certain to fail. In this case, the conceptual model does no longer represent reality faithfully and the database needs to be redesigned. That said, SK's can track how entities are related over time (entities combine to form a new entity, an entity divides into multiple entities, an entity is created or destroyed, etc).

If an EDI cannot be identified in the real world (e.g., clouds, sub-atomic particles, anonymous poems), the assignment of SK values to entities is inherently arbitrary and they are then just as good as the ability to somehow recognize individual entities implicitly within a class systematically." --D. McGoveran
Note: SK's are often deployed for the wrong reasons (e.g., to emulate object-id's, or  to improve performance). In the former case they serve no function. In the latter case logical design is contaminated with physical implementation considerations--a violation of physical independence (PI)--and can actually defeat the purpose. SQL DBMS’'s implement keys by indexing and each SK adds not only data, but also an index that must be managed. Arbitrary SK's cannot be used for meaningful logical data access and if they are visible to users, the need to translate to and from a meaningful/familiar key imposes joins that would otherwise be unnecessary. Query formulation is more complex and DBMS optimization is inhibited. Ami Levin warns that "magical performance improvement myths that are commonly associated with SK's should, like all myths, not be taken for granted and be critically investigated. In many cases it is the NK--if a simple and stable one can be identified--that wins in terms of performance, consistency, and maintenability" [2].

Thus:

NK's
  • represent EDI's;
  • are used for meaningful logical data access.
PK's serve
  • as designated entity identifiers;
  • referential targets.
SK's
  • serve as simple and stable PK's;
  • that help track changes when immutability assumptions fail, or entities mutate.



Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

6 comments:

  1. The bottom-line is, You can change values of keys as often as you want provided you can update all the places these keys are referenced.

    ReplyDelete
    Replies
    1. Looks like you did not understand one word from what I've written.

      Delete
  2. Entities are distinguishable by definition ... They have an entity-defining identifier (EDI)

    Is it always possible to distinguish an entity by inherent properties? Or at least properties that amount to a Natural Key? (Even if it's not practical to actually capture something like a DNA profile or fingerprints.)

    Take a humble tin of beans -- or rather a batch of 50,000 tins. What inherently distinguishes it from yesterday's batch or tomorrow's? We put the tins into cartons, cartons on to pallets, pallets on to trucks. What inherently distinguishes between pallets? What distinguishes the two pallets going to Customer 1 vs the three pallets going to Customer 2?

    And yet we do need a key to identify each shipment. That key is generated by the system at the time of pick/pack/ship. There is nothing about it inherent in the tins of beans. (A particular shipment might include tins from different batches, for example.)

    The entity (a shipment) comes into existence by the operations of the enterprise. And that is all that distinguishes it.

    ReplyDelete
  3. Usually, but not always. In many situations we distinguish entities by some implicit natural properties that we may not be able to spell out. In such cases we assign entities names that are shorthands for the identifiers and represented by SK's.
    However, the SK's must be functionally dependent on the presumed natural identifier, or the identification of entities is unreliable.

    In my fortcoming book I discuss specifically such a case--what Date refers to as "descriptively equivalent entities" (DEE)--but the principle is as above. See also my post: Stating the same fact twice does not make it truer, only redundant.

    Actually, there may well be inherent tin properties such as marks, scratches, stains and so on, but their representation is not cost-effective. Generally we assign names to entities that are meaningful to track. Tins are not, which is why enterprises deal with LOTS for shipping purposes.

    ReplyDelete
    Replies
    1. I'm interested what you think is the inherent property for the entity identified by product number P27 'Tin of beans'? P27 doesn't denote a specific tin of beans in some depot, so it's not about specific DNA or scratches/stains. It doesn't just denote (somehow) all tins in stock; it must denote all tins the enterprise has ever bought/sold/manufactured; and all tins in future.

      Over P27's product lifetime the enterprise might change the recipe, change the packaging, change the material used for the tin, etc, etc.

      So what is 'P27' a surrogate for? Note it's distinct from P29 'Tin of beans Old Fashioned recipe'.

      Delete
  4. Re-read more carefully my reply, part. the last paragraph--it answers your question. I also referred you to another post that deals with it specifically.

    You must distinguish the TYPE of entity you want to track: is it specific tins, or the product of type tin.

    ReplyDelete