Sunday, October 20, 2013

SQL Server, Relational Databases and Primary Keys

I attended a presentation so named by Ami Levin hosted by the local SQL Server user group for three reasons: as the author of A Matter of Identity: Keys--a chapter in my book PRACTICAL CONCEPTS IN DATABASE MANAGMENT--and numerous articles on the subject, including on this site; interest in presentations on data fundamentals, let alone relational ones, that are extremely rare and usually flawed; and curiosity how fundamentals are presented by product specialists, particularly when the state of foundation knowledge is poor. What follows are my observations, thoughts and quibbles, in the order in which they came up in the presentation.

An attendee defined a key as "field/combination of fields to select a record". A more enlightening one would be
'a column or combination of columns of an R-table that represents formally in the database an identity attribute or collection thereof--the identifier of an entity in the real world represented in the database by a row'.
Rows are uniquely identified by key values. This is the core criterion for selecting a candidate key (CK), whether it becomes the primary key (PK), or not. A key is a logical feature--issues such as performance and portability implementation considerations irrelevant in logical design (there are three other key selection criteria, see my forthcoming paper The Key to Keys).

Levin referred to a disagreement between Codd and Date as to "what a primary key is". That is inaccurate: the disagreement is on whether, given multiple CK's, selecting one as a PK is mandatory. Keys are an adaptation of abstract relation (set) theory, to make it applicable to database management (see Business Modeling for Database Design paper). PK selection is, therefore, pragmatic, not theoretical, hence Date's reluctance to mandate it, as Codd seemed to imply. But this  may be a distinction without a difference: for all practical purposes the term "mandatory" in this context means 'beneficial if adhered to, problematic if flouted' and has to do primarily with referential constraints: while there may be rare instances where such constraints refer to multiple CK's, it is generally simpler and more practical to restrict to one PK (all else being equal, this is where implementation considerationscan be brought to bear).

The claim that "normalization eliminates non-simple domains" is a leftover from the initial days of the RM, when Codd was concerned that relation-valued domains (RVD) would require second-order predicate logic. Normalization was the process by which the R-tables "nested" as column values were "unnested", leaving all R-tables with only simple, single-valued domains (SVD)--in their normal form (1NF). Subsequently it transpired that
  • As long as an R-table has a single value in every cell, that value can be anything, including an R-table and it is compatible with first-order predicate logic;
  • R-tables in normal form, or normalized, can still have certain drawbacks, and can be repaired by further normalization (2NF-5NF).
So today we say that an R-table is by definition in 1NF or normalized; and fully normalized, or in 5NF if all drawbacks were prevented by design, or eliminated by further normalization. The RM requires only the minimal 1NF. 

Surrogate keys (SK) are not just a matter of personal preference. In my above mentioned  book I specify four criteria for key selection (as well as define the various types of key). Because they are pragmatic, not theoretical, the criteria can conflict, which in some cases requires tradeoffs. In some cases SK's are advised, but they should not be deployed universally throughout the database and extreme care should be exercised if the reason for a SK is lack of a natural key (NK), as that may indicate poor business modeling. Log tables contain metadata, not data, so SK's e.g., datetimestamps, are unavoidable, but that does not mean they have no meaning whatsoever.

I reiterate that performance is an implementation, not logical design issue. If there are multiple natural CK's, all equivalent with respect to the PK selection criteria, then there is no problem in choosing a better performing PK for referential integrity purposes, but other than that choices should be made on business and logical grounds (as Levin correctly pointed out, NK's may actually be simpler or "shorter" than SK's). That SQL DBMS's implement keys by indexing induces logical-physical confusion (LPC). In fact, there is an implementation technology that obviates the need for indexes altogether (see Go faster! The TransRelational™ Approach to DBMS Implementation). It is critical for educators and trainers to ensure that no LPC occurs and that neither RM or logical design have nothing to do with performance, which is determined at the physical level.

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment