Monday, July 31, 2023

ON RELATIONAL KEYS (& DOMAINS) (t&n)



Note: "Then & Now" (t&n) is a new version of what used to be the "Oldies but Goodies" (OBG) series. To demonstrate the superiority of a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, as well as the disregarded evolution/progress of RDM, I am re-visiting my old debunkings, bringing them up to the current state of knowledge. This will enable you to judge how well arguments have held up and realize the increasing gap between industry stagnation --  and scientific progress.

Then

ON KEYS (originally published September 2002)

“I am an application developer who is competent in SQL and have designed some small-scale schemas that have gotten the job done. I came to your website seeking enlightment on the issue of whether it is better to use strings to represent attributes (I believe these are called "speaking keys"), or to use integers and key them to a list of strings. I see that you are avidly pro-normalization, but I'm not even sure if this is a normalization issue. Here is a quote from a web site that captures my gut feeling. I hope you can comment or point me to an article that presents your view of the matter.

Many popular database tools make it very easy to create some sort of arbitrary numeric sequence for your primary keys. Be careful not to use these to excess. If you have are designing a music catalog system, it is definitely not useful to create a lot of arbitrary codes for music classification. There is unlikely to be much benefit in designating "1" to stand in for "Jazz", 2 for "Rock", 3 for "Classical" and 4 for "Latin Jazz" in an application. Why not just use the word "Jazz" for "Jazz" and use "Rock" to stand for "Rock"? It is certainly easier to remember. It also means that you won't have to lookup the value of 1 to see that it stands for "Jazz".”

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST POSTS

05/28 INTENSION, EXTENSION AND R-TABLES (t&n)

05/08 ON PROPERTIES & CHEN'S E/RM (rm)

04/30 RELATIONSHIPS AND THE RDM V2 Part 3: SEMANTIC CONSTRAINTS

UPDATES

04/23 Added The Story of Mathematical Proof to LINKS page

04/03 Added First OrderLogic to LINKS page

04/03 Added Mathematical Logic - Reasoning in First Order Logic to LINKS page

03/26 Added Modeling of Integrity Constraints Dependencies to LINKS page

03/14 Added Russell’s On Denoting to LINKS page

03/14 Added Russell’s Paradox to LINKS page.

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)

08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my AllAnalytics columns no longer work. I re-published only the 2017 columns @dbdebunk, and within them links to sources external to AllAnalytics may or may not work.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
------------------------------------------------------------------------------------------------------------------

It is not possible to be a competent database professional without foundation knowledge, which cannot be reliably acquired online each time you have a specific question, but rather via formal education, after whch if you are unclear on details, you can try to have them clarified online. There are no shortcuts and beware of anybody who promises otherwise.

Your question is about those fundamentals that you need to know prior to involving yourself in database practice. I suspect that this is not the only foundation knowledge you lack. I suggest the sources on our BOOKS and PAERS page, both of which have information about keys. [In my paper The Key to Relational Keys] I state that PK selection is a logical endeavor, which should not be contaminated by physical implementation considerations such as performance.
“OK, Debunkers, I ordered PRACTICAL ISSUES. I read it cover to cover. There was no discussion at all on the subject of speaking keys, although I could see that most of the keys in the examples were speaking keys (e.g. in many examples, "SUPP" in the "Activity" domain means "support"). There was also a "note" saying that some people like populating their tables with synthetic keys at every opportunity, but this isn't necessary, nor is it always recommended. I would have liked some more discussion of this issue.

So I ask my question again with some background information. Feel free to use it for your quote of the week.

I am an application developer who had developed an application and a database schema to go with it, and was handing the schema off to a DBA so he could complete the work. I originally had an "action" domain that was defined as the varchar values "add", "delete", and "move". He changed the schema to have "action_id" columns, taking system-generated integer values, which referred to an action lookup table with action_id and name columns. He did the same thing for other domains, ripping out varchars and replacing them with integers referenced to lookup tables. He said that this would ensure "flexibility" -- what if I wanted to change one of the names?

I then and especially now (after having read PRACTICAL ISSUES) believe that what he did had nothing to do with normalization and instead was an empty exercise in indirection and obfuscation. What's more, due to the lack of true domains in our DBMS, it makes it possible for a programmer to put an 'action_id' value in, say, a 'source_application_id' column, since they're both integers. To make matters worse, this DBA eschews views, which makes constructing queries very cumbersome. He suggested defining mnemonic constants in my code that would correspond to the appropriate integer values, but it seemed to me that that was just more work for me, and the ensuing "flexibility" of the system left more opportunities for error. I agreed to the changes mostly because it was going to be his schema to work with.

Shortly after I rewrote my application to accommodate the changes, the DBA left us for greener pastures, and I'm stuck with this (in my opinion) ugly schema. My first impulse was to revert the schema to the old design, but I wanted to check with people with more experience in these areas, to see if there was any advice on this common, non-vendor-specific issue.

So my question remains: was I right to use sets of mnemonic character strings for these small, stable domains?  Is it only a matter of taste? What are the practical considerations when making such a decision? I could not find any discussion of this matter on the Database Debunking website or in the PRACTICAL ISSUES book.

I understand you are busy, professional consultants, and don't really expect a personal reply. I'm just suggesting that you address this "practical issue" somewhere on your website, or in a future edition of the book.”
Most of what is needed to understand about keys is in the chapter in my book. If it does not address your question, it must mean that the issue is not a [formal relational one, but rather a pragmatic one, specific to the particular database].

In Chapter 1 state that in RDM every [database relation has a PK], but RDM has nothing to say about key selection -- that is a database-specific issue, which only those familiar with it (in this case, you) can decide. PK choice is made on pragmatic grounds, using four criteria -- familiarity, irreducibility, simplicity and stability -- which may have to be traded off, depending on how the database is used. I'm afraid there is nothing that can be added to that, it is up to you to decide which tradeoff is optimal in your particular case.
“"Because the type definition does not place any restriction on type representations, a data model worthy of its name the relational model, this is achieved by the use of DOMAINS, which are nothing but data types of arbitrary complexity." -- popular book on database management.

Comment: If a domain is "nothing but" a data type of arbitrary complexity, then it is nothing but a data type, period. Nothing but fuzzy thinking is "achieved" by using two different terms to mean the exact same thing.”
You got it somewhat backwards. The industry has been forever corrupting relational terminology. It was in order to avoid ensuing confusion with programming concepts that Codd used database domains to differentiate from application programming types. Yes, it would have been nice if he didn't have to do that, but you should place the blame where it really belongs -- on the industry, not on the RDM.

It’s always a good idea to acquire some background on subjects before you comment on them.

Now

 writings on the subject sKeys, like all fundamentals, are as poorly understood now as they were then (see, for example, my comments in this recent LinkedIn exchange: Concatenated Business Keys VS Hash Keys? Or consider the following:

“A database table must have a primary key for the DBMS to insert, update, restore, or delete data from a database table. A DBMS uses primary keys that are defined to the database. However, you can also define primary keys to supplement those in the database. A primary key is needed [in] any table that is visited more than once in a process, for example, a child table that has two or more parent tables...”    --IBM.com

I refer the reader to my publications and posts, particularly the paper on keys.

In a conceptual model, each group of entities of a single type is a named collection of a set of defining (i.e., required) first order properties (1OP). Each entity (instance of the type) is a collection of 1OP values that has a unique identifier -- an assigned name that can serve as a stand-in for the entity. Then in the logical model, for such an entity type, the value of each required attribute (corresponding to one of the defining 1OPs) depends functionally on that name. Put differently, the name uniquely determines each of the values of the defining 1OP. Symbolically, if each attribute name (e.g., COLOR) is the name of a function, then

COLOR(ID) = color_value (e.g., blue)

In standard dependency notation

COLOR -> ID

(which means the value of ID implies the value of COLOR).

Broadly speaking, there are two categories of entity type, which we will call here context-free entities (CFE) and context-dependent entities (CDE).

The more familiar or elementary notion is of entities defined as independent of their context. Most classical or traditional objects are CFEs. They can be distinguished from each other solely by the values of their defining 1OPs. As a consequence, the values of the defining 1OPs for each CFE must differ for at least one of those defining 1OPs. In fact, a unique combination of defining 1OP values can serve as a  unique identifier for each entity.

Some entities are distinguishable from others of the same type only by the context in which they appear. That is, the defining 1OPs do not suffice to distinguish them from each other. In this case we need higher ordered entities to describe the relationship of an entity to its context and use those to distinguish entities of the type (in  science, logic and philosophy such entities are often called "clones" of each other). For example, every electron has identical 1OP values and is distinguishable from other electrons only by its relationships to spacetime (location as a function of time, spin, momentum, energy); or, more mundanely, consider every dime in a pile of dimes -- they can be distinguished from each other only by their distinct location on the pile). With CDEs, the name thus uniquely identifies not only the entity's 1OPs, but one or more higher order properties (i.e., relationships to context) as well.

If there is no well-defined CFE type -- that is,  all the defining 1OPs for the entity type have not carefully been identified -- we can cheat (or defer that task) by assigning surrogate identifiers. Likewise, for CDE types, surrogate identifiers can be used if when all defining relationships to other entities are nailed down. When compelled to use surrogates identifiers, we are committing logically to the existence of a defining collection of properties without specifying them. It's a "logical debt" which must be considered when reasoning about the entities of the type. We can then use the arbitrary names to talk about two distinct individuals without explicitly identifying how those people are to be distinguished, but we must never assume they are distinct in one part of an argument (query driven analysis), but the same in another. To consider both possibilities, we must keep the "cases" (different vs same) carefully separated. A common example of this kind of confusion is informal, sloppy use of pronouns -- using "it" for multiple entities in the same sentence, conversation, etc. without carefully distinguishing the referent for each instance of use.

Note: When multiple 1OPs that uniquely identify an entity of a type are known, for convenience of referring to them a surrogate identifier (name) is assigned to each entity of that type.

 

References

Keys

The Key to Relational Keys: A New Understanding -- Primary Keys
Description

A New Understanding of Keys

No Such Thing As Primary Key Tuning

Keys & Indexes

Don't Conflate/Confuse Primary Keys, PK Constraints, and Indexes


Domains

In his book, RM/V2, Codd specified differences between programming types and relational domains, which are discussed in my writings.

Domains: The Database Glue
Description

Understanding Domains and Attributes

What Domains Are and Are Not

NO RDBMS WITHOUT RELATIONAL DOMAINS

Simple Domains and Value Atomicity

 

 

 

 

No comments:

Post a Comment

View My Stats