Sunday, August 5, 2018

No Such Thing As "Primary Key Tuning"

“The choice of good InnoDB primary keys is a critical performance tuning decision. This post will guide you through the steps of choosing the best primary key depending on your workload ... You would be surprised how many times I had to explain the importance of primary keys and how many debates I had around the topic as often people have preconceived ideas that translate into doing things a certain way without further thinking.”
--Yves Trudeau, Principal architect,

I will be labeled "pedantic" and a "purist" for saying this, but there is no such thing as "PK tuning". "Choice of good PK for performance" reflects logical-physical confusion (LPC), which, in turn, is rooted in lack of foundation knowledge, and failure to grasp the RDM -- not just by database practitioners, but even by DBMS designers.


I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you. 


NEW: The Key to Relational Keys: A New Perspective


I deleted my Facebook account. You can follow me on Twitter:
@dbdebunk: will contain links to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture, which I am bringing back.

@ThePostWest: will contain evidence for, and my take on the spike in Anti-semitism that usually accompanies existential crises. The current one is due to the decadent decline of the West and the corresponding breakdown of the world order.


  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 

Primary Keys Misunderstood

In RDM, relation tuples represent (facts about) entities. But simple set theory (SST) is not a theory of tuples -- there is no way in SST to refer to, and operate on a single tuple, only relations (the sets) as a whole. PKs are one of the adaptations by Codd of mathematical relations to databases, which enabled reference to singletons (i.e., sets of a single tuple), and, thus, the representation by database relations of entity groups, tuples of facts, and inferencing from those facts (which is what querying is)[1]. PKs represent unique names for entity types in the database and, thus, are not "chosen", and cannot be "tuned" for performance. While the author refers to PKs, he actually discusses indexes:

“The choice of a good primary key for an InnoDB table is extremely important and can have huge performance impacts. When you start working with a customer using an overloaded x1.16xlarge RDS instance, with close to 1TB of RAM, and after putting a new primary in place they end up doing very well with a r4.4xlarge instance — it’s a huge impact. Of course, it is not a silver bullet – you need to have a workload like the ones I’ll highlight in the following sections. Keep in mind that tuning comes with trade-offs, especially with the primary key. What you gain somewhere, you have to pay for, performance-wise, elsewhere. You need to calculate what is best for your workload.”
And it's obvious why.

Logical-physical Confusion

“InnoDB is called an index-organized storage engine. An index-organized storage engine uses the B-Tree of the primary key to stores the data, the table rows. That means a primary key is mandatory with InnoDB. If there is no primary key for a table, InnoDB adds a hidden auto-incremented 6 bytes counter to the table and use that hidden counter as the primary key. There are some issues with the InnoDB hidden primary key. You should always define explicit primary keys on your tables. In summary, you access all InnoDB rows by the primary key values.”

If a DBMS is only a sheer storage engine, or you view it as such, the RDM is meaningless: there are no  PKs (a relational logical feature), only indexes (a physical access method to improve performance). In a real RDBMS logical PKs are implemented with, but are not indexes, and have nothing to do with performance. In fact, the dependence of PKs on indexes is, strictly speaking, a violation of physical independence (PI)[2] and the RDM[3], a contributor to LPC[4].

Most of the long article is a discusses of indexes. I suspect that if I go through all of it I will find other problems due to LPC -- there always are -- but I will not bother, beyond pointing out that it confuses PKs with indexes.

“In summary I always inform our teams, choose the candidate key that best presents the business, the one you will search by, and most likely order by, group by since the PK is typically the clustered, physically sorted already index ... What about the choice of using a varchar column as a PRIMARY KEY? Is there any limit concerning the maximum length of a varchar column used as PRIMARY KEY for Innodb table? Someone has suggested, as alternative to int or a bingint columns, the use of MD5 or an UUID as a PRIMARY key for innodb.”
In fact, candidate keys (CK) are not really "chosen". They are either (1) determined by the unique names assigned in the real world to entity types, or (2) generated for the database, if names are composite/complex, or have not been assigned in the real world. In the first case, as Trudeau admits, CKs "[re]present [simple, real world] business [names]"; if several exist, the simplest is selected as PK (i.e., it's the PK, not the CKs, that is chosen). If the CK(s) is/are complex, or if none exists, a key is generated to serve as PK which, because it is managed transparently by the DBMSs, can be the simplest possible; any complex keys are enforced as alternate keys (AK)[5,6].

Thus, if a name is assigned to some business entity type in the real world, the representation of which is a VARCHAR() attribute, and there is no simpler name, it is enforced as an AK, and the simplest possible PK is generated and managed transparently by the DBMS, but
not for performance reasons -- although it will also perform well -- but for meaningful logical access and simplicity of FK support[5,7], Whether or not performance is satisfactory given the keys dictated by the conceptual business model, simple PK and complex AKs included, is up to the DBMS implementation[8], including indexing and optimization, which is where indexes belong, and any tuning takes place.


[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., Physical Indepdence Parts 1,2,3.

[3] Pascal, F., Interpreting Codd: The 12 Rules.

[4] Pascal, F., The Conceptual-Logical Conflation and the Logical-Physical Confusion.

[5] Pascal, F., The Key to Relational Keys: A New Understanding.

[6] Pascal, F., A New Understanding of Keys Parts 1,2,3.

[7] Pascal, F., Foreign Keys Parts 1,2.

[8] Pascal, F., Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation.

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.

No comments:

Post a Comment

View My Stats