Tuesday, March 27, 2018

Name the Relational Violation Part 1: Conditional Uniqueness Constraint




Note: This is a rewrite of of an older post (which now links here), to bring it into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

"I'm seeing more [data] professionals implementing the following type of conditional unique constraints, typically related to the use of 'soft deletes'):
TABLEX (SK,A,B,C,ACT_FLAG)
Uniqueness is defined for (A,B) iff ACT_FLAG='Yes'. Any row with a ACT_FLAG = 'No' is excluded from the unique requirement."

"Most SQL DB implementations I know of do not allow this type of constraint to be enforced declaratively. Instead it relies on tricks within the index specification for enforcement."

"This conditional application of unique constraints troubles me. The prevalent use of a surrogate primary key avoids duplicates in the table as a whole. But this approach seems to declare a business rule that can be turned off and on based on the value of a non-key column. It definitely feels there is more than one kind of business entity here -- but they do share the same attributes. I also see a similar design pattern when OO classes are mapped to a table during implementation (when the implementation approach is to combine classes into a single table and an attempt is made to enforce two types of 'uniqueness'."

"However, I cannot definitively find a specific rule/guideline within relational theory that it violates. I know there are design alternatives. The fact that there is no true always-on business key other than the surrogate key IS an issue, but the fact that technically the surrogate PK prevents duplicates is almost always presented as a counter argument."


The heart is in the right place, but the violation is obvious. Yet, the several responses did not pinpoint it.


------------------------------------------------------------------------------------------------------------------
SUPPORT THIS SITE

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you.

NEW PUBLICATIONS


HOUSEKEEPING
  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 

@THE POSTWEST
  • The Dystopia of Western Decadence, the Only Acceptable Racism and the Myth of the “Palestinian nation".
------------------------------------------------------------------------------------------------------------------

Surrogate Keys


"The theoretical reason is this: The surrogate key is meaningless and therefore adds no meaning. It is not an attribute of the entity, but simply a contrived physical means to enforce uniqueness. It will not prevent duplicated data in non-key columns. That is not to say it is useless, but there must be something else that uniquely identifies any instance of the entity and it is not (A,B,ACT_FLAG) alone, because that would limit you to having at most 1 inactive row ... I cannot point you to chapter and verse in Codd, Date, Inmon, Kimball or whoever, but I'm pretty sure they thought about this and said something similar.

Note: I refer the reader to our revised definitions of natural (NK) and surrogate key (SK), which are distinct from conventional wisdom[2] (incidentally, I would never lump Inmon and Kimball with Codd and Date).

First, a SK is logical, not "physical".

Second, a generated SK represents an artificial name and by definition cannot be used for meaningful logical database access. However, it can be deemed "contrived" if and only if it is added even when there is a simple candidate key (CK) that 
  • Represents a name (not a property); and, 
  • Satisfies all other formal FK requirements[2].
A SK will "not prevent duplicated data in non-key attributes" if and only if a 1:1 relationship between its values and those of a CK that represents either (1) a set of properties or (2) another name known to uniquely identify objects in the real world is not guaranteed,

We don't know if (A,B) represents a compound name or two properties (abstract rather than concrete examples are usually a bad idea), but it is composite. Assuming that it is a valid CK (i.e., it is known to identify objects in the real world), a generated SK is justifiable on simplicity grounds[2].

"Someone please correct me if I'm wrong here, but I feel the prevalent use of the surrogate key is sometimes creeping into logical model deliberations as well. It is almost as if some practitioners are saying: "As long as we have a unique surrogate key, we don't need to go through all of that effort to define a valid unique business key". The surrogate key can be useful, but it is still a requirement to have a unique declarative business key."
"SKs cannot be a substitute for NKs. They can be added for referential purposes when NKs are composite or complex, but the NKs must be preserved."

It is true that SKs are overused and misused. But properly understood, they have valid use, and often are required for both theoretical and practical reasons[2]. In any case, the SK is not the main problem here.

"Bundling" of Object Types


"Note that if entity class/type identification in the business model is poor and tables "bundle" multiple entity classes (which looks like it may be the case here), SKs "legitimize" the poor design, by giving a false impression that there is no bundling. It is risky to base modeling decisions on SKs."

"Bundling" is poor design, but it is a relational violation if and only if it violates 5NF, in which relations must be by definition. This is not the case here.


"I agree that it feels like a kludge. In this type of case it is not really capturing temporal history. It contains two different types of collections: Active/good records to which uniqueness should apply and inactive rows which are essentially discards/soft deletes (which could serve as the basis for another whole discussion) to which we apply no uniqueness constraints. What I'm trying to clarify in my own mind is whether there is a theoretical reason that supports the feeling that this is a kludge. There are practical ways to handle this. One could simply go ahead and delete the 'inactive records'. You could also move the inactive records to a separate history entity."

This is getting closer, but fails to express the RDM violation explicitly.

Note: One respondent suggested:

"Set the active value to NULL if it is active and the "Deactivated" value to a NEWID()..." now you know when it is active and when it is not, and the deactive values will never be the same."
Substituting one RDM violation for another is not a solution.

No Violation


Here's a response from somebody known to be relationally profficient:

"Re. the constraint, it's just that, a constraint. Which happens to be a uniqueness constraint that goes beyond what the currently existing engines can support declaratively, so in order to enforce it, you're on your own (accompanied only by whatever your bag of acquired database design tricks has to offer). There is no "relational theory" that will somehow render rules "invalid" if they are beyond what currently existing engines can support. If the rule "within the set of XYZ's that are marked currently active, the {COL_A,COL_B} combination has to be unique" applies to the business, then it applies and no theory will ever say this is invalid. I agree 100% with your sentiments on the surrogates."
With respect, this is not "just an uniqueness constraint that is not supported by SQL". Quite the opposite.

Stay tuned and in the meantime test your foundation knowledge: let us know in the comments section.

(Continued in Part 2)


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., The Key to Relational Keys: A New Understanding.



Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.




No comments:

Post a Comment

View My Stats