Sunday, June 15, 2014

Conceptual Muddling and Database Kludges


The Problem with "Conditional" Unique Constraints... raised on LinkedIn and the response to it merit the attention of practitioners.
EP: I'm seeing more implementations where developers/database design professionals are implementing the following type of conditional unique constraints, typically related to the use of 'soft deletes'): 
Table X {COL_SURROGATE_PK,COL_A,COL_B,COL_C,COL_ACTIVE_FLAG}
Uniqueness is defined for {COL_A,COL_B} iff COL_ACTIVE_FLAG='YES'. Any row with a COL_ACTIVE_FLAG = 'NO' is excluded from the unique requirement. Note that I do not mean that COL_ACTIVE_FLAG is part of the key; instead it is being used to conditionally enforce the key.
Most SQL DB implementations I know of do not allow this type of constraint to be enforced declaratively. Instead it relies on tricks within the index specification for enforcement.

This conditional application of unique constraints troubles me. The prevalent use of a surrogate primary key avoids duplicates in the table as a whole. But this approach seems to declare a business rule that can be turned off and on based on the value of a non-key column. However, I cannot definitively find a specific rule/guideline within relational theory that it violates. Any thoughts on that matter? I know there are design alternatives. I'm looking at a way of critiquing (or ultimately accepting) this type of approach for a theoretical standpoint.
(The fact that there is no true always-on business key other than the surrogate key IS an issue, but the fact that technically the surrogate PK prevents duplicates is almost always presented as a counter argument).

It definitely feels there is more than one kind of business entity here - but they do share the same attributes. I also see a similar design pattern when OO classes are mapped to a table during implementation (when the implementation approach is to combine classes into a single table and an attempt is made to enforce two types of 'uniqueness').


AL: The theoretical reason is this: The surrogate key is meaningless and therefore adds no meaning. It is not an attribute of the entity, but simply a contrived physical means to enforce uniqueness. It will not prevent duplicated data in non-key columns. That is not to say it is useless, but there must be something else that uniquely identifies any instance of the entity and it is not {COL_A,COL_B,COL_ACTIVE_FLAG} alone, because that would limit you to having at most 1 inactive row. If COL_ACTIVE_FLAG were replaced or augmented by {COL_FROM_TIMESTAMP,COL_TO_TIMESAMP} and optionally - depending on the entity definition - maybe audit columns like CONTEXT or USER, then we would have a candidate for a true unique constraint. I cannot point you to chapter and verse in Codd, Date, Inmon, Kimball or whoever but I'm pretty sure they thought about this and said something similar.
Points arising (I would not include the last two names in the same sentence with the first two).
  • Are SK's a physical means?
  • SK's are perfectly acceptable in RM if circumstances justify them e.g. long composite NK's that must be referenced by FK's. They are not a theoretical violation, just a bad idea when universally deployed, a la OID's, without justification, like in this case.
EP: I agree that it feels like a kludge. In this type of case it is not really capturing temporal history. It contains two different types of collections: Active/good records to which uniqueness should apply and inactive rows which are essentially discards/soft deletes (which could serve as the basis for another whole discussion) to which we apply no uniqueness constraints. What I'm trying to clarify in my own mind is whether there is a theoretical reason that supports the feeling that this is a kludge. There are practical ways to handle this. One could simply go ahead and delete the 'inactive records'. You could also move the inactive records to a separate history entity.
ES: Re. the constraint, it's just that, a constraint. Which happens to be a uniqueness constraint that goes beyond what the currently existing engines can support declaratively, so in order to enforce it, you're on your own (accompanied only by whatever your bag of acquired database design tricks has to offer). There is no "relational theory" that will somehow render rules "invalid" if they are beyond what currently existing engines can support. If the rule "within the set of XYZ's that are marked currently active, the {COL_A,COL_B} combination has to be unique" applies to the business, then it applies and no theory will ever say this is invalid.
SIRA_PRISE (the result of my personal exploring around in the area of database constraint enforcement) effectively allow you to do something akin to:
    CREATE VIEW active_xyz
    AS SELECT *
       FROM xyz
       WHERE col_active_flag = 'Yes'
       KEY (col_a,col_b);
Note that third line which applies a KEY clause to this view, with the same semantics as if it were applied to a base table. (And making abstraction of the serious parsing problem introduced by syntaxes such as the one I gave - I'm only trying to illustrate a concept/functionality.)
I agree 100% with your sentiments on the surrogates.
The point about constraints is very important. Logical kludges indicate, more often than not, problems at the conceptual level and poor business modeling. Muddled conceptualizations cannot be resolved at lower levels of abstraction except by costly kludges that yield more complex, harder to understand, develop, maintain and error prone databases and applications.
ES: A bit of second thoughts, inspired by AL's remark about "combining two tables into one".
1. Each row in a R-table represents a statement about the world (in logic: a proposition) that is deemed to be true.
2. Formally, for the rows that appear in the same table, these statements/propositions can be regarded as deriving from a "parameterized" statement. The "parameters" in this statement are precisely the names of the columns. In logic: the "parameterized statement" is a predicate, the column names are the "free variables" in this predicate. These predicates define the real business meaning of the data. For example: "Citizen COL_A has bought a house in COL_B at price COL_C" with the key indicating (for example) that each citizen can buy only one house in each state.
But now try to formulate such a predicate that also incorporates your ACTIVE_FLAG. Try to make it such that it produces a grammatically valid English sentence both for the case where the ACTIVE_FLAG is TRUE and where it is FALSE (assuming this column is now a BOOLEAN). You'll have to work very hard on this one already.
In your scenario, this is indicative that your business is actually dealing with two distinct predicates, making two different kinds of statement about the world, and this is in turn indicative that at the logical level you are dealing with two different tables (as AL pointed out).
3. The next theoretical thing to consider is the Closed World Assumption (CWA). This one says that all the rows that make a predicate associated with a table TRUE in the real world, should effectively be present in that table.
The predicate for your table including the ACTIVE_FLAG is very likely to contain some phrasing like "It is ACTIVE_FLAG that it is currently believed that ...".Now ask yourself whether it is true that "it is FALSE that it is currently believed that ES has bought a house in Alaska at price $2". That is a true statement (most probably). And as a consequence, this row should be present in the database. And a row for the price of 3$, and 3.01$, and extra rows for every state in which I'm not currently believed to have bought a house, etc. etc.
Now re-fiddle your predicate such that these spurious rows are no longer required to appear, according to the CWA. That'll be even tougher than the previous exercise. And if you succeed in that (in managing to make those FALSE rows disappear from the database), then what is the only value that can still appear for ACTIVE_FLAG? And if there can only be one value for that column anyway, then why have the column altogether?
Database designs which include any kind of YES/NO, TRUE/FALSE indicator, usually constitute a breach of the CWA, and are usually indicative that what is really going on is "one separate table for the YES case and one other for the NO" [Ed. Note: some entities have some attribute, others don't].
That said, I must also add that it is unclear to me what exactly the damaging consequences are of such a breach. But hey, you asked for a theoretical underpinning of your gut feeling and I gave you one.
See what I mean?
EP: Someone please correct me if I'm wrong here, but I feel the prevalent use of the surrogate key is sometimes creeping into logical model deliberations as well. It is almost as if some practitioners are saying: "As long as we have a unique surrogate key, we don't need to go through all of that effort to define a valid unique business key". The surrogate key can be useful, but it is still a requirement to have a unique declarative business key.
SK's cannot be a substitute for NK's. They can be added for referential purposes when NK's are composite or complex, but the NK's must be preserved.

Note that if entity class/type identification in the business model is poor and tables "bundle" multiple entity classes (which looks like it may be the case here), SK's "legitimize" the poor design, by giving a false impression that there is no bundling. It is risky to base modeling decisions on SK's.       
CJB: Set the active value to NULL if it is active and the "Deactivated" value to a NEWID()...now you know when it is active and when it is not, and the deactive values will never be the same.
Yuckh! The point is to avoid problems, not pile them up.




Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

2 comments:

  1. The part of this article I thought was especially important:

    ES: Re. the constraint, it's just that, a constraint. Which happens to be a uniqueness constraint that goes beyond what the currently existing engines can support declaratively, so in order to enforce it, you're on your own (accompanied only by whatever your bag of acquired database design tricks has to offer). There is no "relational theory" that will somehow render rules "invalid" if they are beyond what currently existing engines can support. If the rule "within the set of XYZ's that are marked currently active, the {COL_A,COL_B} combination has to be unique" applies to the business, then it applies and no theory will ever say this is invalid.

    SIRA_PRISE (the result of my personal exploring around in the area of database constraint enforcement) effectively allow you to do something akin to:
        CREATE VIEW active_xyz 
        AS SELECT * 
           FROM xyz 
           WHERE col_active_flag = 'Yes'
           KEY (col_a,col_b);

    Note that third line which applies a KEY clause to this view, with the same semantics as if it were applied to a base table.

    I don't argue with the above as far as existing products and conventional habits are concerned.

    But as far as theory is concerned, specifically so-called declarative constraints, the "KEY" keyword is often called a shorthand even though its declarative meaning seems to be almost universally ignored. What I mean by that is practically nobody in the www / popular press ever points out that the equivalent in algebraic terms ( or in FOL style, with the appropriate changes in terminology ) of a relation with a so-called primary key is a projection of a join of one or more relations with extended headings.

    My complaint is that people often treat such a language keyword as if the relation it applies to can be treated as if its full declaration doesn't involve a join. And that this somehow makes a base table with such a key constraint not logically a join / projection of a join. If more people would recognize the join maybe we'd have some progress in the treatment of views. ( I think this is so even when a product supports Table_Dee and Table_Dum - which I don't object to as they make optimization easier .) Nothing against shorthands as long as they aren't shortcuts that distort what the declaration means.

    Regarding the relation in question, my attitude is that it's actually a union of several other relations. In other words its simplest algebraic expression doesn't declare a base relation. I admit that what I call "simplest" might seem very elaborate to many practioners, at least the ones I've known.

    ReplyDelete
  2. One of my core interests is what I call the generality-to-simplicity ratio in conceptual modeling and database design, which is practically ignored in the industry. In my paper on the subject I propose a way to approach an optimal ratio: maximum generality with minimum complexity possible.

    My take from the comment is that it deviates from that optimum and makes the database less understandable and, therefore, constraint formulation and application less simple and more prone to error.

    It is a pragmatic issue, but relational theory allows you to optimize the ratio and there is no necessity to complicate matters.

    ReplyDelete