Thursday, June 1, 2017

Redundancy, Consistency, and Integrity: Derivable Data

Database redundancy can wreak havoc with interpretation of analytics results, but it also poses consistency risks that can affect the correctness of the results themselves. The risks are too underappreciated for effective prevention. Given industry practices, analysts who use databases they did not design, or designed without sufficient foundation knowledge, should be on the alert.

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.

A database represents a conceptual model of some segment of reality of interest. The model is informal and consists of business rules expressed in natural language that structure reality as groups of objects with properties. Database systems can only manipulate abstract symbols mathematically, so the model must be 'formalized' -- i.e., expressed symbolically in some formal data language -- for database representation. To that end, object groups are represented in the database by relations, facts about objects by tuples, properties by attributes, and business rules by integrity constraints. The DBMS enforces the constraints to ensure 'database consistency' -- i.e., that the formal database representation is faithful to the informal model of reality -- i.e., is consistent with the rules.

Reality, of course, is not redundant. Information about reality can be recorded redundantly due to database design -- often inadvertently, due to poor design skills, but sometimes intentionally, for user convenience.

'Derivable data' are one source of redundancy. It suffices for the purposes of this discussion to understand that querying relational databases means applying the set operations of the 'relational algebra' (RA), individually or in any sequence, to one or more relations, to produce -- derive -- relations as results, a property referred to as 'relational closure.' This is akin to the numeric algebra we call arithmetic, whose operations, when applied to numbers, produce numbers. The RA is closed to relations like arithmetic is closed to numbers.
A database that contains a relation that is derivable via some sequence of RA operations from the others is redundant. If a projection of that relation is derivable via some RA operations from other projections of relations, including self, it is 'strongly redundant.' A projection selects one or more attributes of a relation.

Consider the relation:
with EMP# as the primary key and MGR# as a foreign key referencing EMP#. Since managers are employees, the projection of EMPLOYEES on {MGR#,MGRNAME} is derivable (via a FK-PK self-join) from the projection on {EMP#,ENAME}.

Strong redundancy introduces inter-relation dependencies that, because they do not exist in the real world, are not reflected in the business rules and the DBMS is unaware of them. There is nothing to prevent updates that leave the database in an inconsistent state (e.g., a change of ENAME of a manager without a corresponding change of MGRNAME, or vice-versa). The responsibility for data integrity falls on users and applications, the unreliable approach that database management made obsolete.

Strong redundancy should be avoided by design, but if it is introduced, it must be 'declared to the DBMS' in the form of constraints, such that it can enforce database consistency. Consider, for example, three relations R, S, T, and two RA operations A, B:
  • Relation A(T) is derivable from relation T by operation A;
  • Relation B(R,S) is derivable from relations R and S via operation B;
  • Relation A(T) is derivable from relation B(R,S) -- i.e., A(T) = B(R,S).
A database that contains either A(T) or B(R,S) together with R, S, and T is strongly redundant. Consistency is guaranteed 'if and only if' the above equality is declared to the DBMS as a formal constraint, 'in addition to the constraints formalized from the rules.'

The reality is that many databases are strongly redundant, but devoid of the additional constraints. Even if database designers were aware of the need to add them -- they are not -- that requires a RDBMS with a relationally complete data language that supports fully and correctly all relational operations and constraints. Unfortunately, SQL DBMSs are not truly relational and SQL does not provide such support. (For relational constraints and how SQL compares, see To Understand Integrity, Don't Start with SQL.)

So analysts cannot take database consistency for granted and should not.

No comments:

Post a Comment

View My Stats