Friday, September 1, 2017

Don't Confuse/Conflate Database Consistency with Truth

Disregard for foundation knowledge and failure to learn from past mistakes by even data professionals deemed experts inhibit progress in data management and bring back problems already resolved that should be of foremost concern to data analysts. Consider the following:

"Above all else, we count on databases to reflect the truth consistently, or at least to reflect the table data perfectly. The database cannot be blamed when an application (or the end users of an application) place inaccurate data in its tables, but a database must accurately report the data it holds. Therefore, bugs are not all created equal; there are bugs, and there are wrong-rows bugs, bugs that silently misrepresent the data that the tables hold. Even the craziest, most obscure corner case that potentially misrepresents your data should rightly bring a loud chorus: "The emperor has no clothes!" We depend on the database, above all, not to lie."

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.

Note: The common misuse of 'database' to refer interchangeably to both the data and the DBMS -- the software managing the data -- induces confusion (e.g., a database 'is' the data, so the database always reflects itself perfectly -- a tautology). What we are talking about here is 'data integrity' and 'database consistency' -- different issues altogether.

In the database context both 'truth' and 'consistency' are critical, but they are distinct and should not be confused, or conflated. We can and should count on DBMSs to guarantee database consistency with the conceptual model of the real world it represents -- "wrong tuples" are not bugs, but functional failure. On the other hand, a DBMS cannot (and should not be expected to) ensure truth.

A 'mathematical' relation is an abstraction -- tuples have no meaning. 'Database' relations, on the other hand, are assigned meaning by database designers: they represent facts about members of real world object groups. For example, the tuples in the relation
represent facts about employees of the type:
Employee identified by employee number (EMP#) has name (ENAME), works in department with department number (DEPT#), earns salary (SALARY).
Substituting values for the parenthesized terms yields facts about individual employees. But the values are not arbitrary, relations must be consistent with the conceptual model of reality it represents -- i.e., the business rules that specify the defining property values of the corresponding object groups. The rules are formally represented in the database by integrity constraints, enforced by the DBMS, that constrain relations to be consistent with the rules -- i.e., have only valid attribute values.

Note very carefully, however, that a tuple can satisfy the constraints -- i.e., consist of valid values -- yet represent a fact that is not true in the real world -- i.e., not about an actual employee! Hence the difference between consistency and truth.

When the database is queried, a true RDBMS applies the set operations of the 'relational algebra' (RA) to one or more database relations to derive other relations as results. Database tuples represent facts that are 'axioms' and result tuples facts that are 'theorems' in formal logic. 'If -- and only if' -- the axioms are true, the theorems are 'provably' true -- i.e., they are logical implications of the theorems. Querying is, thus, 'inferencing': RA queries are, in effect, logic 'proofs'; a true RDBMS together with a fully normalized relational database form a deductive system and query results are 'system-guaranteed' to be logically correct.

Ideally only true axioms should be represented in the database. But while a DBMS can enforce declared constraints for database consistency, it cannot ascertain truth (e.g., that there is an actual employee with specific property values in the real world). The facts recorded in the database are, thus, not statements of objective truth about the world -- they are 'assumed to be true only because they were asserted by trusted authorized users' and are as true as the trust accorded those users.

Note that even consistency is system-guaranteed 'if and only if' both the DBMS and databases are truly relational. The reality is they are not, so analysts: caveat emptor!

No comments:

Post a Comment

View My Stats