Tuesday, January 31, 2017

Outsmarting the DBMS: Analysts Should Beware



Last month I alerted you to the failure by data professionals to appreciate the importance, for soundness reasons, of reliance on the DBMS rather than application code for data manipulation and integrity enforcement. The following is an example of the consequences: 
"If you have multiple boolean fields in a record, consider combining them into a single Integer field. For instance in a User record create a single UserType field instead of 6 separate field for IsTrainee, IsManager, IsTrainer, IsHR, IsSupplier, IsSupport. By assigning 1,2,4,8 and 16, 32 as "yes" values for these then we can say that a value of 3 in this UserType field tell us that they are both Trainee and a Manager; 36 that they are the Trainer, and they are responsible for Support. The advantage of combining these into one field is that is another type can be added (e.g., IsFirstAider=64) without adding a field."


--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.
--------------------------------------------------------------------------------- 


Note: "File, "record," and "field" are physical storage concepts and, therefore, indicate lack of awareness and appreciation of "physical independence" (PI) -- the insulation of queries and applications from reorganization in storage of data in databases. The logical concepts are "relation" (presented as R-table), "tuple" (presented as row) and "attribute" (presented as column). By insisting on the latter terms there is less likelihood of confusion of levels of representation that has serious deleterious consequences.

What is being suggested is what is referred to in database management as "attribute overloading" -- having one attribute represent multiple facts. This is advocated in the name of flexibility, but the result is just the opposite.

The core Information Principle (IP) of relational database management mandates that "All information in a relational database is represented explicitly and in exactly one way: as atomic attribute values in relations." Atomic means that attributes and their values do not have components. If database design adheres to the IP, i.e., is normalized (in 1NF), then each and every data value in the database is guaranteed to be logically accessible via the simple addressing scheme R+C+Kv, where: R = relation name, C = column name, and Kv = key value.

The following is true without guaranteed logical access:
  • There is no physical independence -- the insulation of queries and applications of data reorganization in storage, which means that they won't work anymore if the database is reorganized for performance.
  • The DBMS is unaware of the internal structure of attributes and
    • cannot enforce data integrity;
    • complex programming is required to access the components, defeating the relational objectives (see last month's post).
Aside from the fact that this can hardly be considered more flexible than the PI-adherent proper relational design, this means that with today's DBMSs -- which are not the on-demand self-documented systems mandated by the relational data model -- all bets are off, a serious issue for data analysts, particularly those accessing databases they have not designed themselves.

Another example of poor attribute design due to the "think ahead" syndrome is "store everything as strings, so we could store potentially any data format in any syntax we'd like" which, of course, similarly defeats DBMS-enforced integrity. Such designs are too clever by half, because by violating the IP and 1NF, they hide information from the DBMS and prevent it from guaranteeing correctness of the data and query results -- the core advantage of the relational model.

Relational design, databases, and DBMSs are the analyst's friends and, given the state of the database field, she/he would be well advised to be as familiar with them as she/he is with analytics and programming, to appreciate its unobvious risks to correctness.



No comments:

Post a Comment