Tuesday, January 31, 2017

Outsmarting the DBMS: Analysts Should Beware

Revised 5/4/2020.

Last month I alerted you to the failure by data professionals to appreciate the importance, for a variety of critical reasons, of reliance on the DBMS rather than application code for integrity enforcement and data manipulation. The following is an example of the consequences: 
"If you have multiple boolean fields in a record, consider combining them into a single Integer field. For instance in a User record create a single UserType field instead of 6 separate field for IsTrainee, IsManager, IsTrainer, IsHR, IsSupplier, IsSupport. By assigning 1,2,4,8 and 16, 32 as "yes" values for these then we can say that a value of 3 in this UserType field tell us that they are both Trainee and a Manager; 36 that they are the Trainer, and they are responsible for Support. The advantage of combining these into one field is that is another type can be added (e.g., IsFirstAider=64) without adding a field."
Note: "File, "record," and "field" are physical implementation concepts. The logical design concepts are relation (visualizable as R-table), tuple (visualizable as row) and attribute (visualizable as column). By using the proper terms there is less likelihood of confusion of levels of representation rampant in the industry, which has deleterious consequences[1].

Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. The content of this site is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. Thank you.
The industry is chockful of misconceptions due to lack of foundation knowledge. Corrections them are dismissed as "theory that is not practical", misinterpreted as "ad-hominem attacks", or ignored altogether, regardless of the amount and quality of reasoning and supporting evidence. Most practitioners -- be it user or vendor personnel -- cannot discern fallacies and do not realize the practical implications thereof and, thus, cannot associate problems with their real causes., hence the industry's "cookbook approach" and succession of fads.
What about you? Are you just a practitioner, or a thinking professional?
TYFK (Test Your Foundation Knowledge) posts will each present and debunk a pronouncement containing one or more misconceptions. First try to detect them, then check against our debunking. If there isn't a match, you can acquire the necessary foundation knowledge in our POSTS, BOOKS, PAPERS, LINKS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.
  • 01/14/20 Updated the LINKS page
  • 01/04/20 Updated the POSTS page with the 2020 posts
  • 12/08/19 Added two educational references on set theory to the LINKS page.

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search. 
  • Following the discontinuation of AllAnalytics site, the links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.


I deleted my Facebook account. You can follow me:
  • @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
  • @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
  • @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

What is being suggested is what is referred to in industry practice as "attribute overloading", which is advocated in the name of flexibility, but the result is just the opposite.

The core Information Principle (IP) of relational database management mandates that "All information in a relational database is represented explicitly and in exactly one way: as atomic values of relation attributes"[2]. Atomic means that attributes are defined on simple domains that do not have components meaningful to user applications (i.e., they are treated as atomic by the data sublanguage)[3]. If database design adheres to the IP -- consists of relations, which are normalized (in 1NF)[4] and fully normalized (in 5NF) by definition[5] -- then each and every data value in the database is guaranteed to be logically accessible via the simple addressing scheme:
relation name + attribute name + PK value[2]
Without such guarantee information is "hidden" from the DBMS in the internal structure of attributes, rather than represented explicitly as values drawn from simple domains as mandated by the IP: physical independence is not possible[6], integrity enforcement and data manipulation are relegated to application programs, defeating database management and relational advantages (see last month's post) and all bets are off[7]. This should be a serious issue for data analysts, particularly if they have not designed the database themselves.

Another example of poor design due to the "think ahead syndrome" is "store [sic] everything as strings, so we could store potentially any data format in any syntax we'd like". Such designs are too clever by half, because they too violate IP by hiding information from the DBMS and prevent it from guaranteeing data integrity and correctness of query results -- a core advantage of the relational model[8].

Relational design, databases, and DBMSs are analysts friends and, given the state of the database field, they are well advised to be familiar with them to appreciate the unobvious risks

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.


[2]  Pascal, F.,  Interpreting Codd:The 12 Rules

[6]  Pascal, F., Data Sublanguage series

[7]  Pascal, F., Physical Independence series

No comments:

Post a Comment

View My Stats