Friday, December 17, 2021

OBG: No Understanding Without Foundation Knowledge Part 1 -- Debunking a Book Review

Note: To demonstrate the correctness and stability offered by a sound theoretical foundation (relative to the industry's fad-driven "cookbook" practices), I am re-publishing as "Oldies But Goodies" material from the old (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references, which I enclose in square brackets).

The following was my debunking of a review of my third book (originally published on 01/14/2001)

“Many of us ... do not think that harmony is the great goal, or unity or peacefulness, [and] actually quite like hard questions for their own sake, and enjoy ... the life of the mind. To the question of how to live, the answer is "by disagreement.” --Christopher Hitchens

Let me say, first and foremost, that as the subtitle of the book -- A REFERENCE FOR THE THINKING PRACTITIONER -- indicates, it is targeted at the minority of practitioners who think clearly, independently and critically. It should not be a surprise, then, that those not belonging to that (alas, very small) target audience don't see its practical value. As I said so many times, if my work gained mass appeal, I would wonder what I was doing wrong. This is the sad reality, whether we like it or not. In fact, to be consistent I will go one step further: I don't assume that positive reviews are any better than negative ones -- they are frequently grounded in as faulty reasoning and/or ignorance as the critiques.

Let me also make clear that I do not place all of the blame on  the individual database practitioners or users. Rather, problems are rooted in a systemic, much more profound societal and business culture that fails to instill and encourage foundation knowledge and independent, critical thinking, which not only does not reward, but actually punishes such. This is true to a degree in all societies, of course, but in the US the problem is much more acute (there can hardly be a better demonstration of the horrendous implications of this than how the election was covered, perceived and accepted by most of the press and public) [I wrote this prior to the last two elections -- I leave it to the reader to judge the steepness of the subsequent regress.]


DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.


- 12/05 TYFK: How Not to Explain the Relational Model

- 11/25 Nobody Understands the Relational Model: Semantics, Closure and Database Correctness Part 3

- 11/19 OBG: The Fate of Fads -- XML DBMS

- 11/11 Nobody Understands the Relational Model: Semantics, Relational Closure and Database Correctness Part 2

- 11/05 OBG: Database Consistency and Physical Truth

- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.


The reviewer, one Michael Sims, gives the book a rating of 7.5 out of 10, which is not too bad. However, his summary raised my sensitive antenaes. He characterizes the book as follows:

“Not a practical, hands-on book at all; contains high-level database problems and theoretical solutions.”
To assess this statement, the reader must be familiar with the concept of practicality prevalent in the industry: "hands-on" is a code word for product-specific. For most data practitioners anything what is not product-specific is "just theory" and, therefore, not practical. Sure enough, here's Sims:
“Most of the time, when a computer book has the word "practical" in the title, it means one thing: examples. Lots and lots of real world, cut-and-paste examples intended to solve the exact problem you're facing. This book departs from that stereotype by containing little in the way of practical examples. I don't think it even mentions any specific database products. Instead, it mainly discusses the platonic ideal of a database from a scholarly standpoint, and never touches actual examples of database products. As such, it is a relatively timeless book, but it is not what I would describe as "practical". [emphasis added]
Now, it is certainly true that this is characteristic thinking in the IT industry and trade media. But it is also true that this is patently wrong! It is tantamount to saying in the construction industry that the laws of physics are not practical and, thus, not worth bothering with. But specific practices and tools must, of course, be obeying those laws (theories) to have any practical value. [Similarly, database practices and DBMSs must obey the laws of mathematics and logic, but the industry has this fallacious notion that "theory is not practical". The book provides plenty of examples of the cost of practices and products that are not practical precisely because they disregard theory.] I will come back to this point below.

Theory in this context is, of course, another word for science. What we really need in any domain is a scientific foundation, if available. When we lack such we don't do very well (compare, for example, astronomy with astrology). It so happens that database management does have scientific foundations: one is simple set theory expressible in first order predicate logic (SST/FOPL) -- is Sims suggesting that the [correctness guaranteed by the theory in which the RDM is grounded] is not practical? If so, what else would he propose to make practice and products practical -- trial and error? [Another foundation is directed graph theory, applicable to what Codd called "network applications", which the industry has treated with similar disregard].

Be that as it may, note very carefully that in no sense is my book theoretical, in the sense that it is not just about SST/FOPL, but about the practical implications thereof for database management [the RDM is SST/FOPL adapted for and applied to database management]. Sims notwithstanding, the book illustrates with real-world examples the tangible costs of products and practices that disregard theory.  
“Essentially ... is a scholarly overview of the whole concept of databases, some common pitfalls that database administrators (DBAs) run into, and where actual database systems fall short of the platonic ideal. It would be a good book for an "Intro to Databases" class (and I don't mean a How to Use Excel course, I mean a CompSci course).”
I doubt that Sims has seen any "scholarly" books, as mine does not bear resemblance to them. Be that as it may, the notion that "common pitfalls DBAs run into" are "scholarly" and, thus, not practical, is odd indeed. As to the book serving as text for a computer science class, I refer Sims to both the preface to and to the editorial article launching DBDebunk, where I provide evidence that under pressure from the industry and students, computer science programs are turning into vocational training grounds for vendors, providing sheer product certification instead of education -- a total betrayal of their function.

Sims treatment of individual chapters is rather superficial from the perspective of the targeted reader who wants to decide whether the book is useful. He admits that there are recommendations, some even valuable (so the book is not purely theoretical, as he claims). I will tackle here only comments that I find problematic.
“Chapter 1 discusses data types (how data is stored in the database) and suggests that DBAs should not fall into the trap of using complex, proprietary data types over standard character and numeric fields. Chapter 1 also includes the oddest section of the book: 20 pages of Webpage print-outs whose sole unifying theme seems to be "Look what weird stuff people want to put in databases -- and here's a ZDNet printout to prove it!" This section almost turned me off the book entirely, but thankfully it wasn't repeated. I don't know what they were thinking...”
The chapter is very explicit that in RDM the implementation of a [relational domain, distinct from a programming data type] -- its actual representation in storage -- is hidden from users. Sims' ignores that -- his concept of data types as "how data is stored in the database" is as oblivious to physical independence as the rest of the industry. Alas, this confusion of levels of representation is so too entrenched to correct.

Sims' conclusion that I promote exclusive use of "simple" data types [is another common misconception about the RDM]. I clarify what true support of database domains (distinct from application data types) -- simple and complex -- by users and the DBMS means to help practitioners make informed decisions [as to which is cost-effective to use in specific circumstances, given available DBMSs. What I warn against is the uninformed use of complex domains to avoid the effort of proper database design in the absence of understanding of the implications]. If users believe, once they take these aspects into account, that complex domains are cost-effective (often they are not), they should by all means use them correctly.

Sims entirely misses the point of the appended Webpages, particularly as somebody who criticizes the book for lack of examples. What they clearly show is what happens if [proper database design is avoided by lazy complex domains]. Frankly, it is I who don't know what he was thinking.
“Chapter 2 discusses integrity rules. Integrity constraints are rules that your data should obey - enforcing the rules is the problem. For instance, no two employees should have the same employee number. Essentially, the author's advice boils down to implementing integrity in the database itself rather than via triggers or external logic.”
[Given that database design is mostly specification of integrity constraints], this is a rather superficial description of the chapter. Reducing this critical and much ignored subject to "enforcing the rules is the problem", without referring to the chapter's discussion of the problem and proposed solution (and, incidentally, many examples) is not very helpful.  

As one of the discussants points out to Sims, triggered procedures do implement integrity in the database. The chapter explains, however, that as their name implies, they are procedural, which is why they are inferior for several reasons to declarative constraints -- a critical distinction that the book makes, but escapes Sims.
“Chapter 3 discusses keys. A key is a field in a record with data that you plan to use to pull that record from the table -- for instance, if you were getting information about employees, you might use that employee number as a key, because one employee number should correspond to one record and one employee. The author discusses the various types of keys and makes obvious recommendations.”
More logical/physical confusion. A key is simply a unique logical identifier for tuples [that represent entities]. It is true that keys are frequently used for lookups, but that's precisely because they are identifiers. SQL DBMSs implement key uniqueness with indexes that also improve performance, which causes practitioners to confuse indexes (physical implementation) with keys (logical identification).

The chapter recommendations ought, indeed, to be obvious, but I very much doubt that they all are. If Sims tests this hypothesis empirically, as I have, he will be be surprised.
“Chapter 6 discusses entity subtypes and supertypes -- essentially, what do you do when you have items to store in a database that have some traits in common but some not in common. The nomenclature was a little confusing. He discusses some oddities in the most recent SQL standard, which mostly went over my head.”
I'm afraid it's Sims who is confused not by the chapter, whose purpose is to introduce clear thinking on the subject, but by his own fuzzy terminology. Entity types -- not "items" -- are to be represented -- not "stored" --  in the database. The "oddities" in the SQL standard are the complications that practitioners will likely encounter in the products so dear to Sims' heart. He fails to see any significance to the fact that, when compared to the simple solution provided by the theory, those complications (due to ignoring theory) "go over his head”. Theory is practical, after all, but he misses that very point. One of the core objectives of relational technology is simplicity. Practitioners have been forced into unnecessary complexities for so long, they are no longer able to discern the difference between them and simple solutions. What better argument to counter the fallacious notion that products alone are practical, while theory is not--the exact opposite is true.
“Chapter 7 discusses trees and hierarchies. In a nutshell: there are no trees in SQL. The author is distressed by this.”
Trees are hierarchies. I would not be surprised if what Sims really means to say is "there are trees in the real world; SQL does not have trees; SQL is relational; hence, relational is not an adequate solution for representing the real world". What the chapter shows, though, is that it is possible to both represent and manipulate tree structures relationally, and in a much simpler way than any other proposed hierarchical solution. It's because SQL ignores and products violate relational theory that they fail to support trees properly. Note how Sims' conclusion not only misses the practicality of theory again, but because his position is generally adopted by practitioners, it ensures that the proper (relational) solution will never be implemented by vendors.

[Note on republication: While, strictly speaking, the above is correct, it is also fair to mention that in introducing the RDM Codd reserved it for serving what he called "non-network applications", which are better served by directed graph theory at the expense of the relational advantages.
“Chapter 8 covers redundancy, more or less an extension of chapter 4. Good coverage, mostly seems to be common-sense, but then I've seen plenty of databases that lacked this common sense, so perhaps it isn't as common as one would hope.”
Since there are several types of redundancy, it is somewhat of a stretch to deem Chapter 8 -- which covers them all -- an extension of 4, which focuses on one type thereof -- duplicate rows.
As Chris Date correctly pointed out while reviewing a draft of this article, duplicates are a special case because "unlike other kinds of redundancy, they violate the theoretical foundation (SST), with horrendous implications for everything" [emphasis his].
“Chapter 10 covers missing data, the difference in database-land between a field with (say) Yes, No, an empty string, or a null value, which has given everyone who does any sort of database programming problems at one time or another. The author's analysis is sound and useful.”
Missing information is not "the difference in database-land between a field with (say) Yes, No, an empty string, or a null value", whatever that means. This trivializes one of the thorniest problems in database management, of which practically all products make a mess. Incidentally, "null value" is a contradiction in terms -- the precise point is that a SQL NULL is not a value [ and cannot treated as such by the data sublanguage]!
“To sum up, it's a decent book covering a wide range of areas pertaining to databases from a scholarly viewpoint. Perhaps it could be compared to Sun Tzu's Art of War -- it doesn't really discuss YOUR situation, but it gives a lot of tips, and if you pay attention, you'll probably find something in there that will help you in your present crisis. The author is more of a scholar than a hands-on instructor, but he obviously knows what he's talking about. The book title should probably be "The Zen of Databases" or something like that, though, rather than implying it will be some sort of practical guide to administering SQL Server 7 or anything along those lines.”

[Again, theory has practical implications, but the book is about the latter, not the former]. But according to Sims, only "administering SQL Server 7 or anything along those lines" is not theoretical and, therefore, practical. Moreover, the industry notion that theory is not relevant to "your situation", that it won't "help you in your present crisis" is a fallacy -- it is intended precisely to be relevant to the situation of many, not just one user.

To conclude my response to the review: It fails to appreciate the difference between working from general principles and what I call the "cookbook" (or "recipe") approach. It is sad that in this day and age -- and certainly in this society, where efficiency is a fundamental value -- there is a need to defend the general against the case-by-case approach. Any situation or crisis that practitioners encounter is an instance to which some general principle applies. Without one you will be limited to trial and error -- is that more practical? 

Second, as the book makes clear, many of the problems in practice stem from the products' and users' violating the theory. If users do not realize this, what incentive is there for vendors to bring their products into theoretical compliance and how will they ever progress? Is it more practical to continue to deal with crises, or to avoid them?

“Probably the people who will get the most benefit from it will be DBAs who have learned database administration from the school of hard knocks -- learn by doing -- but find themselves doing it more often than they would like and want to get a little book-learning in to help them past the problems they are encountering. Novices won't get a lot out of it because they won't have hit the problems he describes. Experts will already know the solutions he recommends, although they'll probably get something out of it nonetheless.”
[Note on re-publication: this is exactly upside down and backwards: DBAs encounter many if not most of the problems because they do not learn the fundamental; the book was written for "experts", most of whom do not know these solutions.]

Consider the following, quite representative "situation" from the preface of the book:
"I need to store 40 pieces of unrelated information. Is it better to create [one] table w[ith one] record [and] 40 fields, or create [one] table w[ith] 40 records [and one] field?"

First, novices clearly do encounter problems that originate in lack of foundation knowledge.

Second, no amount of expertise in any DBMS is sufficient, in itself, to address this sort of problem. Third, since a vast majority of DBAs "learned database administration from the school of hard knocks -- by doing", practically all of them would benefit from the book's focus on fundamentals, which they lack. Had they read it, I wouldn't have been getting the silence in my teaching when I ask attendees with long careers and extended experience "What is a database?”

[Let me conclude with an example from the book -- normalization, covered in Chapters 5 and 8 -- that demonstrates the huge cost of disregard for fundamentals. One of the most common and entrenched industry misconceptions is that RDBMSs forces to trade off consistency for performance. 

It is not realized that the redundancies -- and thus, update anomalies -- are a consequence of poor database design and a relational algebra (RA) without 5NF closure. Consequently, instead of designing databases correctly and incenting proper implementations of the RDM by vendors, practitioners accept SQL DBMSs as RDBMSs and "denormalize" databases for performance.]




No comments:

Post a Comment

View My Stats