Saturday, January 1, 2022

Schema and Performance: Never the Twain Shall Meet

One of the core objectives of this site (and my work) has been to demonstrate that there will not be progress in data management as long as the industry and trade media require and promote exclusively (mainly tool) experience in the absence of foundation knowledge. I have published and analyzed ample evidence that relational language and terminology are used without grasping what it actually means -- a good way to gauge lack of foundation knowledge.

Recently I posted a four part series titled "Nobody Understands the Relational Model" showing that even a practitioner steeped in the RDM does not really understand it. Consider now a practitioner's mistake at the beginning of career -- "a bad database schema and what it did to system performance" -- which, he claims, belatedly taught him a lesson. Hhhhmmm, did it, really?


DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.


- 12/17 (OBG) No Understanding Without Foundation Knowledge Part 1: Reviewing a Book Review

- 12/11 Nobody Understands the Relational Model: Semantics, Closure and Database Correctness Part 4

- 12/05 TYFK: How Not to Explain the Relational Model

- 11/25 Nobody Understands the Relational Model: Semantics, Closure and Database Correctness Part 3

- 11/19 (OBG) The Fate of Fads: XML DBMS

- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.


“... a system intended to slow down or stop a lot of the "open proxy" spam ... spammer scum would grab an open proxy and then would connect outward to your mail exchanger and would try to spew a spam at you. Blocking them did little since they forged everything ... They also didn't use that same open proxy twice, so blocking it also didn't help much ... we could probably store the IP address, HELO string, FROM address and TO address in a table, and send back a 4xx "temporary failure" error the first time we saw that particular tuple (or "quad") ... Simple enough, right? Well, the database used for this thing was absolutely terrible. It was bog-standard MySQL but that's not the bad part. The bad part is the schema. The rows in the table looked like this:

id | ip | helo | m_from | m_to | time | ...

Isn't it supposed to be those four things plus a time? Yes, it is, but the problem was HOW it was stored. Those things were STRINGS (you know, varchars in database-speak). So, every damn row in the single "quads" table looked like this:

1 | ip1 | blahblah   | evil@spammer.somewhere  | victim1@our.domain | ...
2 | ip2 | foobar      | another@evil.spammer       | victim2@our.domain | ...
3 | ip3 | MAILSERV | legit@person                     | user@our.domain     | ...
4 | ip4 | foobar      | another@evil.spammer       | victim1@our.domain | ...

Every row was extremely heavyweight: it had the actual raw string in it. Now think about what the database had to do to check for a match. We'd say something like this:

SELECT whatever
FROM quads
WHERE ip='ip1'
 AND HELO='thing_they_sent_us'
 AND m_from='whatever'
 AND m_to='some_user';

You know what it should have been? It should have been normalized. *Someone* should have heard about things like "third normal form" and all of that stuff. Nobody had heard of that, and so it went, bumbling along, chewing memory and CPU and disk space, and being oh-so-slow.”
--A terrible schema from a clueless programmer

How can a schema (i.e., logical design) affect performance, which is determined exclusively by physical implementation? (note also: "store" instead of "represent"). What is more, if you consider carefully the fix proposed by the author, it involves what is essentially a storage (i.e., implementation, not schema) change, albeit necessitated by the poor support of physical independence by SQL DBMSs (see below).
There is, actually, a schema mistake -- which has correctness, not performance implications -- but it's not the VARCHARs.  As we have explained over and over again, system guaranteed correctness  (logical validity and semantic consistency) requires that database relations -- distinct from abstract mathematical relations -- be in both:

  • 1NF (or normalized): all attributes are defined on simple domains; and,
  • 5NF (or fully normalized): all dependencies that hold in it are functional dependencies of the non-key attributes on the PK.

and a relational algebra with 5NF, not just 1NF closure. But as is common in the industry, while the author claims that the schema "should have been normalized":

  • Provides no conceptual model with reference to which to ascertain the verity of this claim, nor how to correct the design if claim is incorrect.
  • Is unaware that
- normalization means 1NF; and,
- full normalization means 5NF.
  • Believes, erroneously, that 3NF is sufficient.
The author does diverge from the industry's "denormalization for performance" fallacy by implying that further normalization to 3NF might have improved performance, but a schema cannot affect performance in either direction. To the extent that performance changes after a change in logical design, it is usually due to poor support of physical independence by SQL DBMS implementations: they tie schema changes to physical changes that affects performance, reinforcing the logical-physical confusion. A simple example is one-file-per-SQL-table DBMS implementation: further normalization from one table to two (logical) also splits one storage file in two (physical) and it is the latter that affects performance. In the face of performance issues, practitioners should always orient themselves towards DBMS and database implementations: even if they are forced to bastardize logical design, they should understand it correctly as an unjustified burden due to product flaws, rather than blame the RDM.
“The first time you encounter something, you're probably going to make some mistakes. There's a post going around tonight about how someone forgot to put an index on some database thing and wound up doing full table scans (or something like that). The rub is that instead of just being slow, it also cost a fair amount of money because this crazy vendor system charged by the row or somesuch. So, by scanning the whole table, they touched all of those rows, and oh hey, massive amounts of money just set ablaze! My point is: EVERYONE goes through this, particularly if operating in a vacuum with no mentorship, guidance, or reference points. Considering that we as an industry tend to chase off anyone who makes it to the age of 35, is it any surprise that we have a giant flock of people roaming around trying anything that'll work? It's a massive problem, and we're all partly responsible. I'm trying to take a bite out of it now by writing stuff like this. What about you?”
Note very carefully, however, that the lessons learned are implementation ones ("forgot to put an index thing", "doing full table scans or something like that") -- avoiding them can be usually learned from experience, but not logical database design, normal forms, physical independence and so on. Conceptual and logical mistakes avoidance is guaranteed by proper education, not just trial and error ("trying anything that'll work"). "Mentorship and guidance" from those who lack such education themselves won't help. In a second post, for example, the author recalls another mistake typically due to lack of foundation knowledge:

“After it went online, I was reading it out loud to someone in my life, and that's when it hit me: I forgot all about the shit show that was how I SELECTed data back out of the database. Recall that it turned into a situation where the individual components of the query (ip, helo, from, to) were now in their own tables, so they all had a number now -- the "id" column. I figured, okay, I need to obtain those little suckers before I can build a query with them, so let's query!

SELECT id FROM m_ips WHERE m_ip = '...'
SELECT id FROM m_helos WHERE m_helo = '...'
SELECT id FROM m_froms WHERE m_from = '...'
SELECT id FROM m_tos WHERE m_to = '...'

So, yep, I fired off four queries to get four numbers, then built up a query, then fired that off as a fifth to get my results. Five round-trips to the database when one would have done. Did I mention that I obviously didn't know the first thing about JOIN back then? It took several more years for that to happen organically. Go on, bag on me for being ignorant. I know what that really means.”

Relational database management is set oriented and declarative and, unlike implementation aspects, this is something not readily realizable from sheer experience with SQL DBMSs. This practitioner figured some things out, but not the fundamentals.





No comments:

Post a Comment

View My Stats