THEORY, MY FOOT
by Fabian Pascal

 

 

 

There is a comp.database.theory Google Group. Now, one would expect, on the one hand, that such a group would discuss theoretical issues in database management. On the other hand, however, one wonders how could such a group even exist these days in the first place? After all, who knows what theory means, let alone cares about it? Everybody’s so busy with “practical” matters, such as XML, or whatever the mindless resident fad happens to be. When a reader sent me a link to a thread there, everything became clear.

 

The thread is an exchange following “a database design question”:

 

GKelly: I have a database for a school that has been in use now for a couple of years and it is working well. There are numerous tables obviously but consider these:

   CONTACT - all contacts,  students, faculty, or any other type of  contact; (probably should have called it Entity)
   FACULTY - info about specific faculty member, foreign key to CONTACT
   STUDENT - info about specific student, foreign key to CONTACT
Considering that the CONTACT table has fields such as: LAST, FIRST, MI, ADDR, CITY, STATE, ZIP, EMAIL ... Do you think it is a good idea to have a single table hold such info?  Or, would you have had the tables FACULTY and STUDENT store THE LAST, FIRST, ADDR and other fields? At what point do you de-normalize for the sake of being more practical?

 

This is the kind of question that C. J. Date labels “I don’t know how to do my job, and I am looking for somebody to do it for me”. Unfortunately, that’s the rule, rather than the exception these days. And, as hard as it is to imagine, it’s getting worse. There seems to be no bottom.

 

Note, first, that the asker provides no information about the conceptual/business model that the numerous tables are supposed to represent. In fact, it’s not clear that he himself knows that (or even knows that he must know!). In other words, he asks for advice on how to represent something complex in the database, without saying what that is. Typical of today’s practitioner.

 

It is obvious that GKelly is clueless about modeling and design in general, and the specific situation he is dealing with in particular. It is not advisable (a) to ask people you don’t know for advice, if you’re not knowledgeable enough to assess its quality and (b) to give such advice in the absence of sufficient information about the business model. But how reasonable is it to expect that the current average practitioner would heed this, or base their response on sound theory? Let’s see.

 

AK: What if Professor of Mathematics also takes courses in Digital Photography? Are you going to store her phone number in 2 rows, both in  Faculty and Student tables? What if a post doc student also has a part  time job teaching?  Depending on the platform, you might find materialized query tables aka materialized views, and / or index covering to be good alternatives to  denormalization. I'm speaking about DB2 and Oracle.

 

To his credit, AK noticed that not sufficient information is provided, specifically the logical relationships between entities of types Contact and those of types Faculty and Student. Without such information, it is impossible to tell whether the database is normalized or not in the first place. But he ignores the numerous other tables, and quickly jumps the gun, contaminating what is clearly a conceptual/logical design matter with physical implementation details (“materialized views”, a bad idea in itself, indexes, and commercial products).

 

Impulsive: My question is - why do you need to denormalize this?  Are there heavy reports being run on this data that don't execute fast  enough?  What kind of reports?  What other info is in faculty and student tables besides foreign keys?

 

Even though Impulsive does not know the business model (how can one worry about reports at this point escapes us), he too assumes that the database is normalized, and that what GKelly suggests is denormalizing it. Note, however, that his angle is physical—performance—as is the next response:

 

B Blink: In general when performance becomes an issue.

 

GKelly again:

 

   No, I do not 'need' to denormalize this.  And I don't want to change it just  for the sake of change. What would you do when you want to close out one year and start a new year? If you had standalone student and faculty tables then
you could archive them easily.   Have a school semester and year attached to them.  However, as you go from one year to the next information about a student  or faculty may change.  Like their addr and phone for example. The database model now is not very good because it doesn’t maintain a  history.  If Student A was in school last year as well but lived somewhere else would you have 2 contact records?  2 student records?  Or do you have just  one of each and have a change log.  which is best?

   The student table besides having a foreign key to the contact would have also the following: medical comments, current grade, foreign key to account which is just another foreign key to contact table for the parent/billing record.
   For the faculty table, additional fields would include: hire info, login/password for electronic timesheet login

 

Here’s precisely why remote design advice should never be offered online, in response to confused, incomplete questions, by somebody without foundation knowledge.  First, GKelly asked a question about denormalization—which, as we shall see, may not be the issue at all—but his real reason seems to have been history maintenance, a completely different issue, absent altogether from his initial question! Second, because the design is complex, the three tables under consideration are involved in relationships with other tables, which are unknown. It is entirely unclear how that affects the design, in either the normalization, or history context. For history issues, we refer the reader to C. J. Date, H. Darwen, and N. Lorentzos, TEMPORAL DATA AND THE RELATIONAL MODEL (Morgan Kaufman, 2002).

 

The next comment is one of those where you don’t know whether to laugh or cry.

 

--DA Morgan:One table name PERSON. Anything else is a denormalized nightmare.

 

Morgan is an Oracle practitioner who teaches at the University of Washington, with whom I happen to have had an exchange, More on the Myth of Market-based Education. The reader interested in forming an opinion about Morgan’s expertise, and his qualifications for academic teaching is urged to read it. Then his comment should not surprise.

 

Even if the issue were normalization, it’s the bundling of everything into one table, like Morgan proposes, that’s the maximum denormalization, yet he deems “anything else” denormalized! Wouldn’t want to be his student.

 

To Impulsive’s question:

 

Impulsive: Do you need to track all and every change to the tables? Then I would go with logs. If no - are your needs restricted to tracking address changes?

 

consider now GKelly’s response:

 

There are no real requirements for this database.   They are simply happy to have whatever is provided.
It would be easy to keep a log of changes; however, I am unsure of a more fundamental issue. I know this next statement is riddled with problems but, just for fun, consider this:  You track changes yes, but at the end of each day or week or whatever you compare the current record with the last snapshot of the record which would simple be an xml based representation of the record.  If the current record differs from the snapshot then you create a new snapshot. You could have a table like:

   SNAPSHOTXML
      id INT
      dts DATETIME
      tablename ...
      tableid INT
      xml CLOB  -- this field would hold the xml snapshot of the record

I know this leaves my initial question concerning normalization somewhat, but is still related.

 

So much for theory. Another excellent example of how fads take off when ignorance reigns supreme: a user organization without data management knowledge hires an IT “professional” who does not possess such knowledge either. XML has absolutely nothing to do with the issues under consideration, and the only reason GKelly brings it up is because it’s being pushed everywhere in the industry, and he jumps on the bandwagon because he does not know what the real issues are.

 

Eric Bohlman: I don't see how having separate tables for fundamentally different types of contacts could be considered denormalization.  In fact, your current scheme of having a bunch of 1-1 relationships seems to reflect object- oriented thinking rather than relational thinking; it looks like your "faculty" and "student" tables and their 1-1 relations are attempts to "subclass" your "entities." IMHO, it would make more sense to have separate tables for each type of "entity," each containing all and only the attributes that make sense for a faculty member, student, etc.  For one thing, it would make referential integrity checking easier.  As I see it, the only virtue of your current scheme is that if a faculty member happens to have a kid at the school, their contact information is stored in only one place.  That's likely to represent only a few cases, and it's subject to change (what if the teacher gets divorced and the other parent gets custody of the kid?  What if somebody wants to receive personal mail at one location and work- related mail at another?).  Otherwise, it looks like an attempt to micro- optimize space usage.

Here’s evidence that somebody with at least some knowledge of data fundamentals can immediately suspect what the real issue probably is (I say probably because, as already stated, there is no sufficient information to know for sure). It’s quite likely that this is an entity supertype/subtype situation. If so, it has nothing to do with normalization, and everything to do with orthogonality: indeed, tables ought to be independent precisely to avoid a type of redundancy that normalization cannot address. Informally and very loosely, normalization ensures that multiple entity types are not “bundled” into one table, and orthogonality ensures that no entity type is split into multiple tables. For the fallacy in denormalization for performance”, the reader is referred The Costly Illusion: Normalization, Integrity and Performance.

 

Entity supertypes/subtypes pose a challenge to design, because entities in such a case are neither of the same type (which would clearly result in one table in GKelly’s case), nor of distinct types (which would  result in one obvious table per entity type). In fact, as paper #10 demonstrates, Bohlman is mistaken: the correct solution does not involve referential constraints. But proper design requires, at the very minimum, knowledge of all the common and unique properties, which GKelly does not provide.

 

Since all contacts have an address, address attributes are common, which is why they reside, correctly, in the CONTACT table, yet GKelly considers, wrongly, transferring those attributes to the tables representing the entity subtypes. What is more, he deems such transfer “denormalization for practical reasons”. The dumbing down has become so severe, that it’s no longer just a matter of poor designs: now they are messing up whatever few good designs are left!!!!

 

 

From: Daniel Guntermann

To: Fabian Pascal

Date: 09 Apr 2005

 

I read with interest your Theory, My Foot.  I subsequently followed up to take a look at the thread you cited in the article myself.

 

1. To be fair and quite a bit more precise, the thread was cross-posted to two groups in the following sequence: comp.databases and then comp.databases.theory.  One would expect that kind of thread of conversation in the comp.databases Usenet group, which often reflects a more practical type of discourse than one encounters in comp.databases.theory.  In fact, that particular discussion group makes no claims towards theory whatsoever; but rather allows, by virtue of its charter, for questions and discourse about anything and everything related to databases and DBMSs, including implementation specific questions as well as questions and interchanges between beginners.

 

2. Having been inculcated in relational theory by Chris Date's books since undergraduate school, I have great respect for precision and logical support for arguments, particularly because of my exposure to his precision as well as his logical faculties in conveying concepts and ideas.  I was therefore somewhat surprised and baffled at your article, which seemed to indicate an opposite tack.

 

3. would recommend that you take a look at a few more conversations and threads in the comp.databases.theory newsgroup before making such large, sweeping generalizations about something you obviously spent little time researching to get the facts straight.  There are some doozies and some trolls, but you will find that not everything there is a nail as well. Thus, there is little need to use the rhetorical hammer.  You might be pleasantly surprised in some cases, especially if you go back a year or more.

 

4. By the way, Dan Morgan, whom you reference in your article, is not a University of Washington professor.  He teaches what basically is a vocational program.  It is simply sponsored by the University of Washington. There are no graduate level requirements levied in order to attend his type of class whatsoever; nor does one have to be part of an academic program.

 

5. His responses in your exchanges were somewhat surprising as well.  His vocational course is entirely centered on Oracle implementations of a SQL DBMS -- for that is the title of the course he tries to teach.  It has nothing to do with IBM, SQL Server, or any of the others implementations outside of Oracle.  A syllabus is posted outlining the topics of the Oracle vocational program and can be found at the UW extension program website (along with the C++, Unix, UML, and other courses taught by practitioners).

 

 

From: Fabian Pascal

 

[I numbered the paragraphs for convenience]

 

1. I was unaware of the other post, but the title does not, obviously, describe the full purpose of my critique. I am willing, however, to correct the title. Can you suggest a better one?

 

2. Why? Do you think that my critique of the exchange was i>not precise and logical?

 

3. Thanks, but no, thanks. I am not a masochist. I prefer to read intelligent material that I can learn from, debunking is a job, not a pleasure. The amount of crap in online exchanges is overwhelming material worth reading by so much, that spending any time on that beyond what my debunking duties require would be a horribly unproductive use of my time.

 

4. I stand corrected, but my response to that is: Morgan should not be teaching in academia. Anybody who teaches database management, even those who teach product and implementation, should be familiar with fundamentals. They exist for practical reasons and must serve as context. His reply to the design question was wrong, and that was a practical issue. Besides, Morgan's problem is much worse that just ignorance of fundamentals.

 

5. I am not surprised at all. To reiterate, his problem is much worse than knowledge.

 

 

From: Daniel Guntermann

 

I hate to criticise and then not have a better alternative to suggest, but not really.

No.  The title and introduction led off with some rather strong statements. Such positioning in an article usually reflects a hypothesis or main idea one intends to support.  This probably had an effect on perception for the rest of the paper, which in truth, was quite correct.  It seemed your central argument began with a focus on the flawed nature of theory today, using the comp.databases.theory newsgroup as a rhetorical device.  That is a hard hypothesis to "prove" based on a single use case, and that use case had its own flaws in supporting such a generalized supposition, since its context wasn't just comp.databases.theory, as I pointed out.

OK.  I understand perfectly.

I couldn't agree more.  Until someone was blunt enough to let him know his blunder, he cited 'Cobb' for over a year rather than E.F. Codd.

Thank you very much for taking the time out of what I am sure is a busy schedule to respond.

 

 

From: Fabian Pascal

 

Well, if I understand you correctly, your only complaint then is that I create a wrong impression about comp.databases.theory. I am still inclined to guess that if I did a systematic survey, it would prove I do not, but since I will not bother, and since you don't have a better title for the article, posting this exchange will allow readers to make their own minds.

 

 

Posted 4/15/05

© Fabian Pascal 2006 All Rights Reserved