There is a comp.database.theory Google Group. Now, one would
expect, on the one hand, that such a group would discuss theoretical issues in
database management. On the other hand, however, one wonders how could such a
group even exist these days in the first place? After all, who knows
what theory means, let alone cares about it? Everybody’s so busy with
“practical” matters, such as XML, or whatever the mindless resident fad happens
to be. When a reader sent me a link to a thread there, everything became clear.
The thread is an exchange following “a database design
question”:
GKelly: I have a database for a school that has been in
use now for a couple of years and it is working well. There are numerous tables
obviously but consider these:
CONTACT - all
contacts, students, faculty, or any
other type of contact; (probably should
have called it Entity)
FACULTY - info about specific faculty
member, foreign key to CONTACT
STUDENT - info about specific
student, foreign key to CONTACT
Considering that the CONTACT table has fields such as: LAST, FIRST, MI, ADDR,
CITY, STATE, ZIP, EMAIL ... Do you think it is a good idea to have a single
table hold such info? Or, would you have had the tables FACULTY and
STUDENT store THE LAST, FIRST, ADDR and other fields? At what point do you
de-normalize for the sake of being more practical?
This is the kind of question that C. J. Date labels “I don’t
know how to do my job, and I am looking for somebody to do it for me”. Unfortunately,
that’s the rule, rather than the exception these days. And, as hard as it is to
imagine, it’s getting worse. There seems to be no bottom.
Note, first, that the asker provides no information about the
conceptual/business model that the numerous tables are supposed to
represent. In fact, it’s not clear that he himself knows that (or even
knows that he must know!). In other words, he asks for advice on how to
represent something complex in the database, without saying what that is.
Typical of today’s practitioner.
It is obvious that GKelly is clueless about modeling and
design in general, and the specific situation he is dealing with in particular.
It is not advisable (a) to ask people you don’t know for advice, if you’re not
knowledgeable enough to assess its quality and (b) to give such advice in the
absence of sufficient information about the business model. But how reasonable
is it to expect that the current average practitioner would heed this, or base
their response on sound theory? Let’s see.
AK: What if Professor of Mathematics also takes courses
in Digital Photography? Are you going to store her phone number in 2 rows, both
in Faculty and Student tables? What if
a post doc student also has a part time
job teaching? Depending on the platform,
you might find materialized query tables aka materialized views, and / or index
covering to be good alternatives to
denormalization. I'm speaking about DB2 and Oracle.
To his credit, AK noticed that not sufficient information is
provided, specifically the logical relationships between entities of types
Contact and those of types Faculty and Student. Without such information, it is
impossible to tell whether the database is normalized or not in the first
place. But he ignores the numerous other tables, and quickly jumps the gun,
contaminating what is clearly a conceptual/logical design matter with physical
implementation details (“materialized views”, a bad idea in itself, indexes,
and commercial products).
Impulsive: My question is - why do you need to
denormalize this? Are there heavy
reports being run on this data that don't execute fast enough?
What kind of reports? What other
info is in faculty and student tables besides foreign keys?
Even though Impulsive does not know the business model (how
can one worry about reports at this point escapes us), he too assumes
that the database is normalized, and that what GKelly suggests is denormalizing
it. Note, however, that his angle is physical—performance—as is the next
response:
B Blink: In
general when performance becomes an issue.
GKelly again:
No, I do not 'need' to
denormalize this. And I don't want to change it just for the
sake of change.
What would you do
when you want to close out one year and start a new year?
If you had standalone
student and faculty tables then
you could archive them easily. Have a school semester
and year attached
to them. However, as you go from one
year to the next information about a student
or faculty may change. Like their addr and phone for example. The
database model now is not very good because it doesn’t maintain a history.
If Student A was in school
last year as well but lived somewhere else would you have 2 contact records?
2 student records? Or do you have just
one of each and have a change log. which is best?
The student table
besides having a foreign key to the contact would have also the following:
medical comments, current grade, foreign key to account which is just another
foreign key to contact table for the parent/billing record.
For the faculty table, additional fields would include: hire info,
login/password for electronic timesheet login
Here’s precisely why remote design advice should never be
offered online, in response to confused, incomplete questions, by somebody
without foundation knowledge. First,
GKelly asked a question about denormalization—which, as we shall see,
may not be the issue at all—but his real reason seems to have been history
maintenance, a completely different issue, absent altogether from his
initial question! Second, because the design is complex, the three tables under
consideration are involved in relationships with other tables, which are
unknown. It is entirely unclear how that affects the design, in either the
normalization, or history context. For history issues, we refer the reader to
C. J. Date, H. Darwen, and N. Lorentzos, TEMPORAL DATA AND THE RELATIONAL
MODEL (Morgan Kaufman, 2002).
The next comment is one of those where you don’t know whether
to laugh or cry.
--DA Morgan:One
table name PERSON. Anything else is a denormalized nightmare.
Morgan is an Oracle practitioner who teaches at the
University of Washington, with whom I happen to have had an exchange, More on the Myth of
Market-based Education. The reader interested in forming an opinion
about Morgan’s expertise, and his qualifications for academic teaching is urged
to read it. Then his comment should not surprise.
Even if the issue were normalization, it’s the
bundling of everything into one table, like Morgan proposes, that’s the maximum
denormalization, yet he deems “anything else” denormalized! Wouldn’t want
to be his student.
To Impulsive’s question:
Impulsive: Do you need to track all and every change to
the tables? Then I would go with logs. If no - are your needs restricted to
tracking address changes?
consider now GKelly’s response:
There are no real requirements for this database. They
are simply happy to have whatever is provided.
It would be easy to keep a log of changes; however, I am unsure of a more
fundamental issue. I know this next statement is riddled with problems but,
just for fun, consider this: You track
changes yes, but at the end of each day or week or whatever you compare the
current record with the last snapshot of the record which would simple be an
xml based representation of the record. If the current record differs
from the snapshot then you create a new snapshot. You could have a table like:
SNAPSHOTXML
id INT
dts DATETIME
tablename ...
tableid INT
xml CLOB -- this field would hold the xml snapshot of
the record
I know this leaves my initial question concerning normalization
somewhat, but is still related.
So much for theory. Another excellent example of how fads
take off when ignorance reigns supreme: a user organization without data
management knowledge hires an IT “professional” who does not possess such
knowledge either. XML has absolutely nothing to do with the issues under
consideration, and the only reason GKelly brings it up is because it’s being
pushed everywhere in the industry, and he jumps on the bandwagon because he
does not know what the real issues are.
Eric Bohlman: I don't see how having separate tables for
fundamentally different types of contacts could be considered
denormalization. In fact, your current
scheme of having a bunch of 1-1 relationships seems to reflect object- oriented
thinking rather than relational thinking; it looks like your
"faculty" and "student" tables and their 1-1 relations are
attempts to "subclass" your "entities." IMHO, it would make
more sense to have separate tables for each type of "entity,"
each containing
all and only the attributes that make sense for a faculty member, student,
etc. For one thing, it would make
referential integrity checking easier.
As I see it, the only virtue of your current scheme is that if a faculty
member happens to have a kid at the school, their contact information is stored
in only one place. That's likely to
represent only a few cases, and it's subject to change (what if the teacher
gets divorced and the other parent gets custody of the kid?
What if somebody wants to receive personal
mail at one location and work- related mail at another?).
Otherwise, it looks like an attempt to
micro- optimize space usage.
Here’s evidence that somebody with at least some
knowledge of data fundamentals can immediately suspect what the real issue
probably is (I say probably because, as already stated, there is no sufficient
information to know for sure). It’s quite likely that this is an entity
supertype/subtype situation. If so, it has nothing to do with
normalization, and everything to do with orthogonality: indeed, tables
ought to be independent precisely to avoid a type of redundancy that
normalization cannot address. Informally and very loosely, normalization
ensures that multiple entity types are not “bundled” into one table, and
orthogonality ensures that no entity type is split into multiple tables. For
the fallacy in denormalization for performance”, the reader is referred The Costly Illusion:
Normalization, Integrity and Performance.
Entity supertypes/subtypes pose a challenge to design,
because entities in such a case are neither of the same type (which would
clearly result in one table in GKelly’s case), nor of distinct types (which
would result in one obvious table per
entity type). In fact, as paper #10 demonstrates, Bohlman is mistaken: the
correct solution does not involve referential constraints. But proper
design requires, at the very minimum, knowledge of all the common and unique
properties, which GKelly does not provide.
Since all contacts have an address, address attributes are
common, which is why they reside, correctly, in the CONTACT table, yet
GKelly considers, wrongly, transferring those attributes to the tables
representing the entity subtypes. What is more, he deems such transfer
“denormalization for practical reasons”. The dumbing down has become so severe,
that it’s no longer just a matter of poor designs: now they are messing up
whatever few good designs are left!!!!
From: Daniel Guntermann
To: Fabian Pascal
Date: 09 Apr 2005
I read with interest your Theory, My Foot.
I subsequently followed up to take a look at
the thread you cited in the article myself.
1. To be fair and quite a bit more precise, the thread was
cross-posted to two groups in the following sequence: comp.databases and
then comp.databases.theory. One
would expect that kind of thread of conversation in the comp.databases
Usenet group, which often reflects a more practical type of discourse than one
encounters in comp.databases.theory.
In fact, that particular discussion group makes no claims towards theory
whatsoever; but rather allows, by virtue of its charter, for questions and
discourse about anything and everything related to databases and DBMSs,
including implementation specific questions as well as questions and interchanges
between beginners.
2. Having been inculcated in relational theory by Chris
Date's books since undergraduate school, I have great respect for precision and
logical support for arguments, particularly because of my exposure to his
precision as well as his logical faculties in conveying concepts and
ideas. I was therefore somewhat
surprised and baffled at your article, which seemed to indicate an opposite
tack.
3. would recommend that you take a look at a few more
conversations and threads in the comp.databases.theory newsgroup before
making such large, sweeping generalizations about something you obviously spent
little time researching to get the facts straight.
There are some doozies and some trolls, but you will find that not
everything there is a nail as well. Thus, there is little need to use the
rhetorical hammer. You might be
pleasantly surprised in some cases, especially if you go back a year or more.
4. By the way, Dan Morgan, whom you reference in your
article, is not a University of Washington professor.
He teaches what basically is a vocational program.
It is simply sponsored by the University of
Washington. There are no graduate level requirements levied in order to attend
his type of class whatsoever; nor does one have to be part of an academic
program.
5. His responses in your exchanges were somewhat surprising
as well. His vocational course is
entirely centered on Oracle implementations of a SQL DBMS -- for that is the
title of the course he tries to teach.
It has nothing to do with IBM, SQL Server, or any of the others
implementations outside of Oracle. A
syllabus is posted outlining the topics of the Oracle vocational program and
can be found at the UW extension program website (along with the C++, Unix,
UML, and other courses taught by practitioners).
From: Fabian Pascal
[I numbered the paragraphs for convenience]
1. I was unaware of the other post, but the title does not,
obviously, describe the full purpose of my critique. I am willing,
however, to correct the title. Can you suggest a better one?
2. Why? Do you think that my critique of the exchange was i>not
precise and logical?
3. Thanks, but no, thanks. I am not a masochist. I prefer to
read intelligent material that I can learn from, debunking is a job, not a
pleasure. The amount of crap in online exchanges is overwhelming material worth
reading by so much, that spending any time on that beyond what my debunking
duties require would be a horribly unproductive use of my time.
4. I stand corrected, but my response to that is: Morgan
should not be teaching in academia. Anybody who teaches database management,
even those who teach product and implementation, should be familiar with
fundamentals. They exist for practical reasons and must serve as context. His
reply to the design question was wrong, and that was a practical issue.
Besides, Morgan's problem is much worse that just ignorance of fundamentals.
5. I am not surprised at all. To reiterate, his problem is
much worse than knowledge.
From: Daniel Guntermann
I hate to criticise and then not have a better alternative to
suggest, but not really.
No. The title and introduction led off with some rather strong
statements. Such positioning in an article usually reflects a hypothesis or
main idea one intends to support. This probably had an effect on
perception for the rest of the paper, which in truth, was quite correct.
It seemed your central argument began with a focus on the flawed nature of
theory today, using the comp.databases.theory newsgroup as a rhetorical device.
That is a hard hypothesis to "prove" based on a single use case, and
that use case had its own flaws in supporting such a generalized supposition,
since its context wasn't just comp.databases.theory, as I pointed out.
OK. I understand perfectly.
I couldn't agree more. Until someone was blunt enough to let him know his
blunder, he cited 'Cobb' for over a year rather than E.F. Codd.
Thank you very much for taking the time out of what I am sure
is a busy schedule to respond.
From: Fabian Pascal
Well, if I understand you correctly, your only complaint then
is that I create a wrong impression about comp.databases.theory. I am still
inclined to guess that if I did a systematic survey, it would prove I do not,
but since I will not bother, and since you don't have a better title for the
article, posting this exchange will allow readers to make their own minds.
Posted 4/15/05
© Fabian Pascal 2006 All Rights Reserved