Monday, September 3, 2012

Normalization, Orthogonality and Database Design

UPDATE: Corrected some minor errors.

On 4/15/05 I debunked at the old site an online exchange from the comp.databases.theory group following “a database design question”. This is a completely revised debunking of that exchange.
GK: I have a database for a school that has been in use now for a couple of years and it is working well. There are numerous tables obviously but consider these:
CONTACT - all contacts,  students, faculty, or any other type of  contact (probably should have called it Entity)
FACULTY - info about specific faculty member, foreign key to CONTACT
STUDENT - info about specific student, foreign key to CONTACT

Considering that the CONTACT table has fields such as: LAST, FIRST, MI, ADDR, CITY, STATE, ZIP, EMAIL ... Do you think it is a good idea to have a single table hold such info?  Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? At what point do you de-normalize for the sake of being more practical? 
My first reaction is: if it works well, why not leave it alone? But OK.

Some quibbles: GK is listing columns, not “fields” and he means ‘information specific to faculty and students’, not “info about specific faculty or student’. This may be pedantic, but when it comes to database design, completeness and precision are critical factors. Also, anyone advising GK should be uncomfortable with the existence of "numerous tables" of which nothing is known.

And there is more important confusion. I am inferring that the existing design is as follows:
CONTACTS {CPK, columns shared by all};
FACULTY {FPK, columns unique to faculty};
STUDENTS {SPK, columns unique to students};
where FPK and SPK are also foreign keys referencing CPK.

This is sufficient to figure out what the design issue is and it also makes clear what information is necessary for the design, but is missing. Let's see if the replies reflect that.
AK: What if Professor of Mathematics also takes courses in Digital Photography? Are you going to store her phone number in 2 rows, both in  Faculty and Student tables? What if a postdoc student also has a part time job teaching? Depending on the platform, you might find materialized query tables aka materialized views, and/or index covering to be good alternatives to  denormalization. I'm speaking about DB2 and Oracle.
To his credit, AK noticed that the business model aspects to which the tables correspond are underspecified: we can't tell how many distinct entity types there are. Unfortunately, he incorrectly assumed that this is a denormalization issue. Loosely, denormalization "bundles" distinct entity types into single tables. This is not the case here: GK asks about "unbundling" data in the CONTACT table to the FACULTY and STUDENT tables, but more importantly, the entity types here are not exactly distinct, as we shall see.

In any case, AK  quickly contaminates what is clearly a business model/logical design matter with physical implementation details (“materialized views”, indexes, and commercial products).

The next two responders make the same mistakes of denormalization and logical-physical confusion:
I: My question is - why do you need to denormalize this?  Are there heavy reports being run on this data that don't execute fast  enough? What kind of reports?  What other info is in faculty and student tables besides foreign keys?

BB: In general when performance becomes an issue.
I does ask about the columns, but that is not the main thrust of his questions.
GK:  No, I do not 'need' to denormalize this.  And I don't want to change it just  for the sake of change.

What would you do when you want to close out one year and start a new year? If you had stand-alone student and faculty tables then you could archive them easily. Have a school semester and year attached to them. However, as you go from one year to the next information about a student or faculty may change.  Like their address and phone for example. The database model now is not very good because it doesn’t maintain a  history.  If Student A was in school last year as well but lived somewhere else would you have 2 contact records?  2 student records?  Or do you have just one of each and have a change log. Which is best? 

The student table, besides having a foreign key to the CONTACT would have also the following: medical comments, current grade, foreign key to account which is just another foreign key to contact table for the parent/billing record.
For the faculty table, additional fields would include: hire info, login/password for electronic timesheet login.
Excellent validation of my reluctance to provide specific design advice online: GK asked a question about "denormalization", but he has now added a completely different aspect—history maintenance; and here are relationships (foreign keys) to those other unknown tables.
DAM: One table name PERSON. Anything else is a denormalized nightmare.
Even if the issue was denormalization, it would be his one table design that would be denormalized, not "anything else".
I: Do you need to track all and every change to the tables? Then I would go with logs. If no - are your needs restricted to tracking address changes?
GK: There are no real requirements for this database.   They are simply happy to have whatever is provided.
It would be easy to keep a log of changes; however, I am unsure of a more fundamental issue. I know this next statement is riddled with problems but, just for fun, consider this:  You track changes yes, but at the end of each day or week or whatever you compare the current record with the last snapshot of the record which would simple be an xml based representation of the record.  If the current record differs from the snapshot then you create a new snapshot. You could have a table like:
      id INT
      dts DATETIME
      tablename ...
      tableid INT
      xml CLOB--should hold the xml snapshot of the record
I know this leaves my initial question concerning normalization somewhat, but is still related.
So we have users who don't specify requirements, all the more reason to leave the database alone if it works. And a database practitioners who does not know much about the requirements, or design principles, but wants to shove XML into the bargain (this gives insight into the fad mechanism at work).
EB: I don't see how having separate tables for fundamentally different types of contacts could be considered denormalization.  In fact, your current scheme of having a bunch of 1-1 relationships seems to reflect object- oriented thinking rather than relational thinking; it looks like your "faculty" and "student" tables and their 1-1 relations are attempts to "subclass" your "entities."
IMHO, it would make more sense to have separate tables for each type of "entity," each containing all and only the attributes that make sense for a faculty member, student, etc.  For one thing, it would make referential integrity checking easier.
As I see it, the only virtue of your current scheme is that if a faculty member happens to have a kid at the school, their contact information is stored in only one place.  That's likely to represent only a few cases, and it's subject to change (what if the teacher gets divorced and the other parent gets custody of the kid?  What if somebody wants to receive personal mail at one location and work- related mail at another?).  Otherwise, it looks like an attempt to micro-optimize space usage.
Finally somebody with real foundation knowledge. Once GK mentioned common and unique attributes it was clear that the case is one of entity supertype-subtypes (ESS), not denormalization.

Normalization and denormalization have to do with distinct entity types, each with a unique set of attributes and are governed by the Principle of Full Normalization (PFN): loosely, one entity type per R-table. ESS involves entities that have both shared and unique attributes and has to do with orthogonality/lack thereof.

Bohlman's recommendation to "separate tables for each type of 'entity,' each containing all and only the attributes that make sense" indicates the design specified above as the the common one used for ESS, one that I advocated myself in PRACTICAL ISSUES IN DATABASE MANAGEMENT. This design requires two foreign key constraints and when a subtype row is inserted into the database, both the CONTACT and one of the two subtype tables must be updated.

But while the CONTACT table represents the entities of the supertype, what do the FACULTY and STUDENTS tables represent? Not the entities of the subtypes, because they have both shared and unique attributes, not just unique ones. It is possible to represent the two classes of entities of the subtypes by views that join the CONTACT table with the FACULTY and STUDENT tables. But aside from the problem of SQL systems not permitting multitable view updates, this seems to be in violation of the Principle of Orthogonal Design (POD) as initially formulated by Date and McGoveran: no two distinct R-tables shall have isomorphic projections, that is, overlapping meanings, which the two views have by virtue of their shared attributes. Indeed, the POD was discovered and intended to ensure view updatability.

What about the design considered by GK? 
FACULTY {FPK, columns shared and unique to faculty};
STUDENTS {SPK, columns shared and unique to students};
This design does not require foreign key constraints or multi-table updates (assuming no faculty take courses and no students teach), but shifts the violation of POD from the views in the previous design to the tables in this one.

But the POD is now controversial. Here's what I added to the article formulating it at the old site:
Database design is more art than science, that is, it is mainly of an informal nature. Whatever formality is there, it has been limited to what we currently call the Principle of Full Normalization (PFN). In 1993, however, McGoveran and Date presented in a two-part paper an additional formal design principle, The Principle of Orthogonal Design (POD). Jointly, the two principles are intended to prevent redundancy, update anomalies,  harder to understand databases, and other practical complications in database design.

The POD formally defined in the original paper was informally described as prohibition of tables or projections thereof with overlapping meanings. But there is now disagreement between the two authors on the original definition of POD, which has been rejected by Hugh Darwen. Date has since revised his formal definition, which is quite complex, does not have a very easy informal description and as far as I know, McGoveran neither supports it, nor did he publish his own definition.
Given the SQL restriction, this may be a moot issue for now, sothe first design is probably better, used with views for querying and direct multi-table updates.

But the main point is that this is not what most of the exchange was about and the only person who raised the real issue was the one who knew data fundamentals. Note in particular that what  DAM advocated would have resulted in a non-relational table with NULLs standing for inapplicable, clearly an incorrect design.

(Originally posted on 4/15/05 at; thoroughly revised  8/24/12)

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

No comments:

Post a Comment