Monday, September 3, 2012

Database Design: Classes of Property-sharing Entities and Relations

UPDATE: This was a revision of a debunking at the old dbdebunk.com. Rewritten 9/3/16.

Q1: "I have a database for a school that has been in use now for a couple of years and it is working well. There are numerous tables obviously but consider these:

  • CONTACT - all contacts,  students, faculty, or any other type of  contact (probably should have called it Entity)
  • FACULTY - info about specific faculty member, foreign key to CONTACT
  • STUDENT - info about specific student, foreign key to CONTACT
Considering that the CONTACT table has fields such as: LAST, FIRST, MI, ADDR, CITY, STATE, ZIP, EMAIL ... Do you think it is a good idea to have a single table hold such info?  Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? At what point do you de-normalize for the sake of being more practical?  My first reaction is: if it works well, why not leave it alone? But OK." --comp.databases.theory
Some quibbles:
  • Attributes, not “fields”;
  • Information specific to faculty and students, not “info about specific faculty or student’;
  • "Numerous tables" (should be relations) that are unspecified are a concern;
This is a good example of the all too common practice of asking for design advice online with a poorly specified business model, from people unfamiliar with the reality being modeled and lacking foundation knowledge. 



Clearly there are several types of entities that share some attributes, while some attributes are unique or, in other words, entity supertypes and subtypes (ESS). Logical design without a well defined conceptual model is a fool's errand. Yet practically none of the replies focus on this fundamental problem.

Note: I have replaced references to table/column/row with relations/attributes/tuples in square brackets.

A1: "What if Professor of Mathematics also takes courses in Digital Photography? Are you going to store her phone number in 2 [tuples], both in Faculty and Student [relations]? What if a postdoc student also has a part time job teaching? Depending on the platform, you might find materialized query [relations] aka materialized views, and/or index covering to be good alternatives to denormalization. I'm speaking about DB2 and Oracle."
+ Notices an underspecified business model;
- Contaminates logical design with implementation considerations--denormalization, materialized relations, specific DBMS's (one must have a fully normalized design to denormalize).

A2: "My question is - why do you need to denormalize this? Are there heavy reports being run on this data that don't execute fast enough? What kind of reports? What other info is in faculty and student [relations] besides foreign keys?"
+ Asks about attributes;
- Same focus on denormalization.

Q2:"What would you do when you want to close out one year and start a new year? If you had stand-alone student and faculty [relations] then you could archive them easily. Have a school semester and year attached to them. However, as you go from one year to the next information about a student or faculty may change. Like their address and phone for example. The database model now is not very good because it doesn’t maintain a history. If Student A was in school last year as well but lived somewhere else would you have 2 contact [tuples]? 2 student [tuples]?  Or do you have just one of each and have a change log. Which is best?

The student [relation], besides having a foreign key to the CONTACT would have also the following: medical comments, current grade, foreign key to account which is just another foreign key to contact [relation] for the parent/billing [tuple]. For the faculty [relation], additional [attributes] would include: hire info, login/password for electronic timesheet login."

Now he adds some of what's missing, but still not sufficient. An excellent validation of my strong recommendation to refrain from database design advice without sufficient knowledge. Q2 reveals new requirements and objectives that have almost nothing to do with the original Q1. No doubt, more will emerge with further prompts (see below).
A3: "One table name PERSON. Anything else is a denormalized nightmare."
- This does not merit comment. Incidentally, it would be the one-table design that would be denormalized and a nightmare, not "anything else"  (complete with NULLs for both inapplicable and missing data in SQL).
A4: "Do you need to track all and every change to the tables? Then I would go with logs. If no - are your needs restricted to tracking address changes?"
- Ditto.
Q3: "There are no real requirements for this database. They are simply happy to have whatever is provided. It would be easy to keep a log of changes; however, I am unsure of a more fundamental issue. I know this next statement is riddled with problems but, just for fun, consider this: You track changes yes, but at the end of each day or week or whatever you compare the current record with the last snapshot of the record which would simple be an xml based representation of the record.  If the current record differs from the snapshot then you create a new snapshot. You could have a table like: 
SNAPSHOTXML
 id INT
 dts DATETIME
 tablename ...
 tableid INT
 xml CLOB
should hold the xml snapshot of the record. I know this leaves my initial question concerning normalization somewhat, but is still related. So we have users who don't specify requirements, all the more reason to leave the database alone if it works. And a database practitioners who does not know much about the requirements, or design principles, but wants to shove XML into the bargain (this gives insight into the fad mechanism at work)."
What did I tell you?
A5: "I don't see how having separate [relations] for fundamentally different types of contacts could be considered denormalization. In fact, your current scheme of having a bunch of 1-1 relationships seems to reflect object-oriented thinking rather than relational thinking; it looks like your "faculty" and "student" [relations] and their 1-1 relationships are attempts to "subclass" your "entities".

IMHO, it would make more sense to have separate [relations] for each type of "entity," each containing all and only the attributes that make sense for a faculty member, student, etc.  For one thing, it would make referential integrity checking easier.

As I see it, the only virtue of your current scheme is that if a faculty member happens to have a kid at the school, their contact information is stored in only one place.  That's likely to represent only a few cases, and it's subject to change (what if the teacher gets divorced and the other parent gets custody of the kid?  What if somebody wants to receive personal mail at one location and work- related mail at another?). Otherwise, it looks like an attempt to micro-optimize space usage."

+ How to represent ESSs in a database has, indeed, nothing to do with denormalization;
- However, they do not involve "fundamentally different types of entities"--they share some (though not all) of the attributes;
- Which is why entity subtyping is precisely the correct modeling here--it is part of basic conceptual modeling that has nothing to do with OO per se, or the RDM.


And here is the crucial point: logical databases are formal representations of conceptual models--relations represent entity classes (including ESS's), which are defined by the properties they share. What properties are shared by what entities will determine the entity types/classes that map to relations. Which is what the questions do not fully specify and on which the responders fail to insist. Without well-specified classes of property-sharing entities--the neat three classes of Contacts, Students and Faculty established a-priori may not be the ones determined by the sharing of properties--what exactly will relations represent and on what grounds are they designed?

Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

2 comments:

  1. Interesting post! I can tell you that after building a system for an auction mgmt system that did NOT use property-sharing entities, I learned my lesson and ended up building a second system for a martial arts concern that did. The auction management system has separate bidder and consignor tables and in real life some bidders ended become consignors. By not using property-sharing entities I ended up having to duplicate records across both tables. It wasn't the end of the world, but clearly could have been thought out better. My second major build used a contact table, just as you mentioned, along with student, instructor and billee tables that each had an fk to contact. This has worked well and has reduced duplication. I think when designing your system the key question to consider is the role of these contacts in your system and how they might evolve over time. If a contact will only ever exist as one "type" forever (i.e. a student will only ever be a student, faculty will only ever be faculty) then having separate tables make sense. But if that contact (and contact is a much better name for the table that "entity") could ever exist as more than one "type" then it certainly makes sense to use property-sharing entities .

    As for the need for logging? Yes, you can go through the trouble, but I've never had a client tell me that they needed to know what address a contact was using 4 years ago. As long as the information for the contact is current that's what seems to matter most.

    ReplyDelete
    Replies
    1. Database design is driven by the conceptual model. How good the latter is determines how useful the former is.

      Practitioners tend to conflate the conceptual with the logical -- they rush into design with poorly specified models.

      Delete