Sunday, May 6, 2018

Meaning Criteria and Entity Supertype-Subtypes

Note: This is a re-write of an earlier post to bring it in line with the McGoveran formalization and interpretation[1] of Codd's RDM.

"I have a database for a school ... [with] numerous tables obviously, but consider these:
CONTACT - all contacts (students, faculty): has fields such as LAST, FIRST, ADDR, CITY, STATE, ZIP, EMAIL;
FACULTY - hire info, login/password, foreign key to CONTACT;
STUDENT - medical comments, current grade, foreign key to CONTACT."
"Do you think it is a good idea to have a single table hold such info? Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? At what point do you denormalize for the sake of being more practical? What would you do when you want to close out one year and start a new year? If you had stand-alone student and faculty tables then you could archive them easily, have a school semester and year attached to them. However, as you go from one year to the next information about a student or faculty may change. Like their address and phone for example. The database model now is not very good because it doesn’t maintain a history. If Student A was in school last year as well but lived somewhere else would you have 2 contact rows? 2 student rows?  Or do you have just one of each and have a change log. Which is best?"
I will ignore the flawed terminology (relations, not tables; attributes, not fields), and why would somebody who does not know past, or new requirements, modeling, and database design, mess with a working database just because "he heard something about (insert your favorite fad here)", and consider the replies.



I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you.



  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 

  • The Dystopia of Western Decadence, the Only Acceptable Racism, and the Myth of a “Palestinian nation”
"What if Professor of mathematics also takes courses in digital photography? Are you going to store her phone number in 2 rows, both in FACULTY and STUDENT tables? What if a postdoc student also has a part time job teaching? Depending on the platform, you might find materialized query tables aka materialized views and/or index covering to be good alternatives to denormalization. I'm speaking about DB2 and Oracle."
Starts well, pointing out the absence of a conceptual model[2]. But accepts denormalization as an issue (impossible to judge without a conceptual model, and which the asker probably does not understand), exhibits the logical-physical confusion (LPC)[3] underlying the common "denormalization for performance" misconception[4], rushing prematurely into physical implementation issues.
"One table name PERSON. Anything else is a denormalized nightmare."
Apparently, relations can be designed without knowledge of what exactly they're supposed to represent. Upside down and backwards, of course: it is the one-table design -- that would be "a denormalized nightmare". We note, in passing, that we currently contend relations are in 5NF, not just 1NF, by definition 
-- denormalized tables are not relations and out of the RDM[5].
"I don't see how having separate tables for fundamentally different types of contacts could be considered denormalization. In fact, your current scheme of having a bunch of 1:1 relationships seems to reflect entity-oriented thinking rather than relational thinking; it looks like your FACULTY and STUDENT tables and their 1:1 relationships are attempts to 'subclass' your 'entities'."

"IMHO, it would make more sense to have separate tables for each type of 'entity', each containing all and only the attributes that make sense for a faculty member, student, etc. For one thing, it would make referential integrity checking easier."

"As I see it, the only virtue of your current scheme is that if a faculty member happens to have a kid at the school, their contact information is stored in only one place. That's likely to represent only a few cases, and it's subject to change (What if the teacher gets divorced and the other parent gets custody of the kid? What if somebody wants to receive personal mail at one location and work-related mail at another?) Otherwise, it looks like an attempt to micro-optimize space usage."

In order to represent every group with single-type members by a relation-- which is what full normalization means informally -- you must have identified the groups, ensured that they have single-type members, and determined whether they are partitioned into subgroups, which has not been done here. The rest is essentially a description of some drawbacks of databases that are not fully normalized (i.e., not relational).

But although it also reflects some confusion of levels representation, this is the only reply that engages, albeit imperfectly, with the core issue, which is not normalization: "it looks like your ... tables are ... attempts to subclass your entities".

Meaning Criteria and Subgroups

From the little information offered we can guess that some properties are shared by all contacts, while some are unique to students, and some to faculty. In other words, there is what is known as an entity supertype-subtypes relationship (ESS): students and faculty are subtypes of the contacts supertype.

Entity groups have individual and collective defining properties that are required for all group members, but there are also optional individual properties that only some members -- a subgroup -- have, which are known as meaning criteria[6].

"Some defining properties are formed as the disjunction of two or more properties. These disjuncts, taken together, are called meaning criteria. Each meaning criterion (individual disjunct) induces a partitioning of a group into two subgroups of entities -- those that meet the criterion and those that do not -- and serves to differentiate each subgroup from the others. Some of the subgroups will be disjoint, while others will not. Each of the subgroups is defined by (“inherits”):
(1) the defining properties of the group conjoined with
(2) at least one meaning criterion (that meaning criterion, or those meaning criteria, thus becoming the defining property, or properties, respectively, specific to the proper subgroup).
--David McGoveran

Each meaning criterion partitions a group into two subgroups -- one that has the optional property, and one that doesn't -- which can be disjoint or overlapping. The defining properties of the subgroup consist of the defining properties of the group conjoined with the corresponding meaning criteria.

The optional properties unique to students and faculty are meaning criteria that partition the contacts group into two subgroups (if faculty can be students, they overlap). The defining properties of the students and faculty subgroups are the common contact properties that the inherit from the group conjoined with the corresponding meaning criteria.

The common database design of ESS relationships in the industry is usually the single table with NULLs recommended above, which, of course, is not a relation. I will leave it to the reader to judge whether the it is not the table that is the "denormalized nightmare", by considering the gap between the users' understanding of the table and what the DBMS knows, and the complexity of the manipulation required to extract certain information from it.

Prior to McGoveran's work, I was using the same kind of design as the one above. Here:


CONTACTS represents the supergroup, but what do the other two relations represent? Not the faculty and students subgroups, because their defining properties also include the common properties, not represented in the two subrelations.

In the RDM as currently understood, database design must adhere to the Principle of Orthogonal Design (POOD): each disjoint subgroup is represented by a relation[7].


Meaning Criteria and SQL

SQL DBMSs do not support ESS relationships. Users can create a UNION of two projections view:

SELECT id,last,first,...,email
FROM faculty
SELECT id,last,first,...,email
FROM students;

A true RDBMS can support ESS relationship and figure out things for itself iff it supports (1) full relation predicates and (2) a theory of types[1].


[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., Don't Design Databases Without Foundation Knowledge and Conceptual Models.

[3] Pascal, F., Physical Independence Part 2: Logical-physical Confusion

[4] Pascal, F., "Denormalization for Performance": Don't Blame the Relational Model.

[5] Pascal, F., What Relations Really Are and Why They Are Important.

[6] Pascal, F., Conceptual Modeling for Database Design: A New Perspective on a a Sound But Ignored Foundation, forthcoming.

[7] Pascal, F., The Principle of Orthogonal Design Part I, II, III.

Note: I will not publish or respond to anonymous comments. If you want to say something, stand behind it. Otherwise don't bother, it'll be ignored.

No comments:

Post a Comment