[Ed. Note]
This part uses a more formal language and in order to make it easier for the reader, here are the terms used in this part and the equivalents I used in the previous three parts:
- relation: R-table
- tuple: row
- attribute: column (I reserve attribute for the conceptual level)
- relation predicate: intersect of integrity constraints to which an R-table is subject
David refers to the tables whose data is stored (base tables in SQL) as the physical design. I prefer to reserve the term for the internal storage and access methods which are used to implement the tables' representation on disk.
His discussion of user-managed missing data is in the context of SQL DBMSs. Consider it as yet another burden left by vendors to developers. For a proposed relational (metadata) solution supportable by a TRDBMS see my paper The Last NULL in the Coffin. For an implementation technology that would facilitate such TRDBMS support see Go Faster! - The TransRelational(TM) Approach to DBMS Implementation]
[End Ed. Note]
The question how to handle missing information such as birth dates and death dates in a database about people has raised a number of issues. I addressed these issues somewhat abstractly in previous articles[1] and in more depth in seminars and workshops. The present attempt is both a bit more narrow in scope and more concrete.
Note: I invent language here to support the concepts as necessary. If you are using a data modeling tool, you should be able to translate into the language of that tool easily enough. (Otherwise obtain a better one!) Implementation is addressed as part of physical design (see below).
Whenever a database design issue arises, it is crucial that we begin with a purely logical design. At the logical level we do not make decisions about what is or is not physically stored. The logical design of the database representing the business (conceptual) model in Part I should be as follows:
PEOPLE {PERSON_ID, NAME};Explanation: We always give an instance its most specific type (and can then deduce that instance's supertype). In effect, instances of a type are necessarily disjoint from each of sets of instances of that type’s proper subtypes. We can easily see whether or not a relation B{A1, ... An, B1, ... Bn} is the subtype of relation A{A1, ... An}: B inherits all the attributes and constraints (including dependencies) of A, such that if relation predicate (i.e., the membership predicate[2] for a relation) PB is TRUE for tuple {A1, ... An, B1, ... Bn}, relation predicate PA is necessarily TRUE for tuple {A1, ... An}.
LIVE_PEOPLE {PERSON_ID, NAME, BIRTH} PROPER SUBTYPE OF PEOPLE;
DEAD_PEOPLE {PERSON_ID, NAME, BIRTH, DEATH} PROPER SUBTYPE OF LIVE_PEOPLE;
Conceptually, when a proposition (fact) is asserted as "TRUE"--i.e., when the tuple representing that proposition is inserted in the database--we start at the bottom of the type hierarchy and match attributes (more precisely, we check for consistency of the fact with the corresponding relation predicates for each relation in the hierarchy). In this example, if we know DEATH, the data can only match the relation predicate for DEAD_PEOPLE and if we know BIRTH but not DEATH, the data can only match the relation predicate for LIVE_PEOPLE. Finally, if we only know NAME, the data can only match the relation predicate for PEOPLE.
If, at a later time, we learn the birth date BIRTH of a person listed in PEOPLE as having name NAME, the type of that instance has been specialized to LIVE_PEOPLE. The corresponding PERSON_ID must be moved from PEOPLE to LIVE_PEOPLE. Similar comments apply if a death date DEATH is learned (How would you modify this example if it were possible to have (knowledge of) a DEATH without a BIRTH? What about a PERSON_ID without a NAME?).
Now let’s move on to questions of physical design. Technically, we can choose any design that is information preserving with respect to the logical design. That is, the physical design must be derivable from the logical design using relational operators. But this is where SQL DBMSs fail us horribly--they do not adequately support the full expressiveness of first order predicate logic, relation predicates, introduce NULLs, do not support a complete type theory and arbitrary expressions of constraints and of dependencies, confuse logical and physical issues, limit view updating, etc. So we are forced into designs that are product-specific and will have to use work-arounds to get the desired result.
We need to choose which of the relations in the logical design to store. Our first task is to determine if the application permits us to deduce that any of those relations will be empty. For example, if our application will only track persons that are named, and for which we have either a birth date or a birth date and a death date, then we never need to store a distinct PEOPLE relation. (Exercise: Why?) And, if we need to, we can derive the set of all instances of the PEOPLE type as:
LIVE_PEOPLE {PERSON_ID, NAME}By contrast, if we need to track named persons whether or not they have either a birth date or a death date, then PEOPLE should be a stored relation.
UNION
DEAD_PEOPLE {PERSON_ID, NAME}
Now, for reasons of physical efficiency (e.g., anticipating lots of queries that refer to all people I’m tracking) and barring other application constraints, I may want to store a single relation
ALL_PEOPLE {PERSON_ID, NAME, BIRTH, DEATH}and create projection views that have the attributes of DEAD_PEOPLE, LIVE_PEOPLE and PEOPLE. Instead of using NULLs, we use default values for BIRTH and DEATH that are outside the otherwise permissible values for the attributes--this escapes the use of many-valued logic, while permitting a simple projection/restriction view to select the tuples belonging to a particular subtype (write the queries for this!).
It also lets us index each of BIRTH and DEATH. In many SQL DBMSs, such single table views are updatable, assuming the defaults are declared. (What extra work is required to keep these "phony default" values from distorting data analysis? Hint: How you define your views makes all the difference!)
As I have commented elsewhere (Nothing from Nothing, Part IV), such designs are easy to extend if the application requires you to differentiate between "inapplicable" and "unknown" (missing). For example, suppose that we want to record that a person died, but we do not know their birth date. This situation is clearly different from what we have dealt with above, in which a missing death date for a living person is clearly "inapplicable". And there are many other ways in which we may want to classify so-called missing data:
- "unknown for now"
- "unknowable ever"
- "not entered"
- "have multiple possible values"
If I’m concerned about managing data quality as a background task, I prefer to store it in a set of adjunct tables which I treat as metadata and so in the same way as I would tables in the system catalog. For example, I might create a table
MISSING {TABLE, KEY, ATTRIBUTE, REASON}I could then search this table for REASON = "NOT_ENTERED" from time-to-time, in order to issue a request to the application user to locate the appropriate data, thereby improving the quality of the database.
On the other hand, classification of why data is missing for an attribute might be an inherent part of the application. In such cases, it is useful to add a "meta-attribute" column to the physical relation that explains the entry or lack thereof. The default for this column is a code meaning "VALID". Other entries need to be selected from a select list during data entry if no valid value is provided for the attribute in question. Meta-attributes can be hidden from the casual user with the views used to select a particular subtype.
Throughout the foregoing, I assume that the application per se will be built on top of views that select a particular type or subtype. The application programmer, let alone the business user, should never be presented with a design that reflects anything other than the entities with which they are familiar. As database experts, it is our responsibility to do our best in achieving that goal--no matter the obstacles SQL DBMS vendors may place in front of us. While the result may not be perfect, discipline and theory will help narrow the gap between how we use products and what a TRDBMS would enable.
[1] Nothing from Nothing, DBP&D, in four parts, Dec.1993-Mar. 1994. Also in C. J.Date, RELATIONAL DATABASE WRITINGS 1994-1997, Addison-Wesley, 1998.
[2] Whenever I refer to a predicate, I mean a symbolic formula that can be used for symbolic computation and, therefore, automated deduction or theorem proving. Contrary to other authors, I do not consider an English declarative statement to be a predicate--at best, such an expression is an informal (and possibly loose) interpretation of such a predicate. English interpretations can be useful for conveying the gist of relation predicates uses, but lead to serious errors if examined in detail.
Do you like this post? Please link back to this article by copying one of the codes below.
URL: HTML link code: BB (forum) link code:
Personally I'm usually tempted to create an (additional) key table: ALL_PEOPLE {PERSON_ID PK) holding all PERSON_ID's. This fixes a root type independent of the actual modeled subtypes
ReplyDeleteThe UNION strategy of creating supertypes I usually reserve for generalization and not specialization (ie disjunct subtypes that have a different identifier but share attributes).
What do you mean by "fixes the root type" and why is it necessary or useful, given that it adds constraints.
DeleteDisjunctive SUBTYPES with DIFFERENT identifiers? Example, pls.
Just to avoid any potential confusion, we're talking ENTITY typing here, not DOMAIN typing.
CREATE VIEW_PARTY AS
DeleteSELECT 'ORGANISATION' AS TYPE, ORGANISATION_ID AS PARTY_ID, NAME
FROM ALL_ORGANISATION
UNION
SELECT 'PERSON' AS TYPE, PERSON_ID, NAME
FROM ALL_PEOPLE
There is indeed a trade-off between constraint standardization and minimization. Standardization can minimize constraint refactoring when additional entity sub-types need to be recognized (e.g. a requirement to register dead people without a date of death)
What do you mean by standardization? Of what?
DeleteLogically, how do you hanle the constraint that a person's birthdate shoild not change when they die. Having birth date in both subsets does not seem to be corre t
ReplyDeleteThat is actually a larger issue--transition constraints--having to do with column values that once recorded in the database should not be allowed to change, except for corrections applied by authorized users. It's not just birth date, but, for example, death date, sex, even names.
ReplyDeleteNeedless to say, there is no DBMS support for them, certainly not in SQL products.
I think that transition constraints would only be needed at the physical level. In the logical model, the attributes of the relation state the dependencies. For example the logical model given incorrectly has birth date as a dependency on the state of being living or dead. Having living_person and dead_person be proper subtypes of a born_person relation represents the dependency accurately.
ReplyDeleteWhy only at the physical level? That birth date or death date or sex are not supposed to change (except for errors) is as logical an aspect as any other.
DeleteIf the text I was careful to italicize is not persuasive as to the design, it'll have to wait for David's response, if he decides to offer one.
Regarding (from the 3rd paragraph): "We can easily see whether or not a relation B{A1, ... An, B1, ... Bn} is the subtype of relation A{A1, ... An}: B inherits all the attributes and constraints (including dependencies) of A, such that if relation predicate (i.e., the membership predicate[2] for a relation) PB is TRUE for tuple {A1, ... An, B1, ... Bn}, relation predicate PA is necessarily TRUE for tuple {A1, ... An}."
ReplyDeleteIf that is so, then is it also so, that when PA is FALSE for tuple {A1, ... An }, PB is FALSE for tuple {A1, ... An, B1, ... Bn} ?
David replies:
DeleteObviously. A the predicate for a subtype can only be TRUE for tuples for which the predicate for the parent type is also TRUE. Conversely, if the predicate for the parent type is TRUE, then it must be TRUE for the predicate of at least one of its subtypes.
I've been wondering on and off how much this requires loosening up on the CWA.
DeleteIf the predicate is true for a subtype, then the tuple has to be there in the "subtype relvar". But if this also implies that the predicate must be true for all the supertypes, then the CWA dictates that the corresponding tuples also be present in the "supertype relvars".
Thanks. Sorry if I already asked this ( I'm pretty clumsy at navigating this site), but why do you use the term 'parent'? (if that makes any sense.)
DeleteMy guess: hierarchies involve parents and children and that's true for type hierarchies too.
DeleteWhat exactly is your navigation problem on this site?
Regarding navigation, I was sure I'd asked the same question about subtypes (or parents/children) twice. Couldn't quite tell whether I was replying to the original article or to a reply of the original. Maybe my mistake since the web page seems to make it obvious today.
ReplyDeleteI thought this was a great article, it being so emphatic about separating physical design. I imagine doing that must be very hard or even impossible for people who are used to SQL tables. The only thing that puzzled me was the mention of subtypes.
Regarding the reason for my question, my understanding of typing theory is feeble but even so it seemed to me that it's enough at design time to say that "P(A) is necessarily true" as David McGoveran put it, simply because it's a logical implication, eg., if P(A) is false, then P(B) can't be true. That attitude seemed more elementary to me. Not to say that the subtype angle isn't a useful one for an optimizer.