Thursday, January 10, 2013

Database Design, Relational Fidelity and SQL Part I: Entity Supertype-Subtypes

(I would like to thank Toon Koppelaars, Will Sisson, Craig Mullins and Nathan Allan for their input).

WB writes:
I have just finished reading your paper The final NULL in the Coffin. Unfortunately, I get the impression that with current RDBMS implementations, there is still no easy way of handling missing values. I don’t think Oracle allows a catalog R-table that would select out required rows. Thus the paper is more of theoretical interest than of practical use at the moment ... Another good thing your paper was useful for was confirming that the substitution of nonsense values for NULLs is not THE answer (but might help).  We still need to use filters, preferably in derived R-tables.

However, before I advise my manager of a strategy, I’d be grateful if you could confirm the following. I hope it is coherent! In the case of the People and Deaths R-tables, there would probably be no need to combine them into a derived R-table. So that is a practical way of avoiding NULLs in that case.  However, it would not be transparent to the user, since we would have the People and Deaths R-tables and a derived R_table Live_People. Thus in Oracle SQL:
CREATE TABLE people
 (person_id INTEGER NOT NULL,
  name VARCHAR(100) NOT NULL,
  birth DATE NOT NULL);

CREATE TABLE deaths
 (person_id INTEGER NOT NULL
  REFERENCES people,
  death DATE NOT NULL);

CREATE VIEW Live_People
  AS (SELECT person_id, name, birth
      FROM people
      WHERE person_id NOT IN (SELECT person_id
                              FROM deaths));
The above isn’t as good an example as I had hoped. Nevertheless, it seems that issues with NULLs can be minimised using current technology, by having more base R-tables and more derived R-tables. Even if it means  SQLthe users have to cope with a few more R-tables, the end result is worth it.
What appears to be a simple question about NULL avoidance raises a number of important issues about business modeling and database design and the effect a SQL DBMS would have relative to a truly relational DBMS (TRDBMS). This offers a good opportunity to address them all.

Missing vs Inapplicable Data

I do not know what in my paper could be interpreted as "nonsense values for NULLs might help". Not only Oracle, but all DBMSs out there fail to handle missing data correctly, which is why a theoretically sound solution--the exact opposite of nonsense--is required. My paper makes it very clear upfront that it offers only an informal outline of a possible relational solution that needs to be researched in depth, to verify that it is sound and to work out its structural, manipulative and integrity implications. This must precede any implementation and there is hardly a better proof of the costly consequences of not heeding this principle than SQL's NULLs.

The reader suggests a design that avoids NULLs that would occur in a SQL table if it were designed to represent one entity class, people:
PEOPLE {PERSON_ID, NAME, BIRTH, DEATH}
Every row representing a live person would have a NULL for DEATH. Such a SQL table is not an R-table (why?).

Note very carefully, though, that no data is missing! The NULLs would stand for 'inapplicable', and are an artifact of design: live people do not have a death date, so the column representing that attribute does not apply to the rows representing them, yet the table nevertheless assigns it to them.

Entity Supertype and Subtypes

Strictly speaking, an R-table represents a collection of facts--propositions asserted by authorized users  to be true. From facts recorded in the database (axioms) other facts (theorems) can be derived or inferred that are logical implications of the recorded facts--this is what querying is all about. But facts about what?

As I explain in Business Modeling for Database Design, facts are about entities of interest, that is what keys identify in the database. Entities that are of the same type--in that they share the same attributes and are subject to the same business rules--form an entity class.

Entities that do not share any attribute are of distinct types and belong to distinct classes. But entities can share some attributes and have additional attributes unique to them. We refer to the entities with both shared and unique attributes as being of a subtype of the entities that have only the shared attributes, referred to as being of a supertype.

In this case
  • all people--dead or alive--share person id and name;
  • live people share id and name with people, but also have a birth date. so they are a subtype of people;
  • dead people share id, name and birth date with live people, but also have a death date, so they are a subtype of live people;
Let's assume for now that all data is known and none is missing. Would you opt for WB's proposed design, choose a different design and, in either case, on what grounds? You can answer in the comments section of this post.

More in Parts II-IV.



Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

33 comments:

  1. I can't see in WB's problem statement anything that justifies defining the live_people view. Maybe it's useful, maybe it's not. Maybe—depending on the business problem—it is actually perverse to invite users to think that way. If it is useful, why is it useful to NOT also have the complementary dead_people view?

    I am not comfortable using numeric types for values where arithmetic is meaningless, such as surrogate/synthetic keys like person.id.

    In the absence of any reason to suppose person.id is immutable, I'd declare deaths.id REFERENCES people ON UPDATE CASCADE.

    Other than that, given the limited information about the business problem and assuming a real SQL implementation and no missing/unknown data, I would (and do) opt for a design similar to what WB suggests.

    At the end of his comments WB frets that "Even if it means SQL the users have to cope with a few more R-tables, the end result is worth it." I have never understood the suggested desirability of minimizing the number of tables in a database. If a design properly arrives at a large number of tables it reveals complexity that is really there. Concealing complexity by conflating multiple fact types in one table is deceptive and counterproductive. The complexity still exists. Almost as bad, you also end up using voluminous application code to extract the complexity at run-time. So, more effort; more code, more execution time? How is that a good thing?

    ReplyDelete
    Replies
    1. Re views, the general principle is that applications should only access views for logical data independence. This is a problem in SQL because it does not permit updating theoretically updatable views (see Date's just published book).

      I agree on types, but why is that an issue here? Anyway, if DBMSs supported fully and correctly UDFs it would be possible to use numeric types for non-numeric attributes without concerns, but UDFs raise some very thorny issues.

      As to minimizing # of tables, my missing data paper explicitly makes the same argument you make. In a well-implemented TRDBMS this would not be an issue and I refer to the TransRelational implementation technology on this subject. But what we have is poorly implemented SQL systems, which mislead people understanding of fundamental into this NoSQL delusion that is a direct result.

      For the rest, with for the series' end.

      Delete
  2. WB's design is the already storage optimized but not the most flexible design.

    My choice would be a root type holding only the key, extended with a 1-1 for attributes that are mandatory:

    CREATE TABLE people
    (person_id INTEGER NOT NULL PRIMARY KEY)

    (I would add REFERENCES people_basic_info as well when you need to make sure basic info on all persons is avialable, but this won't work in SQLDBMses)


    CREATE TABLE people_basic_info
    person_id INTEGER NOT NULL PRIMARY KEY
    REFERENCES people,
    name VARCHAR(100) NOT NULL,
    birth DATE NOT NULL);

    CREATE TABLE people_deceased_info
    (person_id INTEGER NOT NULL
    REFERENCES people,
    death DATE NOT NULL);

    View for the root type of person

    CREATE VIEW People_all
    AS (SELECT person_id, name, birth
    FROM people JOIN people_basic_Info)

    View for the subtype of living people

    CREATE VIEW People_living
    AS (SELECT person_id, name, birth
    FROM People_all
    WHERE person_id NOT IN (SELECT person_id
    FROM people_deceased_info));

    View for all the dead people:

    CREATE VIEW People_dead
    AS (SELECT person_id, name, birth, death
    FROM People_all JOIN People_deceased_info)

    Dead and living people are 2 complementary subtypes (complete and non overlapping)

    ReplyDelete
  3. I would opt for the entity subtype/supertype approach described, on the grounds that there's then no need for NULLs or metadata.

    This approach is often resisted, with complaints about query complexity, performance of joins and "where XXX not in (subselect)", and complaints about additional R-tables implying large amounts of additional storage space.

    So - if that was politically unacceptable, I would use a manual metadata approach similar to that described in "Practical Issues in Database Management". If a systematic manual metadata approach was unacceptable, and I were restricted to a single "Person" R-table, then I would add two additional boolean attributes signifying the applicability of the Birth and Death attributes. This is "bundling" - bad database design - and obviously to be avoided, but it has the virtue of avoiding NULLs and default values while restricting the design to a single R-table.

    I mention these unpalatable alternatives because we quite often have to work within a structure of existing designs and political expediency. It can be a question of which is worse... NULLs, 3VL and/or default values, or metadata?

    ReplyDelete
    Replies
    1. OK. Hold that thought to the end of this series.

      I am collecting these replies and will probably respond to them then.

      Delete
  4. Here's my take (in a NON-SQL shorthand!):

    table Person
    PersonKey keytype
    Name string

    table LivingPerson
    PersonKey keytype references Person.PersonKey
    BirthDate date

    table DeadPerson
    PersonKey keytype references Person.PersonKey
    DeathDate date


    view LivePeople
    select PersonKey, Name, BirthDate
    from LivingPerson
    join Person

    view DeadPeople
    select PersonKey, Name, DeathDate
    from DeadPerson
    join Person

    This is only concerned with facts that are known and assumes no other constraints. (Can't be living if you weren't born and aren't dead until you have died.)

    ReplyDelete
  5. In principle I agree with WB's design.

    The date of death is not-applicable, but if I don't own a car then my car registration number is not applicable and I think we would usually place car ownership in a separate table.

    The only difference here is that the appropriate constraints would be in place so that I can only die once (the primary and foreign key constraints on the death table) whereas I could own many cars (if the business model allowed that).

    I agree completely with anonymous contributor one's comment that if you need many tables to represent the database model correctly, then you need that many tables and trying to economize on the number of tables will only lead you into terrible logical difficulties later.

    However from a presentation perspective you probably do want to show the user {person_id, name, birth, death} as if this was a single table.

    It would certainly be possible to define a view that presented the data to the user in this form. However in SQL products this view would have to user outer join and the result would necessarily contain nulls. Furthermore this view would have to be updateable and though in SQL products it is possible to define “instead of “ triggers the process is labour intensive, error prone and highly implementation specific.

    The proper answer is, I think, to implement outer join properly using relation valued attributes and to extend the view updating capabilities of the DBMS.

    As the database “knows” about cases where the primary key and a foreign key of a table have exactly the same set of attributes we could, in principle, use these constraints to construct the necessary views automatically.

    ReplyDelete
    Replies
    1. Noted.

      An outer join would not be necessary, however. (In SQL that would involve NULLs). And yes, SQL limit the use of views.

      Delete
  6. Early on in my career as a data modeler I made extensive use of super/sub typing to ensure there were no NULLABLE attributes at all in the model. The basic idea was that an attribute that "does not apply" (missing information is a practical data entry concern of an implemented system and imho not relevant to the model) by definition means there is a business rule left uncovered. Why does it not apply? So in the people example I too would have modeled a super type of people, and a subtype of just "dead people" with the death date an attribute of that one sub-type. This would have made clear in the model the business rule that only people who are deceased have death dates.

    But more recently I have done a lot of reading of David Hay's approach to what he calls conceptual modeling which means using the ERD approach to model things in the world the enterprise cares about as opposed to a data store we are about to build. His approach is primarily one of intensely thought out classification hierarchies to "put everything in its place" as it were, using as a model the biology taxonomy. This approach requires that super and sub types are classified only on a single, fundamental, unchanging characteristic that each must share, and thus each occurrence must fall into one and only one sub type. Like the taxonomy of organisms, this can have many levels. The ERD notation he uses is the Barker-Ellis notation of box in box subtypes.

    Under this approach, the desire to handle attributes which don't apply is not really a primary concern. The notation simply notes optional attributes and in my application of it I always attempt to document why that attribute does not apply. Now this approach also has a hierarchy of classification that we are told must be flattened in order to be to be modeled in a relational model.

    I guess where I'm going with this is that, given I must build working databases of my models in RDBMS products, I've given up on attempting to explicitly model entities that have no optional attributes. While I used to use an approach similar to what you describe in my logical model, I gave up on it in favor of the David Hay approach I described, which seemed to me more focused on the fundamental question - what things does the organization care about and how are they fundamentally classified. But now I've read some of your material here and like it just as well and feel it is helping me continue my journey toward a greater understanding of data modeling. What are your thoughts on how one might reconcile these two approaches (David Hay's approach to classification to discover fundamental characteristics to classify by vs. your approach of classification to discover why an given attribute might be optional and place it into its valid context such that it always applies), and proceed both in a world where one must work with current commercial DBMS products and in one where one might have a DBMS product that is truly relational?

    In an attempt to answer my own question, could it be that these two approaches are not at all at odds but instead just at different levels of abstraction? That the "conceptual" model David Hay advocates is not relational and thus "optional" attributes are not a concern, and that it can then be translated into a "logical" model following that is relational using the approach of super/sub types that allows many different sets of sub types based on any distinguishing characteristic that explains why an attribute would be optional?

    Thank you for publishing this blog and for giving the daily workers like myself a way to interact with you and thus continually evolve our understanding of data modeling and of the relational model.

    ReplyDelete
    Replies
    1. @Todd, excellent issue. NULL avoidance is definitely an issue at the _logical_ (formal) level, but does that imply that it need also be an issue at the _conceptual_ (informal) level ? Not necesarily, imo. But the thing is of course, if you need to decide about such matters when going from conceptual to logical (from informal to formal), then of course you need the input to make the right decisions at the logical level. That input must be retrieved from somewhere, somehow. iow : possible missingness MUST be specified, somewhere, somehow.

      Not that I have any particularly convincing arguments about where precisely and how precisely that ought to be.

      Delete
    2. Inapplicable does not mean data is missing. It means that an entity does not have an attribute; if so, why does the logical design assign it the attribute? So, yes, you must do the conceptual right and you won't have inapplicable NULLs.

      Real missingness is a different matter altogether.

      NULL is a SQL issue, not a missing data issue per se. It's just IBM's bad implementation of 3VL which is applied by users as if it were 4VL. Can wreak havoc.

      Delete
    3. "Real missingness is a different matter altogether."

      Could you explain a little bit more about what you consider to be the distinction between inapplicable and missing data?



      Delete
    4. The conceptual modelling approach I am most familiar with is Object Role Modelling which I believe is closely related to NIAM.

      In object role modelling their is an important distinction between attributes playing a mandatory role and attributes playing an optional role.

      You can however define relations between attributes before defining whether they are mandatory or optional. This helps to prompt questions in analysis like "when we record the details of a person must be always know their date of birth? "

      Delete
    5. ESS does not mean attributes are mandatory or optional. All attributes are mandatory, the issue is identification of the entities and their types.

      Delete
  7. There are (1) a conceptual or business level (2) a logical database level (3) a physical or storage level

    (1) is informal, no matter what approach you use, whether you call it ERD, or classification hierarchy or NIAM.

    The RM is a mechanism for formalizing (1) into (2). So the adjective relational does not apply to (1).

    Entity supertype/subtypes relationships are at (1) and have nothing to do with relational. How to represent them in the database using the RM is a separate matter.

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. >Could you explain a little bit more about what you consider to be the distinction between inapplicable and missing data?

    I think I just did.

    Missing data is that that should be but isn't. Inapplicable data should not be. A live man does not have a death date, so no data is missing. A live man has a birth date, if there is no data, it's missing.

    The latter is about imperfect information, the former is not.

    ReplyDelete
  10. Thanks for all the comments on my earlier question. You have helped me sort this out a bit.

    >>Inapplicable does not mean data is missing. It means that an entity does not have an attribute; if so, why does the logical design assign it the attribute?

    This statement is a real eye opener. I had never thought of it that way. This formalizes my lack of comfort modeling optional attributes and thus a desire to at least require the model to explain why it is optional.

    >>The RM is a mechanism for formalizing (1) into (2). So the adjective relational does not apply to (1).

    This, coupled with that first statement, helps me greatly to resolve my issue and convinces me to back track and ensure I never allow NULL columns in (2). At (1) I can focus on classification and really not worry at all about attribute optionality, or I can focus on it if I choose. Perhaps the business is working on classification in one model, and working on understanding attributes in another. The model is informal.

    >>Entity supertype/subtypes relationships are at (1) and have nothing to do with relational. How to represent them in the database using the RM is a separate matter.

    I think what you are saying is that the RM is orthogonal to (1). Due to this I think the terms entity, attribute, and optional fit the informal, conceptual level the terms table, column, and NULL fit the logical level. So when transforming from (1) to (2), if I want (2) to be strictly relational (which I do), I must not have any null columns.

    But in all this I have forgotten to weigh in on your original question:

    >>Would you opt for WB's proposed design, choose a different design and, in either case, on what grounds?

    Given what I have learned today, I would have the following in my conceptual, business, informal model:

    A box for person with attributes for person_id, name, and birth_date. That is it. I would name the entity "person" instead of "people" as an entity should be named with a single noun or noun phrase that names a single occurrence. My informal, conceptual model would not even have death as an entity, as it is actually a weak entity - meaning it depends fully upon its parent and has no other relationship. It just clutters up the conceptual model. I think this fits what you have presented also as this conceptual model is informal. That is another point I was missing in my original confusion. Both this and the naming convention of singular nouns is from William G Smith who does data modeling training.

    Now when transforming from (1) to (2), before today, I would have just made death an NULL column on the person table. But after what I learned today I would take this approach:

    table person
    (person_id INTEGER NOT NULL,
    name VARCHAR() NOT NULL,
    birth DATE NOT NULL);

    table death
    (person_id INTEGER NOT NULL
    REFERENCES person,
    death DATE NOT NULL);

    I would argue that the death table is NOT a sub-type, but instead an example of a fully dependent child table of person that describes an event - a death. The column death is the date of that death - when the event occurred. Even at the logical, formalized level, I continue to like the approach of reserving a super/sub type as a mutually exclusive taxonomy based on a single unchanging characteristic. Each occurrence of the super type must be classified into one and only 1 sub type for now and ever more. An organism must be either a bacteria, a plant, or an animal. An animal can't later change into a plant. The difference is very subtle but important to distinguish.

    The concept of "live" or "dead" person does not enter into the model at all. Those are questions answerable by queries like any other question. Looking at the model, both are easily answered. Neither are needed. Now if a particular application wants to create a view to ensure they always retrieve only live persons, or dead persons, they are free to do so.

    ReplyDelete
  11. A final thought I had while pondering this. Over the past year I have focused a good deal of time distinguishing what I called the "logical" model from what I called the "physical" model. But you clarified:

    >>There are (1) a conceptual or business level (2) a logical database level (3) a physical or storage level

    So like most everyone else in the industry I am confused - the old logical - physical confusion you have written on and I had read and now find I am STILL confused. I'm confused as what I called the "physical" model was STILL just what I called an "external" schema - tables and column definitions. I called the storage level the "internal" schema and called that the "database design". The DAMA framework also uses this terminology and has these 4 levels. So today after reading your responses I had an epiphany. I really have TWO logical models. The first one is my best attempt to do a truly relational model independent of any implementation platform or application workload. The second one, which could undergo additional "external" schema transformations, was created due to the constraints of the DBMS we are using to implement the database given the workload that will be thrown at it. Denormalizations, cascading of particular keys to enable "internal" schema partitioning alignment, speed tables, and so on.

    This is very liberating! It is because I have a pathetic excuse for a DBMS that does not properly separate the physical storage level from the logical level that I even NEED that second logical model! And now I can create a real logical model and keep it REGARDLESS of what I am forced to do in SQL Server or Oracle or whatever else or by a developer who looks at Person and Death tables and tells me "too many tables" (and like Mozart I ask so which ones would you like me to remove?) and I am forced to implement it his/her way due to organizational culture. Now I can have my true logical model and say "Despite how we implement THIS is how it really is and NO ONE can make me change it unless they show me how I have err'd in relational design.

    I'm looking forward to parts II-IV and going back and re-reading some old posts given what I've learned today. I am very interested in your thoughts on what I have articulated to see where you think I am on the right track and where I am off base.

    ReplyDelete
  12. Well, the reality is that the confusion is instilled in practitioners by the practices and tools they are accustomed too.

    SQL vendors have responsibility for this by implementing SQL products as mirror images. The base tables are physically stored so why shouldn't people think that tables are physical?

    Your epiphany is in the right direction, but not complete.
    Logical design has nothing to do with physical implementation. You think in terms of tables that represent the conceptual/business model. Whose tables' data will be stored and how, and whose tables' data derived is irrelevant.

    That is why I so much emphasized the TransRelational Implementation approach -- there are no base tables at all. So in essence all tables are virtual, the DATA is stored, not the tables, if you catch my drift.

    ReplyDelete
  13. I am glad this site and the exchanges are helpful to you.

    I hope you will not hold it against me if I say that I am not able to read very long comments and then do them justice in comments.

    I still need to work for a living. It just so happens that my work DOES do justice to such: my writings and teachings. :)



    ReplyDelete
    Replies
    1. I certainly will not hold it against you :-) In the future I will strive for greater brevity.

      Delete
  14. This is a very interesting topic and I am looking forward to the other parts. Here is my take on this.

    As an OO programmer who has seen his fair share of code out in the wild I've made some observations. Sub-classing is heavily overused, you are often better of to favor composition over inheritance.

    In this case I would like to see a parameterised type supported by the DBMS, in particular something like Optional. This makes the N/A value explicit to the user.

    Lacking proper support for this the next best thing is to simulate it. I am a PostgreSQL user myself and would probably construct an optional_date type.

    CREATE TYPE optional_date AS (
    has_value BOOLEAN,
    value DATE
    );

    CREATE TABLE people (
    person_id INTEGER NOT NULL,
    name VARCHAR(100) NOT NULL,
    birth DATE NOT NULL,
    death optional_date NOT NULL
    );

    I would then add the constraint that optional_date.value <> NULL iff optional_date.has_value = TRUE.

    I think this is a reasonable approach given the limitations of todays SQL systems.

    My 2 cents.

    ReplyDelete
    Replies
    1. OK, I don't think we're talking optional here. Dead people simply don't have an attribute, it's not an option for live people.

      If people have patience until part IV is out, they will have the opportunity to judge whether their solution is optimal.

      Pls note that on top of the ESS we must also consider the solution to missing/unknown data!!! So the total solution must include both ESS and unknown.

      Delete
    2. Your suggestion of an optional_date type gets to the heart of the thing. You have to discern and make explicit what could have been properly and safely assumed to be the case without any additional work. And having hi-jacked the logic, you have taken on the burden of forevermore doing for yourself what the DBMS could have done for you.

      You have made complex something that was trivial. To what end?

      Even given the limitations of SQL this is not a reasonable approach. Given the vastly worse limitations of the OO programming paradigm, it might be.

      Delete
    3. OODBMS is all about types -- it lacks the equivalent to relations. Types encapsulate, relations do not.

      That's why OODBMS are DBMS building kits, not DBMSs.

      And that has to do with the fact that OO is a programming, not data paradigm a la RM

      Delete
    4. Hackers who kind of like this optional_date idea (but it's really just a half-baked hack, not a true solution) might want to ponder if optional_date(false,null) is indeed a valid value of the type, and if so, whether that value compares equal to itself ...

      Or whether optional_date(false,2012-01-01) could be made to compare equal to optional_date(false,2010-12-31) ...

      Delete
    5. See my earlier post on "Out-clevering the DBMS".

      Many developers overestimate their skill of coming up with "elegant" tricks which cause more problems than solve.

      It should be obvious that you know only a programming paradigm and no data paradigm, this is rather predictable.
      As the former OO is for packaging data structures and manipulation/integrity into apps, not to create them yourself.

      Delete
  15. Given that you get epiphanies from this site, how about supporting it?

    ReplyDelete
  16. Regarding political difficulties in this field, what is the "definitive" source of the Relational Model that people cite? I tend to go for Codd's original paper and follow it up with Mr. Pascal's excellent writings and "The Third Manifesto" by Date and Darwen, but I'm struck by the lack of a single, concise, authoritative document that "is" the Relational Model.

    Codd's own book attempting to resolve this problem, "The Relational Model For Database Management, Version 2" has a number of problems, not least of which is its treatment of NULLs.

    ReplyDelete
    Replies
    1. This is my opinion:

      Codd is the core authority except when he is not e.g. 4VL. David McGoveran is working on a formal exposition, refinement, expansion and filling in the incompleteness of Codd's concept.

      My material is adhering to it as far as possible.

      Most of Date's and Darwen's is consistent with that, except where it isn't e.g. they complete computational language.

      Delete