Sunday, March 3, 2013

Database Design, Relational Fidelity and SQL - An Exchange, Part 1

This is the first part of a reply by David McGoveran and to Erwin Smout's comments on David's 4th part of our earlier series on database design, missing data, relational fidelity and SQL, with some comments by myself. The format is (a) quoted DM original text (b) ES comments (c) DM response (d) my comments, if any.


Whenever a database design issue arises, it is crucial that we begin with  a purely logical design.  At the logical level we do not make decisions about what is or is not physically stored. The logical design of the database representing the business (conceptual) model in Part I should be as follows:
PEOPLE {PERSON_ID,NAME};
LIVE_PEOPLE {PERSON_ID, NAME, BIRTH} PROPER SUBTYPE OF PEOPLE;
DEAD_PEOPLE {PERSON_ID, NAME, BIRTH, DEATH} PROPER SUBTYPE OF LIVE_PEOPLE;"
ES: Well, since the main subject of this treatment is "how to handle missing information", it might have been more appropriate to also consider the possibility that someone has a DEATH but not a [known] BIRTH.

DM: I chose an example that illustrates the point of how to deal with missing data. If you want to redefine the problem to permit death dates without birth dates, the example is easy to modify. It simply involves one other relation and a different interpretation (i.e., the predicates for the attributes include something like 'the known birth date is' and 'the known death date is'.

FP: In my three parts in the series I discussed the elimination of inapplicable columns by design, where no data is missing. In his post David suggests mostly how to manually handle unknown and, therefore, missing data if the DBMS does not support them correctly, which is true of SQL implementations. The automated solution is not in the series, but outlined in my paper, The Final NULL in the Coffin. The full solution must provide support for both entity type hierarchies and the 2VL relational solution to missing data outlined in my paper.
"Explanation: We always give an instance its most specific type (and can then deduce that instance's supertype)."
ES: It's a bit "hidden", but this implies very clearly that that "most specific type" must be unique. Implying, in turn, that the type system must constitute a lattice. Not that that need be a problem. Just observing.  (Also observing : the word 'type' in here is used to mean something quite different than what it means in TTM.  Not necessarily a problem, but might be confusing to people who are rather deeply entrenched in TTM thinking.)

DM: My system supports multiple type hierarchies. However, one is selected for any logical inference. The multi-hierarchy (and attendant "true" multiple inheritance subtlety here I won't go into) exists only at the metalinguistic level. The type system does this by "framing" a problem so that logical inference occurs within a chosen type hierarchy (a point of view), while the human view of the world consists of many, "overlapping" and possibly inconsistent, type hierarchies. Hence there is always a most significant type.

I don't accept the TTM type system and objected to it in detail when I reviewed the first TTM draft.

FP: Perhaps this is a good place to refer the reader to my earlier post type, class and domain, which are often used interchangeably, but should not be (see also Business Modeling for Database Design).
"In effect, instances of a type are necessarily disjoint from each of sets of instances of that type’s proper subtypes. We can easily see whether or not a relation B{A1, ... An, B1, ... Bn} is the subtype of relation A{A1, ... An}: B inherits all the attributes and constraints (including dependencies) of A, such that if relation predicate (i.e., the membership predicate[2] for a relation) PB is TRUE for tuple {A1, ... An, B1, ... Bn}, relation predicate PA is necessarily TRUE for tuple {A1, ... An}."
ES: Can this in fact be achieved without taking some liberties with the CWA?

DM: I don't understand the comment about "liberties" with CWA. I think the comment hides some other misunderstanding and deserves elaboration.
"Conceptually, when a proposition (fact) is asserted as "TRUE"--i.e., when the tuple representing that proposition is inserted in the database--we  start at the bottom of the entity type hierarchy and match attributes (more precisely, we check for consistency of the fact with the corresponding relation predicates for each relation in the hierarchy). In this example, if we know DEATH, the data can only match the relation predicate for DEAD_PEOPLE and if we know BIRTH but not DEATH, the data can only match the relation predicate for LIVE_PEOPLE. Finally, if we only know NAME, the data can only match the relation predicate for PEOPLE."
ES: Brief, we pick the database relvar (/entity type) whose heading matches exactly with the heading of the tuple we are inserting. There need not be any "searching" involved, in the sense of an "upward graph walk".

DM: I object to the "brief" characterization of my explanation of identifying types by "walking the graph". I do not use "headings". Matching by comparing relation names and attribute names leads to false reasoning about relation or attribute equivalence. Instead, I use the formal symbolic predicates (not some natural language rendering of them) and Kleene's algorithm (which determines if two predicates are the same up to a simple difference of symbol assigned to represent some primitive).

FP: Emphasis mine. It is critical to distinguish between a formal predicate (expressed symbolically) in first order predicate logic and its informal interpretation (meaning) by users when applied to a real world situation (what Date calls the "external predicate". As I emphasize in my writings, a DBMS cannot understand the latter, it can only manipulate abstract symbols and values mathematically.

Stay tuned for Part 2.



Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

22 comments:

  1. About the CWA :

    I commented on "... such that if relation predicate (i.e., the membership predicate[2] for a relation) PB is TRUE for tuple {A1, ... An, B1, ... Bn}, relation predicate PA is necessarily TRUE for tuple {A1, ... An}."

    ES: Can this in fact be achieved without taking some liberties with the CWA?

    DM: I don't understand the comment about "liberties" with CWA. I think the comment hides some other misunderstanding and deserves elaboration."

    The CWA as I understand it dictates that if the relation membership predicate PA for tuple for tuple {A1, ... An} is TRUE, then this tuple should appear in the relation.

    Ergo, if relation membership predicate PB is true for tuple {A1, ..., An, B1, ... , Bm}, then this tuple should appear in relation B, but if it follows from that that relation membership predicate PA is also true for tuple {A1, ..., An}, then it also follows from that that {A1, ..., An} should appear in relation A.

    ReplyDelete
    Replies
    1. This one will require care to respond to and I will as soon as I have time. It’s a good question but also a little confused – CWA isn’t the issue, it’s what the predicates have to be when you use MST and have a type/subtype hierarchy. Also how predicates and CWA interrelate.

      David McGoveran



      Delete
  2. "Matching by comparing relation names and attribute names leads to false reasoning about relation or attribute equivalence. "

    I am not sure if I have understood this correctly. I presume data manipulation statements will still use attribute names but that the DBMS will infer which predicate the attribute names match?

    ReplyDelete
    Replies
    1. Yes.

      Which is how the Principle of Orthogonal Design surfaced: you don't want overlapping predicates. And it surfaced during work on view updatability, for obvious reasons.

      As David says in his piece, SQL does not understand formal predicates and therefore cannot update certain views.

      Delete
    2. So if I was to assert the proposition:
      {Birth : 12/12/1970, Name : Smith, Person_ID : 34} then there would be no need for me to state the relation name specifically?

      Delete
    3. That thing you specify is categorically _not_ a proposition ! It's a tuple. That tuple can be used to obtain really just any proposition, by instantiating some predicate with it (where that predicate has three free variables, Birth, Name and Person_ID). One possibility is, e.g., "My friend died on 12/12/1970, leaving a wife named Smith widowed at age 34".

      Delete
    4. Erwin,

      You are, of course correct. I am sure that Will knows that, but should have been more careful expressing it for the benefit of readers.

      Those of us who are versed in fundamentals sometimes relax our precision because things are obvious to us.

      What Will meant of course was that in the context of a specific interpretation of the corresponding predicate,the tuple represents a specific proposition.

      A DBMS operating on the basis of predicates would accept or reject the tuple based on the predicate, not on the name of the R-table.

      Delete
    5. Yes. I had two replies and erased the other in favor of the one I left (because the two were after all related). The other one was essentially about the mechanics of the alternative.

      Identifying a predicate in a traditional system, is done by associating it with a corresponding relvar name, and have the user specify relvar name plus tuple(s) if he wants to do inserts or deletes (assert truths or falseness of propositions).

      DM doesn't allow me to use headings or relvar names, so I wonder, what does that mean to me, the database programmer, who wants to issue some insert(/delete). I specify, what ? Proposition plus predicate ? Tuple plus predicate ? What does that look like, _linguistically_ ??? If the mechanics for identifying a predicate cannot be "relvar name", then what are they instead ?

      Delete
    6. Well, I understand your desire to know David means--I have it too--but he is not ready to publish and we must wait until he does. He has been working on it, but it is heavy formal stuff and a lot of his time is taken by more mundane work.

      Delete
    7. Erwin, Fabian thank you for the clarification. I shall try and be more precise in future.

      I too am still a little puzzled by this sentence "Matching by comparing relation names and attribute names leads to false reasoning about relation or attribute equivalence. Instead, I use the formal symbolic predicates (not some natural language rendering of them)". If we don't have the relvar name and we don't have attribute names what do we have?

      I am intrigued and await Part 2 with some anticipation.

      Delete
    8. I'm afraid the 2nd part does not provide an answer to your and Erwin's question.

      I've seen some rough draft of a few pages of the manuscript and sheer English is not enough to read it. Utterly formal stuff and if David says that's tough, trust me, it's real tough.

      I only got an informal glimpse when he said that "updates are applied to THE DATABASE, not any specific relation", but until he completes the formal framework and then try to explain it to us mere mortals we have no choice but wait.


      Delete
  3. IF there are no overlapping predicates AND the DBMS relies on predicates and not on names, yes.

    ReplyDelete
  4. Updating the database can be achieved using multiple assignments. I can't find the reference which defines database as a variable with relations as values, so there were at least hints about concept of updating the database. Having said all that, I'm very curious to see an approach that uses formal symbolic predicates for database updates. And, I'm looking forward to see how far this concept can go about enforcing the principle of orthogonal design.

    Sima

    ReplyDelete
    Replies
    1. If I remember correctly, David does not subscribe to multiple assignments and, in fact, not even to explicit reference to relvars in the data language. He thinks that Codd's use of the term "time-varying relations" and not relvars was intentional and significant. it's not that there are no relvars, the issue is whether to use them explicitly. And I am biased: when the 2D's and 1D disagree, I usually defer to the latter.

      We all do want to see David work, including himself. But it is "just a tad" harder than the TTM. And it does not pay for groceries.

      Delete
    2. Practice must be biased, but many conventional biases are based on nothing but convention, eg. the assumption that users update db's. But they don't, the dbms does that (or should, users can only express their desires). It's a little phony to amplify the importance of extremely intricate techniques with the word 'fundamental' when much that is fundamental is ignored. That ignorance is what leads to misapplication of headings, appeals to convention or various 'authorities'. (Obviously headings don't 'have' logical operators, but view definitions do.)

      Likewise talk of relvars and database updates in the same breath is a chaotic approach. If database updates are possible in the first place, obviously relvar name mentions are optional.

      It's more fundamental to approach design by asking why base relvars should even be available to applications when they are the very constituents of the relations (aka views) applications use. Letting applications manipulate base relvars, let alone mention them as an update 'target' only invites chaos because the apps can undercut the logical structure the designer intended. Adding base constraints only invites more chaos, opening the walls up.

      DM wrote: "At the logical level we do not make decisions about what is or is not physically stored.". Right, but more to the point is that the decisions are about what is recordable. What the designer must do is enforce what is recordable without applications being 'aware' of what is 'stored in some fashion', then decide what influence apps are allowed.

      Delete
    3. > It's a little phony to amplify the importance of extremely intricate techniques with the word 'fundamental' when much that is fundamental is ignored.

      Yes. Practitioners are not even aware that there is something called fundamentals, which makes the problem unsolvable at the industry level. This is educational failure.

      >Letting applications manipulate base relvars, let alone mention them as an update 'target' only invites chaos because the apps can undercut the logical structure the designer intended.

      IBM and Oracle are directly responsible for that: SQL.

      That's what most people missed about the TransRelational Model--it would have facilitated a TRDBMS operating in the way you describe. But, of course, it was viewed with conventional eyes, that prevented comprehension.

      BTW, here's a reminder that Date published a free downloadable e-book on TRM:

      www.zums.ac.ir/files/research/site/ebooks/it.../go-faster.pdf

      Delete
  5. "He thinks that Codd's use of the term "time-varying relations" and not relvars was intentional and significant."

    Obviously database users have a requirement for 'time-varying relation values'. Obviously with a shared db an r1 user may cause r2's value to 'vary' without him or a sleeping user of r2 knowing that. What r2's value MEANS to a user might change too. But how can r2's predicate change, ie., what r2 means?

    ReplyDelete
  6. If the predicate (meaning) changes, it requires a new modeling and design exercise.

    ReplyDelete
  7. 'Faster' doesn't mean 'more right'. Existing organizational techniques already 'facilitate' the way of operating I (briefly) mentioned. (Existing practice doesn't.) As far as fundamental db requirements are concerned, TRM is in the same realm as other appeals to technique, such as multiple assignment.

    A fundamental question for any db is whether the advantage of a logical structure is worth the cost of implementation. The developers of 1960's db's would certainly say the machine portion of the cost has fallen exponentially. I've seen many application requirements developed in the trenches that were renounced when brought to the attention, in detail, of the real application 'owner'. TRM is but one part of an overall implementation and doesn't necessarily imply all of the implementation, eg., all of the application of relational logic ('implementation' is often misconstrued as implementation of a logical system when it is actually an application of the logical aspects that are deemed feasible, eg., conventional union is a truncation).

    I admire the deep thinkers and careful writers in this field (at least the handful of very public ones, such as yourself) but they too are capable of misconstruing what Codd wrote and therefore causing confusion, though usually their misconstructions are fewer and more exact. One of those, IMHO, is assuming that technique such as base constraints can disguise the mistake of exposing base relvars. (I'd grant that TRM does make it fairly obvious that some regime for logical joins is needed 'on top' of itself.)

    What is available to a problem application and what is recordable are the same thing. Security aside, what's available, what's recordable and what's updateable are the same thing too. Problem applications should not be allowed to change that, otherwise they may as well save the trouble and stick with filesystem organizations.

    (Regarding the link, I'm guessing it is in Farsi. If so I'm glad but hope nothing was lost in the translation.)

    ReplyDelete
  8. >'Faster' doesn't mean 'more right'.

    I would never think otherwise. There is, in fact, a negative correlation. Expediency upfront slows down subsequently more than the initial gains.

    >TRM is in the same realm as other appeals to technique, such as multiple assignment.

    I strongly disagree. TRM is an implementation approach that provides full physical data independence, which is quite strategic benefit.

    But, of course, it can produce the benefits only if vendors and users practice sound data management and I am the last person to suggest anything is a magic wand. In the context of current tools and practices, having a TRDBMS implemented with TRM would have had some impact, which is better than nothing.

    >I admire the deep thinkers and careful writers in this field (at least the handful of very public ones, such as yourself) but they too are capable of misconstruing what Codd wrote and therefore causing confusion.

    Absolutely. In fact Codd himself made a big mistake with his 4VL.
    And the fact that he was a genius and could not communicate his ideas at the level of us, mortals has considerable responsibility for the sad state of affairs.

    All we can do is aspire to interpret him correctly and, if we make mistakes, to recognize and address them. That's the best humans can do. One of the major flaws of the IT industry is ignorance of the past and its failure to learn from mistakes.

    Here's the correct link:
    http://bookboon.com/en/textbooks/it-programming/go-faster

    ReplyDelete
    Replies
    1. "Absolutely. In fact Codd himself made a big mistake with his 4VL.
      And the fact that he was a genius and could not communicate his ideas at the level of us, mortals has considerable responsibility for the sad state of affairs.

      All we can do is aspire to interpret him correctly and, if we make mistakes, to recognize and address them. That's the best humans can do. One of the major flaws of the IT industry is ignorance of the past and its failure to learn from mistakes."

      I think of his 1970 paper as two dimensions. The first includes his profound observations about information having structure in its own right. The second includes his practical illustrations which are his particular synthesis of the possibilities the first might suggest. I suspect he deliberately coloured the latter with lingo that was familiar or conventional at the time and from various products we know that the illustrations are often taken too literally. But the observations can lead to different syntheses if we discard some of the familiar conventions, especially if we avoid carrying forward the operators from filesystem days.

      Delete
    2. >profound observations about information having structure in its own right.

      Indeed. I do emphasize that there are 2 major contributions: the concept of a 3-component formal data model; and the specific data model he deviseed.

      >I suspect he deliberately coloured the latter with lingo that was familiar or conventional at the time and from various products we know that the illustrations are often taken too literally.

      It hink it's more accurate to say that he had a dilemma. On the one hand he wanted, for communication reasons, to use familiar terminology; on the other he was keen in disassociate his terminology from the one used in the industry, also for obvious reasons (e.g. domain rather than type, mark rather than null, etc.).

      This is persistent dilemma in an industry dominated by ad-hoc, imprecise, vague and inconsistent terminology.

      Delete