Saturday, September 3, 2016

Relation Predicates and Identical Relations



Note: This is a 11/25/17 re-write of an earlier post, to bring it in line with the McGoveral formalization and interpretation [1] of Codd's real RDM.

Here's what's wrong with the last wrong picture I posted, namely:

Q: "Can you have 2 tables, VIEWS and DOWNLOADS, with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other."

A1:"I don't think that there is a problem, per se. From a E/R modeling point of view I don't see a problem with that, as long as they represent two semantically different entities."

A2:"Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding."

A3: "Chris Date and Dave McGoveran formalised the Principle of Orthogonal Design. Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result."

A4: "When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence. Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often. This alone could be a good reason not to burden it with records from the downloads." --StackOverflow.com


Business Rules and Meaning


To recall, a base relation (which can be visualized as a R-table) represents a set of facts about a group  of property-sharing objects. Its meaning is denoted by a conjunction of informal business rules that specify the individual and collective object properties required for membership in the group that the relation represents:

  • Property rules specify individual first order properties (1OP) shared by objects;
  • Object rules specify the second order properties (2OP) that arise from relationships among 1OPs;
  • Multiobject rules specify the collective third order properties (3OP) that arise from relationships among all members of a group;
  • Multigroup rules specify properties of the groups as a whole that arise from relationship among the object groups;

Note: A 'base relation' does not mean a stored relation (as in SQL), although it can and usually will be stored. Rather, it is a member of the base set of relations, the tuples of which represent axioms -- facts about the real world objects of interest recorded in the database, from which all relations meaningful to applications are derived, the tuples of which represent theorems -- facts that are logical implications of (inferences from) the axioms.

Because rules are informal, they are not "computable". They must be formalized as constraints -- predicates "understood" algorithmically by a DBMS that can be enforced by it for consistency with the rules. For each relation, the constraints that correspond to the first three types of rule comprise the relation predicate (RP). The RPs in conjunction with the constraints corresponding to the fourth type of rule comprise the database predicate (DBP).


Semantics and Constraints


Constraints are semantic -- they constitute the formal assignment of the meaning to the relation. If two relations are identically structured and constrained,

VIEWS (USER_ID, ITEM_ID)
DOWNLOADS (USER_ID,ITEM_ID)

the only discernible difference between them is in their names. But the core Information Principle (IP) of the RDM mandates that all information in a relational database be represented explicitly and in exactly one way -- as values of relation attributes defined on domains. The difference between relation names is, thus, meaningful information, the representation of which violates the IP and the RDM, for which reason it is inaccessible to the DBMS: consider the candidate tuple {v1,v2} -- it is impossible for the DBMS to know to which relation it belongs based on the relation and attribute names because it does not understand semantics! Applications/users must specify -- with little, or no help from the DBMS -- the proper relation. Not only is it prone to error, but also  because the information is implicit in the names, relational operations lose it: if you UNION the two relations, you get users that either viewed or downloaded items, or both. For the DBMS to be able to decide, RPs must be unique -- such that a candidate tuple always satisfies exactly one [2]).

One way to make the information explicit is by representing it as values of an attribute. Assuming there are only two actions, view and download,

User (USER_ID) performed action (ACTION) on item (ITEM_ID).
where ACTION is an attribute defined on the binary domain {view,download}. This yields a one-relation design with the difference incorporated as a domain constraint on the attribute.

All of this is lost, of course, on most data professionals, who lack data foundation knowledge [3
], as pointed out in two replies:
A1: The relations are semantically different, but the DBMS is kept unaware of the difference (the E/RM is irrelevant here);
A2: Renaming attributes (not fields) does not address the basic problem;


The Principle of Orthogonal Design


As to A3, the treatment given in the mentioned article to the Principle of Orthogonal Design (POOD) was incomplete, which created misunderstandings such as the one in the comment (the article authors do no longer agree on the subject
[4]). Suffice it to say here that the POOD has to do with the properties of formal systems, of which the RDM, grounded as it is on FOPL, is one.
"Formal systems have highly desirable properties when they are governed by three principles, one of which is the FOPL Principle of Orthogonality (or Axiomatic Independence): Axioms should be independent -- none should be derivable from the others. In database design terms base relations should be independent -- not derivable from the other base relations. Two relations are trivially independent if they are defined in terms of mutually exclusive domains, attributes and tuples (there are, of course, more complex ways they can be independent)." --David McGoveran
Obviously, the two relations do not violate the POOD -- the facts represented by VIEWS are not derivable from those represented by DOWNLOADS and vice-versa. The problem here is conceptual specification and logical design that "hide" information from the DBMS in the names.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., The Principle of Orthogonal Database Design Part I, II, III.

[3] Pascal, F., THE DBEDBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS. 


[4] On View Updating (C. J. Date and D. McGoveran).



16 comments:

  1. "The two relations have different meanings and, therefore, there should be some attribute or constraint differentiating the RPs. "

    There is, the table name. The table name itself is semantic content. The argument presented here treats the table name as something other than semantic content, which is incorrect. If it is necessary to encode table names and relationships into the database explicitly, then one should do so.

    The VIEWS and DOWNLOAD tables do not have a direct relationship to each other, nor is their relationship orthogonal. They are different semantic features of user behavior. It seems the obvious solution is to treat this behavior semantically in the same table:

    USERACTION (USER_ID, ITEM_ID, ACTIONTYPE);

    Action type being something like V for view and D for download. But it could also expand to include things like U for upload, or S for save-for-later.

    Personally, I find these logic based approaches to database design interfere with the creative process necessary to design databases. The logic approach move us away from treating the data semantically, and forces us towards treating the data syntactically and reductively, which confuses us when we face certain design problems. By treating tables themselves as semantic content, it becomes obvious that the table can be a value. And thus we can have an additional column which captures that table meaning in a superset of the VIEWS and DOWNLOAD rows. this kind of synthetic solution is not an obvious outcome from predicate style logics - unless the database itself becomes part of the predicate system. and as that feels like we might be slipping towards Godel's theorem, I'll stop.

    Thanks for an interesting article.

    ReplyDelete
    Replies
    1. The entire article and its claims are predicated on a very specific interpretation of the Information Principle (that what Date calls an external relvar predicate is itself information too and must therefore be accessible to the DBMS, with all that that entails). With no agreement on that interpretation, surely there's no point in debating/discussing.

      The "external relvar predicate" way of working is known to work well. It was the way of working even in the pre-database era : records in the file named X represent meaning Y.

      What the alternative looks & feels like, I think nobody knows that yet, including DMG himself.

      Delete
    2. There is a formal FOPL relation predicate (RP) which, when expressed in some specific relational data language, is a conjunction of declarative integrity constraints that a RDBMS enforces. The RP has an informal expression in natural language. Date chose to call them "internal" and "external" predicates, but so what? (If you read David's chapters he documents the frequent changes in Date's definitions of the two over time.)

      A relation name is only a shorthand for the RP and the RP surely does not consist of only attribute names as per Date. A RP can capture and symbolize more meaning than just attribute names, including verbs.

      Date just does not like the conceptual level because it's informal. He wants to stick strictly to the logical level, which renders the endeavor too abstract to be useful.

      Delete
    3. You seem to think that what Date calls the external predicate, is just the NL formulation of the internal one. That is wrong.

      You also seem to have arrived at the conclusion that according to Date, what you call the RP (and which he calls "internal predicate"), "consists of only attribute names". Given that you have just stated yourself that your RP is "a conjunction of integrity constraints", it is hard to imagine anyone, let alone Date, believing that such a thing can "consist of only attribute names".

      Delete
    4. How IS the external predicate different than the informal version of the internal one? How CAN it be?

      No, that is not what I said. What I said was that he uses the attribute names as meaning of relations.

      Delete
    5. Here is what you said : "the RP surely does not consist of only attribute names as per Date". That is what you said.

      The external predicate differs from the internal one in that it contains the part that cannot be DBMS-understood, which is why it is called "external", as opposed to the internal one, which is entirely DBMS-understood, which is why it is called "internal".

      Delete
    6. That was just a quick expression, not detailed.

      The DBMS does not have to understand anything in a semantic sense and it doesn't really even those parts that Date thinks it does understand.

      The meaning assigned by the db designer is all FOPL SYMBOLIZED, all the DBMS does is manipulates the symbols consistently. Semantics are applied by users AFTER symbol manipulation to interpret results based on in-system documentation of the symbol meaning which should be accessible to users on demand--something which SQL DBMS's don't do but RDBMSs should and would.

      Delete
  2. As a matter of principle I decided not to respond to anonymous comments. But I will suggest you re-read the article more carefully, with focus on relation and attribute names in the context of the IP and the definition of a RP.

    ReplyDelete
  3. My apologies for the anon. I simply chose the google option.

    I read it a few times, and then typed a response, and then read it again. Your solution:

    "One way to differentiate RPs is to represent the verbs as values of an attribute. Assuming there are only two actions:
    User (USER_ID) performed action (ACTION) on item (ITEM_ID)."

    is obviously the right solution.

    My confusion was in trying to read through and understand what you were saying with so many abbreviations. And when I read your solution it seemed you were dismissing it because it did not satisfy the POOD or it did and POOD was irrelevant...and it violates the... RDM...? oh, relational data model...

    either way, I still feel that a logic approach presents certain problems in design, as axiomatic features (tables) in a FOPL can become new semantic content which should produce a redesign in the database - which seemed to be the posters underlying problem. Because often in software development, we do not know what all the facts are until we start working with an inadequate model.

    -arthole

    ReplyDelete
    Replies
    1. No need to apologize--you are free to post anonymously, but if you want a response, I require to know to whom I am speaking.

      If you introduce yourself I'll reply.

      Delete
    2. I agree with the solution which resolves the false premise of the model: views and downloads are merely different aspects of the same object/entity and nothing on their own which is why they share primary keys.

      I can also agree that Fabian's style can be a little overly academic and that it's easy to trip up over the over the abbreviation ("IP" is used in so many contexts in IT (sic) that it should come with a health warning!) ;-) But in general I love his writing: Practical Issues in Database Management is a fantastic read and that it is out of print tells us more about the inability to think in the industry than it does about the book.

      But I think that the main point that is, that if your conceptual model has ambiguities then your DB schema is going to be a mess. And, as usual, some of the answers provided on things like Stack Overflow are positively harmful: A4 sticks out here! But a lot depends on the question you ask and this should be a question about the conceptual model and not the schema.

      Delete
    3. Charlie,

      When readers focus on style and abbreviations, they signal something about their ability to comprehend the substance.

      I have no idea what is academic about my style. My writing does not contain fluff, it is to the point and I strive to be precise, which is precisely what is missing in most of the published material these days.

      As to abbreviations, I always attach them to the full term first and I use them to save typing when I have to repeat them. There is no possibility to confuse them to other common uses.

      The point of the RDM is to maximize db mgmt. ***by the DBMS***. That is why all information must be represented EXPLICITLY and in EXACTLY one way, which is how much of the simplicity is achieved. Anything that's left only in the mind of the users is trouble.

      Delete
  4. If you want an example of how lack of education on fundamentals can handicap practitioners to they point of inability to comprehend anything beyond tools and products (and even that with limits), then check out the exchange that my post has triggered.

    https://news.ycombinator.com/item?id=12437389&goto=news

    There are a couple of exceptions, but other than that, it's very sad to see the level of intellect in the profession (or, more accurately, lack thereof).

    ReplyDelete
  5. Incidentally, there is criticism of the RDM that it does not capture sufficient meaning, particularly verbs (which is, of course, a misconception due particularly to how Date's treatment of attribute names. But a demonstration that verbs CAN be captured gets objected to with "I have not read this in any book".

    ReplyDelete
  6. Erwin,

    The issue is really how much of the semantics is symbolized in RP's and it is simply false that verbs cannot be.

    ReplyDelete
  7. Regarding the apt complaint "The entire article and its claims are predicated on a very specific interpretation of the Information Principle ... With no agreement on that interpretation, surely there's no point in debating/discussing":

    A frequent interpretation of the IP is that it is to be obeyed so that the intent or casual descriptions of various properties can be reflected either by table names or by attribute values and no more is a said. There's a big difference between a name and a value. Rarely do data designers talk of how to enable user programs to apply the IP.

    In practice, the information available to a user is less than that available to the dbms. This means the dbms has leverage the user doesn't have, specifically the values of relations the user is unaware of but which depend on the user's relations or on which the user's relations depend. The data design can determine how much leverage. Full leverage is only potential until dbms'es exist that reflect such an interpretation.

    Applying adjectives like "natural" to any usage of a database by a dbms is off-base because a dbms doesn't use so-called natural language. Strictly speaking, there is nothing "natural" about a database and all attributes are artificial.

    Most data design dogma is only about query interpretation. It's possible for a user to see a projection without knowing it's a projection and this is not necessarily an illogical requirement, knowing such may not be required. The user who specifies a restriction or who can only see restrictions is not looking at the same relation value the typical data design allows. Knowing a table is a projection is only necessary if the very inefficient attitude is taken that it is users who program the dbms.

    The original question was also about updates. It is logical and should be easy for the dbms to leverage attributes that users can't see, so as to allow certain updates and prevent others, as the data designer sees fit. They might be attributes given values by literals in the catalog, aka logical constants, so as to seem to appear in base tables but that doesn't necessarily mean that those attributes appear in physical tables, Or, they could appear in physical tables which users can't see anyway, at the whim of the dbms developer.

    The tables are just a manifestion, it is really the exact relations that matter, the ones the designer chooses so as to permit some updates and not others.

    ReplyDelete

View My Stats