Saturday, September 3, 2016

Database Design: Relation Predicates and "Identical Relations"

I have rewritten an earlier post that you should also read:
Database Design: Property-Sharing Classes and Relations
Here's what's wrong with the last wrong picture I posted, namely:

Q: "Can you have 2 tables
  • VIEWS
  • DOWNLOADS
with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other."
A1:"I don't think that there is a problem, per se. From a E/R modeling point of view I don't see a problem with that, as long as they represent two semantically different entities."
A2:"Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding."
A3: "Chris Date and Dave McGoveran formalised the Principle of Orthogonal Design. Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result."
A4: "When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence. Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often. This alone could be a good reason not to burden it with records from the downloads."
--Can you have 2 identical tables in a good schema?, StackOverflow.com



Business Rules and Integrity Constraints

To recall, a base relation (which can be pictured as a R-table) represents a set of facts about a class of property-sharing entities. Its meaning and that of its tuples is provided by a conjunction of informal business rules (BR) expressed in natural language:
  • Property rules (PR)
  • Entity rule (ER)
  • Multi-entity rules (MER)
  • Multi-class rules (MCR)
The ER defines the type of fact represented by the relation (i.e., specifies the properties that entity members of the class share). The PRs and MERs constrain the facts that (1) were asserted by a trusted authority (as true) to those that (2) satisfy the rules (i.e., specify valid ranges for property values), for which reason they are referred to as constraints.

Note: A 'base relation' does not mean a stored relation (as in SQL), although it
can and usually will be stored. Rather, it is a set of tuples that represent primitive facts (axioms), as distinct from query results, views and snapshots--derived relations whose tuples represent facts that are logical implications of the axioms (theorems).

The BR's are formally represented in the database by a relation predicate (RP)--a conjunction of integrity constraints that are formalized first order predicate logic (FOPL) versions of the constraints, expressed in a RDBMS-specific data language. While the informal BR's are understood semantically by users, a DBMS that can only manipulate abstract symbols mathematically relies on the RP to enforce database consistency algorithmically: when a candidate tuple--representing a fact--is presented for insertion into the database by an authorized user/application, it enforces the integrity constraints to ensure that only tuples that do not violate the constraints are accepted in the proper relation. Base relations must be defined logically such that the DBMS
can enforce the integrity constraints and users manipulate them consistent with the RPs.

Note: We distinguish between constraints--informal BRs at the conceptual level expressed in natural language that constrain facts--and integrity constraints--
their formalized FOPL version at the logical level, expressed in a specific data language--that represents them in the database.

Meaning and Relation Predicates

The two relations have different meanings and, therefore, there should be some attribute or constraint differentiating the RPs. Furthermore, the core Information Principle (IP) of the RDM mandates that to be accessible to a RDBMS, all information in a relational database should be represented explicitly and exactly one way--as values of relation attributes defined on domains.

If two relations have identical designs, the difference can only be encoded in the relation (and possibly attribute) names.

VIEWS (USER_ID, ITEM_ID);
DOWNLOADS (USER_ID,ITEM_ID);

in violation of the IP. As such, it is inaccessible to the DBMS, which cannot rely on RP difference to accept/reject tuples as correct/incorrect in the proper relation. Users--with little, or no help from the DBMS--must ensure that the correct tuples are inserted in the correct relation. Moreover, because the RP difference is information not represented explicitly--as data values-- relational operations lose it: if you UNION the two relations, you get users that either viewed or downloaded items, or both.

One way to differentiate RPs is to represent the verbs as values of an attribute. Assuming there are only two actions:

User (USER_ID) performed action (ACTION) on item (ITEM_ID).
where action is an attribute defined on the binary domain {view,download}. This yields a one-relation design with the RP difference incorporated as a domain constraint on the attribute and represented explicitly as attribute values.

All this is lost, of course, on those lacking data and relational foundation knowledge.

  • A1: The relations are semantically different, but the DBMS is kept unaware of the difference (the E/RM is irrelevant here).
  • A2: Renaming attributes (not fields) does not address the basic problem.
  • A4: Common contamination of logical design with physical considerations.

Principle of Orthogonal Design

As to A3, the treatment given in the mentioned article to the Principle of Orthogonal Design (POOD) was incomplete, which created misunderstandings such as the one in the comment (the article authors do no longer agree on the subject). Suffice it to say here that the POOD has to do with the properties of formal systems, of which the FOPL underlying the RDM is one--a true RDBMS is a logic inference engine, it derives facts (theorems) from the facts recorded in the database (axioms).
"Formal systems have highly desirable properties when they are governed by three principles, one of which is the FOPL Principle of Orthogonality (or axiomatic independence): Axioms should be independent--none should be derivable from the others. In database design terms base relations--representing classes of primitive entities--should be independent. Two relations are trivially independent if they are defined in terms of mutually exclusive domains, attributes and tuples (there are, of course, more complex ways they can be independent)." --David McGoveran
Obviously, the two relations do not violate the POOD--the facts  represented by VIEWS are not derivable from those represented by DOWNLOADS and vice-versa. The problem here is poor logical design that "hides" information in the names and violates the IP and RDM.

For a detailed formal exposition of the three design principles and the consequences of their violation (e.g., on view updatability), see [1]. 




References:

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS, forthcoming (draft chapters here).
[2] Pascal, F., THE DBEDBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, forthcoming.
[3] Pascal, F., PRACTICAL DATABASE FOUNDATIONS papers (revisions forthcoming).




Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

16 comments:

  1. "The two relations have different meanings and, therefore, there should be some attribute or constraint differentiating the RPs. "

    There is, the table name. The table name itself is semantic content. The argument presented here treats the table name as something other than semantic content, which is incorrect. If it is necessary to encode table names and relationships into the database explicitly, then one should do so.

    The VIEWS and DOWNLOAD tables do not have a direct relationship to each other, nor is their relationship orthogonal. They are different semantic features of user behavior. It seems the obvious solution is to treat this behavior semantically in the same table:

    USERACTION (USER_ID, ITEM_ID, ACTIONTYPE);

    Action type being something like V for view and D for download. But it could also expand to include things like U for upload, or S for save-for-later.

    Personally, I find these logic based approaches to database design interfere with the creative process necessary to design databases. The logic approach move us away from treating the data semantically, and forces us towards treating the data syntactically and reductively, which confuses us when we face certain design problems. By treating tables themselves as semantic content, it becomes obvious that the table can be a value. And thus we can have an additional column which captures that table meaning in a superset of the VIEWS and DOWNLOAD rows. this kind of synthetic solution is not an obvious outcome from predicate style logics - unless the database itself becomes part of the predicate system. and as that feels like we might be slipping towards Godel's theorem, I'll stop.

    Thanks for an interesting article.

    ReplyDelete
    Replies
    1. The entire article and its claims are predicated on a very specific interpretation of the Information Principle (that what Date calls an external relvar predicate is itself information too and must therefore be accessible to the DBMS, with all that that entails). With no agreement on that interpretation, surely there's no point in debating/discussing.

      The "external relvar predicate" way of working is known to work well. It was the way of working even in the pre-database era : records in the file named X represent meaning Y.

      What the alternative looks & feels like, I think nobody knows that yet, including DMG himself.

      Delete
    2. There is a formal FOPL relation predicate (RP) which, when expressed in some specific relational data language, is a conjunction of declarative integrity constraints that a RDBMS enforces. The RP has an informal expression in natural language. Date chose to call them "internal" and "external" predicates, but so what? (If you read David's chapters he documents the frequent changes in Date's definitions of the two over time.)

      A relation name is only a shorthand for the RP and the RP surely does not consist of only attribute names as per Date. A RP can capture and symbolize more meaning than just attribute names, including verbs.

      Date just does not like the conceptual level because it's informal. He wants to stick strictly to the logical level, which renders the endeavor too abstract to be useful.

      Delete
    3. You seem to think that what Date calls the external predicate, is just the NL formulation of the internal one. That is wrong.

      You also seem to have arrived at the conclusion that according to Date, what you call the RP (and which he calls "internal predicate"), "consists of only attribute names". Given that you have just stated yourself that your RP is "a conjunction of integrity constraints", it is hard to imagine anyone, let alone Date, believing that such a thing can "consist of only attribute names".

      Delete
    4. How IS the external predicate different than the informal version of the internal one? How CAN it be?

      No, that is not what I said. What I said was that he uses the attribute names as meaning of relations.

      Delete
    5. Here is what you said : "the RP surely does not consist of only attribute names as per Date". That is what you said.

      The external predicate differs from the internal one in that it contains the part that cannot be DBMS-understood, which is why it is called "external", as opposed to the internal one, which is entirely DBMS-understood, which is why it is called "internal".

      Delete
    6. That was just a quick expression, not detailed.

      The DBMS does not have to understand anything in a semantic sense and it doesn't really even those parts that Date thinks it does understand.

      The meaning assigned by the db designer is all FOPL SYMBOLIZED, all the DBMS does is manipulates the symbols consistently. Semantics are applied by users AFTER symbol manipulation to interpret results based on in-system documentation of the symbol meaning which should be accessible to users on demand--something which SQL DBMS's don't do but RDBMSs should and would.

      Delete
  2. As a matter of principle I decided not to respond to anonymous comments. But I will suggest you re-read the article more carefully, with focus on relation and attribute names in the context of the IP and the definition of a RP.

    ReplyDelete
  3. My apologies for the anon. I simply chose the google option.

    I read it a few times, and then typed a response, and then read it again. Your solution:

    "One way to differentiate RPs is to represent the verbs as values of an attribute. Assuming there are only two actions:
    User (USER_ID) performed action (ACTION) on item (ITEM_ID)."

    is obviously the right solution.

    My confusion was in trying to read through and understand what you were saying with so many abbreviations. And when I read your solution it seemed you were dismissing it because it did not satisfy the POOD or it did and POOD was irrelevant...and it violates the... RDM...? oh, relational data model...

    either way, I still feel that a logic approach presents certain problems in design, as axiomatic features (tables) in a FOPL can become new semantic content which should produce a redesign in the database - which seemed to be the posters underlying problem. Because often in software development, we do not know what all the facts are until we start working with an inadequate model.

    -arthole

    ReplyDelete
    Replies
    1. No need to apologize--you are free to post anonymously, but if you want a response, I require to know to whom I am speaking.

      If you introduce yourself I'll reply.

      Delete
    2. I agree with the solution which resolves the false premise of the model: views and downloads are merely different aspects of the same object/entity and nothing on their own which is why they share primary keys.

      I can also agree that Fabian's style can be a little overly academic and that it's easy to trip up over the over the abbreviation ("IP" is used in so many contexts in IT (sic) that it should come with a health warning!) ;-) But in general I love his writing: Practical Issues in Database Management is a fantastic read and that it is out of print tells us more about the inability to think in the industry than it does about the book.

      But I think that the main point that is, that if your conceptual model has ambiguities then your DB schema is going to be a mess. And, as usual, some of the answers provided on things like Stack Overflow are positively harmful: A4 sticks out here! But a lot depends on the question you ask and this should be a question about the conceptual model and not the schema.

      Delete
    3. Charlie,

      When readers focus on style and abbreviations, they signal something about their ability to comprehend the substance.

      I have no idea what is academic about my style. My writing does not contain fluff, it is to the point and I strive to be precise, which is precisely what is missing in most of the published material these days.

      As to abbreviations, I always attach them to the full term first and I use them to save typing when I have to repeat them. There is no possibility to confuse them to other common uses.

      The point of the RDM is to maximize db mgmt. ***by the DBMS***. That is why all information must be represented EXPLICITLY and in EXACTLY one way, which is how much of the simplicity is achieved. Anything that's left only in the mind of the users is trouble.

      Delete
  4. If you want an example of how lack of education on fundamentals can handicap practitioners to they point of inability to comprehend anything beyond tools and products (and even that with limits), then check out the exchange that my post has triggered.

    https://news.ycombinator.com/item?id=12437389&goto=news

    There are a couple of exceptions, but other than that, it's very sad to see the level of intellect in the profession (or, more accurately, lack thereof).

    ReplyDelete
  5. Incidentally, there is criticism of the RDM that it does not capture sufficient meaning, particularly verbs (which is, of course, a misconception due particularly to how Date's treatment of attribute names. But a demonstration that verbs CAN be captured gets objected to with "I have not read this in any book".

    ReplyDelete
  6. Erwin,

    The issue is really how much of the semantics is symbolized in RP's and it is simply false that verbs cannot be.

    ReplyDelete
  7. Regarding the apt complaint "The entire article and its claims are predicated on a very specific interpretation of the Information Principle ... With no agreement on that interpretation, surely there's no point in debating/discussing":

    A frequent interpretation of the IP is that it is to be obeyed so that the intent or casual descriptions of various properties can be reflected either by table names or by attribute values and no more is a said. There's a big difference between a name and a value. Rarely do data designers talk of how to enable user programs to apply the IP.

    In practice, the information available to a user is less than that available to the dbms. This means the dbms has leverage the user doesn't have, specifically the values of relations the user is unaware of but which depend on the user's relations or on which the user's relations depend. The data design can determine how much leverage. Full leverage is only potential until dbms'es exist that reflect such an interpretation.

    Applying adjectives like "natural" to any usage of a database by a dbms is off-base because a dbms doesn't use so-called natural language. Strictly speaking, there is nothing "natural" about a database and all attributes are artificial.

    Most data design dogma is only about query interpretation. It's possible for a user to see a projection without knowing it's a projection and this is not necessarily an illogical requirement, knowing such may not be required. The user who specifies a restriction or who can only see restrictions is not looking at the same relation value the typical data design allows. Knowing a table is a projection is only necessary if the very inefficient attitude is taken that it is users who program the dbms.

    The original question was also about updates. It is logical and should be easy for the dbms to leverage attributes that users can't see, so as to allow certain updates and prevent others, as the data designer sees fit. They might be attributes given values by literals in the catalog, aka logical constants, so as to seem to appear in base tables but that doesn't necessarily mean that those attributes appear in physical tables, Or, they could appear in physical tables which users can't see anyway, at the whim of the dbms developer.

    The tables are just a manifestion, it is really the exact relations that matter, the ones the designer chooses so as to permit some updates and not others.

    ReplyDelete