Relation Predicates and Identical Relations

Saturday, September 3, 2016

Relation Predicates and Identical Relations

Follow @DBDebunk Follow @ThePostWest

Note: This is a 11/25/17 re-write of an earlier post, to bring it in line with the McGoveral formalization and interpretation [1] of Codd's real RDM.

Here's what's wrong with the last wrong picture I posted, namely:

Q: "Can you have 2 tables, VIEWS and DOWNLOADS, with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other."

A1:"I don't think that there is a problem, per se. From a E/R modeling point of view I don't see a problem with that, as long as they represent two semantically different entities."

A2:"Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding."

A3: "Chris Date and Dave McGoveran formalised the Principle of Orthogonal Design. Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result."

A4: "When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence. Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often. This alone could be a good reason not to burden it with records from the downloads." --StackOverflow.com

Business Rules and Meaning

To recall, a base relation (which can be visualized as a R-table) represents a set of facts about a group of property-sharing objects. Its meaning is denoted by a conjunction of informal business rules that specify the individual and collective object properties required for membership in the group that the relation represents:

Property rules specify individual first order properties (1OP) shared by objects;
Object rules specify the second order properties (2OP) that arise from relationships among 1OPs;
Multiobject rules specify the collective third order properties (3OP) that arise from relationships among all members of a group;
Multigroup rules specify properties of the groups as a whole that arise from relationship among the object groups;

Note: A 'base relation' does not mean a stored relation (as in SQL), although it can and usually will be stored. Rather, it is a member of the base set of relations, the tuples of which represent axioms -- facts about the real world objects of interest recorded in the database, from which all relations meaningful to applications are derived, the tuples of which represent theorems -- facts that are logical implications of (inferences from) the axioms.

Because rules are informal, they are not "computable". They must be formalized as constraints -- predicates "understood" algorithmically by a DBMS that can be enforced by it for consistency with the rules. For each relation, the constraints that correspond to the first three types of rule comprise the relation predicate (RP). The RPs in conjunction with the constraints corresponding to the fourth type of rule comprise the database predicate (DBP).

Semantics and Constraints

Constraints are semantic -- they constitute the formal assignment of the meaning to the relation. If two relations are identically structured and constrained,

VIEWS (USER_ID, ITEM_ID)
DOWNLOADS (USER_ID,ITEM_ID)

the only discernible difference between them is in their names. But the core Information Principle (IP) of the RDM mandates that all information in a relational database be represented explicitly and in exactly one way -- as values of relation attributes defined on domains. The difference between relation names is, thus, meaningful information, the representation of which violates the IP and the RDM, for which reason it is inaccessible to the DBMS: consider the candidate tuple {v1,v2} -- it is impossible for the DBMS to know to which relation it belongs based on the relation and attribute names because it does not understand semantics! Applications/users must specify -- with little, or no help from the DBMS -- the proper relation. Not only is it prone to error, but also because the information is implicit in the names, relational operations lose it: if you UNION the two relations, you get users that either viewed or downloaded items, or both. For the DBMS to be able to decide, RPs must be unique -- such that a candidate tuple always satisfies exactly one [2]).

One way to make the information explicit is by representing it as values of an attribute. Assuming there are only two actions, view and download,

User (USER_ID) performed action (ACTION) on item (ITEM_ID).

where ACTION is an attribute defined on the binary domain {view,download}. This yields a one-relation design with the difference incorporated as a domain constraint on the attribute.

All of this is lost, of course, on most data professionals, who lack data foundation knowledge [3], as pointed out in two replies:

A1: The relations are semantically different, but the DBMS is kept unaware of the difference (the E/RM is irrelevant here);

A2: Renaming attributes (not fields) does not address the basic problem;

The Principle of Orthogonal Design

As to A3, the treatment given in the mentioned article to the Principle of Orthogonal Design (POOD) was incomplete, which created misunderstandings such as the one in the comment (the article authors do no longer agree on the subject [4]). Suffice it to say here that the POOD has to do with the properties of formal systems, of which the RDM, grounded as it is on FOPL, is one.

"Formal systems have highly desirable properties when they are governed by three principles, one of which is the FOPL Principle of Orthogonality (or Axiomatic Independence): Axioms should be independent -- none should be derivable from the others. In database design terms base relations should be independent -- not derivable from the other base relations. Two relations are trivially independent if they are defined in terms of mutually exclusive domains, attributes and tuples (there are, of course, more complex ways they can be independent)." --David McGoveran

Obviously, the two relations do not violate the POOD -- the facts represented by VIEWS are not derivable from those represented by DOWNLOADS and vice-versa. The problem here is conceptual specification and logical design that "hide" information from the DBMS in the names.

References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., The Principle of Orthogonal Database Design Part I, II, III.

[3] Pascal, F., THE DBEDBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.

[4] On View Updating (C. J. Date and D. McGoveran).

16 comments:

UnknownSeptember 6, 2016 at 3:54 PM
"The two relations have different meanings and, therefore, there should be some attribute or constraint differentiating the RPs. "

There is, the table name. The table name itself is semantic content. The argument presented here treats the table name as something other than semantic content, which is incorrect. If it is necessary to encode table names and relationships into the database explicitly, then one should do so.

The VIEWS and DOWNLOAD tables do not have a direct relationship to each other, nor is their relationship orthogonal. They are different semantic features of user behavior. It seems the obvious solution is to treat this behavior semantically in the same table:

USERACTION (USER_ID, ITEM_ID, ACTIONTYPE);

Action type being something like V for view and D for download. But it could also expand to include things like U for upload, or S for save-for-later.

Personally, I find these logic based approaches to database design interfere with the creative process necessary to design databases. The logic approach move us away from treating the data semantically, and forces us towards treating the data syntactically and reductively, which confuses us when we face certain design problems. By treating tables themselves as semantic content, it becomes obvious that the table can be a value. And thus we can have an additional column which captures that table meaning in a superset of the VIEWS and DOWNLOAD rows. this kind of synthetic solution is not an obvious outcome from predicate style logics - unless the database itself becomes part of the predicate system. and as that feels like we might be slipping towards Godel's theorem, I'll stop.

Thanks for an interesting article.
ReplyDelete
Replies
Fabian PascalSeptember 6, 2016 at 4:11 PM
As a matter of principle I decided not to respond to anonymous comments. But I will suggest you re-read the article more carefully, with focus on relation and attribute names in the context of the IP and the definition of a RP.
ReplyDelete
Replies
UnknownSeptember 6, 2016 at 5:02 PM
My apologies for the anon. I simply chose the google option.

I read it a few times, and then typed a response, and then read it again. Your solution:

"One way to differentiate RPs is to represent the verbs as values of an attribute. Assuming there are only two actions:
User (USER_ID) performed action (ACTION) on item (ITEM_ID)."

is obviously the right solution.

My confusion was in trying to read through and understand what you were saying with so many abbreviations. And when I read your solution it seemed you were dismissing it because it did not satisfy the POOD or it did and POOD was irrelevant...and it violates the... RDM...? oh, relational data model...

either way, I still feel that a logic approach presents certain problems in design, as axiomatic features (tables) in a FOPL can become new semantic content which should produce a redesign in the database - which seemed to be the posters underlying problem. Because often in software development, we do not know what all the facts are until we start working with an inadequate model.

-arthole
ReplyDelete
Replies
Fabian PascalSeptember 6, 2016 at 7:25 PM
If you want an example of how lack of education on fundamentals can handicap practitioners to they point of inability to comprehend anything beyond tools and products (and even that with limits), then check out the exchange that my post has triggered.

https://news.ycombinator.com/item?id=12437389&goto=news

There are a couple of exceptions, but other than that, it's very sad to see the level of intellect in the profession (or, more accurately, lack thereof).
ReplyDelete
Replies
Fabian PascalSeptember 7, 2016 at 9:54 AM
Incidentally, there is criticism of the RDM that it does not capture sufficient meaning, particularly verbs (which is, of course, a misconception due particularly to how Date's treatment of attribute names. But a demonstration that verbs CAN be captured gets objected to with "I have not read this in any book".
ReplyDelete
Replies
Fabian PascalSeptember 7, 2016 at 10:51 AM
Erwin,

The issue is really how much of the semantics is symbolized in RP's and it is simply false that verbs cannot be.
ReplyDelete
Replies
toledobytheseaSeptember 16, 2016 at 8:51 AM
Regarding the apt complaint "The entire article and its claims are predicated on a very specific interpretation of the Information Principle ... With no agreement on that interpretation, surely there's no point in debating/discussing":

A frequent interpretation of the IP is that it is to be obeyed so that the intent or casual descriptions of various properties can be reflected either by table names or by attribute values and no more is a said. There's a big difference between a name and a value. Rarely do data designers talk of how to enable user programs to apply the IP.

In practice, the information available to a user is less than that available to the dbms. This means the dbms has leverage the user doesn't have, specifically the values of relations the user is unaware of but which depend on the user's relations or on which the user's relations depend. The data design can determine how much leverage. Full leverage is only potential until dbms'es exist that reflect such an interpretation.

Applying adjectives like "natural" to any usage of a database by a dbms is off-base because a dbms doesn't use so-called natural language. Strictly speaking, there is nothing "natural" about a database and all attributes are artificial.

Most data design dogma is only about query interpretation. It's possible for a user to see a projection without knowing it's a projection and this is not necessarily an illogical requirement, knowing such may not be required. The user who specifies a restriction or who can only see restrictions is not looking at the same relation value the typical data design allows. Knowing a table is a projection is only necessary if the very inefficient attitude is taken that it is users who program the dbms.

The original question was also about updates. It is logical and should be easy for the dbms to leverage attributes that users can't see, so as to allow certain updates and prevent others, as the data designer sees fit. They might be attributes given values by literals in the catalog, aka logical constants, so as to seem to appear in base tables but that doesn't necessarily mean that those attributes appear in physical tables, Or, they could appear in physical tables which users can't see anyway, at the whim of the dbms developer.

The tables are just a manifestion, it is really the exact relations that matter, the ones the designer chooses so as to permit some updates and not others.
ReplyDelete
Replies

Add comment

POSTS

Saturday, September 3, 2016