Wednesday, September 28, 2016

Brother, Spare Me the Paradigm

My September post @All Analytics:

Every few years -- the interval is getting shorter -- data management is claimed to undergo a “paradigm shift,” some new fundamentally different new way of doing things is being promoted (e.g., “Consider dimensional design and Big Data as two additional paradigms”) that if you don't adopt you’ll be left behind (I’ve written about Big Data, for data warehouse dimensional modeling see Data Warehouses and the Logical-Physical Confusion). For the few who understand what a paradigm is and are familiar with the history of data management, the irony could not be richer. 

Read it all.  (Please comment there, not here)



Monday, September 26, 2016

This Week

1. Quote of the Week
"Which leads to another bad experience: the pernicious use of foreign keys. In the ORMs I've used, links between classes are represented in the data model as foreign keys which, if not configured carefully, result in a large number of joins when retrieving the object. (A recent count of one such table in my work resulted in over 600 attributes and 14 joins to access a single object, using the preferred query methodology.)
...
When you have foreign keys, you refer to related identities with an identifier. In your application, "identifier" takes on various meanings, but usually it's the memory location (a pointer). In the database, it's the state of the object itself. These two things don't really get along because you can really only use database identifiers in the database (the ultimate destination of the data you're working with)." --wozniak.ca

Monday, September 19, 2016

The Principle of Orthogonal Database Design Part I

Note: This is a 11/24/17 re-write of Part I of a three-part series that replaced several older posts (the pages of which which now redirect here), to bring in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.
"The principle of orthogonal design (abbreviated POOD) ... is the second of the two principles of database design, which seek to prevent databases from being too complicated or redundant, the first principle being the principle of full normalization (POFN). Simply put, it says that no two relations in a relational database should be defined in such a way that they can represent the same facts. As with database normalization, POOD serves to eliminate uncontrolled storage redundancy and expressive ambiguity, especially useful for applying updates to virtual relations (views). Although simple in concept, POOD is frequently misunderstood ... is a restatement of the requirement that a database is a minimum cover set of the relational algebra. The relational algebra allows data duplication in the relations that are the elements of the algebra. One of the efficiency requirements of a database is that there be no data duplication. This requirement is met by the minimum cover set of the relational algebra." --Wikipedia.org
Well, not quite.

Monday, September 12, 2016

This Week

1. Quote of the Week
"Data sense-making does not benefit from the relational data model. Dr. Codd’s rules for relational modeling were designed to improve efficiencies in data processing and storage, not to make data more intelligible. In fact, structuring data relationally makes the work of data sensemaking more difficult, which is why Dr. Kimball created dimensional data modeling and also why an entire industry of middleware products emerged to hide the complexities of relational models." --Stephen Few, PerceptualEdge.com

2. To Laugh or Cry?

Saturday, September 3, 2016

Relation Predicates and Identical Relations

Note: This is a 11/25/17 re-write of an earlier post, to bring it in line with the McGoveral formalization and interpretation [1] of Codd's real RDM.

Here's what's wrong with the last wrong picture I posted, namely:

Q: "Can you have 2 tables, VIEWS and DOWNLOADS, with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other."

A1:"I don't think that there is a problem, per se. From a E/R modeling point of view I don't see a problem with that, as long as they represent two semantically different entities."

A2:"Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding."

A3: "Chris Date and Dave McGoveran formalised the Principle of Orthogonal Design. Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result."

A4: "When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence. Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often. This alone could be a good reason not to burden it with records from the downloads." --StackOverflow.com

Business Rules and Meaning


To recall, a base relation (which can be visualized as a R-table) represents a set of facts about a group  of property-sharing objects. Its meaning is denoted by a conjunction of informal business rules that specify the individual and collective object properties required for membership in the group that the relation represents:

  • Property rules specify individual first order properties (1OP) shared by objects;
  • Object rules specify the second order properties (2OP) that arise from relationships among 1OPs;
  • Multiobject rules specify the collective third order properties (3OP) that arise from relationships among all members of a group;
  • Multigroup rules specify properties of the groups as a whole that arise from relationship among the object groups;

Note: A 'base relation' does not mean a stored relation (as in SQL), although it can and usually will be stored. Rather, it is a member of the base set of relations, the tuples of which represent axioms -- facts about the real world objects of interest recorded in the database, from which all relations meaningful to applications are derived, the tuples of which represent theorems -- facts that are logical implications of (inferences from) the axioms.

Because rules are informal, they are not "computable". They must be formalized as constraints -- predicates "understood" algorithmically by a DBMS that can be enforced by it for consistency with the rules. For each relation, the constraints that correspond to the first three types of rule comprise the relation predicate (RP). The RPs in conjunction with the constraints corresponding to the fourth type of rule comprise the database predicate (DBP).

Semantics and Constraints


Constraints are semantic -- they constitute the formal assignment of the meaning to the relation. If two relations are identically structured and constrained,

VIEWS (USER_ID, ITEM_ID)
DOWNLOADS (USER_ID,ITEM_ID)

the only discernible difference between them is in their names. But the core Information Principle (IP) of the RDM mandates that all information in a relational database be represented explicitly and in exactly one way -- as values of relation attributes defined on domains. The difference between relation names is, thus, meaningful information, the representation of which violates the IP and the RDM, for which reason it is inaccessible to the DBMS: consider the candidate tuple {v1,v2} -- it is impossible for the DBMS to know to which relation it belongs based on the relation and attribute names because it does not understand semantics! Applications/users must specify -- with little, or no help from the DBMS -- the proper relation. Not only is it prone to error, but also  because the information is implicit in the names, relational operations lose it: if you UNION the two relations, you get users that either viewed or downloaded items, or both. For the DBMS to be able to decide, RPs must be unique -- such that a candidate tuple always satisfies exactly one [2]).

One way to make the information explicit is by representing it as values of an attribute. Assuming there are only two actions, view and download,

User (USER_ID) performed action (ACTION) on item (ITEM_ID).
where ACTION is an attribute defined on the binary domain {view,download}. This yields a one-relation design with the difference incorporated as a domain constraint on the attribute.

All of this is lost, of course, on most data professionals, who lack data foundation knowledge [3
], as pointed out in two replies:
A1: The relations are semantically different, but the DBMS is kept unaware of the difference (the E/RM is irrelevant here);
A2: Renaming attributes (not fields) does not address the basic problem;

The Principle of Orthogonal Design


As to A3, the treatment given in the mentioned article to the Principle of Orthogonal Design (POOD) was incomplete, which created misunderstandings such as the one in the comment (the article authors do no longer agree on the subject
[4]). Suffice it to say here that the POOD has to do with the properties of formal systems, of which the RDM, grounded as it is on FOPL, is one.
"Formal systems have highly desirable properties when they are governed by three principles, one of which is the FOPL Principle of Orthogonality (or Axiomatic Independence): Axioms should be independent -- none should be derivable from the others. In database design terms base relations should be independent -- not derivable from the other base relations. Two relations are trivially independent if they are defined in terms of mutually exclusive domains, attributes and tuples (there are, of course, more complex ways they can be independent)." --David McGoveran
Obviously, the two relations do not violate the POOD -- the facts represented by VIEWS are not derivable from those represented by DOWNLOADS and vice-versa. The problem here is conceptual specification and logical design that "hide" information from the DBMS in the names.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., The Principle of Orthogonal Database Design Part I, II, III.

[3] Pascal, F., THE DBEDBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS. 


[4] On View Updating (C. J. Date and D. McGoveran).