Friday, June 30, 2017

Data Meaning: Analytics vs. Data Mining


My July post @All Analytics

Data mining is distinct from analytics. The former is aimed at ‘finding’ meaningful data patterns—i.e., knowledge ‘discovery’—while the latter derives new knowledge from ‘existing’ knowledge—i.e., deduction (see Data, Information, Knowledge Discovery, and Knowledge Representation). ‘Sensible’ querying of databases to retrieve data for analytic applications and correct interpretation of results without a good grasp of data meaning is a fool's errand. Yet current database practices are extremely deficient in this respect.

Read it all.(Please comment there, not here.)

Sunday, June 25, 2017

Relations & Relationships Part I

Note: This is a 10/17 rewrite of a 04/13 post to bring it in line with the McGoveran formalization and interpretation [1] of Codd's true RDM. Some of the references have also been rewritten for the same reason and it is recommended that you re-read them.

There is much confusion, misuse and abuse of terminology and outright nonsense in the wrong picture of two weeks ago, all rooted in poor grasp of data and relational fundamentals. I debunk as much of each of the three paragraphs as is amenable to reason.



Levels of Representation


It is critical to keep levels of representation distinct in one's mind and avoid conceptual-logical conflation (CLC) and logical-physical confusion (LPC) [2], both of which are reflected in the comments. The following terminology [3] helps:
  • Conceptual modeling: Informal language and real world terms (object groups, properties, objects business rules);
  • Logical database design: Formal language and database terms (relations, domains/attributes, tuples, constraints);
  • Physical implementation: Formal language and physical storage and access terms (e.g., files, indexes);

Conceptual Modeling: In the Eyes of the Users


"This confusion of entities vs. attributes has been with us a long time ... a paper discussed this dilemma in 1939 [and] proposed calling the thing, which we could not determine whether it was an entity or an attribute, an "entribute ... William Kent's DATA AND REALITY (1978) is devoted to "attributes" and (in my words) he confesses that he cannot distinguish between "relationships" and "attributes". Thus, the later might be completely redundant."
First, entities are conceptual, attributes are logical (CLC). So the referenced confusion is actually entities vs. properties, the conceptual element to which logical attributes correspond (Historically, entity had a connotation with physical record, for which reason we prefer object -- though certainly not in the OO sense -- to avoid conceptual-physical confusion.

Second, difficulty to distinguish arises only when seeking absolute entities, relationships and attributes inherent in the real world. Modeling choices are based on subjective user perceptions of reality and pragmatic (‪i.e., driven by how well they satisfy specific application/user needs). There are no scientific grounds for preferring one perception/choice over another. For example,
  • A budget can be modeled either as a property of a department object, or as an type of object with properties of its own by financial management. Often both perceptions are useful (e.g., the former for the department staff, the latter for finanacial management) and can be accommodated within a model [4];
  • A many-to-many (M:N) relationship between two object groups can be modeled as an associative object group;
  • Some object group properties arise from relationships (a) between individual properties shared by, and (b) among all group members [5].
The difficulty is eased once it is realized that relationships and properties are "in the eye of the users": the modeler chooses based on their perceptions and needs discovered via requirements analysis. What is required is that choices are well defined, explicit and, once made, used consistently (Kent's is a classic, must read Recommended Book).

A well-defined and complete conceptual model is essential for a database representation -- business rules that specify the required (1) First order individual properties shared by members (2) Second order properties (2OPs) arising from relationships among those properties (3) Third order collective properties (3OPs) arising from relationships among all the group members  and (4) Collective properties of the set of groups as a whole [5].
"I have since come to realize that you must have the relationship first -- the notion of an attribute presumes a relationship, so we must define that first. In fact we must also define the exclusivity/multiplicity of that relationship (in both directions for a binary relationship) before we can determine how to put the information into tables. In reality, we are not modeling objects/entities/attributes, [ X ], [ A ], etc. at all in the relational model, we are modeling a bunch of relationships, say [ X | A ], [ X | B ], etc. mashed together into a table [ X | A | B | ...], hence perhaps Codd was correct in calling it a "relation", a bunch of relationships. We would also have to define and enforce referential integrity everywhere an A appeared. All of this is handled explicitly and correctly in ORM -- we model objects (each one appears only once in a data model diagram) and relationships. There are no attributes. As I said before, an attribute is an object playing a role in a relationship with another object."
Much of this paragraph is gibberish that I cannot hope to make sense of.
  • We don't put information in tables, we put data in relations; we visualize relations as R-tables on physical media (e.g., paper, screen);
  • A relation is and has always been "a bunch of relationships" [6] -- formal logical representation of real world relationships among user-defined
(a) object properties
(b) objects and
(c) object groups
Codd did not name it, it's been a special kind of set in set theory that way preceded Codd -- he only borrowed, adapted it and applied it to database management. It is testament to the poor education and foundation knowledge in the industry that practitioners either still don't know it, or only realize it now, five decades after the RDM was introduced [7].
  • Object Role Modeling (ORM) is a conceptual modeling methodology expanded from NIAM that is superior to E/RM (though not faultless [8]). You can use it to produce conceptual models of reality and the RDM to represent them formally in logical databases -- conflating the two is CLC. If you want proof, consider the title of NIAM author's book: CONCEPTUAL SCHEMA AND RELATIONAL DATABASE DESIGN (another Recommended classic).
  • A referential constraint is the formal database representation of a real world relationship between two object groups expressed by a business rule. It is -- and should be -- as explicit at the logical level as the business rule specifying the relationship is at the conceptual level.
"Furthermore, the relational model has no relationships since Codd decreed that all relationships must be represented by foreign keys, which are exactly the same as "attributes." Interesting that most people think of relationships as being the distinguishing characteristic of a relational model and it is not.  As I said before, an attribute is an object playing a role in a relationship with another object ... Looking back into original relational model (by Codd, Date, etc.) isn't it funny, that the term relation is implicitly mapped (in our minds) to a table of a database? If (loosely speaking) a relationship in our conventional data modeling is represented by a foreign key in a table (and combining both points together)--should a table (relation) consists only of foreign keys?" --Gordon Everest, LinkedIn.com
Well, which is it -- a bunch of relationships, or no relationships?

Modeling reality as groups of objects with properties is what we do at the conceptual level and business rules specify several types of relationship -- within groups, among properties and objects; and among groups -- a
ll supported by the RDM at the logical level.

FKs are indeed attributes involved in the representation of inter-group relationships [9] -- one of the adaptations of abstract set theory for database management, such that relations could be assigned real world meaning as representing object groups [10]. But it is not the FK that represents the relationship -- the referential constraint on both the referenced and referencing relations that enforces FK-PK values matching does. 


The old misconception that relational comes from "relationships between tables" -- it actually comes from relations -- gave rise to another that referential rules are the only kind of relationship that the RDM represents. Codd has "decreed" nothing of the sort -- stay tuned for Part 2.

The paragraph ends with more gibberish that does not merit attention.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F.,
The Conceptual-Logical Conflation and the Logical-Physical Confusion.

[3] Pascal, F., Levels of Representation Conceptual Modeling, Logical Design and Physical Implementation.

[4] Pascal, F., Formalizing the Informal: Business Modeling for Database Design.

[5] Pascal, F., The Interpretation and Representation of Database Relations.

[6] Pascal, F., Relational Data Model: It's All Relationships.


[7] Pascal, F., DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.

[8]  Nijssen, G. M., Duke, D. J., Twine, S. M., The Entity-Relationship Data Model Considered Harmful, Empirical Foundations of Information and Software Science V 1990, pp 109-130, 1988. 

[9] Pascal, F., Foreign Keys -- Are They Really Necessary?

[10] Pascal, F., What Meaning Means: Business Rules, Predicates, Integrity Constraints and Database Consistency.



Sunday, June 18, 2017

This Week

1. Database Truth of the Week

"The RDM is a formal system. It has two parts.
  • The Deductive Subsystem: the formal language
  • The Interpretation Subsystem i.e., the application--of that language
Without an interpretation subsystem there is no possibility of applying the formal system and it remains an abstract game of symbols.
Semantics is about applying the RDM to some subject. In effect, what you do is restrict the power of the abstract formalism so that it is more closely aligned with your intended use. That means you are using constraints to limit the vocabulary to the subject matter (and making it finite and usually fairly small) and restricting the possible interpretations that can be used consistently with the resulting subset of the formalism." --David McGoveran

Sunday, June 11, 2017

What Meaning Means: Business Rules, Predicates, Integrity Constraints and Database Consistency


Note: This is a 10/23/17 rewrite of a 7/29/12 post to bring it in line with the McGoveran interpretation [1] of Codd's true RDM.


 

To understand what's wrong with the picture of two weeks ago, namely:
"If we step back and look at what RDBMS is, we’ll no doubt be able to conclude that, as its name suggests (i.e., Relational Database Management System), it is a system that specializes in managing the data in a relational fashion. Nothing more. Folks, it’s important to keep in mind that it manages the data, not the MEANING of the data! And if you really need a parallel, RDBMS is much more akin to a word processor than to an operating system. A word processor (such as the much maligned MS Word, or a much nicer WordPress, for example) specializes in managing words. It does not specialize in managing the meaning of the words ... So who is then responsible for managing the meaning of the words? It’s the author, who else? Why should we tolerate RDBMS opinions on our data? We’re the masters, RDBMS is the servant, it should shut up and serve. End of discussion." --Alex Bunardzic, Should Database Manage The Meaning?
it helps to consider the quote in the context of another article by the author, "The Myth of Data Integrity", where he reveals that those "DBMS opinions" are constraints (the article has been deleted, but a few comments remain online and are highly recommended for a feel  of the consequences of lack of foundation knowledge).

Thursday, June 1, 2017

Redundancy, Consistency, and Integrity: Derivable Data

My May post @All Analytics

Database redundancy can wreak havoc with interpretation of analytics results, but it also poses consistency risks that can affect the correctness of the results themselves. The risks are too underappreciated for effective prevention. Given industry practices, analysts who use databases they did not design, or designed without sufficient foundation knowledge, should be on the alert.


Read it all (Please comment there, not here).