Monday, October 30, 2017

The Importance of Understanding Classes, Sets, and Relations for Analytics

One of the clearest indications of poor foundation knowledge in data management practice is misuse and abuse of terminology. Many data professionals are inducted into the industry without a formal education, via programming and software tools, and use terms indiscriminately, as jargon, without understanding them. This has produced weak DBMS implementations and poorly designed databases that put the correctness of databased analytics at risk).

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.

For example, 'class' in data management is confused with programming class, an important distinction between class and 'set' is missed, and a 'relation' -- which is a set -- is often thought of as a class.

In object-oriented programming a class is a code template for creating objects that encapsulate data and behavior, each object being an instantiation of a class (i.e., a specific application of the template). In data management, however, a class is not code, but a formal, well defined concept from set theory, one-half of the theoretical foundation of the Relational Data Model (RDM), the understanding of which is critical for proper conceptual modeling, database design, and valid analytics.

A class is a group of objects that share the properties required for group membership (i.e., are of the same type). As I explained in Data Meaning: Analytics vs. Data Mining, there are two kinds of properties required for group membership:
  • Individual properties shared by class members
  • Collective properties that arise from relationships among (1) individual properties and (2) all members.
Given some well-defined universe of objects and a class definition (the properties), when the definition is applied to the universe, it selects out those objects that satisfy the definition (i.e., have the required properties). Otherwise put, a class induces a set of members (the definition is the class 'intension', the set of members its 'extension').

Conceptual (or business) modeling formulates business rules that define object classes of interest, each of which is jointly defined by several types of rules that specify the properties required for membership. Applying the rules to corresponding object universes induces sets of members, and these sets are represented formally in the database by relations:
  • Facts about each group's members are represented by 'tuples' (displayed as rows);
  • Individual properties are represented by 'attributes' defined on 'domains' (displayed as columns);
  • Collective property rules are represented by 'constraints' on relations (that constrain them to be consistent with the rules);
For example, when an enterprise determines individual (e.g., education, skills, experience) and collective (e.g., uniqueness, a maximum number of hires) properties that data scientists must have individually and as a group to be hired for available positions, it specifies the rules defining the class of its 'data science employees.' This class definition is applied to a universe of applicants to hire those that satisfy the rules, inducing a set of employees represented by a relation (displayable as a R-table). The relation is subject to constraints that are formalizations of the rules, which ensure that it is consistent with the class definition rules.

Failure by data professionals to understand these fundamentals gives rise to the common mistake of asking if one or more tables (not even guaranteed to be R-tables) are properly designed, without specifying the rules (which denote the meaning assigned by the database designer to the relations represented by the tables) and the corresponding constraints that enforce the rules in the database, the knowledge of which is often insufficient.

In these circumstances databases are not guaranteed to be properly designed and constrained and 'logical validity' and 'semantic correctness' of datasets retrieved by queries for analysis cannot be assumed, which means that insofar as analytics are concerned, all bets are off.

No comments:

Post a Comment

View My Stats