Friday, June 30, 2017

Data Meaning: Analytics vs. Data Mining

Years back a client consulted me on a problem: they had an important database the meaning of which they did not know. Could the meaning be reconstructed? The question reflected common poor practices due to poor grasp of data fundamentals.

Data mining is distinct from analytics. The former is aimed at 'finding' meaningful data patterns -- i.e., knowledge 'discovery' -- while the latter derives new knowledge from 'existing' knowledge -- i.e., deduction (see Data, Information, Knowledge Discovery, and Knowledge Representation). 'Sensible' querying of databases to retrieve data for analytic applications and correct interpretation of results without a good grasp of data meaning is a fool's errand. Yet current database practices are extremely deficient in this respect.
Data professionals commonly believe they can infer meaning from sheer inspection of the tabular display of database relations, which is an illusion even for simple ones, particularly when, as is so often the case, they are poorly designed. 

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.

For example, what does this relation mean?

Here's what an analyst needs to be aware of when it comes to database interpretation.

Consider the relation
Unless you are familiar with the real world context of its design, you just don't know. Usually the interpretation, if any is given, is something informal like
Supplier with supplier number S# supplies part with part number P# in quantity QTY.
which (1) is ambiguous -- is QTY the packaging quantity? The only quantity ever supplied? Does it represent historical quantities? -- and (2) incomplete. Much more is required make all this clear in English.

Relations represent formally in the database facts about object groups in the real world -- in this case, supply objects are relationships between objects of two other types, suppliers and parts. The meaning assigned to them and the database as a whole 'by the database designer' is conveyed by a set of 'business rules' -- informal group descriptions in natural language that specify group properties of several types:
  1. 'Property rules' specify individual properties shared by group members independent of the other members (e.g., the three of supplies);
  2. 'Object rules' specify properties that arise from relationships between individual properties (e.g., the supply relationship);
  3. 'Multi-object rules' specify properties that objects have collectively as a group and arise from within-group relationships between all the group members (e.g., distinguishability of supplies);
  4. 'Multi-group rules' specify properties of the group of related groups -- which is what a database represents -- that arise from cross-group relationships between, members of distinct groups (e.g., between supplies and suppliers and parts);
The full meaning of a relation is conveyed by the conjunction of all the rules in the first three categories; that of the database by the conjunction of all the rules for all the relations with those in the fourth category.

A DBMS does not understand informal rules in natural language like users do, it can only manipulate abstract symbols mathematically. Rules must be 'formalized' -- i.e., expressed symbolically as  'constraints' in a formal data language that the DBMS "understands" 'algorithmically' -- i.e., as procedures to enforce database consistency with the corresponding rules.
  1. 'Domain' and 'attribute constraints' correspond to property rules;
  2. 'Tuple constraints' correspond to object rules;
  3. 'Multi-tuple constraints' correspond to multi-object rules;
  4. 'Multi-relation (or 'database') constraints' correspond to multi-group rules;
For full consistency, the constraints corresponding to all the rules 'must be declared -- explicitly spelled out -- to the DBMS in a formal data language and enforced by it'. They constrain relations to be consistent with the rules, but 'neither the rules, nor the constraints appear in the tabular display of relations'.
The above rule only specifies the three individual properties shared by supplies and the one arising from the relationship between them. Enforcement would be limited to the corresponding attribute constraints and one tuple constraint. Which means that tuples inserted in the relation will be consistent with the rule above, but may not be with any of the other rules.

Sound databased analytics require full consistency. They obtain 'if and only if':
  • Database designers have a good grasp of data fundamentals --information modeling and relational database design -- and are competent in formulation of rules and their formalization as relational constraints;
  • Databases are fully normalized;
  • Relationally complete declarative data languages express all the relational constraints;
  • 'True' RDBMSs enforce them, document the meaning 'in the system' and make it accessible to users on demand;
Sadly, none of this can be taken for granted in current industry practices. SQL DBMSs record very few constraints in the system catalog and no business rules. Caveat emptor.

"Not cost-effectively and with sufficient confidence in the result" I told the client.

No comments:

Post a Comment