Monday, July 25, 2016

Duplicates: Stating the Same Fact More Than Once Does Not Make it Truer, Only Redundant




Here's what what wrong with last week's picture, namely:
RB: "From the tabular point of view, does it make sense why we can't have duplicate rows in a relation?"

John Sullivan: "As with everything else in life, it depends what you are trying to do (and exactly what you mean when you talk about a DBMS table v. a formal relation). From an operational (transactional) database point of view, for obvious reasons, you don't want duplicate rows (enforce a natural key). But if you're analysing data from various legacy sources (e.g. spreadsheets) it might be useful. Then again, you might introduce a surrogate key to give you more control over what's going on - again, depends on what you are trying to do." --LinkedIn.com

One of my readers once wondered why "database professionals understand uniqueness via keys, but don't seem to understand why duplicate rows should be prohibited and the consequences of breaking relational closure": 


"What is the proper way to represent two or more identical entities (e.g., cake soda cans purchased by a customer)? If you buy five cans, the receipt has five duplicate rows, which actually have meaning--ask anybody and they understand perfectly well that those rows represent five different cans. What they don't understand is that those five duplicates are distinguished by their position on the receipt--which violates Codd's Information Principle--and why that is a bad thing."

So the "reasons for which you don't want duplicates in the database" are, apparently, not obvious.

Note: "Are keys mandatory?", or "Why do we need a PK?" are two of the most frequent hits @dbdebunk: 85 (40%) out of a non-scientific sample of 208),  so keys are not better understood than duplicates. I have dealt with keys in previous posts and my next one will be a final iteration on the subject, viewed from the perspective of McGoveran's interpretation of the RDM, so stay tuned).

The data on the receipt is a result of some query issued by an application to the supermarket database and how it is presented is at the discretion of the application. Closure is violated only if the supermarket database contains duplicates and/or the DBMS produces internally results that are not relations, in which case, the number of cans on the receipt may not be the one actually purchased by the customer (see below).
 

The table
=================
 ITEM      PRICE
-----------------
 Soda can  1.39
 Soda can  1.39
 ...
=================
pictures a bag, not a relation, that represents a single fact of the type
Item (ITEM) has price (PRICE).
multiple times. Codd used to say that stating the same fact more than once does not make it truer, only redundant.

Entity Distinguishability and Integrity

Entities--even identical ones, what Date calls descriptively equivalent entities (DEE)--are always distinguishable in the real world, otherwise we could not tell them apart, or count them. They have a set of inherent properties that define and uniquely identify them, by which we recognize them, explicitly or implicitly. In principle, with unlimited time and effort we could identify the set of properties that define soda cans and uniquely identifies them (dents, scratches and all). The questions are (1) Is distinguishability meaningful in a specific informational context (i.e., need the entities be tracked individually?) and (2) how practical it is to identify and represent such an identifier in the database? [4].
  • If it is not meaningful, the identifier need not be represented in the database. Stores handle DEE's such as soda cans collectively:
==================================
 PROD#   PRODUCT    UPRICE  UNITS
======----------------------------
 A1X     Soda can   1.39       5
...  
==================================  
  • If it is meaningful, even if we could identify a defining unique identifier cost-effectively, it would usually have a complex representation as a composite natural key (NK). Since this is impractical, DEE's must be assigned a unique name that substitutes for the presumed defining identifier, represented in the database by a simple surrogate key (SK).
In fact, all entities for which unique identification is meaningful are assigned names (e.g., SSN's for US residents, S/N's for computers, hospital ID's for patients, employee ID's for employees and so on), all of which are represented by SK's in databases.
==========================
 ITEM_ID  ITEM      PRICE
=======-------------------
 1        Soda can  1.39
 2        Soda can  1.39
 3        Soda can  1.39
 ...
==========================

Duplicates Are a Costly Contradiction

Thus,  whether distinguishability is meaningful or not, a design that induces duplicates
  • Contradicts the conceptual model;
  • Violates RDM's Information Principle (IP), which mandates that, to be accessible to a RDBMS, all information in a relational database--including distinguishability, if meaningful--must be represented explicitly as attribute values.
  • Contradictions and RDM violations have costly consequences, no matter "what you're trying to do".
Relational algebra is closed to relations like arithmetic is closed to numbers: arithmetic operates on numbers and yields a number as a result, relational algebra operates on relations and yields a relation. Just as arithmetic operations break when applied to anything other than numbers, relational operations break if applied to anything other than relations. Relations have unique tuples, so bags are not relations, relational closure is broken and
  • Logical correctness of query results is no longer guaranteed  (e.g., applying relational operations to SQL bags produces results that differ in the number of duplicates contained in the bags (see Chapter 8, Appendix 4B: SQL Redundancy and Duplicates in PRACTICAL ISSUES FOR DATABASE MANAGEMENT);
  • Results become difficult to interpret (if the results are semantically equivalent, why isn't there only one?; and if they are not equivalent--as SQL treats them--what do the differences mean?);
  • Relational operations lose their nestability (with loss of ability to narrow down results).
Regardless of what you are trying to do, you don't want duplicates in databases, be they operational, analytical, legacy, or SQL. They
  • Reflect a loss of integrity--a DBMS cannot distinguish between "intentional" duplicates and data entry errors
  • Defeat logical access, physical independence (why?) and logical correctness of query results, adding unnecessary complications (see Chapter 8, Appendix 4A: Duplicate Removal in SQL in PRACTICAL ISSUES IN DATABASE MANAGEMENT)
all of which the RDM was devised to prevent. 

Understanding this is not difficult, but
  • Lack of education on data and relational fundamentals;
  • Focus on representation (structure) that misses integrity and manipulation implications;
  • SQL tolerance for duplicates
inhibit it. Had SQL authors had a good grasp of RDM, they would have prohibited duplicates from the start. That they or legacy systems did not is not a justification for duplicates and misses one of the many advantages of the RDM.
  • Every SQL table should have a key.
  • DISTINCT should be always applied to eliminate duplicates from results.
although this does not eliminate all the problems due to duplicates in SQL.

See also:


[1] Duplicates: Not Worth Repeating
 
[2] Conceptual Muddling and Database Kludges


[3] Chapter 8, PRACTICAL ISSUES IN DATABASE MANAGEMENT (Addison Wesley, 2003)

[4] Understanding Keys -- A New Perspective (forthcoming)  



No comments:

Post a Comment

View My Stats