RB: "From the tabular point of view, does it make sense why we can't have duplicate rows in a relation?"
John Sullivan: "As with everything else in life, it depends what you are trying to do (and exactly what you mean when you talk about a DBMS table v. a formal relation). From an operational (transactional) database point of view, for obvious reasons, you don't want duplicate rows (enforce a natural key). But if you're analysing data from various legacy sources (e.g. spreadsheets) it might be useful. Then again, you might introduce a surrogate key to give you more control over what's going on - again, depends on what you are trying to do." --LinkedIn.com
One of my readers once wondered why "database professionals understand uniqueness via keys, but don't seem to understand why duplicate rows should be prohibited and the consequences of breaking relational closure":
"What is the proper way to represent two or more identical entities (e.g., cake soda cans purchased by a customer)? If you buy five cans, the receipt has five duplicate rows, which actually have meaning--ask anybody and they understand perfectly well that those rows represent five different cans. What they don't understand is that those five duplicates are distinguished by their position on the receipt--which violates Codd's Information Principle--and why that is a bad thing."
So the "reasons for which you don't want duplicates in the database" are, apparently, not obvious.
Note: "Are keys mandatory?", or "Why do we need a PK?" are two of the most frequent hits @dbdebunk: 85 (40%) out of a non-scientific sample of 208), so keys are not better understood than duplicates. I have dealt with keys in previous posts and my next one will be a final iteration on the subject, viewed from the perspective of McGoveran's interpretation of the RDM, so stay tuned).
The data on the receipt is a result of some query issued by an application to the supermarket database and how it is presented is at the discretion of the application. Closure is violated only if the supermarket database contains duplicates and/or the DBMS produces internally results that are not relations, in which case, the number of cans on the receipt may not be the one actually purchased by the customer (see below).
pictures a bag, not a relation, that represents a single fact of the type=================
Soda can 1.39
Soda can 1.39
Item (ITEM) has price (PRICE).multiple times. Codd used to say that stating the same fact more than once does not make it truer, only redundant.
Entity Distinguishability and IntegrityEntities--even identical ones, what Date calls descriptively equivalent entities (DEE)--are always distinguishable in the real world, otherwise we could not tell them apart, or count them. They have a set of inherent properties that define and uniquely identify them, by which we recognize them, explicitly or implicitly. In principle, with unlimited time and effort we could identify the set of properties that define soda cans and uniquely identifies them (dents, scratches and all). The questions are (1) Is distinguishability meaningful in a specific informational context (i.e., need the entities be tracked individually?) and (2) how practical it is to identify and represent such an identifier in the database? .
- If it is not meaningful, the identifier need not be represented in the database. Stores handle DEE's such as soda cans collectively:
PROD# PRODUCT UPRICE UNITS
A1X Soda can 1.39 5
- If it is meaningful, even if we could identify a defining unique identifier cost-effectively, it would usually have a complex representation as a composite natural key (NK). Since this is impractical, DEE's must be assigned a unique name that substitutes for the presumed defining identifier, represented in the database by a simple surrogate key (SK).
ITEM_ID ITEM PRICE
1 Soda can 1.39
2 Soda can 1.39
3 Soda can 1.39
Duplicates Are a Costly ContradictionThus, whether distinguishability is meaningful or not, a design that induces duplicates
- Contradicts the conceptual model;
- Violates RDM's Information Principle (IP), which mandates that, to be accessible to a RDBMS, all information in a relational database--including distinguishability, if meaningful--must be represented explicitly as attribute values.
- Contradictions and RDM violations have costly consequences, no matter "what you're trying to do".
- Logical correctness of query results is no longer guaranteed (e.g., applying relational operations to SQL bags produces results that differ in the number of duplicates contained in the bags (see Chapter 8, Appendix 4B: SQL Redundancy and Duplicates in PRACTICAL ISSUES FOR DATABASE MANAGEMENT);
- Results become difficult to interpret (if the results are semantically equivalent, why isn't there only one?; and if they are not equivalent--as SQL treats them--what do the differences mean?);
- Relational operations lose their nestability (with loss of ability to narrow down results).
- Reflect a loss of integrity--a DBMS cannot distinguish between "intentional" duplicates and data entry errors
- Defeat logical access, physical independence (why?) and logical correctness of query results, adding unnecessary complications (see Chapter 8, Appendix 4A: Duplicate Removal in SQL in PRACTICAL ISSUES IN DATABASE MANAGEMENT)
Understanding this is not difficult, but
- Lack of education on data and relational fundamentals;
- Focus on representation (structure) that misses integrity and manipulation implications;
- SQL tolerance for duplicates
- Every SQL table should have a key.
- DISTINCT should be always applied to eliminate duplicates from results.
 Duplicates: Not Worth Repeating
 Conceptual Muddling and Database Kludges
 Chapter 8, PRACTICAL ISSUES IN DATABASE MANAGEMENT (Addison Wesley, 2003)
 Understanding Keys: Natural, Primary and Surrogate Are Sufficient (forthcoming)
Do you like this post? Please link back to this article by copying one of the codes below.URL: HTML link code: BB (forum) link code: