Sunday, June 28, 2020

TYFK: Misconceptions About the Relational Model

“The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an n-ary relation. The main highlights of this model are:
  • Data is stored in tables called relations.
  • Relations can be normalized, [in which case] values saved are atomic values.
  • Each row in a relation contains a unique value.
  • Each column in a relation contains values from a same domain.”

Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).


DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

-12/24/20: Added 2021 to the
POSTS page

-12/26/20: Added “Mathematics, machine learning and Wittgenstein to LINKS page

- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- The PostWest blog for monthly samples of global Antisemitism – the only universally acceptable hatred left – as the (traditional) response to the existential crisis of decadence and decline of Western  civilization (including the US).
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.


The Misconceptions

  • The popular data model is not really the RDM.
  • There are no data models other than the RDM.
  • First order predicate logic (FOPL) is not by itself the theoretical foundation of the RDM.
  • FOPL does not "define a table as a n-ary relation".
  • Data is not stored in tables or relations.
  • Normalization is neither optional, nor sufficient for relations ("value atomicity" is not understood correctly).
  • Relations do not have rows and columns.

The Debunking

What is popular is not RDM, but what is erroneously believed in the industry to be RDM, namely the "SQL data model" (models, given the variety of SQL dialects), which has a very low relational fidelity[1,2,3,4,5]. Most practitioners do not know what a data model is, nor the RDM even as originally understood, let alone the current understanding[6].

The science in RDM is its dual theoretical foundation, which is responsible for most of it advantages[7], primary among them system-guaranteed logical validity and by-design semantic consistency[8]. We say that RDM is based on simple set theory (SST), but theory of relations (ToR) is more accurate than SST. Since ToR is based on set theory, however, we can safely say the RDM is based on SST.

“C. S. Peirce's theory of relations was published in 1870 (but largely ignored until the 1900s). He was very familiar with set theory as defined and understood at the time and his theory was in part motivated by the need to ***express relationships among sets***. By the time Codd came along in 1969, set theory was usually understood as the Zermelo-Fraenkel (ZFC) axiomatic systems and it was clear that these sufficed to express Peirce's theory of relations.”
--David McGoveran
In other words, set theory did not have a way to express relationships between sets and Peirce extended it by adding a special kind of set -- a relation -- that formalizes such set relationships.

The RDM is the only data model consistent with the concept introduced by Codd -- a theory-based combination of structure/integrity and manipulation -- that has been formalized. Industry claims notwithstanding, there are no other data models[9].

RDM is Codd's adaptation of SST/ToR to database management. One adjustment was combining ToR with FOPL.

“Codd read in Church's famous text that the predicates of predicate logic could be expressed as the mathematical relations in ToR. He realized that if he used some combination of attributes as unique identifiers and the rest as descriptive, he could interpret tuples as data records. Given that predicate logic is expressively complete (to 1st order), he had a formal way of expressing all data and data relationships!”
--David McGoveran
the addition of primary keys (PK) being another adjustment[10] (see note below).

In ToR a mathematical relation, which (unlike a database relation) is abstract (i.e., has no real world meaning) is a subset of the cross-product of two sets referred to as domains. For example:

Ackn. Wikipedia

The cross-product of the two domains A{a,b,c} and B{x,y,z} is:


a subset of which would be a relation. In our example:


is a relation on A,B -- it has three (a subset of the nine AxB) tuples and two attributes, a(A){a,c,b} and b(B){y,z,z}, drawing their values from the domains.

R can be presented as a table on a physical medium (here, screen):

 a(A) b(B)
  a    y
  c    z
  b    z
but the column and row arrangement plays no role in ToR and, thus, in RDM. Note that only the body of the table presents the relation! 

Note: Mathematical relations have unique tuples, so why do database relations need PKs?[10].

As relationships between two domains mathematical relations are binary. n-ary relations are another adjustment of the theory by Codd for database management.

“Prior to the RDM n-ary relations were understood as a composition of binary relations. Tarski (primarily a logician!) rediscovered Peirce's relation theory in 1940 and developed relation algebra (RA) based on it. Codd's initial RA was similar to Tarski's, but subsequently he took n-ary domain relationships as valid in their own right, so that expressing them as binary is secondary. He generalized relations as n-ary operands and the ToR RA operations proved tricky when generalized, which led to further normalization requirements beyond 1NF.”
--David McGoveran

One of the core advantages of the RDM is physical independence -- the insulation of applications and users from implementation details (storage and access methods) and changes thereof[11]. Relations are logical and the RDM is mute on how their data is physically stored and accesed. DBMS vendors have complete freedom to implement and DBAs to deploy any physical means they deem appropriate and change them at will without disrupting database access. Data is not stored in tables.

Database relations are both normalized (in 1NF)[12] and fully normalized (in 5NF)[13] by definition, otherwise they are not relations. 1NF means they are defined on simple domains the values of which are treated as atomic by the data sublanguage[14],  5NF means that they represent groups of entities of a single types. The three core design principles of the RDM jointly imply full normalization (but not vice-versa)[15]. Thus, explicit normalization to 1NF and further normalization to 5NF are required only to repair poor designs that do not adhere to the principles.

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.


[1] Pascal, F., SQL Sins

[2] Pascal, F., To Really Understand Integrity, Don't Start with SQL

[3] Pascal, F., DISTINCT and ORDER BY Are Not Relational

[4] Pascal, F., NULL Value is a Contradiction in Terms

[5] Pascal, F., Duplicates: Stating the Same Fact More Than Once Does Not Make it Truer, Only Redundant

[6] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK (draft chapters), forthcoming.

[7] Pascal, F., Science, "Data Science", and Database Science

[8] Pascal, F., Logical Validity and Semantic Correctness

[9] Pascal, F., "Multi-model DBMSs" is an Empty Set

[10] Pascal, F., The Key to Relational Keys: Primary Keys

[11] Pascal, F., Physical Independence series

[12] Pascal, F., First Normal Form in Theory and Practice series

[13] Pascal, F., Normalization and Further Normalization series

[14] Pascal, F., Simple Domains and Value Atomicity

[15] Pascal, F., Database Design: What It Is and Isn't





No comments:

Post a Comment

View My Stats