DATABASE DEBUNKINGS: FOPL

Showing posts with label FOPL. Show all posts

Thursday, August 5, 2021

TYFK: Facts, Properties, Relationships, Domains, Relations, Tuples

Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

A statement from a 1986 book that "Data are facts represented by values -- numbers, character strings, or symbols -- which carry meaning in a certain context" triggered the following response on Linkedin:

“...In contrast, Date and Darwen (2000) say:
Domains are the things that we can talk about.
Relations are the truths we utter about those things.
Thus, the declarative sentence "Fred is in the kitchen." is a fact that links the domains Person[s] and Place[s] with the predicate "is in". The complete relation might be made up of three facts:
Fred is in the kitchen.
Mary is in the garden.
Arthur is in the garden.
This seems to be more precise than the 1986 statement.”

To which the book author responded:

“...back then we did not have the refinement, clarity, nor precision from people like Sjir Nijssen and Terry Halpin regarding facts, or elementary fact sentences, which today you and I know are the bedrock of data modeling. Facts are expressed in sentences (with domains and predicates).”

Unfortunately none of this is sufficiently clear and precise to prevent confusion and it inhibits understanding of the RDM.

RE-WRITE

Follow @DBDebunk Follow @ThePostWest

See: https://www.dbdebunk.com/2023/08/entities-properties-and-codds-sleight.html

Friday, March 19, 2021

Data Sublanguages vs. Programming Languages

Follow @DBDebunk Follow @ThePostWest

Revised 3/20/21

I recently came across a review of Edsger Dijkstra's work, where the following comment on a book he co-authored (referred to as D&S) raised my debunking antennae:

“... in general computer people seem to have a penchant for whipping up homebrew logics ... D&S is not the only example ... See E.F. Codd’s Relational Calculus, an obvious mess.”
--Maarten van Emden, A Bridge too Far: E.W. Dijkstra and Logic

Having recently argued that "Codd was wrong" and "You're teaching [his] gospel" Betray Lack of Foundation Knowledge, my suspicion should hardly surprise. Besides, criticism of Dijkstra is a very tall order in itself, particularly in the context of disputes among logicians). As a reader asked, "What’s so obviously messy in Codd’s Relational Calculus?". Answer:

TYFK: What Is a Relational Database and Why Is It Important?

Follow @DBDebunk Follow @ThePostWest

Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an n-ary relation. The main highlights of this model are:
Data is stored in tables called relations.
Relations can be normalized.
In normalized relations, values saved are atomic values.
Each row in a relation contains a unique value.
Each column in a relation contains values from a same [sic] domain.”
--What is a relational database and why is it important, Quora.com

OBG: Missing Data -- Many-valued Logics and NULL Part 1

Follow @DBDebunk Follow @ThePostWest

Note: To demonstrate the correctness and stability of a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may break long pieces into multiple posts, revise, and/or add comments and references.

In response to a LinkedIn exchange we continue the series about missing data, NULL and the RDM. In Parts 1,2 and 3 we re-published a past exchange between myself and Hugh Darwen on the pros and cons of our relational solution to missing data vs. Hugh's "horizontal decomposition".

Here we re-publish my debunking of reactions to an article of mine exhibiting the common confusions evoked by NULL.

Understanding Codd's 12 Rules for RDBMS

Follow @DBDebunk Follow @ThePostWest

In response to an online publication of a book appendix regurgitating Codd's 12 famous rules (some of which were, typically, incorrect[1]) I posted earlier a clarification of the rules. This is a revision thereof for better consistency with the new understanding of the RDM based on McGoveran's re-interpretation, extension and formalization[2] of Codd's work.

TYFK: Nothing to Do With Relational

Follow @DBDebunk Follow @ThePostWest

Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“The relational calculus is good in describing sets. But it´s bad at describing relations between data in different sets. Explicit identities (primary keys) need to be introduced and normalization is needed to avoid update inconsistencies due to duplication of data. To say it somewhat bluntly: The problem with the relational calculus and RDBMS etc. is the focus on data. It´s seems to be so important to store the data, that connecting the data moves to the background. That might be close to how we store filled in paper forms. But it´s so unlike how the mind works. There is no data stored in your brain. If you look at the fridge in your kitchen, there is no tiny fridge created in your brain so you can take the memory of your fridge with you, when you leave your kitchen.” --Weblogs.asp.net

The lack of foundation knowledge exposed by the above paragraph is so complete that its claims are practically upside down and backwards.

Fundamentals

As we have demonstrated, in mathematical set theory a relation (set) is a subset of a cross-product of domains (sets). In other words, it is a set that is a relationship among sets. Being abstract (i.e., having no real world meaning), the values of mathematical relations can be arbitrary.

The RDM is an application of simple set theory expressible in first order predicate logic (SST/FOPL) to database management: a relational database represents a conceptual model of some reality, namely (facts about) a multigroup in the real world -- a collection of related entity groups -- each database relation representing one such group; a database is also a set of related relations. The values in database relations (i.e., the data) are, thus, not arbitrary, but must be consistent with the conceptual model: relations and the database as a whole are semantically constrainted to be so consistent: (1) individual properties of entities and (2) collective properties of (a) groups (i.e., relationships among entities within groups), and (b) the multigroup (i.e., relationships among groups).

A primary key (PK) represents names given in the real world to entities of a given type, and the corresponding PK constraint (uniqueness) enforces consistency of a relation with the distinguishability of those entities in the real world, the facts about which it represents. These are not RDM artifacts, but rather part of the adaptation of SST/FOPL to database management.

For the primary advantage of the RDM -- guaranteed correctness of query results (i.e., inferences made from the database) -- to materialize, logical database design must adhere to three core principles which, jointly, imply fully normalized relations (5NF). In fact, in RDM relations are in 5NF by definition, otherwise they are not relations -- relational algebra (RA) operations lose information and all bets are off.

The RA is the manipulative component of the RDM -- a collection of primitive and derived set operations on relations that describe relationships among relations. For example, the join operation r1 JOIN r2 describes a relationship between r1 and r2 relation, the result itself a relation. Note that since every result of a RA operation on even one relation is always a relation and still describes a relationship -- between the "input" and "output" relations.

A data model -- and, industry claims notwithstanding, the only one satisfying Codd's definition that has been formalized is the RDM -- is by nature focused on data. However, the RDM supports physical independence (PI) and, thus, not concerned with how data is physically stored and accessed. The notion of "files stored in paper form" is an example of the common and entrenched logical-physical confusion (LPC) due to failure to understand the distinction between a logical relation and its tabular visualization on a physical medium, induced/reinforced by the industry's "direct image" implementation of SQL DBMSs.

Conclusion

We rephrase the above paragraph as follows:

“The relational algebra describes relationships among relations (sets). Primary keys are one of the adaptations of the SST/FOPL for database management: a PK constraint -- uniqueness -- represents formally in the database a within-group relationship among all its entities.

Mandatory adherence to three core design principles jointly imply full normalization, which is necessary to guarantees correctness of query results. True RDBMSs:

Implement the RA for logical data retrieval independent of how the data is physically stored and accessed. SQL DBMSs notwithstanding, vendors are free to store data whichever way they want as long as they don't expose it to users in applications.
Enforce relational constraints that are formal database representations of relationships in the conceptual model represented by the database.”

The "brain" stuff is sheer nonsense.

Thursday, August 20, 2020

TYFK: Relations, Tables, Domains and Normalization

Follow @DBDebunk Follow @ThePostWest

Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an n-ary relation. The main highlights of this model are:

Data is stored in tables called relations.

Relations can be normalized. In normalized relations, values saved are atomic values.

Each row in a relation contains a unique value.

Each column in a relation contains values from a same domain.”

OBG: Data Models and Physical Independence

Follow @DBDebunk Follow @ThePostWest

Note: To appreciate the stability of a sound foundation vs the industry's fad-driven cookbook practice, I am re-publishing some of the articles and reader exchanges from the old DBDebunk.com (2000-06), giving you the opportunity to judge for yourself how well my claims/arguments hold up and whether the industry has progressed at all. I am adding comments on re-publication where necessary. Long pieces are broken into smaller parts for fast reading.

From "Little Relationship to Relational" originally posted on March 29, 2001.

“E.F. ("Ted") Codd conceived of his relational model for databases while working at IBM in 1969. Codd's approach took a clue from first-order predicate logic, the basis of a large number of other mathematical systems and presented itself [sic] in terms of set theory, leaving the physical definition of the data undefined and implementation dependent. In June of 1970, Codd laid down much of his extensive groundwork for the model in his article, "A Relational Model of Data for Large Shared Data Banks" published in the Communications of the ACM, a highly regarded professional journal published by the Association for Computing Machinery. Buoyed by an intense reaction against the ad hoc data models offered by the physically oriented mainframe databases, Codd's rigid separation of the logical model, with its rigorous mathematical underpinnings, from the less elegant realities of hardware engineering was revolutionary in its day. Codd and his relational ideas blazed across the academic computing landscape over the next few years.”

TYFK: Misconceptions About the Relational Model

Follow @DBDebunk Follow @ThePostWest

“The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an n-ary relation. The main highlights of this model are:

Data is stored in tables called relations.

Relations can be normalized, [in which case] values saved are atomic values.

Each row in a relation contains a unique value.

Each column in a relation contains values from a same domain.”

Semantics and the Relational Model

Follow @DBDebunk Follow @ThePostWest

“The RDM is semantically weak ... struggles with consistent granularity and has limitations at the property level... it has no concept of data flow ... it is an incomplete theory. Great for its time but needs something better now ... it uses ill defined and linguistically suspect labels ... it has no rules for semantic accuracy ... this just makes the RDM 1% of the truth ... the RDM should have solved this all by now ... but it has clearly not. You fail to see the reality of the failure of RDM in the real world ... this is your choice. I understand why you cling to it ... it is a most excellent theory that I respect greatly ... [but o]pen minds make progress...”

Thus in a LinkedIn exchange. Criticism of the RDM almost always reflects poor foundation knowledge and lack of familiarity with the history of the field, and as we shall see, this one is not different. It is often triggered by what I call the "fad-to-fad cookbook approach", one of the latest fads being the industry's revelational "discovery" of semantics.

No Such Thing As "Current Relational Data Models"

Follow @DBDebunk Follow @ThePostWest

“... the concept of a state group is indeed a missing modeling concept in relational/current data models...”

Thus in a LinkedIn exchange. I don't know what a "state group" is, but I spent almost six decades debunking the misuses of data model in general and the abuses of the RDM in particular and I smell them from miles away. While the time when lack of foundation knowledge shocked me is long gone, practitioners' total unawareness of and indifference to it, and poor reasoning in a field founded on logic never ceases to amaze me.

What exactly are "relational/current data models"?

TYFK: What Is A Database Relationship?

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of an earlier post. About TYFK posts (Test Your Foundation Knowledge) see the post insert below.

“Here two or more table[s] are related with each other. This is Database relationship. Database relationship is used a lot ... [in] relational database management systems ... shortly called RDBMS. Here is Join_data [sic] table and Interview_data table. For creating a relational database management system both of the table[s] must have a common field. Here Employee_ID is a common field ... Database relationship types: One-To-One relation, One-To-many relation, Many-to-many relation. Minimum one common field is essential in all the tables. The data type of common field and field size will be same in all the tables.”

First try to detect the misconceptions, then check against our debunking. If there isn't a match, you can acquire the necessary foundation knowledge in our POSTS, BOOKS, PAPERS, LINKS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.

TYFK: "Multi-model DBMSs" is an Empty Set

Follow @DBDebunk Follow @ThePostWest

Note: About TYFK posts (Test Your Foundation Knowledge) see the post insert below.

“Traditional databases ... don't have a multi-model capability. Point is that richer data models are underused, relational data models are overused, and graph data models have so many advantages that shouldn't be ignored. Relational models, on the other hand, have wildly complex structures often with hundreds to thousands of tables. Each table then contains tens to hundreds of columns, arbitrarily constructed in each and every relational system. And just in case the situation wasn't complex enough, many of those columns are exist exclusively to manage uniqueness and provide connections to other tables. This Structure-FIrst approach produced the cascade of complexity from which we have struggled to recover ever since.”

TYFK: What Is a Relational Database?

Follow @DBDebunk Follow @ThePostWest

“RDBMS stands for Relational Database Management System. RDBMS is the basis for SQL, and for all modern database systems like MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access. RDBMS store the data into collection of tables, which might be related by common fields (database table columns). RDBMS also provide relational operators to manipulate the data stored into the database tables. An important feature of RDBMS is that a single database can be spread across several tables. This differs from flat-file databases, in which each database is self-contained in a single table. The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an n-ary relation. The main highlights of this model are:

Data is stored in tables called relations.

Relations can be normalized.

In normalized relations, values saved are atomic values.

Each row in a relation contains a unique value.

Each column in a relation contains values from a same domain.”

The question got 18 answers online, but none came even close to being correct. This is the only one that merits debunking -- the rest will be posted on LinkedIn as "To laugh or cry?".

Note: While the question is about database, due to routine interchangeable use of database and DBMS, we suspect the intention was DBMS. Our debunking applies to database, and our correct answer makes the proper distinction.

First try to detect the misconceptions, then check against our debunking. If there isn't a match, you can acquire the necessary foundation knowledge in our posts, BOOKS, PAPERS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.

Comments on a Stonebraker Article

Follow @DBDebunk Follow @ThePostWest

These comments were prompted by a LinkedIn post referencing Michael Stonebraker's Those Who Forget the Past Are Doomed to Repeat It -- something I often reiterate myself -- where he argues:

“Over the past decade, there have been a number of DBMSs introduced (typically labeled as NoSQL) which utilize a network or hierarchical data model. MongoDB and Cassandra come immediately to mind as examples. Some such systems support networks through the concepts of "links" and some support hierarchical data using a nested data model often utilizing JSON. In my opinion, these systems have not internalized lessons from history.

“At the SIGFIDET (now SIGMOD) annual conference in 1974, there was a "Great Debate" over the merits of the relational model versus the network and hierarchical models ... Basically, the argument was about which model [relational or network] was a better fit for structured data (as opposed to documents, e-mails, etc.) and boiled down to two questions:

Question 1: Are high-level data sublanguages a good idea?
Question 2: Are tables the best data structure or should one use a network or hierarchy?”

“The last 45 years have definitely affirmed Codd’s position on both issues ... The conclusion from the 1970s was that the relational model provides superior data independence, compared to the network and hierarchical [graph] models. Forty-five years later, this conclusion is still true. If you want to insulate yourself from the changes that business conditions dictate, use a relational DBMS. If you want the successor to the successor to your job to thank you for your wise decision, use a relational model.”

I couldn't agree more, having repeatedly argued this myself. But he misses some old aspects that the industry has failed to recognize, has ignored, or dismissed[1]; and some important new aspects due to a new understanding of Codd's work[2].

Data Sublanguage Part 4: Conclusion

Follow @DBDebunk Follow @ThePostWest

In Parts 1, 2, and 3 we showed that when the RDM is the data model:

A data sublanguage is short for data manipulation language (DML) that combines (1) a relationally complete retrieval component (i.e., that expresses the RA) with (2) a component that expresses updates as relation transformations;
A DBMS language is a careful combination, for practical purposes, of the data sublanguage with several sublanguages, each of which expresses a data management function (e.g., data definition, transactions, concurrency, authorizations) -- that are not relational, but are consistent with the RDM, and must not include syntactic elements that are at odds with, or subvert those of the DML.

Note: The RDM is the only data model consistent with Codd's definition that has been formalized [1].

We are now in a position to debunk the two quotes that triggered this series.

Brother, Spare Me the "Paradigms"

Follow @DBDebunk Follow @ThePostWest

Note: This is a revised version of an old column @All Analytics in response to a recent LinkedIn exchange (check out my comments in the exchange).

“Consider dimensional design and Big Data as two additional paradigms ... Big Data paradigms like Hadoop and NoSQL will alleviate the temptation people have to try to use the relational database in unnatural ways.”

Every few years (and the intervals are getting shorter) a "fundamentally different" new way of doing data management -- a "paradigm shift" -- is being promoted that, if you don't adopt, you’ll be "left behind". In the above mentioned online exchange it is argued that data management is undergoing a paradigm shift from application-centric to data-centric data management. For the very few who (1) understand what a paradigm is and (2) are familiar with data fundamentals and the history of the field, the irony could not be richer.

Data Sublanguage Part 3: DBMS Language

Follow @DBDebunk Follow @ThePostWest

Note: 10/10/2019 significantly revised Part 2, which requires a re-read.

A formal data model consists of structure, integrity and manipulation[1,2] and so requires (1) a language that expresses data manipulation (retrievals and updates) augmented with (2) a metalanguage used to define the model's structural elements. As we showed in Part 1 and Part 2, when the data model is the RDM:

Data sublanguage is short for a relationally complete data manipulation sublanguage (DML) that expresses retrievals and updates, the latter correctly understood as set-theoretic relation transformations.
A data definition sublanguage (DDL) is a metalanguage for DML that is outside the theory but consistent with the RDM and at least as powerful expressively as the DML (e.g., a very carefully restricted SOL to avoid self-referencing).
The DML and the DDL can, for practical purposes, be carefully unified into what Codd called a "comprehensive data sublanguage", but we prefer DBMS language to avoid confusion.

By carefully we mean that because only the DML is, strictly speaking, based on relational theory, the DDL must be consistent with, but differentiated from it, such that the two can't be mixed in the same expression in a way that the former subverts the latter. A DML expression can be referenced as a sub-expression by a DDL expression (e.g., as in view definitions), but only if it contains strictly retrievals (e.g., SELECT) and no updates (e.g., INSERT, UPDATE, DELETE).

Note: Data definition and manipulation are possible without a DBMS. However:

“With a relational catalog, definition can be performed via the RA, which requires physical implementation to be determined exclusively by the catalog (behind the scenes as it were) -- a kind of skeletal, primitive, or rudimentary DBMS. This is why Codd created a relational catalog that contains a description of the database and could be managed using RA-based DML. It works well unless one is allowed to mix DDL (metalanguage) with DML (language) in the same expression. Otherwise put, the database can be read to modify the catalog, but not vice-versa (as far as the DML is concerned, the catalog that describes the database does not exist).

But with a data model that, unlike the RDM, does not define a catalog such that the same language can be used for both database and it, a rudimentary DBMS must provide a workaround, and if the model is computationally complete (like CODASYL was), there must limits on how "active" the catalog is to prevent users from writing self-referencing expressions that cannot be automatically implemented because they may corrupt the database (same as would mixing data sublanguage and host language). This is one reason some of the pre-RDM directed graph DBMSs had limited notions of catalog that often required completely separate facilities to maintain.”

--David McGoveran

The DML and the DDL express two core data management functions centralized in the DBMS. When the data model is the RDM, only data management functions are permitted to access the data. At the end of Part 2 we alluded to other such functions that, like data definition, are outside relational theory, but must be consistent with the RDM.

Data Sublanguage Part 2: Data Manipulation and Definition

Follow @DBDebunk Follow @ThePostWest

Revised 10/10/2019.

In Part 1 we showed that Codd intended in 1969 to base the RDM on axiomatic set theory (AST) and second order logic (SOL) to accommodate relation-valued domains (RVD) (i.e., sets of sets), but that for the benefit of relational advantages and to avoid SOL problems he had to trade off the expressive power of AST/SOL for the simple set theory (SST) of proper sets (i.e., relations in normal form) expressible in first order predicate logic (FOPL) and, thus, computational for relational completeness[1]. He retained the power of the former for applications by hosting a relationally complete FOPL-based language expressing the RA in computationally complete programming languages (CCL).

We also alerted to an important, but unnoticed detail: data sublanguage appeared in the 1970 paper -- in 1969 Codd referred to retrieval sublanguage. This can be understood only with reference to the theoretical foundation of the RDM.

POSTS

Thursday, August 5, 2021

Thursday, June 10, 2021

Friday, March 19, 2021

Friday, February 12, 2021

Saturday, January 9, 2021

Saturday, October 17, 2020

Saturday, September 26, 2020

Fundamentals

Conclusion

Thursday, August 20, 2020

Friday, August 7, 2020

Sunday, June 28, 2020

Friday, June 12, 2020

Thursday, May 28, 2020

Sunday, May 10, 2020

Monday, April 27, 2020

Thursday, January 30, 2020

Friday, November 1, 2019

Saturday, October 26, 2019

Saturday, October 19, 2019

Friday, October 11, 2019

Friday, September 27, 2019