2019 - DATABASE DEBUNKINGS

Monday, December 23, 2019

Friday, December 20, 2019

The RDM and Model Stability

“3rd normal form data models in data warehousing efforts struggle when changes impact parent child relationships. These impacts cause cascading changes to the data model, the queries, and the loading processes. [For example:]

There are bank accounts

Each account belongs to exactly one customer

A customer can have more than one account

The bank introduces a new product: joint accounts, which means an account can now have more than one owner. It is clear that the 3NF model has to be extended in order to keep this new information; the data vault models seems to be able to fulfill the new requirement.

Some banks propose joint accounts, some don’t, therefore some use M:N relation between client and accounts and others 1:N. A model which is good for any possible case is actually awful model because it describes nothing: by looking at this model you can’t say if joint accounts exist among bank's products.”

--Data Vault and Model (in)Stability

Data warehousing/vault[1] are a red herring here -- the real issue is data independence. Some corrections and clarifications first:

Normal forms do not pertain to the data model itself -- the RDM -- but to relations in logical models created using strictly the RDM[2].
3NF is insufficient -- relations are in 5NF by definition, otherwise correctness is not guaranteed[3].
The RDM was introduced as a database representation superior to old directed graph -- hierarchic and network (CODASYL) -- systems for conceptual models focused on relationships among entity groups, rather than among individual entities[4]. Graph database representation (nodes and edges) corresponds to a worldview at the conceptual level of parents-children (network) relationships, of which parent-children (hierarchy) is a special case. The relational representation (relations) corresponds to M:N relationships among entity groups, of which M:1 is a special case[5].

Note: Correctness -- logical and semantic[6] -- requires adherence to three principles of database design that jointly imply 5NF[7].

TYFK: 5NF, Association Relations and Join

Follow @DBDebunk Follow @ThePostWest

Assume a conceptual model of a multigroup consisting of two related entity groups, Customers and Orders, where a customer can issue multiple orders. The conventional logical database design is:

CUSTOMERS
===============================================
| CID | NAME     | AGE | ADDRESS   | SALARY   |
-=====-----------------------------------------
|   1 | Ramesh   | 32 | Ahmedabad | 2000.00 |
|   2 | Khilan   | 25 | Delhi     | 1500.00 |
|   3 | Kaushik | 23 | Kota      | 2000.00 |
|   4 | Chaitali | 25 | Mumbai    | 6500.00 |
|   5 | Hardik   | 27 | Bhopal    | 8500.00 |
|   6 | Komal    | 22 | MP        | 4500.00 |
|   7 | Muffy    | 24 | Indore    | 10000.00 |
-----------------------------------------------

ORDERS
===================================
| OID | DATE       | CID | AMOUNT |
-=====-----------------------------
| 102 | 2009-10-08 |   3 |   3000 |
| 100 | 2009-10-08 |   3 |   1500 |
| 101 | 2009-11-20 |   2 |   1560 |
| 103 | 2008-05-20 |   4 |   2060 |
-----------------------------------

where ORDERS.CID is an "embedded" foreign key (FK) referencing the primary key (PK) CUSTOMERS.CID.

Consider the query "For all orders, find the CID, name, OID, amount, and date" that applies a join of the two relations on CID. In SQL:

SELECT c.cid,c.name,o.oid,o.amount,o.date
FROM customers c
INNER JOIN orders o
ON c.cid = o.cid;

with the result displayed by the table:

====================================================
| C.CID | C.NAME   | O.OID | O.AMOUNT | O.DATE     |
-=======------------=======-------------------------
|     2 | Khilan   |   101 |     1560 | 2009-11-20 |
|     3 | Kaushik |   102 |     3000 | 2009-10-08 |
|     3 | Kaushik |   100 |     1500 | 2009-10-08 |
|     4 | Chaitali |   103 |     2060 | 2008-05-20 |
----------------------------------------------------

Note: A table is just a tabular display of a relation and the two should not be confused[1,2]. Bear in mind that SQL tables are not relations.

It may surprise you to know that both the design and the result are problematic from a relational standpoint.

The CAP Theorem and the RDM: Consistency, Availability, Partitioning

Follow @DBDebunk Follow @ThePostWest

by David McGoveran

I have recently posted on LinkedIn two "To Laugh or Cry?" quotes about advantages and disadvantages of "relational" and non-relational" DBMSs from "Choosing The Right Database" @TowardsDataScience.com (by Jun Wu, "a Content Writer for Technology, AI, Data Science, Psychology, and Parenting" with "background in programming and statistics" who "on her spare time, writes poetry and blogs on her parenting website").

I also brought the article to the attention of David McGoveran, who replied "This deserves a response -- it is simplistic, but makes the current, widespread thinking in the industry clear". His comments that follow focus on the three concepts in the title.

Although likely written with good intentions, Jun Wu's article is representative of the muddy thinking that pervades the database and "data science" communities[1,2]. Besides the rather obvious fact that one does not choose a "database" (an organized collection of data), but a DBMS (software that manages databases), it swallows whole and then spreads confused and inconsistent usage of the terms consistency, availability, and partitioning. [FP: That practitioners use the terms database and DBMS interchangeably is part and parcel of the general misuse and abuse of terminology reflecting poor foundation knowledge.]

The terms consistency, availability, and partitioning as used in Brewer's CAP Theorem are not the same terms as those traditionally used in relational database theory and even commercial SQL DBMS practice. For pragmatic reasons I'll qualify the former terms with "CAP" and the latter terms with "RDB", even though I find it misleading to refer to commercial SQL systems as "relational"[3,4,5,6]. Let's look at each term and its intended meaning.

Comments on a Stonebraker Article

Follow @DBDebunk Follow @ThePostWest

These comments were prompted by a LinkedIn post referencing Michael Stonebraker's Those Who Forget the Past Are Doomed to Repeat It -- something I often reiterate myself -- where he argues:

“Over the past decade, there have been a number of DBMSs introduced (typically labeled as NoSQL) which utilize a network or hierarchical data model. MongoDB and Cassandra come immediately to mind as examples. Some such systems support networks through the concepts of "links" and some support hierarchical data using a nested data model often utilizing JSON. In my opinion, these systems have not internalized lessons from history.

“At the SIGFIDET (now SIGMOD) annual conference in 1974, there was a "Great Debate" over the merits of the relational model versus the network and hierarchical models ... Basically, the argument was about which model [relational or network] was a better fit for structured data (as opposed to documents, e-mails, etc.) and boiled down to two questions:

Question 1: Are high-level data sublanguages a good idea?
Question 2: Are tables the best data structure or should one use a network or hierarchy?”

“The last 45 years have definitely affirmed Codd’s position on both issues ... The conclusion from the 1970s was that the relational model provides superior data independence, compared to the network and hierarchical [graph] models. Forty-five years later, this conclusion is still true. If you want to insulate yourself from the changes that business conditions dictate, use a relational DBMS. If you want the successor to the successor to your job to thank you for your wise decision, use a relational model.”

I couldn't agree more, having repeatedly argued this myself. But he misses some old aspects that the industry has failed to recognize, has ignored, or dismissed[1]; and some important new aspects due to a new understanding of Codd's work[2].

Data Sublanguage Part 4: Conclusion

Follow @DBDebunk Follow @ThePostWest

In Parts 1, 2, and 3 we showed that when the RDM is the data model:

A data sublanguage is short for data manipulation language (DML) that combines (1) a relationally complete retrieval component (i.e., that expresses the RA) with (2) a component that expresses updates as relation transformations;
A DBMS language is a careful combination, for practical purposes, of the data sublanguage with several sublanguages, each of which expresses a data management function (e.g., data definition, transactions, concurrency, authorizations) -- that are not relational, but are consistent with the RDM, and must not include syntactic elements that are at odds with, or subvert those of the DML.

Note: The RDM is the only data model consistent with Codd's definition that has been formalized [1].

We are now in a position to debunk the two quotes that triggered this series.

Brother, Spare Me the "Paradigms"

Follow @DBDebunk Follow @ThePostWest

Note: This is a revised version of an old column @All Analytics in response to a recent LinkedIn exchange (check out my comments in the exchange).

“Consider dimensional design and Big Data as two additional paradigms ... Big Data paradigms like Hadoop and NoSQL will alleviate the temptation people have to try to use the relational database in unnatural ways.”

Every few years (and the intervals are getting shorter) a "fundamentally different" new way of doing data management -- a "paradigm shift" -- is being promoted that, if you don't adopt, you’ll be "left behind". In the above mentioned online exchange it is argued that data management is undergoing a paradigm shift from application-centric to data-centric data management. For the very few who (1) understand what a paradigm is and (2) are familiar with data fundamentals and the history of the field, the irony could not be richer.

Testing Your Foundation Knowledge

Follow @DBDebunk Follow @ThePostWest

The Web is chockful of unnoticed/unquestioned pronouncements by novices or self-taught "experts", that are (1) wrong, or (2) gobbledygook. Attempts to demonstrate the lack of foundation knowledge underlying these misconceptions are usually dismissed as "theory, not practical", attacked as "insulting ad-hominem", or ignored altogether, regardless of the amount and quality of supporting evidence and logic. Practitioners who cannot discern such misconceptions and understand their practical implications are insufficiently prepared for a professional career in data management. They cannot associate problems with their real causes and come up with proper solutions, which explains the industry's "cookbook approach" and succession of fads. What about you?

This is another batch in the Test Your Foundation Knowledge of online statements reflecting common misconceptions due to lack of foundation knowledge. Can you debunk them? Which of the two categories, (1) or (2), do they fall in? If not, check out the recommended references.

Comment: The kind of clueless exchange leading nowhere that takes place in the absence of foundation knowledge. See if you learned anything from it, then check out the references.

Data Sublanguage Part 3: DBMS Language

Follow @DBDebunk Follow @ThePostWest

Note: 10/10/2019 significantly revised Part 2, which requires a re-read.

A formal data model consists of structure, integrity and manipulation[1,2] and so requires (1) a language that expresses data manipulation (retrievals and updates) augmented with (2) a metalanguage used to define the model's structural elements. As we showed in Part 1 and Part 2, when the data model is the RDM:

Data sublanguage is short for a relationally complete data manipulation sublanguage (DML) that expresses retrievals and updates, the latter correctly understood as set-theoretic relation transformations.
A data definition sublanguage (DDL) is a metalanguage for DML that is outside the theory but consistent with the RDM and at least as powerful expressively as the DML (e.g., a very carefully restricted SOL to avoid self-referencing).
The DML and the DDL can, for practical purposes, be carefully unified into what Codd called a "comprehensive data sublanguage", but we prefer DBMS language to avoid confusion.

By carefully we mean that because only the DML is, strictly speaking, based on relational theory, the DDL must be consistent with, but differentiated from it, such that the two can't be mixed in the same expression in a way that the former subverts the latter. A DML expression can be referenced as a sub-expression by a DDL expression (e.g., as in view definitions), but only if it contains strictly retrievals (e.g., SELECT) and no updates (e.g., INSERT, UPDATE, DELETE).

Note: Data definition and manipulation are possible without a DBMS. However:

“With a relational catalog, definition can be performed via the RA, which requires physical implementation to be determined exclusively by the catalog (behind the scenes as it were) -- a kind of skeletal, primitive, or rudimentary DBMS. This is why Codd created a relational catalog that contains a description of the database and could be managed using RA-based DML. It works well unless one is allowed to mix DDL (metalanguage) with DML (language) in the same expression. Otherwise put, the database can be read to modify the catalog, but not vice-versa (as far as the DML is concerned, the catalog that describes the database does not exist).

But with a data model that, unlike the RDM, does not define a catalog such that the same language can be used for both database and it, a rudimentary DBMS must provide a workaround, and if the model is computationally complete (like CODASYL was), there must limits on how "active" the catalog is to prevent users from writing self-referencing expressions that cannot be automatically implemented because they may corrupt the database (same as would mixing data sublanguage and host language). This is one reason some of the pre-RDM directed graph DBMSs had limited notions of catalog that often required completely separate facilities to maintain.”

--David McGoveran

The DML and the DDL express two core data management functions centralized in the DBMS. When the data model is the RDM, only data management functions are permitted to access the data. At the end of Part 2 we alluded to other such functions that, like data definition, are outside relational theory, but must be consistent with the RDM.

Testing Your Foundation Knowledge

Follow @DBDebunk Follow @ThePostWest

The Web is chockful of unnoticed/unquestioned pronouncements by novices or "experts", many self-taught, that are (1) wrong, or (2) gobbledygook. Any attempt to demonstrate lack of foundation knowledge underlying these misconceptions and their practical implications are usually dismissed as "theory, not practical", attacked as "insulting ad-hominem", or ignored altogether, regardless of the amount and quality of the supporting evidence and argument logic. This is understandable: in the absence of foundation knowledge and ability to reason, it is by definition impossible to comprehend and appreciate corrections that require them.

Practitioners who cannot detect such misconceptions and understand their practical implications and the importance thereof are insufficiently prepared for a professional career in data management. Worse, they cannot associate problems with their real causes and, thus, cannot come up with proper solutions, which explains the industry's "cookbook approach" and succession of fads.

What about you? This is another batch in the Test Your Foundation Knowledge regular series of posts of online statements reflecting common misconceptions that are difficult to discern without foundation knowledge. You can test yours by trying to debunk them in Comments, including what category, (1) or (2) do they fall in? If you can't, proper education is in order.

Data Sublanguage Part 2: Data Manipulation and Definition

Follow @DBDebunk Follow @ThePostWest

Revised 10/10/2019.

In Part 1 we showed that Codd intended in 1969 to base the RDM on axiomatic set theory (AST) and second order logic (SOL) to accommodate relation-valued domains (RVD) (i.e., sets of sets), but that for the benefit of relational advantages and to avoid SOL problems he had to trade off the expressive power of AST/SOL for the simple set theory (SST) of proper sets (i.e., relations in normal form) expressible in first order predicate logic (FOPL) and, thus, computational for relational completeness[1]. He retained the power of the former for applications by hosting a relationally complete FOPL-based language expressing the RA in computationally complete programming languages (CCL).

We also alerted to an important, but unnoticed detail: data sublanguage appeared in the 1970 paper -- in 1969 Codd referred to retrieval sublanguage. This can be understood only with reference to the theoretical foundation of the RDM.

Testing Your Foundation Knowledge

Follow @DBDebunk Follow @ThePostWest

The Web is chockful of unnoticed/unquestioned pronouncements by self-taught novices or "experts" that are (1) wrong, or (2) gobbledygook. Any attempt to demonstrate lack of foundation knowledge underlying these misconceptions and their practical implications are usually dismissed as "theory that is not practical", attacked as "insulting ad-hominem", or ignored altogether, regardless of the amount and quality of the supporting evidence and argument logic. This is understandable: in the absence of foundation knowledge and ability to reason, it is by definition impossible to comprehend and appreciate corrections that require them.

I have always contended that practitioners who cannot detect such misconceptions, and understand their practical implications and the importance thereof are insufficiently prepared for a professional career in data management. Worse, neither can they associate problems with their real causes and, thus, cannot come up with proper solutions, which explains the industry's "cookbook approach" and succession of fads.

What about you? This is another batch in the Test Your Foundation Knowledge regular series of posts of online statements reflecting common misconceptions that are difficult to discern without foundation knowledge. You can test yours by trying to debunk them in Comments -- what category, (1) or (2) do they fall in?

Data Sublanguage Part 1: Relational vs. Computational Completeness

Follow @DBDebunk Follow @ThePostWest

Note: I have revised the "Logical Access, Data Sublanguage, Kinds of Relations, Database Redundancy, and Consistency" paper in the "Understanding the Real RDM" series" (available from the PAPERS page) for consistency with this post.

“Recently I have read that SQL is actually a data sublanguage and not a programming language like C++ or Java or C# ... The answers ... have the pattern of "No, it is not. Because it's not Turing complete.", etc, etc. ... I am a bit confused, because since you can develop things through SQL, I thought it is similar to other programming languages ... I am curious about knowing why exactly is SQL not a programming language? Which features does it lack? (I know it can't do loops, but what else more?)”

--StackOverflow.com

“The SQL operators were meant to implement the relational algebra as proposed by Dr. Ted Codd. Unfortunately Dr. Codd based some of his ideas on a "extended set theory", which was an idea formulated and described in a 1977 paper by D. L. Childs ... But Childs’ extensions were not ideally suited, which is explained in quite some detail in [a] book ... by Professor Gary Sherman & Robin Bloor [who] argue that mainstream Zermelo-Fraenkel set theory (Cantor), would have been a better starting point. One key issue is that sets should be able to be sets of sets.”

--Dataversity.net

The concept of a sublanguge cannot be understood without foundation knowledge and familiarity with the history of the database management field, both lacking in the industry.

Testing Your Foundation Knowledge

Meaning Criteria and Entity Supertype-Subtypes Relationships

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of a previous post.

"I have a database for a school ... [with] numerous tables obviously, but consider these:

CONTACT - all contacts (students, faculty): has fields such as LAST, FIRST, ADDR, CITY, STATE, ZIP, EMAIL;
FACULTY - hire info, login/password, foreign key to CONTACT;
STUDENT - medical comments, current grade, foreign key to CONTACT."
"Do you think it is a good idea to have a single table hold such info? Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? At what point do you denormalize for the sake of being more practical? What would you do when you want to close out one year and start a new year? If you had stand-alone student and faculty tables then you could archive them easily, have a school semester and year attached to them. However, as you go from one year to the next information about a student or faculty may change. Like their address and phone for example. The database model now is not very good because it doesn’t maintain a history. If Student A was in school last year as well but lived somewhere else would you have 2 contact rows? 2 student rows? Or do you have just one of each and have a change log. Which is best?"

How would somebody who "does not know past, or new requirements, modeling, and database design" and messes with a working database just because "he heard something about (insert your favorite fad here)" figure out correct from bad answers? Particularly if the answers suffer from the same lack of foundation knowledge as the question?

Re-write

Follow @DBDebunk Follow @ThePostWest

Understanding Conceptual vs. Data Modeling Part 4: Properties-object Modeling

Friday, June 21, 2019

Data Meaning and Mining: Knowledge Representation and Discovery

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write -- prompted by a LinkedIn exchange -- of two columns I published @All Analytics.

“Scientific research experiments that "require assignment of data to tables, which is difficult when the scientists do not know ahead of time what analysis to run on the data, a lack of knowledge that severely limits the usefulness of relational [read: SQL] databases.”

NoSQL are recommended in such cases. But what does "scientists do not know ahead of time what analysis to run" really mean?

Data, Information, and Knowledge

One way to view the difference between data, information, and knowledge is:

“1. Data: Categorized sequences of values representing some properties of interest, but if and how they are related is unknown (e.g., research variables in scientific experiments);
2. Information: Properties further organized in named combinations -- "objects", but how they are related is unknown (e.g., "runs", or "cases" in scientific experiments);
3. Knowledge: Relationships among properties and among objects of different types are known.”

--David McGoveran

Normalization and Further Normalization Part 3: Understanding Database Design

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of two older posts, to bring them into line with McGoveran's formalization, re-interpretation, and extension[1] of Codd's RDM.

In Part 1 we explained that for a database to be relational, database design must adhere to three core principles, in which case it consists of relations that are by definition in both 1NF and 5NF. In Part 2 we showed that whether tables visualize relations (i.e., are R-tables) can be determined only with reference to the conceptual model that the database designer intended the database to represent (not what any users might think it does). This is obscured by the common and entrenched confusion/conflation of levels of representation and, consequently, of types of model -- conceptual, logical, physical, and data model -- that we have so often debunked[2].

Normalization and Further Normalization Part 2: If You Need Them, You're Doing It Wrong

Follow @DBDebunk Follow @ThePostWest

In Part 1 we outlined some fundamentals of database design, namely the distinction between normalization to 1NF, and further normalization (to "full" 5NF), and explained that they are necessary only to repair poor designs -- if you (1) develop a complete conceptual model and (2) formalize it properly using the RDM, (3) adhering to the three core principles of database design, you should end up with a relational database in both 1NF and 5NF.

Here we apply this knowledge to the typical request for "normalization" help we presented in Part 1.

Normalization and Further Normalization Part 1: Databases Representing ... What?

Follow @DBDebunk Follow @ThePostWest

Note: This is a re-write of older posts (which now link here), to bring them into line with the McGoveran formalization, re-interpretation, and extension[1] of Codd's RDM.

“A particular bug-bear and a mistake that +90% of "data modelers" make, is analyzing "point in time" views of the business data and "normalizing" those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):

SALES-INVOICE
{Invoice-Date,
Customer-Account-ID,
Customer Name,
Invoice-Address-Line-1,
Invoice-Address-Line-2,
Invoice-Address-Line-3,
Invoice-Address-Line-4,
Invoice-Address-Postcode,
Net-Amount,
VAT,
Total-Amount
};

Nearly every time, through the blind application of normalization we get this ... there’s even a term for it -- it’s called "over-normalization":

SALES-INVOICE
{Invoice-Date,
Customer-Account-Id
REFERENCES Customer-Account,
Net-Amount,
VAT,
Total-Amount
};

CUSTOMER-ACCOUNT
{Customer-Account-Id,
Customer-Name,
Invoice-Address
REFERENCES Address
};

ADDRESS
{Address-Line-1,
Address-Line-2,
Address-Line-3,
Address-Line-4,
Postcode
};”

A measure of scarcity of foundation knowledge in the industry are the attempts to correct a plethora of common misconceptions[2] that suffer from the very misconceptions they aim to correct. One of the most common fallacies is confusion of levels of representation[3] that takes two forms[4]. We have written extensively about the logical-physical confusion (LPC)[5,6,7,8] underlying "denormalization for performance"[9], and the conceptual-logical conflation (CLC) that lumps conceptual with data modeling[10,11,12], inhibiting understanding that the latter is formalization of the former.

Reader Mail: Sets vs. Graphs, Education vs. Training

Follow @DBDebunk Follow @ThePostWest

GK writes:

“I just wanted to drop a note of thanks for the website, especially the latest articles on understanding data modeling, which among other things, explains very nicely the difference between the application of set theory and graph theory. It parallels in the real world with the community (set of data elements) and the individual (node in a network) and how it is easier to connect communities (RDM), but how much more complex it would be to connect individuals directly (GDM) without going through such a community connection arrangement (e.g. e-mail, postal system).”

“I'm currently working out the concept of what I call CMCs or contextual metadata connectors. I'm sure such entities will be heavily dependent upon the usage of RDM to do their job. In the project, I would like to use both approaches (RDM, GDM) due to the power of set theory and graph theory, but exactly where one should do so is so critical.”

“It's exciting to think of the endless potential for AI-based automation when one correctly leverages the underlying principles of data relationships. Since my discovery in 2004 about a much better way to approach test automation which I called data-centric (vs. the code-centric industry standard), I have found that it applies anywhere there is data, as long as one holds to a proper understanding of data and how to view it relationally.”

“What I find very surprising though is how rare it is to find in the I.T. industry a proper understanding of data, especially when viewing it relationally. It is indeed one of the most massively misunderstood aspects of the I.T. industry to this day, as your website alludes to. Rather than running away from it, RDM should be the very first course taught in any program involved in either computer science or information science. Maybe then I wouldn't always be losing people in technical conversations whenever I start talking about it. I see a diamond and they just see carbon.”

While it is nice to have one's work appreciated, the following text appears in every post and cannot be missed:

“Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. You will not find its content anywhere else, so if you deem it useful, particularly if you are a regular reader, please back up your appreciation with concrete support -- purchase publications, or regular donations. Thank you.”

Codd was explicit about introducing the set-based RDM to relieve what he called "non-network applications" -- concerned with relationships among groups of entities -- from the complexity burden of directed graphs for network applications concerned with relationships among individual entities. But this too, like so many other aspects of his work, was missed/ignored. Witness the GDBMS revival and promotion as "superior to RDBMSs" (which are confused with SQL DBMSs), without any reference to their distinct application domains.

Furthermore, as we have often pointed out, the older generation GDBMSs were actually not grounded in graph theory, but were abstractions from industry practices, and although the current crop are improvements -- having learned from the RDM -- there is no agreed, formally well defined, theory based graph data model (GDM)[1,2]. If there is, what are -- precisely, please! -- its structure, manipulation, and integrity components?[3].

I am not familiar with CMCs, but extreme care must be exercised with respect to "using both approaches (RDM, GDM) due to the power of set theory and graph theory", to prevent the latter (based on higher logic) from defeating the purpose and advantages of the former (intentionally restricted to FOPL)[4,5].
While I do not disagree with the data-centric vs. code-centric argument, I have serious reservations -- to put it politely -- for a multiplicity of reasons to "endless potential of AI-automation", which are beyond the scope of this response.

Surprising? Since the late 80s all our writings (at the old DBDebunk, and elsewhere and at this blog; papers; books; and seminar s have done nothing but document and explain the lack of knowledge and understanding of data fundamentals in the industry[6,7,8,9,10,11]. It has much to do with the destruction of education and its replacement with tool training[12,13], a component of the decadence and decline of Western civilization. The rich irony of promoting "data science", while discarding the real data science (the RDM) escapes, of course, the industry[14,15].

References

[1] Pascal, F., Graph Databases They Who Forget the Past...

[2] Pascal, F., OO/UML, and "Graph Data Models"

[3] Pascal, F., What Is a Data Model, and What It Is Not.

[4] Pascal, F., Structure, Integrity, Manipulation: How to Compare Data Models.

[5] Pascal, F., Natural, Programming, and Data Language.

[6] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER (October 2016).

[7] Pascal, F., Database Management No Progress Without Data Fundamentals.

[8] Pascal, F., Industry Practice Is No Substitute for Foundation Knowledge.

[9] Pascal, F., The Cookbook Approach to Data Management.

[10] Pascal, F., Are You a Thinking Data Professional?

[11] Pascal, F., Lenin, Trotsky, Data Management, and the Tyranny of Knowledge and Reason.

[12] Pascal, F., A Note on Education vs. Training.

[13] Pascal, F., Education, Practicality and an Introductory SQL Book.

[14] Pascal, F., The Real Data Science.

[15] Understanding Relations: Tables? So What?

Saturday, May 11, 2019

Understanding Data Modeling Part 5: Conclusions

Follow @DBDebunk Follow @ThePostWest

In Part 1 we presented some foundation knowledge with which to debunk misconceptions lurking in the "data modeling" mess in the industry that Friesendal has tried to catalog, and argued that it can help overcome it. In Part 2 we applied this knowledge to the first two industry "data models" considered by Friesendal -- the E/RM and RDM. In Part 3, we applied it to OO/UML and (yet a formally undefined) GDM, and in Part 4 to Fact Modeling (FM).

Here we apply it to Friesendal's conclusions.

Understanding Data Modeling Part 4: Fact Modeling

Follow @DBDebunk Follow @ThePostWest

In Part 1 we presented some foundation knowledge with which to debunk misconceptions lurking in the "data models" mess in the industry that Friesendal has tried to catalog. In Part 2 we applied this knowledge to the first two industry "data models" considered by Friesendal, the E/RM and the RDM. In Part 3, we applied it to OO/UML and (a yet formally undefined) "GDM". Here we apply it to fact modeling (FM).

Fact Modeling

“... another school of modelers working with "fact modeling". Their approach is not new. It goes back to the 70's, where Eckhard Falckenberg and Sjir Nijssen started working on the approach (in parallel). Fact Modeling was known for many years as Object-Role-Modeling (ORM), and it was supported by the popular Visio diagramming tool at the time that Microsoft bought the company behind Visio. I like Nijssens name “Binary Relationship Modeling” a lot and it has been in the back of my head since the early 80's. Fact Modeling is definitely at the right level (concepts and their relationships), but it also contains all of the logic details required for formal, precise specifications. The visual syntax goes back to: Nijssen, G.M. and T.A. Halpin, Conceptual Schema and Relational Database Design — A fact oriented approach, Prentice Hall 1989.”

Understanding Data Modeling Part 3: OO/UML, and "Graph Data Models"

Follow @DBDebunk Follow @ThePostWest

In Part 1 we presented some foundation knowledge with which to debunk misconceptions lurking in the industry's "data modeling" mess that Friesendal has tried to catalog. In Part 2 we applied this knowledge to the first two modeling approaches considered by Friesendal, the E/RM and RDM. We apply it here to other two, OO/UML and "GDM".

Object Orientation and Unified Modeling Language

“A "counter revolution" against the relational movement was attempted in the 90’s. Graphical user interfaces came to dominate and they required advanced programming environments. Functionality like inheritance, sub-typing and instantiation helped programmers combat the complexities of highly interactive user dialogs. The corresponding Data Modeling tool is the Unified Modeling Language ...”

Understanding Data Modeling Part 2: "E/RM" and "RDM"

Follow @DBDebunk Follow @ThePostWest

In Part 1 we presented some foundation knowledge with which to debunk misconceptions lurking in the industry's modeling mess that Friesendal has tried to map. We now proceed to apply it to the various industry "data models" considered by Friesendal, and his understanding thereof. In this part, we apply this knowledge to the first two industry "data models" considered by Friesendal -- the E/RM and RDM.

"Entity-Relationship Model"

“One of the first formal attempts at a framework for Data Modeling was the Entity-Relationship data model paradigm proposed [in 1976] by Peter Chen. Notice that in the original Chen-style, the attributes are somewhat independent and the relationships between entities are named and carry cardinalities ("how many" participants in each end of the relationship) ... Attributes are related to their "owner" entity" in what other people called "functional dependencies".”

Understanding Data Modeling Part 1: Models, Models Everywhere, Nor Any Time to Think

Follow @DBDebunk Follow @ThePostWest

“... I needed to know what the constituent parts of data models really are. Across the board, all platforms, all models etc. Is there anything similar to atoms and the (chemical) bonds that enables the formation of molecules? My concerns were twofold ... I wanted a simple, DIY-style, metadata repository for storing 3-level data models -- what would the meta model of such a thing look like? -- [where] atomicity is of essence ... I took a tour (again) in the Data Modeling zone, trying to deconstruct the absolutely essential metadata, which data modelers cannot do without.”
--Thomas Friesendal, The Atoms and Molecules of Data Models, Dataversity.com

All data models? 3-level data models? Platforms? Hhhmmmm!

Graph Databases: They Who Forget the Past...

Follow @DBDebunk Follow @ThePostWest

Out of the plethora of misconceptions common in the industry[1], quite a few are squeezed into this paragraph:

“The relational databases that emerged in the ’80s are efficient at storing and analyzing tabular data but their underlying data model makes it difficult to connect data scattered across multiple tables. The graph databases we’ve seen emerge in the recent years are designed for this purpose. Their data model is particularly well-suited to store and to organize data where connections are as important as individual data points. Connections are stored and indexed as first-class citizens, making it an interesting model for investigations in which you need to connect the dots. In this post, we review three common fraud schemes and see how a graph approach can help investigators defeat them.”

--AnalyticBridge.DataScienceCentral.com

Relational databases did not emerge in the 80s (SQL DBMSs did);

There is no "tabular data" (the relational data structure is the relation, which can be visualized as a table on a physical medium[2], and SQL tables are not relations);

Analysis is not a DBMS, but an application function (while database queries, as deductions, are an important aspect of analysis, and computational functions can be added to the data sublanguage (as in SQL), the primary function of a DBMS is data management)[3];

A data model has nothing to do with storage (storage and access methods are part of physical implementation, which determines efficiency/performance[4]).

Here, however, we will focus on the current revival (rather than emergence) of graph DBMSs claimed superior -- without any evidence or qualifications -- to SQL DBMSs (not relational, which do not exist) that purportedly "make it difficult to connect data scattered across multiple tables". This is a typical example of how lack of foundation knowledge and of familiarity with the history of the field inhibit understanding and progress[5].

Fourth Order Properties Part 2: Association Relations in Database Design - An Example

Follow @DBDebunk Follow @ThePostWest

Part 1 outlined fundamentals of fourth order properties (4OP) of a multigroup arising from relationships among its group members due to 1:1, M:1, M1:M2, and, generally, M1:M2:M3:...:Mn relationships among the groups' entity members. Fundamentals are commonly missing from database practice, as reflected in the exchange:

“We have Building, Room, and Bed entities. Logically, if this is in the scope of some hypothetical hotel, then each one of those entities is dependent on their parent to exist ... you cannot have a bed without a room. Also, that room wouldn't exist without its parent, Building. So, why have I rarely seen this identifying relationship introduced? When I was learning databases, everything was apparently "non-identifying". When is this type of relationship necessary, if at all? I see the issue arises when that BED can exist without a BUILDING. If you were to INSERT into the BED table, you are constraint [sic] to provide a building_id, as the building_id is part of that BED's primary key. Couldn't you avoid an identifying relationship by giving each table its own surrogate primary key? Is this the correct representation of an identifying relationship? I could avoid that by just giving each table its own ID. At the end of the day, this is about IDENTIFYING relationships, not their existence, which is how I've been logically determining if something is an "identifying relationship" If that were the case, then any 1:N relationship could be "identifying" but that's not how you define identifying or non-identifying.”

“Interesting -- I’d never heard this term before. I’ve heard it referred to as a cached ID though, as that 2nd ID isn’t required, but may be beneficial for performance purposes. For this example with 3 levels it’s not a huge joint statement, but for some systems with 12 tables the joins get unpleasant. I’ve never started a system with this additional id, but I have added one later on once the need was there and the profiling led to this being the best solution for our specific situation. Usually though, just creating a view that does the joins for me has been easier. I’ll be curious what has led others to use this approach.”

It's not really introduced because it's way more towards academic than functional.”

--Reddit.com

Knowledge of the fundamentals would have obviated the question, the ad-hoc terminology, and the answers. Note in particular how -- notwithstanding the conceptual and logical nature of the question -- the first answer typically delves directly into implementation[1]. The second answer does not merit attention, except as indicator of the sad state of the industry.

Given a conceptual model, we shall now compare the database design proposed in the question with (1) conventional industry practice that includes "embedded foreign keys", and (2) assuming a true RDBMS, the unified representation using association relations of not just M1:M2, or, generally, M1:M2:M3:...:Mn 4OP relationships, but also the 1:1 and M:1 special cases[2].

Fourth Order Properties Part 1: Association Relations vs. Foreign Keys

Follow @DBDebunk Follow @ThePostWest

“We have Building, Room, and Bed entities. Logically, if this is in the scope of some hypothetical hotel, then each one of those entities is dependent on their parent to exist ... you cannot have a bed without a room. Also, that room wouldn't exist without its parent, Building. So, why have I rarely seen this identifying relationship introduced? When I was learning databases, everything was apparently "non-identifying". When is this type of relationship necessary, if at all? I see the issue arises when that BED can exist without a BUILDING. If you were to INSERT into the BED table, you are constraint [sic] to provide a building_id, as the building_id is part of that BED's primary key. Couldn't you avoid an identifying relationship by giving each table its own surrogate primary key? Is this the correct representation of an identifying relationship? I could avoid that by just giving each table its own ID. At the end of the day, this is about IDENTIFYING relationships, not their existence, which is how I've been logically determining if something is an "identifying relationship" If that were the case, then any 1:N relationship could be "identifying" but that's not how you define identifying or non-identifying.”

“Interesting -- I’d never heard this term before. I’ve hears it referred to as a cached ID though, as that 2nd ID isn’t required, but may be beneficial for performance purposes. For this example with 3 levels it’s not a huge joint statement, but for some systems with 12 tables the joins get unpleasant. I’ve never started a system with this additional id, but I have added one later on once the need was there and the profiling led to this being the best solution for our specific situation. Usually though, just creating a view that does the joins for me has been easier. I’ll be curious what has led others to use this approach.”

“It's not really introduced because it's way more towards academic than functional.”

--Reddit.com

Such questions, and ad-hoc terms like "identifying relationships"[1] come up because practice is driven by intuition and experience (if any), without the benefit of foundation knowledge[2]. Whether practitioners know/like it or not, a database is a formal computable representation of an informal conceptual model[3] and, therefore, data modeling (i.e., logical database design)[4] is impossible without (1) a well-defined and complete conceptual model and (2) a formal data model with which to formalize it as a logical model[5]and the two should not be confused[6]. Otherwise all bets are off.

Here's how foundation knowledge should have informed modeling and design.

Class, Type, Set, Relvar, and Relation

Follow @DBDebunk Follow @ThePostWest

Note: This is a rewrite of a part of an older post (now redirecting here), to bring into line with McGoveran's formalization, re-interpretation, and extension of Codd's RDM[1] (the rewrite of the other part was posted last week).

“[According to Date] relvar ≠ class. [But i]n simple terms, class applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a relation is a class (and trivially also a set), which contributes to confusion.”

“In modern programming parlance, class is generally distinguished from type only in that the latter refers to primitive (system-defined) data definitions, while class refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, type and class are actually synonymous.”

Class, type, and set are often used interchangeably in the industry. Relations are neither class, nor type, and Date's relvars must be placed properly in their formal context. While details regarding these concepts vary with the flavor of set theory, they are sufficiently well defined to be distinguishable in each of the three formal foundations of the RDM, simple set theory (SST), mathematical relation theory, and first order predicate logic (FOPL).

Understanding Domains and Attributes

Follow @DBDebunk Follow @ThePostWest

Note: This is a rewrite of one section of an older post (page thereof now links here), to bring it into line with McGoveran's formalization, re-interpretation, and extension of Codd's RDM[1]. The rewrite of the other part will be posted next.

“I don't understand the concepts of domain and attribute in relational database modeling. Can someone give me an effective example?”

“Domain is an overloaded word in the DB lexicon. It probably should also be avoided. When one refers to an attribute domain in practice it is only referring to columns that have a check constraint on them that limit the values. Reference tables with foreign key constraints in general also fulfill the spirit of what domain attributes do outside of an RDBMS.”

“A domain in most SQL usage is essentially an alias name for an existing type + restrictions on an existing type that can be used in a column. As for an attribute, it's essentially a COLUMN in SQL, a field in other types of databases, etc.”

To the extent that practitioners are familiar with domains, they equate them with programming data types (PDT), or, at best, with SQL data types.

Test your foundation knowledge -- are domains the same as PDTs or SQL data types?

Data and Meaning Part 4: Query and Result Correctness

Follow @DBDebunk Follow @ThePostWest

As we have seen in Parts 1, 2, and 3, the RDM is a formal theory adapted and applied to database management: database relations (1) preserve the formal properties of mathematical relations, but also (2) have interpretations -- carry a real world meaning assigned by a conceptual model: facts about entities, entity groups, and multigroups (i.e., their properties, some of which are relationships, specified by business rules (BR)). A relation is formally in 5NF and constrained for semantic consistency (i.e., to represent facts about an entity group).

“When we create specific domains, relations, and attributes we are constraining (restricting) an abstract logical system to a specific interpretation (meaning). Seen the other way around, an interpretation of the logical system is a representation of a specific segment of the world, and that is exactly the purpose of database design. For example, an attribute name created by the designer is assigned meaning intended by the modeler as representing an entity property, which is the very meaning of semantics. That is why full normalization cannot be achieved or assessed without reference to some conceptual model -- what attribute names mean, and how they are related to each other (i.e., their dependencies), and so on.” --David McGoveran

Yet requesting and giving design advice without a conceptual model is routine in the industry[1]. What is more, most practitioners are oblivious to the implications for correctness of queries and results[2].

Data and Meaning Part 3: Database Design

Follow @DBDebunk Follow @ThePostWest

We have seen in Part 2 that the meaning of data in a database is the conceptual model that the database is intended to represent, namely (1) the three types of objects -- entities of multiple types that form entity groups that form a multigroup -- and (2) the business rules (BR) that specify their properties:

Properties in context (PiC) shared by entities of each type;
Collective group properties (i.e., relationships among entity group members);
Multigroup properties (i.e., inter-group relationships).

Often somebody produces one or more tables and asks if there's "anything wrong" with them, or "if they are in some specific normal form and, if not, how to normalize them". This reflects lack of foundation knowledge.

Data and Meaning Part 2: Types of Business Rules

Follow @DBDebunk Follow @ThePostWest

Per Part 1, meaning is captured during conceptual modeling as information about objects of interest, specifically their properties (some of which are relationships), specified in business rules (BR). Because they are expressed informally in natural language, objects and BRs must be formalized into computable form. Data modeling (we prefer logical database design) uses a formal data model to formalize informal conceptual models as formal logical models for database representation: it assigns the meaning in the former to symbols and expressions in the latter[2]. Using the RDM:

Objects -- entities, entity groups, and multigroups -- formalize as tuples, relations, and databases, respectively;
Properties formalize as domains, and when associated with entities of specific types, as attributes;
Group and multigroup properties -- relationships among entities, and among groups[3] -- formalize as constraints on and among relations enforceable by the DBMS.

POSTS

Monday, December 23, 2019

Friday, December 20, 2019

Saturday, November 30, 2019

Friday, November 8, 2019

by David McGoveran

Friday, November 1, 2019

Saturday, October 26, 2019

Saturday, October 19, 2019

Friday, October 18, 2019

Friday, October 11, 2019

Friday, October 4, 2019

Friday, September 27, 2019

Wednesday, September 25, 2019

Sunday, September 22, 2019

Tuesday, September 17, 2019

Sunday, August 25, 2019

Saturday, June 29, 2019

Friday, June 21, 2019

Data, Information, and Knowledge

Friday, June 14, 2019

Sunday, June 2, 2019

Friday, May 31, 2019

Saturday, May 25, 2019

Saturday, May 11, 2019

Saturday, May 4, 2019

Fact Modeling

Sunday, April 28, 2019

Object Orientation and Unified Modeling Language

Saturday, April 20, 2019

"Entity-Relationship Model"

Sunday, April 14, 2019

Wednesday, March 27, 2019

Saturday, March 9, 2019

Saturday, March 2, 2019

Saturday, February 16, 2019

Sunday, February 10, 2019

Saturday, January 19, 2019

Wednesday, January 9, 2019

Tuesday, January 1, 2019