Friday, November 27, 2020

Oldies But Goodies: Missing Data - "Horizontal Decomposition" Part 2


Note: To demonstrate the correctness and stability of a sound foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may break long pieces into multiple posts, revise, and/or add comments and references.

In Part 1 we re-published a reader's response to "horizontal decomposition" -- Hugh Darwen's How to Handle Missing Information without Using NULLs  -- in comparison to our The Final NULL in the Coffin: A Relational Solution to Missing Data). Here's Hugh's response.

Sunday, November 22, 2020

Oldies But Goodies: Missing Data - "Horizontal Decomposition" Part 1

Note: To demonstrate the correctness and stability of a sound foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may break long pieces into multiple posts, revise, and/or add comments and references.
 

“I'm excited to share a data.world research partnership with Prof Leonid Libkin and Paolo Guagliardo from The University of Edinburgh. Our goal is to understand how NULL values are used in the real word to bridge theory and practice. Please help us by participating in a survey.”


Thus a recent announcement on LinkedIn, which triggered reactions in praise of this "much needed effort".

Sigh! SQL's NULL is a blunder unworthy of research. The commonly used "NULL value" is a contradiction in terms, indicating that industry surveys are not a path to enlightening. The real issue is, of course, missing data, which is governed by long studied and well understood logic[1,2,3,4], though apparently not in the industry and today's academia.

In 2004 we published The Final NULL in the Coffin: A Relational Solution to Missing Data (a paper revised since) that we believe is theoretically sound and, importantly, consistent with McGoveran's work re-interpreting, extending and formalizing Codd's RDM[5]. At the time it generated a series of exchanges with readers, which were posted at the old DBDebunk (2000-2006). In light of the above they warrant re-production.

I start with the first, split in three parts: In this Part 1 a reader's reaction to both our solution and Hugh Darwen's "horizontal decomposition" alternative, How to Handle Missing Information without Using NULLs; Hugh's reply is in Part 2 and mine -- re-written to bring up to date with current state of knowledge and for clarity --
is in Part 3.

Note: As far as we know, Darwen no longer abides by that approach -- in a later paper he referred to a "multi-relation" which seems an allusion to our solution -- but the exchange is useful for pedagogical reasons.

Thursday, October 29, 2020

Oldies But Goodies: Database Design Correctness

Note: To demonstrate the correctness and stability of databas designs provided by a sound foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" post from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions they were intended to dispel (I may break long pieces into multiple posts, and add comments and references).

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST UPDATES
-
07/22/20: LINKS update: Added “An Argument for Controlled Natural Languages in Mathematics”, “Let’s Make Set Theory Great Again”.
- 07/21/20 LINKS update: Added “How Gödel’s Proof Works”.

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
- @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

On Thinking: The Horror! 

(originally posted March 4, 2005)


RK:
"I am a long time follower of your website and enjoy it to no end. Not that I enjoy seeing people make fools of themselves, but the reading is enjoyable. I watch a number of groups and I found a thread that just is full of garbage. I know you probably get tons of these, but you really need to read this. Its amazing the things people come up with. If I wanted to think of crazy things to say, I don’t know if I could come up with these. Below is the link to the thread. I think you would find some great quotes of the week from this one."

database design question
"Keep up the fight. I have taken on our modeling tasks at our company some time ago. I fight all kinds of crazy ideas that people think will make their life easier, but really don’t."


Fabian Pascal:
I don't enjoy stupidity and ignorance either. Debunking is a dirty job, but somebody's gotta do it. [I may debunk that thread if I ever get to it].


RK:
"You're right, I should not enjoy reading threads like this. I guess I probably should have tried to respond with a correct answer to his question. I just read stuff like this and wonder what is going on in people's heads to come up with this stuff. Do they just stop thinking completely?"


Fabian Pascal:
"It's not entirely their fault, they were never taught to think [independently and critically], never rewarded [but] actually punished for such."

 
RK:
"I guess. It’s just so amazing to read what they say, and when you do think about it, how crazy it sounds. Over time as I get older and continue to learn, I understand first that there is always more to learn, but before accepting something, I want to know the foundation behind the concept. That generally reveals the truth. I look forward to additional updates on your site. Some of us do listen and enjoy hearing you speak on any number of topics."

 
Fabian Pascal:
May have something to do with your education. Currently there is none to be had really, and none is required. In fact, it is discouraged.

Incidentally, one of the comments in the thread is by Daniel Morgan, an Oracle guy who teaches at the [evening program @the] University of Washington, with whom I had an exchange out of which I posted as a weekly quote [one of his] absurd comments. Here's [his reaction] to the question that initiated the thread you referred me to:


gkelly:
"I have a database for a school that has been in use now for a couple of years and it is working well. There are numerous tables obviously but consider these:
  • CONTACT - holds all contacts - students, faculty, or any other type of contact - (probably should have called it Entity);
  • FACULTY - holds info about specific faculty member - foreign key to contac;
  • STUDENT - holds info about specific student - foreign key to contact.
Considering that the contact table has fields such as: LAST, FIRST, MI, ADDR, CITY, STATE, ZIP, EMAIL ... Do you think it is a good idea to have a single table hold such info. Or, would you have had the tables FACULTY and STUDENT store the LAST, FIRST, ADDR and other fields? At what point do you de-normalize for the sake of being more practical?"
DA Morgan:
"One table name PERSON. Anything else is a denormalized nightmare."

If the teachers are ignorant and can't think, what can we expect of their students?


RK:
"This is very true, that is why I try to focus on people being able to solve problems. I figure if someone gets a degree from a good school and the degree is in some type of applied math or some other program like this, then basic problem solving should be there. I do push to hear more about that. I don’t care what tools they have learned I care what type of foundation they have. This is not very exact but it’s all you can do these days.

I was lucky. In my education a large focus was on mathematics so I feel that I did learn how to solve problems. Also in the CS classes I did take, one was on relational databases and we spend 80–90 percent of the time studying theory, where we used Date’s INTRODUCTION TO DATABASE SYSTEMS as our text. We spent a very small amount of time in the end looking at current tools -- at the time Ingres. Looking back that was the most valuable class I ever took. I still keep Date’s book next to me and use it very often in discussions with other people."


Fabian Pascal:
Yup, that used to be the case. No more. It's getting worse and worse and it shows.


Comments on republication

If you do not understand what's absurd about Morgan's comment, education on data fundamentals is warranted.

As is almost always the case, a set of tables is given that visualizes, supposedly, a set of database relations (i.e., part of a logical database) without (1) the conceptual model (all properties and all relationships among the entity types) which the relations are supposed to represent (2) the semantic constraints on the relations which are supposed to represent those relationships. Without those it is impossible to determine full normalization (i.e., whether the tables are R-tables that visualize relations). Specifically, depending on the specification of the entity types (i.e., properties) it is likely that entity supertype-subtypes are involved, in which case the logical design is incorrect (how?).

The point is that correctness of a logical model (database design) can only be established with reference to a complete conceptual model. Whoever does not understand this fundamental is not a competent data professional.

Note very carefully, however, that while it's unclear whether the three-table design is correct or not, it it Morgan's PERSON table that is a "denormalized nightmare" (i.e., is not a R-table) -- why? In other words, he has it upside down and backwards -- he does not understand what full normalization is.

Saturday, October 17, 2020

Understanding Codd's 12 Rules for RDBMS

In response to an online publication of a book appendix regurgitating Codd's 12 famous rules (some of which were, typically, incorrect[1]) I posted earlier a clarification of the rules. This is a revision thereof for better consistency with the new understanding of the RDM based on McGoveran's re-interpretation, extension and formalization[2] of Codd's work.

Saturday, September 26, 2020

TYFK: Nothing to Do With Relational

“The relational calculus is good in describing sets. But it´s bad at describing relations between data in different sets. Explicit identities (primary keys) need to be introduced and normalization is needed to avoid update inconsistencies due to duplication of data. To say it somewhat bluntly: The problem with the relational calculus and RDBMS etc. is the focus on data. It´s seems to be so important to store the data, that connecting the data moves to the background. That might be close to how we store filled in paper forms. But it´s so unlike how the mind works. There is no data stored in your brain. If you look at the fridge in your kitchen, there is no tiny fridge created in your brain so you can take the memory of your fridge with you, when you leave your kitchen.” --Weblogs.asp.net
 

Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

The lack of foundation knowledge exposed by the above paragraph is so complete that its claims are practically upside down and backwards.

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST UPDATES
-
07/22/20: LINKS update: Added “An Argument for Controlled Natural Languages in Mathematics”, “Let’s Make Set Theory Great Again”.
- 07/21/20 LINKS update: Added “How Gödel’s Proof Works”.

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
- @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

Fundamentals

As we have demonstrated, in mathematical set theory a relation (set) is a subset of a cross-product of domains (sets). In other words, it is a set that is a relationship among sets. Being abstract (i.e., having no real world meaning), the values of mathematical relations can be arbitrary.

The RDM is an application of simple set theory expressible in first order predicate logic (SST/FOPL) to database management: a relational database represents a conceptual model of some reality, namely (facts about) a multigroup in the real world -- a collection of related entity groups -- each database relation representing one such group; a database is also a set of related relations. The values in database relations (i.e., the data) are, thus, not arbitrary, but must be consistent with the conceptual model: relations and the database as a whole are semantically constrainted to be so consistent: (1) individual properties of entities and (2) collective properties of (a) groups (i.e., relationships among entities within groups), and (b) the multigroup (i.e., relationships among groups).

A primary key (PK) represents names given in the real world to entities of a given type, and the corresponding PK constraint (uniqueness) enforces consistency of a relation with the distinguishability of those entities in the real world, the facts about which it represents. These are not RDM artifacts, but rather part of the adaptation of SST/FOPL to database management.

For the primary advantage of the RDM -- guaranteed correctness of query results (i.e., inferences made from the database) -- to materialize, logical database design must adhere to three core principles which, jointly, imply fully normalized relations (5NF).
In fact, in RDM relations are in 5NF by definition, otherwise they are not relations -- relational algebra (RA) operations lose information and all bets are off.

The RA is the manipulative component of the RDM -- a collection of primitive and derived set operations on relations that describe relationships among relations. For example, the join operation r1 JOIN r2 describes a relationship between r1 and r2 relation, the result itself a relation. Note that since every result of a RA operation on even one relation is always a relation and still describes a relationship -- between the "input" and "output" relations.

A data model -- and, industry claims notwithstanding, the only one satisfying Codd's definition that has been formalized is the RDM -- is by nature focused on data. However, the RDM supports physical independence (PI) and, thus, not concerned with how data is physically stored and accessed. The notion of "files stored in paper form" is an example of the common and entrenched logical-physical confusion (LPC) due to failure to understand the distinction between a logical relation and its tabular visualization on a physical medium, induced/reinforced by the industry's "direct image" implementation of SQL DBMSs.

Conclusion


We rephrase the above paragraph as follows:

“The relational algebra describes relationships among relations (sets). Primary keys are one of the adaptations of the SST/FOPL for database management: a PK constraint -- uniqueness -- represents formally in the database a within-group relationship among all its entities.

Mandatory adherence to three core design principles jointly imply full normalization, which is necessary to guarantees correctness of query results. True RDBMSs:

  • Implement the RA for logical data retrieval independent of how the data is physically stored and accessed. SQL DBMSs notwithstanding, vendors are free to store data whichever way they want as long as they don't expose it to users in applications.
  • Enforce relational constraints that are formal database representations of relationships in the conceptual model represented by the database.”

 The "brain" stuff is sheer nonsense.

 

 

 

 

Friday, September 4, 2020

Oldies But Goodies: Relationships and Relations


Note: To demonstrate the stability afforded by a sound foundation relative to the industry's fad-driven cookbook practices, I am re-publishing under "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my debunkings hold up, and whether the industry has progressed beyond the misconceptions they were intended to dispel. I may break long pieces into more pointed parts, and add comments and references to further reading.

From "Little Relationship to Relational" originally posted on March 29, 2001.

“Given the depth and complexity of Codd's thought, not to mention the arcane terms in which he often expressed himself, it is not difficult to grasp why so many of his key points have been widely misunderstood. Even programmers still often misconstrue the technical term "relational". The relational in relational theory refers to relations and not relationships. A relation is a special set of similar objects commonly modeled as entities or as database tables. Relationships may exist between these relations and if your relations are entities you could easily represent the whole thing using a Relational Entity Relationship approach. To elucidate a simple practical example, if you had a company table and an employee table and each company record could have many employee records associated with it, you would have two relations and one relationship. The relations would be the sets of similar objects found in the Employee and Company tables and the relationship would be the association between them. In this case one company to many employees.”
Codd's thought was very deep indeed--new implications are still being derived from his original ideas--and one major objective of relational technology, now almost forgotten, is simplicity. There is little that is complex in relational technology and, in fact, it is the most simple approach possible. Any other general approach is more complex.

It is true that Codd, as a mathematician, did not present his ideas in a way comprehensible to the average practitioner. But it is also true that he had to use different terminology in order to distinguish his precise concepts from the fuzzy, problematic terms already used in the industry. It is also true that, as I argued in the first editorial launching this site, practitioners are so steeped in complex implementation details and devoid of education in fundamentals, that they have a hard time understanding simple logical concepts. It is rather ironic that the author of the article himself reveals some misunderstanding of his own. To clarify:

  • formally a relation is a set of tuples, representing propositions about the real world.
  • informally, a relational table can be viewed as representing an "entity type", with rows representing "entities" of that type.
But note carefully that:
  • "entity" has no precise, formal definition
  • "relationship" can and should be regarded as a special case of "entity"

Comments on re-publication: 
  • A relation is a relationship among domains that is constrained semantically to represent in the database real world relationships within and among entity groups. 
  • We no longer use R-table as a substitute for relation -- it is a visualization of a relation on some physical medium that plays no role in RDM. Note that constraints are not visible in a R-table.
  • A relationship can be (1) among entities within an entity group, in which case it is a collective property of the group and is represented by a constraint or (2) between groups, in which case it is represented by an associative relation.


Further Reading

The Interpretation and Representation of Database Relations (Codd 1969-70)

Logical Symmetric Access, Data Sublanguage, Kinds of Relations, Redundancy and Consistency (Codd 1969-70)

What Relations Really Are and Why They Are Important

Understanding Relations series

Levels of Representation: Conceptual Modeling, Logical Database Design and Physical Implementation

Understanding Conceptual vs. Data Modeling series

Conceptual Modeling Is Not Data Modeling

Relationships and the RDM series

Relations & Relationships

Relationships, Rules, Relations and Constraints


What Is A Database Relationship



Friday, August 28, 2020

TYFK: Denormalization Part 1 - Fundamentals


““Main Question: How do we trade-off while doing denormalization? 
  • Sub-question 1: the standard to implement
- Do we always have to denormalize a model? For what kind of project must we use denormalization techniques while others may not?
- Since denormalization has its gains and losses, how well should we denormalize a data model? Perhaps, the more complete we denormalize, the more complex, uncertain and poor the situation will be.
  • Sub-question 2: the characteristics of normalization
-Does denormalization have several levels/forms the same as that of normalization? For instance: 1DNF, 2DNF...
- Given we can denormalize a data model, it may never be restored to the original one because to do normalization, one can have many ways while to build a data model, you can have multiple choices in determining entities, attributes, etc.””

Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

In Part 1 we discuss the relevant fundamentals in which we will ground the debunking in Part 2.

View My Stats