Sunday, November 22, 2020

Oldies But Goodies: Missing Data - "Horizontal Decomposition" Part 1

Note: To demonstrate the correctness and stability of a sound foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may break long pieces into multiple posts, revise, and/or add comments and references.
 

“I'm excited to share a data.world research partnership with Prof Leonid Libkin and Paolo Guagliardo from The University of Edinburgh. Our goal is to understand how NULL values are used in the real word to bridge theory and practice. Please help us by participating in a survey.”


Thus a recent announcement on LinkedIn, which triggered reactions in praise of this "much needed effort".

Sigh! SQL's NULL is a blunder unworthy of research. The commonly used "NULL value" is a contradiction in terms, indicating that industry surveys are not a path to enlightening. The real issue is, of course, missing data, which is governed by long studied and well understood logic[1,2,3,4], though apparently not in the industry and today's academia.

In 2004 we published The Final NULL in the Coffin: A Relational Solution to Missing Data (a paper revised since) that we believe is theoretically sound and, importantly, consistent with McGoveran's work re-interpreting, extending and formalizing Codd's RDM[5]. At the time it generated a series of exchanges with readers, which were posted at the old DBDebunk (2000-2006). In light of the above they warrant re-production.

I start with the first, split in three parts: In this Part 1 a reader's reaction to both our solution and Hugh Darwen's "horizontal decomposition" alternative, How to Handle Missing Information without Using NULLs; Hugh's reply is in Part 2 and mine -- re-written to bring up to date with current state of knowledge and for clarity --
is in Part 3.

Note: As far as we know, Darwen no longer abides by that approach -- in a later paper he referred to a "multi-relation" which seems an allusion to our solution -- but the exchange is useful for pedagogical reasons.

Thursday, October 29, 2020

Oldies But Goodies: Database Design Correctness

Note: To demonstrate the correctness and stability of databas designs provided by a sound foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" post from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions they were intended to dispel (I may break long pieces into multiple posts, and add comments and references).

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST UPDATES
-
07/22/20: LINKS update: Added “An Argument for Controlled Natural Languages in Mathematics”, “Let’s Make Set Theory Great Again”.
- 07/21/20 LINKS update: Added “How Gödel’s Proof Works”.

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
- @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

On Thinking: The Horror! 

(originally posted March 4, 2005)


RK:
"I am a long time follower of your website and enjoy it to no end. Not that I enjoy seeing people make fools of themselves, but the reading is enjoyable. I watch a number of groups and I found a thread that just is full of garbage. I know you probably get tons of these, but you really need to read this. Its amazing the things people come up with. If I wanted to think of crazy things to say, I don’t know if I could come up with these. Below is the link to the thread. I think you would find some great quotes of the week from this one."

database design question
"Keep up the fight. I have taken on our modeling tasks at our company some time ago. I fight all kinds of crazy ideas that people think will make their life easier, but really don’t."


Fabian Pascal:
I don't enjoy stupidity and ignorance either. Debunking is a dirty job, but somebody's gotta do it. [I may debunk that thread if I ever get to it].


RK:
"You're right, I should not enjoy reading threads like this. I guess I probably should have tried to respond with a correct answer to his question. I just read stuff like this and wonder what is going on in people's heads to come up with this stuff. Do they just stop thinking completely?"


Fabian Pascal:
"It's not entirely their fault, they were never taught to think [independently and critically], never rewarded [but] actually punished for such."

 
RK:
"I guess. It’s just so amazing to read what they say, and when you do think about it, how crazy it sounds. Over time as I get older and continue to learn, I understand first that there is always more to learn, but before accepting something, I want to know the foundation behind the concept. That generally reveals the truth. I look forward to additional updates on your site. Some of us do listen and enjoy hearing you speak on any number of topics."

 
Fabian Pascal:
May have something to do with your education. Currently there is none to be had really, and none is required. In fact, it is discouraged.

Incidentally, one of the comments in the thread is by Daniel Morgan, an Oracle guy who teaches at the [evening program @the] University of Washington, with whom I had an exchange out of which I posted as a weekly quote [one of his] absurd comments. Here's [his reaction] to the question that initiated the thread you referred me to:


gkelly:
"I have a database for a school that has been in use now for a couple of years and it is working well. There are numerous tables obviously but consider these:
  • CONTACT - holds all contacts - students, faculty, or any other type of contact - (probably should have called it Entity);
  • FACULTY - holds info about specific faculty member - foreign key to contac;
  • STUDENT - holds info about specific student - foreign key to contact.
Considering that the contact table has fields such as: LAST, FIRST, MI, ADDR, CITY, STATE, ZIP, EMAIL ... Do you think it is a good idea to have a single table hold such info. Or, would you have had the tables FACULTY and STUDENT store the LAST, FIRST, ADDR and other fields? At what point do you de-normalize for the sake of being more practical?"
DA Morgan:
"One table name PERSON. Anything else is a denormalized nightmare."

If the teachers are ignorant and can't think, what can we expect of their students?


RK:
"This is very true, that is why I try to focus on people being able to solve problems. I figure if someone gets a degree from a good school and the degree is in some type of applied math or some other program like this, then basic problem solving should be there. I do push to hear more about that. I don’t care what tools they have learned I care what type of foundation they have. This is not very exact but it’s all you can do these days.

I was lucky. In my education a large focus was on mathematics so I feel that I did learn how to solve problems. Also in the CS classes I did take, one was on relational databases and we spend 80–90 percent of the time studying theory, where we used Date’s INTRODUCTION TO DATABASE SYSTEMS as our text. We spent a very small amount of time in the end looking at current tools -- at the time Ingres. Looking back that was the most valuable class I ever took. I still keep Date’s book next to me and use it very often in discussions with other people."


Fabian Pascal:
Yup, that used to be the case. No more. It's getting worse and worse and it shows.


Comments on republication

If you do not understand what's absurd about Morgan's comment, education on data fundamentals is warranted.

As is almost always the case, a set of tables is given that visualizes, supposedly, a set of database relations (i.e., part of a logical database) without (1) the conceptual model (all properties and all relationships among the entity types) which the relations are supposed to represent (2) the semantic constraints on the relations which are supposed to represent those relationships. Without those it is impossible to determine full normalization (i.e., whether the tables are R-tables that visualize relations). Specifically, depending on the specification of the entity types (i.e., properties) it is likely that entity supertype-subtypes are involved, in which case the logical design is incorrect (how?).

The point is that correctness of a logical model (database design) can only be established with reference to a complete conceptual model. Whoever does not understand this fundamental is not a competent data professional.

Note very carefully, however, that while it's unclear whether the three-table design is correct or not, it it Morgan's PERSON table that is a "denormalized nightmare" (i.e., is not a R-table) -- why? In other words, he has it upside down and backwards -- he does not understand what full normalization is.

Saturday, October 17, 2020

Understanding Codd's 12 Rules for RDBMS

In response to an online publication of a book appendix regurgitating Codd's 12 famous rules (some of which were, typically, incorrect[1]) I posted earlier a clarification of the rules. This is a revision thereof for better consistency with the new understanding of the RDM based on McGoveran's re-interpretation, extension and formalization[2] of Codd's work.

Saturday, September 26, 2020

TYFK: Nothing to Do With Relational

“The relational calculus is good in describing sets. But it´s bad at describing relations between data in different sets. Explicit identities (primary keys) need to be introduced and normalization is needed to avoid update inconsistencies due to duplication of data. To say it somewhat bluntly: The problem with the relational calculus and RDBMS etc. is the focus on data. It´s seems to be so important to store the data, that connecting the data moves to the background. That might be close to how we store filled in paper forms. But it´s so unlike how the mind works. There is no data stored in your brain. If you look at the fridge in your kitchen, there is no tiny fridge created in your brain so you can take the memory of your fridge with you, when you leave your kitchen.” --Weblogs.asp.net
 

Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

The lack of foundation knowledge exposed by the above paragraph is so complete that its claims are practically upside down and backwards.

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST UPDATES
-
07/22/20: LINKS update: Added “An Argument for Controlled Natural Languages in Mathematics”, “Let’s Make Set Theory Great Again”.
- 07/21/20 LINKS update: Added “How Gödel’s Proof Works”.

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
- @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

Fundamentals

As we have demonstrated, in mathematical set theory a relation (set) is a subset of a cross-product of domains (sets). In other words, it is a set that is a relationship among sets. Being abstract (i.e., having no real world meaning), the values of mathematical relations can be arbitrary.

The RDM is an application of simple set theory expressible in first order predicate logic (SST/FOPL) to database management: a relational database represents a conceptual model of some reality, namely (facts about) a multigroup in the real world -- a collection of related entity groups -- each database relation representing one such group; a database is also a set of related relations. The values in database relations (i.e., the data) are, thus, not arbitrary, but must be consistent with the conceptual model: relations and the database as a whole are semantically constrainted to be so consistent: (1) individual properties of entities and (2) collective properties of (a) groups (i.e., relationships among entities within groups), and (b) the multigroup (i.e., relationships among groups).

A primary key (PK) represents names given in the real world to entities of a given type, and the corresponding PK constraint (uniqueness) enforces consistency of a relation with the distinguishability of those entities in the real world, the facts about which it represents. These are not RDM artifacts, but rather part of the adaptation of SST/FOPL to database management.

For the primary advantage of the RDM -- guaranteed correctness of query results (i.e., inferences made from the database) -- to materialize, logical database design must adhere to three core principles which, jointly, imply fully normalized relations (5NF).
In fact, in RDM relations are in 5NF by definition, otherwise they are not relations -- relational algebra (RA) operations lose information and all bets are off.

The RA is the manipulative component of the RDM -- a collection of primitive and derived set operations on relations that describe relationships among relations. For example, the join operation r1 JOIN r2 describes a relationship between r1 and r2 relation, the result itself a relation. Note that since every result of a RA operation on even one relation is always a relation and still describes a relationship -- between the "input" and "output" relations.

A data model -- and, industry claims notwithstanding, the only one satisfying Codd's definition that has been formalized is the RDM -- is by nature focused on data. However, the RDM supports physical independence (PI) and, thus, not concerned with how data is physically stored and accessed. The notion of "files stored in paper form" is an example of the common and entrenched logical-physical confusion (LPC) due to failure to understand the distinction between a logical relation and its tabular visualization on a physical medium, induced/reinforced by the industry's "direct image" implementation of SQL DBMSs.

Conclusion


We rephrase the above paragraph as follows:

“The relational algebra describes relationships among relations (sets). Primary keys are one of the adaptations of the SST/FOPL for database management: a PK constraint -- uniqueness -- represents formally in the database a within-group relationship among all its entities.

Mandatory adherence to three core design principles jointly imply full normalization, which is necessary to guarantees correctness of query results. True RDBMSs:

  • Implement the RA for logical data retrieval independent of how the data is physically stored and accessed. SQL DBMSs notwithstanding, vendors are free to store data whichever way they want as long as they don't expose it to users in applications.
  • Enforce relational constraints that are formal database representations of relationships in the conceptual model represented by the database.”

 The "brain" stuff is sheer nonsense.

 

 

 

 

Friday, September 4, 2020

Oldies But Goodies: Relationships and Relations


Note: To demonstrate the stability afforded by a sound foundation relative to the industry's fad-driven cookbook practices, I am re-publishing under "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my debunkings hold up, and whether the industry has progressed beyond the misconceptions they were intended to dispel. I may break long pieces into more pointed parts, and add comments and references to further reading.

From "Little Relationship to Relational" originally posted on March 29, 2001.

“Given the depth and complexity of Codd's thought, not to mention the arcane terms in which he often expressed himself, it is not difficult to grasp why so many of his key points have been widely misunderstood. Even programmers still often misconstrue the technical term "relational". The relational in relational theory refers to relations and not relationships. A relation is a special set of similar objects commonly modeled as entities or as database tables. Relationships may exist between these relations and if your relations are entities you could easily represent the whole thing using a Relational Entity Relationship approach. To elucidate a simple practical example, if you had a company table and an employee table and each company record could have many employee records associated with it, you would have two relations and one relationship. The relations would be the sets of similar objects found in the Employee and Company tables and the relationship would be the association between them. In this case one company to many employees.”
Codd's thought was very deep indeed--new implications are still being derived from his original ideas--and one major objective of relational technology, now almost forgotten, is simplicity. There is little that is complex in relational technology and, in fact, it is the most simple approach possible. Any other general approach is more complex.

It is true that Codd, as a mathematician, did not present his ideas in a way comprehensible to the average practitioner. But it is also true that he had to use different terminology in order to distinguish his precise concepts from the fuzzy, problematic terms already used in the industry. It is also true that, as I argued in the first editorial launching this site, practitioners are so steeped in complex implementation details and devoid of education in fundamentals, that they have a hard time understanding simple logical concepts. It is rather ironic that the author of the article himself reveals some misunderstanding of his own. To clarify:

  • formally a relation is a set of tuples, representing propositions about the real world.
  • informally, a relational table can be viewed as representing an "entity type", with rows representing "entities" of that type.
But note carefully that:
  • "entity" has no precise, formal definition
  • "relationship" can and should be regarded as a special case of "entity"

Comments on re-publication: 
  • A relation is a relationship among domains that is constrained semantically to represent in the database real world relationships within and among entity groups. 
  • We no longer use R-table as a substitute for relation -- it is a visualization of a relation on some physical medium that plays no role in RDM. Note that constraints are not visible in a R-table.
  • A relationship can be (1) among entities within an entity group, in which case it is a collective property of the group and is represented by a constraint or (2) between groups, in which case it is represented by an associative relation.


Further Reading

The Interpretation and Representation of Database Relations (Codd 1969-70)

Logical Symmetric Access, Data Sublanguage, Kinds of Relations, Redundancy and Consistency (Codd 1969-70)

What Relations Really Are and Why They Are Important

Understanding Relations series

Levels of Representation: Conceptual Modeling, Logical Database Design and Physical Implementation

Understanding Conceptual vs. Data Modeling series

Conceptual Modeling Is Not Data Modeling

Relationships and the RDM series

Relations & Relationships

Relationships, Rules, Relations and Constraints


What Is A Database Relationship



Friday, August 28, 2020

TYFK: Denormalization Part 1 - Fundamentals


““Main Question: How do we trade-off while doing denormalization? 
  • Sub-question 1: the standard to implement
- Do we always have to denormalize a model? For what kind of project must we use denormalization techniques while others may not?
- Since denormalization has its gains and losses, how well should we denormalize a data model? Perhaps, the more complete we denormalize, the more complex, uncertain and poor the situation will be.
  • Sub-question 2: the characteristics of normalization
-Does denormalization have several levels/forms the same as that of normalization? For instance: 1DNF, 2DNF...
- Given we can denormalize a data model, it may never be restored to the original one because to do normalization, one can have many ways while to build a data model, you can have multiple choices in determining entities, attributes, etc.””

Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

In Part 1 we discuss the relevant fundamentals in which we will ground the debunking in Part 2.

Monday, August 24, 2020

Oldies But Goodies: Data Sublanguage vs. SQL


Note: To demonstrate the stability of a sound foundation relative to the industry's fad-driven cookbook practices, I am re-publishing under "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my debunkings hold up and whether the industry has progressed beyond the misconceptions they were intended to dispel. I may break long pieces into more pointed parts, and add comments and references to further reading.

From "Little Relationship to Relational" originally posted on March 29, 2001.
“Crucial to making the dream of relational databases a reality was the development of a new query language which took significant advantage of relational theory. Then contemporary query languages like CODASYL usually required incredibly complicated code to answer even the simplest of questions from a database. Codd himself developed a relational language called ALPHA that he often used to communicate his ideas in an academic context [Actually, this is not true], but it attracted almost no commercial interest. In 1974 and 1975 Raymond Boyce and Don Chamberlin of IBM designed a new fourth generation language to extract information from systems based on Codd's relational model known as Structured English Query Language, or SEQUEL. This would later be shortened to SQL but is still most correctly pronounced "sequel." As a relational language, SQL was designed to be descriptive rather than procedural. SQL allowed users to specify the results they wanted rather than having to specify the procedure to obtain the desired results. Theoretically this would absolve the user from having to figure out the best way to execute the program and make information available to a whole class of non-programmers. The model's scope and elegance soon drew many converts.”
It is somewhat of a stretch to deem SQL a full fledged data [sub-]language, and I am not sure how significant is the advantage it takes of relational theory. Codd, Date and others, myself included, have amply demonstrated that not only does SQL fail to support, or violates important relational features, but that it is actually a poorly designed language as well. It can be argued that at least one reason relational technology has not reached its potential is that SQL was practically its only concretization (an arguably better language, QUEL, could not compete due to lack of the then critical IBM compatibility). Not only do most practitioners think that SQL DBMSs are relational, but they actually blame the problems due to SQL's violations of, or lack of adherence to relational principles on the relational model itself!

The claim of non-procedurality (the term is declarative, not descriptive) is rather questionable. While SQL is, relatively speaking, less procedural than programming languages, or the quasi-data languages preceding, it is by no means as declarative as it should and could have been and in many cases (particularly those when it eschews relational features) the SQL formulation can be extremely procedural (see, for example, the chapters on normalization, redundancy and integrity in my latest book). One case in point are sub-queries (the "structured" in SQL's name comes from them): they were put in the language to avoid direct expressions of relational operations such as JOIN, INTERSECT, or DIFFERENCE. Unfortunately, they make SQL a highly redundant language--the same query can be expressed in many ways. Codd warned IBM about the optimization implications of such redundancy, but was ignored. In 1983 I ran multiple formulations of the same query against several SQL PC DBMSs and got response times ranging from 2 to 2500 seconds. Ironically, not all direct expressions could be avoided (e.g. UNION); what is more, explicit expression of all relational operations were later added to SQL anyway. So now there is double redundancy and the optimization complications worse!


Further Reading

Data Sublanguage series

Domains, R-tables, and SQL

Not Worth Repeating: Duplicates

To Really Understand Integrity, Don't Start with SQL

NULL Value is a Contradiction in Terms

Precision, Procedurality and SQL

DISTINCT and ORDER BY Are Not Relational
 

Language Redundancy and DBMS Performance: A SQL Story





Thursday, August 20, 2020

TYFK: Relations, Tables, Domains and Normalization


“The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an n-ary relation. The main highlights of this model are:
  • Data is stored in tables called relations.
  • Relations can be normalized. In normalized relations, values saved are atomic values.
  • Each row in a relation contains a unique value.
  • Each column in a relation contains values from a same domain.”


Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

Friday, August 7, 2020

Oldies But Goodies: Data Models and Physical Independence


Note: To appreciate the stability of a sound foundation vs the industry's fad-driven cookbook practice, I am re-publishing some of the articles and reader exchanges from the old DBDebunk.com (2000-06), giving you the opportunity to judge for yourself how well my claims/arguments hold up and whether the industry has progressed at all. I am adding comments on re-publication where necessary. Long pieces are broken into smaller parts for fast reading.

From "Little Relationship to Relational" originally posted on March 29, 2001.

 
“E.F. ("Ted") Codd conceived of his relational model for databases while working at IBM in 1969. Codd's approach took a clue from first-order predicate logic, the basis of a large number of other mathematical systems and presented itself [sic] in terms of set theory, leaving the physical definition of the data undefined and implementation dependent. In June of 1970, Codd laid down much of his extensive groundwork for the model in his article, "A Relational Model of Data for Large Shared Data Banks" published in the Communications of the ACM, a highly regarded professional journal published by the Association for Computing Machinery. Buoyed by an intense reaction against the ad hoc data models offered by the physically oriented mainframe databases, Codd's rigid separation of the logical model, with its rigorous mathematical underpinnings, from the less elegant realities of hardware engineering was revolutionary in its day. Codd and his relational ideas blazed across the academic computing landscape over the next few years.”

Monday, July 20, 2020

Oldies But Goodies: Data Independence and "Physical Denormalization"


Note: I am re-publishing some of the articles and reader exchanges from the old DBDebunk (2000-06). How well do they hold up -- have industry knowledge and practice progressed? Judge for yourself and appreciate the difference between a sound foundation and the fad-driven cookbook approach.


January 2, 2001


ML: ... one of the "4 great lies" is "I denormalize for performance." You state that normalization is a logical concept and, since performance is a physical concept, denormalization for performance reasons is impossible (i.e., it doesn't make sense). What term would you use to describe changing the physical database design to be different from the logical design to enhance performance? Because normalization is a logical concept, you imply that this is not called denormalization.

Friday, July 10, 2020

Oldies But Goodies: Skyscrapers with Shack Foundations

Ed. Note: I am re-publishing some of the posts (slightly refined) from the old dbdebunk (2000-06) to demonstrate how well they hold up to this day, and how not much has improved in the industry -- quite the opposite. The following is the first editorial with which I started dbdebunk in 2000.
 

June 4, 2000

“Well, it's really a judgment call and I think a lot of experience comes into it. It's a little bit like building a shack. Say you want to build a skyscraper, and you started out building a shack and you just keep trying to add onto it. After a while you have this severe structural problem ... So there is a fallacy to the build-upon-a-simple structure approach. Sometimes you get up to three stories and you have to do some major structural changes, and I just accept that.”
--Wayne Ratliffe, developer of dBase
“Client Servers were a tremendous mistake. And we are sorry that we sold it to you. Instead of applications running on the desktop and data sitting on the server, everything will be Internet based. The only things running on the desktop will be a browser and a word processor. What people want is simple, inexpensive hardware that functions as a window on to the Net. The PC was ludicrously complex with stacks of manuals, helplines and IT support needed to make it function. Client server was supposed to alleviate this problem, but it was a step in the wrong direction. We are paying through the nose to be ignorant.”
--Larry Ellison, CEO, Oracle Corp.

Sunday, June 28, 2020

TYFK: Misconceptions About the Relational Model


“The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an n-ary relation. The main highlights of this model are:
  • Data is stored in tables called relations.
  • Relations can be normalized, [in which case] values saved are atomic values.
  • Each row in a relation contains a unique value.
  • Each column in a relation contains values from a same domain.”

Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

Friday, June 12, 2020

Semantics and the Relational Model

“The RDM is semantically weak ... struggles with consistent granularity and has limitations at the property level... it has no concept of data flow ... it is an incomplete theory. Great for its time but needs something better now ... it uses ill defined and linguistically suspect labels ... it has no rules for semantic accuracy ... this just makes the RDM 1% of the truth ... the RDM should have solved this all by now ... but it has clearly not. You fail to see the reality of the failure of RDM in the real world ... this is your choice. I understand why you cling to it ... it is a most excellent theory that I respect greatly ... [but o]pen minds make progress...” 
Thus in a LinkedIn exchange. Criticism of the RDM almost always reflects poor foundation knowledge and lack of familiarity with the history of the field, and as we shall see, this one is not different. It is often triggered by what I call the "fad-to-fad cookbook approach", one of the latest fads being the industry's revelational "discovery" of semantics.

Thursday, May 28, 2020

No Such Thing As "Current Relational Data Models"

“... the concept of a state group is indeed a missing modeling concept in relational/current data models...”

Thus in a LinkedIn exchange. I don't know what a "state group" is, but I spent almost six decades debunking the misuses of data model in general and the abuses of the RDM in particular and I smell them from miles away. While the time when lack of foundation knowledge shocked me is long gone, practitioners' total unawareness of and indifference to it, and poor reasoning in a field founded on logic never ceases to amaze me.

What exactly are "relational/current data models"?

Sunday, May 10, 2020

TYFK: What Is A Database Relationship?

Note: This is a re-write of an earlier post. About TYFK posts (Test Your Foundation Knowledge) see the post insert below.

“Here two or more table[s] are related with each other. This is Database relationship. Database relationship is used a lot ... [in] relational database management systems ... shortly called RDBMS. Here is Join_data [sic] table and Interview_data table. For creating a relational database management system both of the table[s] must have a common field. Here Employee_ID is a common field ... Database relationship types: One-To-One relation, One-To-many relation, Many-to-many relation. Minimum one common field is essential in all the tables. The data type of common field and field size will be same in all the tables.”
First try to detect the misconceptions, then check against our debunking. If there isn't a match, you can acquire the necessary foundation knowledge in our POSTS, BOOKS, PAPERS, LINKS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.


Monday, April 27, 2020

TYFK: "Multi-model DBMSs" is an Empty Set


Note: About TYFK posts (Test Your Foundation Knowledge) see the post insert below.
“Traditional databases ... don't have a multi-model capability. Point is that richer data models are underused, relational data models are overused, and graph data models have so many advantages that shouldn't be ignored. Relational models, on the other hand, have wildly complex structures often with hundreds to thousands of tables. Each table then contains tens to hundreds of columns, arbitrarily constructed in each and every relational system. And just in case the situation wasn't complex enough, many of those columns are exist exclusively to manage uniqueness and provide connections to other tables. This Structure-FIrst approach produced the cascade of complexity from which we have struggled to recover ever since.”

First try to detect the misconceptions, then check against our debunking. If there isn't a match, you can acquire the necessary foundation knowledge in our POSTS, BOOKS, PAPERS, LINKS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.

Monday, March 23, 2020

TYFK: How (Not) to Compare NoSQL Systems and RDBMSs?


Note: About TYFK posts (Test Your Foundation Knowledge) see the post insert below.
“But if you still want to compare NOSQL databases with RDBMS, they primarily vary in
1. "normalization" where RDBMS contains normalized (upto certain degree) data and NOSQL based database contains non-normalized data;
2. RDBMS based databases are (I MUST say, generally and it isn't a criteria) fully ACID compliant while NOSQL databases are partially ACID compliant.
3. RDBMS are much slower and difficult to scale while NOSQL databases are much faster and easily scalable.
4. RDBMS normalization was very useful 50 years ago when cost of disk and memory was high, and computation power was limited. With the revolution in computing power, cheapest disk and memory availability has made RDBMS normalization a matter of joke - many people do not really understand why they need to normalize data in today's time.”
First try to detect the misconceptions, then check against our debunking. If there isn't a match, you can acquire the necessary foundation knowledge in our POSTS, BOOKS, PAPERS, LINKS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.

Note: In what follows RDBMS refers to a truly relational DBMS (of which currently aren't any), not to be confused with a SQL DBMS.

Thursday, March 12, 2020

Muddling Modeling Part 2: An Example


In an old article I used a Hay-Ross exchange to illustrate how disregard for fundamentals and the associated name proliferation -- which underlies the industry's fad-to-fad tradition -- cause confusion that inhibits understanding of conceptual modeling for database design. A recent LinkedIn exchange -- hardly unique -- showed the article to be as relevant today as it was two decades ago, prompting me to bring it up to date.

In Part 1 we reiterated pertinent fundamentals. Here is the re-written article
-- try to apply the fundamentals from Part 1 before you proceed with our debunking.

Sunday, March 1, 2020

Muddling Modeling Part 1: Fundamentals


“Data modelling, star schema, snow flakes, data vault. Implementing virtual data warehouses (many stage to modify relationships). Normalisation (using a lot of surrogate keys) all for the sake of business reporting analytics. Reason a SQL DBMS approach columns rows is mandatory.”
--LinkedIn

This recent "comment" reminded me of a decades-old article I published in response to a critique by David Hay of the "fact model" then newly proposed by Ron Ross as an "alternative to the data model". In a Letter to the Editor, Hay correctly observed:
“In our industry, there is a strong desire to put names on things. This is natural enough, given the amount of information that we have to classify and deal with in our work. To give something a name is to gain control over it, and this is not necessarily a bad thing. The problem is when the name takes the place of true understanding of the thing named. Discourse tends to be the bantering of names, without true understanding of the concepts involved.”
of which the above comment is an exquisite example.

Friday, February 14, 2020

TFYK: What Is a Relational Schema?

Note: About TYFK posts (Test Your Foundation Knowledge) see the post insert below.
“A relational database stores information in a structured format called a schema. This schema is defined according the rules of database normalization. These rules are meant to ensure the integrity of the data. The schema for a database is broken up into the objects such as tables and constraints. Tables hold your data and are broken down into rows. each row represents a single entity such as a person and has columns which define the attributes of the entity such as age. Constraints define limitations around the data. For example a check constraint might limit the range of valid dates in a datetime column. From there queries can be run to extract data from the database. These queries will often join multiple tables to pull data from them.”
First try to detect the misconceptions, then check against our debunking. If there isn't a match, you can acquire the necessary foundation knowledge in our POSTS, BOOKS, PAPERS, LINKS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.

Thursday, January 30, 2020

TYFK: What Is a Relational Database?

“RDBMS stands for Relational Database Management System. RDBMS is the basis for SQL, and for all modern database systems like MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access. RDBMS store the data into collection of tables, which might be related by common fields (database table columns). RDBMS also provide relational operators to manipulate the data stored into the database tables. An important feature of RDBMS is that a single database can be spread across several tables. This differs from flat-file databases, in which each database is self-contained in a single table. The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an n-ary relation. The main highlights of this model are:
  • Data is stored in tables called relations.
  • Relations can be normalized.
  • In normalized relations, values saved are atomic values.
  • Each row in a relation contains a unique value.
  • Each column in a relation contains values from a same domain.”

The question got 18 answers online, but none came even close to being correct. This is the only one that merits debunking -- the rest will be posted on LinkedIn as "To laugh or cry?".


Note: While the question is about database, due to routine interchangeable use of database and DBMS, we suspect the intention was DBMS. Our debunking applies to database, and our correct answer makes the proper distinction.

First try to detect the misconceptions, then check against our debunking. If there isn't a match, you can acquire the necessary foundation knowledge in our posts, BOOKS, PAPERS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.

Friday, January 24, 2020

Naming Relations: Singular or Plural?

Revised 1/24/20.
“There is a lot of confusion when it comes to designing tables in SQL Server around whether to pluralize names or not. How do you choose whether to pluralize or not? If we want to store a list of people and their details do we use "Person", "Persons", "People" or "Peoples"? Some people will use "People" and some will use "Person", other persons or people would go for "Peoples" or "Persons". The defined standard is to go for non-plural because in a table we are storing a set of an entity and we name the table as the entity so if we want to store one or more people in a single entity or table, we store it or them in the “Person” table. If we stick to this then it makes other situations simpler and stops us having to think about how to pluralize a word, I have for example seen hierarchy pluralized as "hierarcys" [sic].

If we look at Relational Model of Data Large Shared Data Banks by none other than "E. F. Codd" who basically invented the relational database, the examples he gives are singular (supplier and component). If we then look at the ISO standard for naming things (11179-5: Naming and identification principles), this also says that singular names should be used "Nouns are used in singular form only".

For new projects or where you can easily change the name of entities then I would say you must use singular names, for older projects you’ll need to be a bit more pragmatic!”

--The.AgileSQL.Club

Ignoring, for the purposes of this discussion, that a SQL table is not a relation[1], and we don't "store a list of an entity set" in it[2], naming relations involves two choices: (1) the name per se (person, people?), and (2) singular or plural (people or peoples? person or persons)? The former is determined at the conceptual level by the enterprise's business terminology[3]. While the RDM is mute on the latter, nevertheless foundation knowledge (here, what relations represent) is, as always, relevant.

Wednesday, January 22, 2020

TYFK: “Why is a relational database so powerful?”


 Note: About TYFK posts (Test Your Foundation Knowledge) see the post insert below.
“...the theoretical awesomeness of relational algebra is kinda hard to intuitively relate back to your payroll or audit-log tables - the real power is the computed join ... it lets you dynamically fetch sets of data in the exact format that you need ... with any group of tables in the dataset. Unlike other data models, where the things you can fetch are typically fixed when you define your elements, and where relationships between data - if any - are statically defined in advance ... joins let you specify the relationships between objects (rows and tables in SQL-based relational databases) ... create queries and run them on your data without needing to write a lot of extra code beyond the SQL itself. This Ad Hoc Query capability ends up being hugely valuable when doing "secondary" business tasks in a big data world such as doing reporting and analytics, and is often hugely difficult to do in non-relational environments without a lot of extra code and often a specialized reporting schema.”

“Relational theory as applied to databases is that all data is connected to each other, keyed to each piece ... And with a SQL query [you can] create anything that can exist, as output.”

“...main so called power of RDBMs lies within ACID compliance. A transaction in a RDB is Atomic, Consistent, isolated, durable ... makes a database useful, unique, or I suppose powerful ... let's say you update or insert some new record, right in the middle power goes out, due to ACID compliance your transaction will not go through ... either the operation will complete or fail, nothing in between. [And RDBMSs are] tried, tested and true for almost 50 years.”

First try to detect the misconceptions then check against our debunking (some of the above is correct per se but not directly relevant to relational power, some only partially correct, and some is wrong. Can you discern which is which?) If there isn't a match, you can acquire the necessary foundation knowledge in our POSTS, BOOKS, PAPERS, LINKS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.

Wednesday, January 15, 2020

TYFK: "What is better than relational databases?"

Note: About TYFK posts (Test Your Foundation Knowledge) see the post insert below.
“Relational databases like MySQL, PostgreSQL and SQLite3 represent and store data in tables and rows. They're based on a branch of algebraic set theory known as relational algebra. Meanwhile, non-relational databases like MongoDB represent data in collections of JSON documents. The Mongo import utility can import JSON, CSV and TSV file formats. Mongo query targets of data are technically represented as BSON (binary JASON).

“Relational databases use Structured Querying Language (SQL), making them a good choice for applications that involve the management of several transactions. The structure of a relational database allows you to link information from different tables through the use of foreign keys (or indexes), which are used to uniquely identify any atomic piece of data within that table. Other tables may refer to that foreign key, so as to create a link between their data pieces and the piece pointed to by the foreign key. This comes in handy for applications that are heavy into data analysis.

“If you want your application to handle a lot of complicated querying, database transactions and routine analysis of data, you’ll probably want to stick with a relational database. And if your application is going to focus on doing many database transactions, it’s important that those transactions are processed reliably. This is where ACID (the set of properties that guarantee database transactions are processed reliably) really matters, and where referential integrity comes into play.”
First try to detect the misconceptions, then check against our debunking. If there isn't a match, you can acquire the necessary foundation knowledge in our POSTS, BOOKS, PAPERS, LINKS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.

Friday, January 3, 2020

Science, "Data Science", and Database Science


“The foundation of modern database technology is without question the relational model; it is the foundation that makes the field a science.”
--C. J. Date, AN INTRODUCTION TO DATABASE SYSTEMS
“Over the past decades mainstream economics in universities has become increasingly mathematical, focusing on complex statistical analyses and modeling to the detriment of the observation of reality.”
--J. Luyendijk, Don’t let the Nobel Prize fool you, economics is not a science

Science is the formulation and validation of theories about the real world in the context of discovery (CoD) and context of validation (CoV), respectively. There is "hard" science -- theories about the physical world (physics, chemistry, biology) -- and "soft" science -- theories about human behavior (political, economics, psychology). All science uses data, initially only in the CoV, but increasingly also in the CoD -- computerized discovery of patterns as potential hypotheses (i.e., "data mining"). 

View My Stats