Friday, May 31, 2019

Normalization and Further Normalization Part 1: Databases Representing ... What?


Note: This is a re-write of older posts (which now link here), to bring them into line with the McGoveran formalization, re-interpretation, and extension[1] of Codd's RDM.
“A particular bug-bear and a mistake that +90% of "data modelers" make, is analyzing "point in time" views of the business data and "normalizing" those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):
SALES-INVOICE
 {Invoice-Date,
  Customer-Account-ID,
  Customer Name,
  Invoice-Address-Line-1,
  Invoice-Address-Line-2,
  Invoice-Address-Line-3,
  Invoice-Address-Line-4,
  Invoice-Address-Postcode,
  Net-Amount,
  VAT,
  Total-Amount
 };
Nearly every time, through the blind application of normalization we get this ... there’s even a term for it -- it’s called "over-normalization":
SALES-INVOICE
 {Invoice-Date,
  Customer-Account-Id
   REFERENCES Customer-Account,
  Net-Amount,
  VAT,
  Total-Amount
 };

CUSTOMER-ACCOUNT
 {Customer-Account-Id,
  Customer-Name,
  Invoice-Address
   REFERENCES Address
 };

ADDRESS
 {Address-Line-1,
  Address-Line-2,
  Address-Line-3,
  Address-Line-4,
  Postcode
 };”
A measure of scarcity of foundation knowledge in the industry are the attempts to correct a plethora of common misconceptions[2] that suffer from the very misconceptions they aim to correct. One of the most common fallacies is confusion of levels of representation[3] that takes two forms[4]. We have written extensively about the logical-physical confusion (LPC)[5,6,7,8] underlying "denormalization for performance"[9], and the conceptual-logical conflation (CLC) that lumps conceptual with data modeling[10,11,12], inhibiting understanding that the latter is formalization of the former. 

Saturday, May 25, 2019

Reader Mail: Sets vs. Graphs, Education vs. Training


GK writes:
“I just wanted to drop a note of thanks for the website, especially the latest articles on understanding data modeling, which among other things, explains very nicely the difference between the application of set theory and graph theory. It parallels in the real world with the community (set of data elements) and the individual (node in a network) and how it is easier to connect communities (RDM), but how much more complex it would be to connect individuals directly (GDM) without going through such a community connection arrangement (e.g. e-mail, postal system).”

“I'm currently working out the concept of what I call CMCs or contextual metadata connectors. I'm sure such entities will be heavily dependent upon the usage of RDM to do their job. In the project, I would like to use both approaches (RDM, GDM) due to the power of set theory and graph theory, but exactly where one should do so is so critical.”

“It's exciting to think of the endless potential for AI-based automation when one correctly leverages the underlying principles of data relationships. Since my discovery in 2004 about a much better way to approach test automation which I called data-centric (vs. the code-centric industry standard), I have found that it applies anywhere there is data, as long as one holds to a proper understanding of data and how to view it relationally.”

“What I find very surprising though is how rare it is to find in the I.T. industry a proper understanding of data, especially when viewing it relationally. It is indeed one of the most massively misunderstood aspects of the I.T. industry to this day, as your website alludes to. Rather than running away from it, RDM should be the very first course taught in any program involved in either computer science or information science. Maybe then I wouldn't always be losing people in technical conversations whenever I start talking about it. I see a diamond and they just see carbon.”

While it is nice to have one's work appreciated, the following text appears in every post and cannot be missed:
“Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. You will not find its content anywhere else, so if you deem it useful, particularly if you are a regular reader, please back up your appreciation with concrete support -- purchase publications, or regular donations. Thank you.”
Codd was explicit about introducing the set-based RDM to relieve what he called "non-network applications" -- concerned with relationships among groups of entities -- from the complexity burden of directed graphs for network applications concerned with relationships among individual entities. But this too,  like so many other aspects of his work, was missed/ignored. Witness the GDBMS revival and promotion as "superior to RDBMSs" (which are confused with SQL DBMSs), without any reference to their distinct application domains.

Furthermore, as we have often pointed out, the older generation GDBMSs were actually not grounded in graph theory, but were abstractions from industry practices, and although the current crop are improvements -- having learned from the RDM -- there is no agreed, formally well defined, theory based graph data model (GDM)[1,2]. If there is, what are -- precisely, please! -- its structure, manipulation, and integrity components?[3].

I am not familiar with CMCs, but extreme care must be exercised with respect to "using both approaches (RDM, GDM) due to the power of set theory and graph theory", to prevent the latter (based on higher logic) from defeating the purpose and advantages of the former (intentionally restricted to FOPL)[4,5].
While I do not disagree with the data-centric vs. code-centric argument, I have serious reservations  -- to put it politely -- for a multiplicity of reasons to  "endless potential of AI-automation", which are beyond the scope of this response.

Surprising? Since the late 80s all our writings (at the old DBDebunk,  and elsewhere and at this blog; papers; books; and seminars have done nothing but document and explain the lack of knowledge and understanding of data fundamentals in the industry[6,7,8,9,10,11]. It has much to do with the destruction of education and its replacement with tool training[12,13], a component of the decadence and decline of Western civilization -- on which subject see my ThePostWest blog and Twitter page.
The rich irony of promoting "data science", while discarding the real data science (the RDM) escapes, of course, the industry[14,15].


References

[1] Pascal, F., Graph Databases They Who Forget the Past...

[2] Pascal, F., OO/UML, and "Graph Data Models"

[3] Pascal, F., What Is a Data Model, and What It Is Not.

[4] Pascal, F., Structure, Integrity, Manipulation: How to Compare Data Models.

[5] Pascal, F., Natural, Programming, and Data Language.

[6] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER (October 2016).

[7] Pascal, F., Database Management No Progress Without Data Fundamentals.

[8] Pascal, F., Industry Practice Is No Substitute for Foundation Knowledge.

[9] Pascal, F., The Cookbook Approach to Data Management.

[10] Pascal, F., Are You a Thinking Data Professional?

[11] Pascal, F., Lenin, Trotsky, Data Management, and the Tyranny of Knowledge and Reason.

[12] Pascal, F., A Note on Education vs. Training.

[13] Pascal, F., Education, Practicality and an Introductory SQL Book.

[14] Pascal, F.,  The Real Data Science.

[15] Understanding Relations: Tables? So What?




Thursday, May 16, 2019

Naming Relations: Singular or Plural?

“There is a lot of confusion when it comes to designing tables in SQL Server around whether to pluralize names or not. How do you choose whether to pluralize or not? If we want to store a list of people and their details do we use "Person", "Persons", "People" or "Peoples"? Some people will use "People" and some will use "Person", other persons or people would go for "Peoples" or "Persons". The defined standard is to go for non-plural because in a table we are storing a set of an entity and we name the table as the entity so if we want to store one or more people in a single entity or table, we store it or them in the “Person” table. If we stick to this then it makes other situations simpler and stops us having to think about how to pluralize a word, I have for example seen hierarchy pluralized as "hierarcys" [sic].

If we look at Relational Model of Data Large Shared Data Banks by none other than "E. F. Codd" who basically invented the relational database, the examples he gives are singular (supplier and component). If we then look at the ISO standard for naming things (11179-5: Naming and identification principles), this also says that singular names should be used "Nouns are used in singular form only".

For new projects or where you can easily change the name of entities then I would say you must use singular names, for older projects you’ll need to be a bit more pragmatic!”

--The.AgileSQL.Club

Ignoring, for the purposes of this discussion, that a SQL table is not a relation[1], and we don't "store a list of an entity set" in it[2], naming relations involves two choices: (1) the name per se (person, people?), and (2) singular or plural (people or peoples? person or persons)? The former is determined at the conceptual level by the enterprise's business terminology[3]. While the RDM is mute on the latter, nevertheless foundation knowledge (here, what relations represent) is, as always, relevant.

Saturday, May 11, 2019

Understanding Data Modeling Part 5: Conclusions

In Part 1 we presented some foundation knowledge with which to debunk misconceptions lurking in the "data modeling" mess in the industry that Friesendal has tried to catalog, and argued that it can help overcome it. In Part 2 we applied this knowledge to the first two industry "data models" considered by Friesendal -- the E/RM and RDM. In Part 3, we applied it to OO/UML and (yet a formally undefined) GDM, and in Part 4 to Fact Modeling (FM).

Here we apply it to Friesendal's conclusions.

Saturday, May 4, 2019

Understanding Data Modeling Part 4: Fact Modeling


In Part 1 we presented some foundation knowledge with which to debunk misconceptions lurking in the "data models" mess in the industry that Friesendal has tried to catalog. In Part 2 we applied this knowledge to the first two industry "data models" considered by Friesendal, the E/RM and the RDM. In Part 3,  we applied it to OO/UML and (a yet formally undefined) "GDM". Here we apply it to fact modeling (FM).

Fact Modeling


“... another school of modelers working with "fact modeling". Their approach is not new. It goes back to the 70's, where Eckhard Falckenberg and Sjir Nijssen started working on the approach (in parallel). Fact Modeling was known for many years as Object-Role-Modeling (ORM), and it was supported by the popular Visio diagramming tool at the time that Microsoft bought the company behind Visio. I like Nijssens name “Binary Relationship Modeling” a lot and it has been in the back of my head since the early 80's. Fact Modeling is definitely at the right level (concepts and their relationships), but it also contains all of the logic details required for formal, precise specifications. The visual syntax goes back to: Nijssen, G.M. and T.A. Halpin, Conceptual Schema and Relational Database Design — A fact oriented approach, Prentice Hall 1989.”