Friday, June 21, 2019

Data Meaning and Mining: Knowledge Representation and Discovery


Note: This is a re-write -- prompted by a LinkedIn exchange -- of two columns I published @All Analytics.
“Scientific research experiments that "require assignment of data to tables, which is difficult when the scientists do not know ahead of time what analysis to run on the data, a lack of knowledge that severely limits the usefulness of relational [read: SQL] databases.”
NoSQL are recommended in such cases. But what does "scientists do not know ahead of time what analysis to run" really mean?

Data, Information, and Knowledge


One way to view the difference between data, information, and knowledge is:
“1. Data: Categorized sequences of values representing some properties of interest, but if and how they are related is unknown (e.g., research variables in scientific experiments);
2. Information: Properties further organized in named combinations -- "objects", but how they are related is unknown (e.g., "runs", or "cases" in scientific experiments);
3. Knowledge: Relationships among properties and among objects of different types are known.”

--David McGoveran


Friday, June 14, 2019

Normalization and Further Normalization Part 3: Understanding Database Design


Note: This is a re-write of two older posts, to bring them into line with McGoveran's formalization, re-interpretation, and extension[1] of Codd's RDM.
 

In Part 1 we explained that for a database to be relational, database design must adhere to three core principles, in which case it consists of relations that are by definition in both 1NF and 5NF. In Part 2 we showed that whether tables visualize relations (i.e., are R-tables) can be determined only with reference to the conceptual model that the database designer intended the database to represent (not what any users might think it does). This is obscured by the common and entrenched confusion/conflation of levels of representation and, consequently, of types of model -- conceptual, logical, physical, and data model -- that we have so often debunked[2].


Sunday, June 2, 2019

Normalization and Further Normalization Part 2: If You Need Them, You're Doing It Wrong


In Part 1 we outlined some fundamentals of database design, namely the distinction between normalization to 1NF, and further normalization (to "full" 5NF), and explained that they are necessary only to repair poor designs -- if you (1) develop a complete conceptual model and (2) formalize it properly using the RDM, (3) adhering to the three core principles of database design, you should end up with a relational database in both 1NF and 5NF.

Here we apply this knowledge to the typical request for "normalization" help we presented in Part 1.

Friday, May 31, 2019

Normalization and Further Normalization Part 1: Databases Representing ... What?


Note: This is a re-write of older posts (which now link here), to bring them into line with the McGoveran formalization, re-interpretation, and extension[1] of Codd's RDM.
“A particular bug-bear and a mistake that +90% of "data modelers" make, is analyzing "point in time" views of the business data and "normalizing" those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):
SALES-INVOICE
 {Invoice-Date,
  Customer-Account-ID,
  Customer Name,
  Invoice-Address-Line-1,
  Invoice-Address-Line-2,
  Invoice-Address-Line-3,
  Invoice-Address-Line-4,
  Invoice-Address-Postcode,
  Net-Amount,
  VAT,
  Total-Amount
 };
Nearly every time, through the blind application of normalization we get this ... there’s even a term for it -- it’s called "over-normalization":
SALES-INVOICE
 {Invoice-Date,
  Customer-Account-Id
   REFERENCES Customer-Account,
  Net-Amount,
  VAT,
  Total-Amount
 };

CUSTOMER-ACCOUNT
 {Customer-Account-Id,
  Customer-Name,
  Invoice-Address
   REFERENCES Address
 };

ADDRESS
 {Address-Line-1,
  Address-Line-2,
  Address-Line-3,
  Address-Line-4,
  Postcode
 };”
A measure of scarcity of foundation knowledge in the industry are the attempts to correct a plethora of common misconceptions[2] that suffer from the very misconceptions they aim to correct. One of the most common fallacies is confusion of levels of representation[3] that takes two forms[4]. We have written extensively about the logical-physical confusion (LPC)[5,6,7,8] underlying "denormalization for performance"[9], and the conceptual-logical conflation (CLC) that lumps conceptual with data modeling[10,11,12], inhibiting understanding that the latter is formalization of the former.