Friday, January 3, 2020

Science, "Data Science", and Database Science

“The foundation of modern database technology is without question the relational model; it is the foundation that makes the field a science.”
“Over the past decades mainstream economics in universities has become increasingly mathematical, focusing on complex statistical analyses and modeling to the detriment of the observation of reality.”
--J. Luyendijk, Don’t let the Nobel Prize fool you, economics is not a science

Science is the formulation and validation of theories about the real world in the context of discovery (CoD) and context of validation (CoV), respectively. There is "hard" science -- theories about the physical world (physics, chemistry, biology) -- and "soft" science -- theories about human behavior (political, economics, psychology). All science uses data, initially only in the CoV, but increasingly also in the CoD -- computerized discovery of patterns as potential hypotheses (i.e., "data mining"). 


DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

-12/24/20: Added 2021 to the
POSTS page

-12/26/20: Added “Mathematics, machine learning and Wittgenstein to LINKS page

- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- The PostWest blog for monthly samples of global Antisemitism – the only universally acceptable hatred left – as the (traditional) response to the existential crisis of decadence and decline of Western  civilization (including the US).
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.


Science and Data Mining

The CoD -- traditionally the purview of human intellect -- has been increasingly "outsourced" to computers tasked with "discovery of data patterns". As practiced in the industry, it is problematic from a science perspective.

First, hypothesis formulation is not a sheer computational endeavor: computers discover patterns that humans may not, but cannot determine how meaningful they are with respect to the real world. To demonstrate what happens when the CoD is contaminated by, and reduced to computational thinking, take, for example, A Quantitative Semantic and Topological Analysis of UK House of Commons:

with data analysis results such as the following: 

It is certainly complex enough to seem "scientific", but as a trained political scientist I find political interpretation (i.e., real world meaning) of both the hypothesis and the analytical result -- to put it politely -- difficult[1]. I will leave it to the reader to judge whether it advances our knowledge and understanding of political behavior.

Second, even a mined hypothesis must be validated on data distinct from that from which it was mined. More often than not validation is conflated with discovery, skipping the CoV altogether and shortcircuiting the scientific method.

"Data Science"

For these and other reasons beyond the scope of this discussion much of the industry practice referred to as "data science" cannot be considered science (in fact, the label is used to obscure the absence thereof)[2]. Were it science, what would it be a science of -- theories about what? At best it would be "business science" (i.e., the formulation and validation of theories about business) -- part of the soft behavioral science.

The Real Data Science: Database Science

But the label suggests a science of data, namely the formulation and validation of theories not about the world, but about data, albeit applied theories (i.e., that have a real world interpretation to be useful). There actually is such science, but it is a sad irony that the industry utterly dismisses and disregards it[3].

The mathematical theory of relations is abstract -- it has no real world meaning. The RDM is Codd's adaptation and application to database management, which enabled a real world interpretation[4]: database relations represent entity groups, attributes represent properties, tuples represent facts about entities, and constraints on relations represent relationships among properties, entities, and groups in databases[5].

“The RDBMS implements an abstract, uninterpreted logical system. When we create specific database domains, relations, and attributes we are restricting that system to a specific interpretation -- seen the other way around, an interpretation of a logical system is a representation of the world -- and that is exactly the purpose of database design. An attribute name is assigned a meaning (in other words, in the conceptual model, a property is represented by an attribute name that the designer creates) -- semantics. That is why, for example, the level of normalization cannot be established without knowledge of the conceptual model -- what those attribute names mean and how they are related to each other, (i.e., dependencies).
-- David McGoveran
But with scientific education substituted by coding and tools experience[6,7], it is hardly surprising that the core critical advantage of relational database management -- its scientific foundation -- is entirely missed and unappreciated:
“In a relational database, each entity is represented by a table. A database table is simply a list of information, presented with rows and columns, about the category of person, thing, or concept you want to track. So in a phone book, you might have a table to store information about residences and another table to store information about businesses; or in a library catalog, you might have one table to store information about books and another to store information about authors.”
Einstein’s famously advised that everything should be as simple as possible, but not simpler. Aside from being riddled with misconceptions[8], this is not even a definition, let alone one that distinguishes relational databases from any other type[9].

It's this absence of foundation knowledge[10,11] that explains why SQL DBMSs -- intended as implementations of relational theory -- ended up so very short of it (in its thousands of pages the ANSI SQL standard intentionally does not mention the word ‘relational’ even once!), or why NoSQL products are promoted as "post-relational progress" because they "don't require a data model"[12].

“After attending NoSQL conference I am really hoping that companies think through this 'big data' implementation! No one there was interested in data model ... and said so ... forget the data model.”
There is, however, a corrolary to Einstein's advice: everything should be as complex as is comprehensible, but not complexer. Unfortunately, computational complexity is nowadays mistaken for science, while database science -- the RDM -- is dismissed either as "4th grade mathematics", or too complex for the average data practitioner. In the absence of foundation knowledge industry practice fails both ways[13,14].

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.



[1] Pascal, F., Data Meaning and Mining: Knowledge Representation and Discovery

[2] Pascal, F., Industry Practice Is No Substitute for Foundation Knowledge

[3] Pascal, F., Social BigData and Relational Denial

[4] Pascal, F., The RDM Is Applied Theory

[5] Pascal, F., Conceptual Modeling For Database Design

[6] Pascal, F., Education vs. Training

[7] Pascal, F., Database Education Oughts and OughtNots


[9] Pascal, F., What Is a True Relational System (and What It Is Not)

[10] Pascal, F., Forward to the Past: From Codd to SQL to NoSQL

[11] Pascal, F., NoSQL and SQL: A Plague on Both Their Houses

[12] Pascal, F., What Is a Data Model and What It Is Not

[13] Pascal, F., Database Management No Progress Without Data Fundamentals

[14] Pascal, F., Healthcare, Data Fundamentals and the PASS Summit


No comments:

Post a Comment

View My Stats