Friday, January 3, 2020

Science, "Data Science", and Database Science




“The foundation of modern database technology is without question the relational model; it is the foundation that makes the field a science.”
--C. J. Date, AN INTRODUCTION TO DATABASE SYSTEMS
“Over the past decades mainstream economics in universities has become increasingly mathematical, focusing on complex statistical analyses and modeling to the detriment of the observation of reality.”
-- J. Luyendijk, Don’t let the Nobel Prize fool you. Economics is not a science

Science is the formulation and validation of theories about the real world in the context of discovery (CoD) and context of validation (CoV), respectively. There is "hard" science -- theories about the physical world (physics, chemistry, biology) -- and "soft" science -- theories about human behavior (political, economics, psychology). All science uses data, initially only in the CoV, but increasingly also in the CoD -- computerized discovery of patterns as potential hypotheses (i.e., "data mining"). 


------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE 

Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. The content of this site is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. Thank you.

NEW

  • 12/08/19 Added two educational references on set theory to the LINKS page.
  • 10/26/19: The POSTS page now links to all 2012-2018 posts (to be updated annually at year-end). Except for 2017, the (italicized) links are to abstracts of my columns @All Analytics site, which was discontinued (see below).
  • 10/26/19: Updated and cleaned up the WRITINGS page.
  • 08/09/19: Following my series of posts on data sublanguage (Parts 1-4), I have revised for consistency the corresponding section of paper #2 in the Understanding the Real RDM series, Logical Access, Data Sublanguage, Kinds of Relations, and Database Redundancy and Consistency, which is available for ordering from the PAPERS page.

LATEST PUBLICATIONS (order PAPERS and BOOKS) 


USING THIS SITE 

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search. 
  • Following the discontinuation of AllAnalytics site, the links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA 

I deleted my Facebook account. You can follow me:

  • @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
  • @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
  • @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

Science and Data Mining


The CoD -- traditionally the purview of human intellect -- has been increasingly "outsourced" to computers tasked with "discovery of data patterns". As practiced in the industry, it is problematic from a science perspective.

First, hypothesis formulation is not a sheer computational endeavor: computers discover patterns that humans may not, but cannot determine how meaningful they are with respect to the real world. To demonstrate what happens when the CoD is contaminated by, and reduced to computational thinking, take, for example, A Quantitative Semantic and Topological Analysis of UK House of Commons:






with data analysis results such as the following: 

It is certainly complex enough to seem "scientific", but as a trained political scientist I find political interpretation (i.e., real world meaning) of both the hypothesis and the analytical result -- to put it politely -- difficult[1]. I will leave it to the reader to judge whether it advances our knowledge and understanding of political behavior.

Second, even a mined hypothesis must be validated on data distinct from that from which it was mined. More often than not validation is conflated with discovery, skipping the CoV altogether and shortcircuiting the scientific method.

"Data Science"


For these and other reasons beyond the scope of this discussion much of the industry practice referred to as "data science" cannot be considered science (in fact, the label is used to obscure the absence thereof)[2]. Were it science, what would it be a science of -- theories about what? At best it would be "business science" (i.e., the formulation and validation of theories about business) -- part of the soft behavioral science.

The Real Data Science: Database Science


But the label suggests a science of data, namely the formulation and validation of theories not about the world, but about data, albeit applied theories (i.e., that have a real world interpretation to be useful). There actually is such science, but it is a sad irony that the industry utterly dismisses and disregards it[3].

The mathematical theory of relations is abstract -- it has no real world meaning. The RDM is Codd's adaptation and application to database management, which enabled a real world interpretation[4]: database relations represent entity groups, attributes represent properties, tuples represent facts about entities, and constraints on relations represent relationships among properties, entities, and groups in databases[5].

“The RDBMS implements an abstract, uninterpreted logical system. When we create specific database domains, relations, and attributes we are restricting that system to a specific interpretation -- seen the other way around, an interpretation of a logical system is a representation of the world -- and that is exactly the purpose of database design. An attribute name is assigned a meaning (in other words, in the conceptual model, a property is represented by an attribute name that the designer creates) -- semantics. That is why, for example, the level of normalization cannot be established without knowledge of the conceptual model -- what those attribute names mean and how they are related to each other, (i.e., dependencies).
-- David McGoveran
But with scientific education substituted by coding and tools experience[6,7], it is hardly surprising that the core critical advantage of relational database management -- its scientific foundation -- is entirely missed or unappreciated:
“In a relational database, each entity is represented by a table. A database table is simply a list of information, presented with rows and columns, about the category of person, thing, or concept you want to track. So in a phone book, you might have a table to store information about residences and another table to store information about businesses; or in a library catalog, you might have one table to store information about books and another to store information about authors.”
Einstein’s famously advised that everything should be as simple as possible, but not simpler. Aside from being riddled with misconceptions[8], this is not even a definition, let alone one that distinguishes relational databases from any other type[9].

It's this absence of foundation knowledge[10,11] that explains why SQL DBMSs -- intended as implementations of relational theory -- ended up so very short of it (in its thousands of pages the ANSI SQL standard intentionally does not mention the word ‘relational’ even once!), or why NoSQL products are promoted as "post-relational progress" because they "don't require a data model"[12].

“After attending NoSQL conference I am really hoping that companies think through this 'big data' implementation! No one there was interested in data model ... and said so ... forget the data model.”
There is, however, a corrolary to Einstein's advice: everything should be as complex as is comprehensible, but not complexer. Unfortunately, computational complexity is nowadays mistaken for science, while database science -- the RDM -- is dismissed either as "4th grade mathematics", or too complex for the average data practitioner. In the absence of foundation knowledge IT industry practice fails both ways[13,14].



Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.

 

References

[1] Pascal, F., Data Meaning and Mining: Knowledge Representation and Discovery

[2] Pascal, F., Industry Practice Is No Substitute for Foundation Knowledge

[3] Pascal, F., Social BigData and Relational Denial

[4] Pascal, F., The RDM Is Applied Theory

[5] Pascal, F., Conceptual Modeling For Database Design

[6] Pascal, F., Education vs. Training

[7] Pascal, F., Database Education Oughts and OughtNots

[8] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING PRACTIONER

[9] Pascal, F., What Is a True Relational System (and What It Is Not)

[10] Pascal, F., Forward to the Past: From Codd to SQL to NoSQL

[11] Pascal, F., NoSQL and SQL: A Plague on Both Their Houses

[12] Pascal, F., What Is a Data Model and What It Is Not

[13] Pascal, F., Database Management No Progress Without Data Fundamentals

[14] Pascal, F., Healthcare, Data Fundamentals and the PASS Summit

No comments:

Post a Comment

View My Stats