Friday, June 21, 2019

Data Meaning and Mining: Knowledge Representation and Discovery

Note: This is a re-write -- prompted by a LinkedIn exchange -- of two columns I published @All Analytics.
“Scientific research experiments that "require assignment of data to tables, which is difficult when the scientists do not know ahead of time what analysis to run on the data, a lack of knowledge that severely limits the usefulness of relational [read: SQL] databases.”
NoSQL are recommended in such cases. But what does "scientists do not know ahead of time what analysis to run" really mean?

Data, Information, and Knowledge

One way to view the difference between data, information, and knowledge is:
“1. Data: Categorized sequences of values representing some properties of interest, but if and how they are related is unknown (e.g., research variables in scientific experiments);
2. Information: Properties further organized in named combinations -- "objects", but how they are related is unknown (e.g., "runs", or "cases" in scientific experiments);
3. Knowledge: Relationships among properties and among objects of different types are known.”

--David McGoveran



Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. The content of this site is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. Thank you. 


  • 05/21/19: Updated the LINKS page.
  • 04/20/19: Added POSTS page with links to all site posts, to be updated monthly. 



  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics may work.

I deleted my Facebook account. You can follow me:

  • @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture posts, and my exchanges on LinkedIn.
  • @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
  • @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.


In science there is a context of discovery and a context of validation (I wonder how many of those who call themselves "data scientists" know these concepts). In the former there is no full-fledged hypothesis/theory (i.e., there is only data or, at best, information -- knowledge is being sought), which is why "scientists do not know ahead of time what analysis to run". In the latter a theory/hypothesis has been "discovered" -- there is potential knowledge that  can be represented in a database and tested by analysis and, as long as it is not rejected, further implications are derived and tested for further validation.

Any DBMS can be used to represent data or information. RDBMSs are most useful for knowledge representation in the context of validation, when properties of objects (entity groups, and a multigroup thereof) -- some of which are relationships among properties and objects -- have been identified and can be represented in a database or, in other words, when there is a conceptual model that can be formalized using the RDM as a logical model for database representation[1]. Relationships formalize as constraints that a RDBMS enforces for consistency of the logical model (data) with the conceptual model (meaning)[2,3], and manipulates the data (query the database) for inferencing, or retrieval to analytical applications. The dual theoretical foundation of the RDM -- simple set theory (SST) and first order predicate logic (FOPL) -- ensures correctness: system-guaranteed logical validity, and semantic correctness[4,5], as well as other advantages.

Unfortunately, true RDBMSs don’t exist. The closest the IT industry has ever come are SQL DBMSs, which -- common misconceptions notwithstanding -- are not TRDBMSs[6]. Be that as it may, insofar as knowledge representation in the context of validation is concerned, NoSQL systems cannot do better, or even what SQL DBMSs do[7]. What about knowledge discovery (commonly known as "data mining")?

“It also requires a formal logical system supplemented with a theory about semantics and how properties and relationships are to be recognized in, or inferred from a subject matter. Not all DBMSs support any degree of automated knowledge discovery and, unfortunately, many even fail to enhance manual knowledge discovery to any substantial degree. That is certainly true of commercial NoSQL systems that lack formal logical underpinnings, so any discovery is ad-hoc and manual: it all occurs in the head of the user.”
--David McGoveran

Can't Infer Meaning from Data

Years back a client consulted me on a problem: they had an important database, but did not know what the data meant -- could I "reconstruct" the meaning?

I have written frequently on this subject ever since. Many data practitioners design databases without well-defined, complete conceptual models, and most think they can -- and do -- infer meaning from sheer inspection of tabular displays of data[8]. But what can -- if anything -- be inferred even from very simple tables used for education purposes is extremely limited. For example, the informal interpretation usually given for the famous  SUPPLIES (S#,P#,QTY) table:

"Supplier with supplier number S# supplies part with part number P# in quantity QTY".
is ambiguous (Is QTY the packaging quantity? The only quantity ever supplied? Does it represent historical quantities?), and incomplete (much more is required to make the meaning clear in natural language).

Since a relational database represents knowledge -- objects and properties, some of which are relationships among objects and properties (i.e., a conceptual model) -- the correctness of the representation (i.e.,  semantic consistency) can be determined only with reference to the model, in which case inferring the meaning from the data is a fool's errand[9]. While it might be possible -- with some effort -- from a complete specification of database relations in a relational data sub-language expressing all relational constraints, it is not possible from SQL tables (let alone from tabular displays thereof) and constraints[10] in industry practice. What, for example, do the data in the following table mean?

 hard   yes     09:30  10:30   SAVER
 hard   yes     11:00  12:00   SAVER
 hard   no      14:00  15:30   STANDARD
 grass  no      10:00  11:30   PREMIUM-B
 grass  no      11:30  13:30   PREMIUM-B
 grass  yes     15:00  16:30   PREMIUM-A
Without the conceptual model it is not even possible to tell if it is a R-table (i.e., that it visualizes a relation[11]). If it is not, the database is not relational and all bets are off: sensible querying and correct results (and interpretation thereof), as well as the other relational advantages, are not guaranteed. 

TRDBMSs would:
  • Support expression/declaration, and enforcement of all the FOPL-expressible relational constraints, thus guaranteeing semantically consistent databases; and,
  • Be semantically self-documenting: make meaning accessible to users on demand.
but, sadly, none is available. SQL DBMSs do not support all relational constraints do not self-document semantically, and non-SQL -- including "schema-less" NoSQL -- systems are hopeless in this respect [12].

I told the client that I could try, but it would require a serious effort with considerable user involvement, without guarantee that re-conceptualization and re-design could be avoided. Predictably, they declined.

And so it goes.

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.


[1] Pascal, F., Conceptual (Business) Modeling for Database Design v.5, Parts 1,2, forthcoming.

[2] Pascal, F., Relationships and the RDM Parts 1-3.

[3] Pascal, F., What Meaning Means: Business Rules, Predicates, Integrity Constraints, and Database Consistency.

[4] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK (draft chapters), forthcoming.

[5] Pascal, F., Logical Validity and Semantic Correctness.

[6] Pascal, F., SQL Sins.

[7] Pascal, F., Thinking Logically: SQL, NoSQL, and the Relational Model.

[8] Pascal, F., Tables: So What?

[9] Pascal, F., Don't Design Databases Without Foundation Knowledge and Conceptual Models.

[10] Pascal, F., What Relations Really Are and Why They Are Important.

[11] Pascal, F., To Really Understand Integrity, Don't Start with SQL.

[12] Pascal, F., NoSQL and SQL: A Plague on Both Their Houses.

No comments:

Post a Comment