Friday, November 5, 2021

OBG: Database Consistency and Physical Truth



Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may slightly revise, break into parts, and/or add comments and/or references.

This is an email exchange with a reader responding to my third book.
(Originally posted on 06/21/2001)

“I'm presently reading your book PRACTICAL ISSUES IN DATABASE MANAGEMENT and there are a couple of points that I find a little confusing. I'll start first by saying that I have no formal database oriented education, and I'm attempting to familiarize myself with some of the underlying theories and practices, so that I can further my personal education and career prospects (but aren't we all!). My questions may sound a little bit ignorant, but that would be because I am! (Please note ignorant, not stupid!) I'll quote you directly from the book for this (possibly I'm taking you out of context or missing something important)

Chapter 3, A Matter of Identity: Keys, pg. 75: "Databases represent assertions of fact - propositions - about entities of interest in the real world. The representation must be correct - only true propositions (facts) must be represented."

Now, correct me if I'm wrong with a basic assumption here, but isn't a database simply a model of a "real world" data collection? I would've thought that the intention of a database would be to model real life effectively (and accurately) enough to provide useful data for interpretation. Now obviously this is not an easy process with complex data types, but would it even be possible to have a 100% true proposition with only atomic data types? (i.e. can a simplified model contain only facts?) In my understanding of modeling, any model that fits real life closely enough to be a good statistical representation is a usable model. e.g. Newton's Laws are accurate enough when applied on a local scale, but we need to use Einstein's model of space-time across larger scales. Wouldn't recording only "facts" (which I would presume you mean to be statements that are provable in the objective sense i.e. no interpretation, only investigation or calculation) possibly eliminate the utility of some aspects of the database? Or do we account for the interpretative aspect in the metadata or in some other way?

Essentially, I can see what you're saying, but not necessarily how you've reached the conclusion. Admittedly in an ideal world we should be able to record only facts in a database, but this is not an ideal world. As an example, in surveys we see such questions as "Are you happy with this product?" followed by a rating system of 1-5, or 'completely unhappy to completely happy'. This is an artificial enforcement of a quantitative measure on a qualitative property. How do we account for the fact that this is interpreted data and not calculated or measured?

My questions may have little relevance to database theory in general, but the concept fascinates me!”

Good for you.

I am not sure I understand your question, but I'll try to provide a general reply. The data that represents the real world at the logical database level -- relation tuples -- represent a set of facts (true propositions): property values for the entities the facts are about. They are represented by attribute values, which can be as "simple" or "complex" as required (fuzzy terminology), but that does not change that fact (pun unintended).

I do not recall where in the book, but there is a note clarifying the meaning of "correct" as distinct from "true" in databases: it does not mean truth in the real world ("physical truth") -- no DBMS can know and guarantee that -- but consistency with the declared business rules comprising the conceptual model. If:

  • Business rules are asserted fully and correctly; and,
  • The database is designed correctly and the constraints represent the rules faithfully; and,
  • The DBMS implements the RDM and enforces the constraints,

semantic consistency can be guaranteed. If the data violates any of the constraints, it is inconsistent with (contradicts) the rules and, therefore, must be incorrect.

Comments on Republication

The questions not only have relevance to database theory, but also raised fundamental issues that I did not respond to at the time, but that do merit a response.

Even though they are distinct, correctness and truth are often confused in the industry.  The former has to do with consistency of the database with the conceptual model it is intended to represent; truth has to do with whether the facts recorded in the database are true in the real world.

A database cannot be created and used without an interpretation (meaning or semantics) of the data it records -- the conceptual model of reality it is intended to represent. Via requirements analysis the modeler extracts from users the business rules and the types of facts to be recorded and designs the database to be consistent with the rules; users use the DBMS to record true facts in the database and make inferences from them -- query to derive new facts (theorems) from the recorded facts (axioms). Because they are the source of the model, users should know the intended meaning, although often this not the case in industry practice. Databases should and can capture more of the meaning than they currently do, but the industry has failed to progress the RDM for that purpose and current DBMSs (SQL) are not really relational, let alone semantically rich (see an effort in this direction).




No comments:

Post a Comment

View My Stats