Sunday, February 19, 2017

Simple Domains and Value Atomicity

Revised 2/20/17.

Here's what's wrong with last week's picture, namely:

Q: "I'm currently trying to design a database and I'm not too sure about the best way to approach a dynamically sized array field of one of my objects. My first thought is to use a column in my object to store an array of integers. However the more I read, the more I think this isn't the best option. Concrete example wise, I have a player object that stores 0 to many items, which are represented by an integer. What is the best way to represent this?" 
A: "If a collection of values is atomic, store them together. Meaning, if you always care about the entire group, if you never search for nested values and never sort by nested values, then they should be stored together as a single field value. If not, they should be stored in a separate table, each value bring a row, each assigned the parent ID (foreign key) of a record on the other table that "owns" them as a group. For more info, search on the term "database normalization".

Some databases, support an array as a data type. For example, Postgres allows you to define a column as a one-dimension array, or even a two dimension array. If your database does not support array as a type of column definition, transform you data collection into an XML or JSON support if your database your database supports that type. For example, Postgres has basic support for storing, retrieving, and non-indexed searching of XML using XPath. And Postgres offers excellent industry-leading support for JSON as a data type including indexed support on nested values. Going this XML/JSON route can be an exception to the normalization rules I mentioned above." --StackOverflow.com

Focus on physical implementation ("dynamically sized array field") without well-defined conceptual and logical features it is supposed to represent ("a player object" is hardly enough) and confusion of levels of representation (a real world object does not "store" anything) are always a red flag, an indication of poor grasp of foundation knowledge. So let's introduce some.

As I explain in the forthcoming first paper [1] in my new "Understanding Codd's True RDM" series, even though the RDM was introduced almost 50 years ago, the current understanding--such as it is--is very far from the original intent. One of the least understood features of the RDM as defined by Codd is the "simple domain ... whose elements are atomic (nondecomposable) values" [2].


Value decomposability


Date has argued that because any value can be decomposed, value atomicity is not absolute and cannot be enforced. So, while according to Codd only a relation with all attributes defined on simple domains with atomic values is in its simplest, normal form, Date contends that a relation is by definition in normal form because its attributes are defined on single-valued domains. Well, yes and no.

Note: The normal form became first normal (1NF) form when additional normal form were introduced.

Date interprets Codd's atomicity as an inherent property of the data, while we believe that Codd meant a choice by the database designer driven by application needs (in the context of what is possible using relational operators). A designer can define any domain with values of arbitrary complexity (images, documents, relations) as simple, if it has no components that are meaningful to applications--i.e., there is no need to constrain or manipulate elements of the values.


For a domain defined as simple to the DBMS, if the data sublanguage does not allow relational expressions--constraints and relational operations--to reference domain components, it is in effect rendering values non-decomposable. This is another way of saying that values are only values of defined domains, not of undefined domain components [3], as mandated by the Information Principle (IP), Rule 1 of the famous 12 Codd rules: "All information in a relational database must be represented explicitly in exactly one way, as values of relation attributes drawn from domains". For relational access to domain components, make them explicit domains in their own right.

The answer to the question has the heart in the right place and cleaned from the logical-physical confusion (LPC) of which it suffers and expressed more rigorously, it is: "If applications do not need the DBMS to constrain or manipulate elements of the value collections (sets) individually, define a simple relation-valued domain (RVD)". Note the lack of any reference to physical implementation aspects such as storage, records, fields and so on.

Document-valued domains (e.g., XML/JSON) can also be defined as simple and their values treated as atomic--i.e., no constraints on, or manipulation of document elements. 



Normalization


Per Codd, a relation with all attributes defined on simple domains is in 1NF and, by our interpretation, the data language can enforce atomicity". While EFC referred to "non-1NF relations",
"It is clear to me--and I agree with Date--that the term relation must be reserved for those relations that are at least in 1NF. The formal definitions of the relational operators work correctly only with relations in their normal form." --David McGoveran
Note: In fact, there are reasons to believe that a formal relation is necessarily fully normalized [3] (e.g., for guaranteed semantic correctness of query results), but this is beyond the current discussion. 

If a relation was designed with attributes defined on non-simple RVDs, there is a simple normalization process that eliminates the RVDs by separating their values as relations in their own right, related to the original relation via a FK-PK referential constraint [2]. But as I reiterated more than once, explicit normalization is necessary only for repair of poor designs. Proper design  implicitly produces fully normalized (5NF) relations [4]. 

Non-simple domains that are not RVDs (e.g., XML/JSON-valued) are not easily amenable to normalization. Eliminating them is more complicated--the information content of the documents' must be modeled such that they can be represented by 1NF relations. But the cost of not doing so may prove higher.


If relations are in their normal form, first order predicate logic (FOPL) is sufficient for relational data sublanguages, which confers the following significant practical advantages:
  • Language decidability;
  • Physical independence (PI);
  • Guaranteed logical and semantic correctness of query results;
  • Simplicity.
All this would be possible with (1) a true RDBMS that supports simple domains (2) a truly relational data sublanguage that does not subvert atomicity and (3)  relations in at least 1NF. SQL DBMSs do not support true domains, let alone simple ones and have special SQL functions that can be used to subvert atomicity.

Application programs can, of course, use the hosted relational sublanguage to retrieve atomic values and process them in whichever way desired at the application level. They can even update values if the new values satisfy the domain specification. But, of course, this defeats the purpose of the RDM: it requires higher logic than FOPL and a computationally complete programming language, the relational benefits are lost and all bets are off in that respect. The reason you never see applications enforce the integrity of the elements of complex-valued domains is that it is prohibitive for programmers to analyze and enforce in code; and soundness cannot be system-guaranteed.


So:
  • Yes, there is no absolute atomicity;
  • No, the atomicity designed into domains is enforcible by the data language;
  • Yes, a relation is at least in 1NF by definition, in which case EFC's definition based on simple domains is just fine.
Simple RVD-, image-, or document-valued domains are often due to modeling and design laziness. If uncertain about application needs, adhere to the Principle of Cautious Design and use explicit simple domains. The cost of non-simple domains is greater than ignoring the simple domains if not needed.


References

  1. Pascal, F., Interpretation and Representation of Database Relations, Understanding Codd's Real RDM, paper #1, forthcoming.
  2. Codd, E. F., A Relational Model of Data for Large Shared Data Banks (LINKS page).
  3. Pascal, F., First Normal Form in Theory and Practice, Parts 1,2 and 3, www.dbdebunk.com.
  4. Pascal, F., The Costly Illusion: Normalization, Integrity and Performance, PRACTICAL DATABASE FOUNDATIONS paper #2, www.dbdebunk.com.



Do you like this post? Please link back to this article by copying one of the codes below.

URL: HTML link code: BB (forum) link code:

3 comments:

  1. "The answer to the question has the heart in the right place" -- I'm sincerely glad you came to the same conclusion as me, at first I feared a hatchet job! You application of foundation knowledge here is superb, as we've come to expect.

    ReplyDelete
  2. If Codd meant "meant a choice by the database designer driven by application needs" then his position suffers from the same mistake that people make when they speak of the colour of an object. An object that appears to be blue under white light will appear to be black under yellow light. A design that will appear to be xNF under application needs X might appear to be 0NF under application needs Y. Properties that depend in any way on ambient conditions are not properties at all, they are illusions of them. And moreover they are not formally assessable unless and until everything that they are a function of is itself formally expressible. In this case, until "application needs" are formally expressible.

    ReplyDelete
  3. >>All this would be possible with (1) a true RDBMS that supports simple domains (2) a truly relational data sublanguage that does not subvert atomicity

    Are you aware of any active research going on with respect to the above, in either university or industry setting? Other than what is listed on Date/Darwen TTM site?

    ReplyDelete