Saturday, March 24, 2018

Data Modeling and NoSQL




Revised 3/25/18
"To the question “How relevant is data modeling in the world of NoSQL?” I give the following answer.

The main purpose of data modeling is to understand the business, some application domain, some users world. The model becomes a representation of that world -- the "things" in it, the relationships among those things and any constraints on those things or relationships. A secondary purpose is to build a database to contain information which pertains to and describes that domain."

"Generally we speak of the model coming first, then the implementation, and finally, the data gets collected and stored according to the model. Hence, the business data model should not be concerned with issues of physical stored representation, or the transformations/manipulations/constraints which are imposed to facilitate implementation in some data (storage) management system. That could be a relational DBMS, or a NoSQL tool".

" ... increasingly the data already exists in some form. Which leaves us with the task of figuring out what it means, what it represents -- that is, understanding the data as it represents some user domain. NoSQL tools are often designed to deal with existing data and to process it more efficiently (that may be an oversimplification!). Either way, you must understand the business in order to make sense of the data." 

--Gordon Everest, LinkedIn.com
I have written extensively on the three levels of representation and four types of model and I won't repeat it here -- readers can refresh their memory if necessary[1,2]. Everest's comments are at best ambiguous with respect to the levels and models (e.g., by data modeling he means business modeling, and his "business data model" lumps together business model and data model). It is to avoid such ambiguities and the resulting confusion that I recommend the three-fold terminology of conceptual modeling, logical database design and physical implementation, eschewing data modeling[3]. Here I will rely on my earlier writings to address strictly the issue of data modeling in the NoSQL context raised by Everest.

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE 

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you.

NEW PUBLICATIONS


HOUSEKEEPING

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 
 
@THE POSTWEST

  • The Dystopia of Western Decadence, the Only Acceptable Racism, the Myth of the “Palestinian”
------------------------------------------------------------------------------------------------------------------

Out of the trio of terms I recommend, logical database design is the only one which might properly be referred to as data modeling -- provided data professionals understood it correctly: it consists of using a data model in the true sense of the term -- a formal theory of data -- to formalize informal conceptual models as "computable" formal logical models for database representation[4]. But, unfortunately, they don't have a good grasp of it.

Two of the three components of a data model -- structure and integrity -- provide the formal database representatives corresponding to the informal elements of conceptual models that jointly comprise logical models (of course, these components cannot be defined independent of the third (manipulation) component).

The RDM, for example, fulfills this function as follows:


Structure
  • Relations represent entity groups;
- tuples represent (facts about) entities;
- attributes represent entity properties;
Integrity
  • Constraints represent relationships:
- among properties;
- among entities;
- among entity groups.
Many data professionals belabor under the misconception that the RDM supports only uniqueness (PK) and referential (FK) constraints. But the latter represents relationship among groups, and the former only one type of relationships among entities. In fact, distinct from conventional wisdom, the RDM properly understood also supports relationships among properties, and other types of relationships among entities (e.g., functional dependencies and aggregate)[5]. They are not supported by SQL DBMSs, and in the absence of foundation knowledge, practitioners are unaware of them.

Because business models are independent of physical implementation, Everest believes that either NoSQL tools or RDBMSs can be used for data management, but that the former "process data that exist in some form" more efficiently than the latter. However, such efficiency, if any, is achieved by trading off physical independence (PI)[6] for it: programmers have direct access to the physical level, producing applications that are brittle, vulnerable to physical changes. RDBMS software supports PI and translates logical data requests by applications to physical hardware access, insulating applications from the physical level.

In a data modeling context it is important to distinguish between two kinds of NoSQL tools, and to understand that not all "processings" are created equal, with different implications for data management practice.

  • Schema-less NoSQL systems, which handle arbitrarily structured existing data lack a data model. Their attraction is skipping business and data modeling altogether. You can do the former, but you can't formalize the results as logical models and have a DBMS enforce integrity and guarantee logical and semantic correctness[7]. Such systems often depend on a pattern matching language facility (e.g., awk) to find data that matches some user-defined format and then manipulate it in some ad-hoc way.
  • NoSQL systems that require a specific data structure (e.g., key-value) lack the attraction of "handling existing data in some form", but support a data structure that is much inferior to the n-ary relation for integrity enforcement and data manipulation purposes.
"Processing" in neither case benefits from the critical advantages of a true RDBMS (system-guaranteed logical validity, semantic correctness, declarative and decidable data languages, logical independence, and so on), or of even the lesser ones of a SQL DBMS. NoSQL systems permit computationally complete languages (CCL), which are expressively more powerful than relational data languages (enabling them to represent a broader range of business models), but they are often incapable of having a powerful DBMS to manage data, are necessarily procedural (the programmer has to specify everything step by step), and logical consistency is difficult to enforce, especially during system recovery. These are the very problems of pre-relational DBMSs that the RDM was intended to eliminate. They who forget the past...

Constraints are not "imposed to facilitate physical implementation in some data (storage) management system" (and neither is manipulation): they are logical and enforced by a RDBMS to ensure logical database consistency with the relationships specified in the business model[8] (not the NoSQL physical notion of copy consistency!). 


In fact, most NoSQL products are merely storage management systems, never rising to the level of database management systems. Not exactly equivalent alternatives to RDBMSs, as Everest implies (is this why he calls them tools, not DBMSs?) But in the absence of foundation knowledge, all this is lost on data professionals.

Note: Check out the comments to Everest's post -- hard to know whether to laugh or cry.


References

[1] Pascal, F., Data Model: Neither Conceptual, Nor Logical, Nor Physical

[2] Pascal, F., What Is a Data Model?

[3] Pascal, F., Levels of Representation: Conceptual Modeling, Logical Database Design and Physical Implementation

[4] Pascal, F., Business Modeling for Database Design: Formalizing the Informal

[5] Pascal, F., Relationships and the Relational Model


[6] Pascal, F., Physical Independence, Parts 1 and 2 

[7] Pascal, F., Logical Validity and Semantic Correctness

[8] Pascal, F., Business Rules, Predicates, Integrity Constraints and Database Consistency

7 comments:

  1. I'd want to remark here that the phrasing "Constraints represent relationships" will serve more to perpetuate rather than debunk the misconception that constraints are more than just keys and foreign keys [and single-tuple CHECK constraints for those who think that far] .

    ReplyDelete
    Replies
    1. I see what you mean. I recently I wrote a post "Relationships and the RDM" where I expressly specified all the relationships beyond FKs. I will edit the text to stress the issue and add it to the References.

      Delete
  2. The NoSQL Tool idea needs more consideration, in fact, it's about dealing with the nature of the data which is unstructured comparing to relational structured data with referential integrity.
    For me, it's clear that an RDBMS which supports JSON (NoSQL) is far more able to take the maximum value form the DATA around the Relational Data Model.

    The following article recalls the need for a Model with the new BIG DATA, NoSQL era:
    https://tdwi.org/articles/2017/03/22/data-modeling-for-big-data-and-nosql.aspx

    ReplyDelete
    Replies
    1. You should read several of my posts/papers on this subject.

      "Unstructured data" is a contradiction: data are by definition structured, otherwise it's random noise. There is no "inherent nature" of data -- it has whatever structure we choose to give it. Documents have multiple structures -- we just don't analyze them and restructure them for different informational purposes.

      Integrity is not only referential.

      A RDBMS cannot do MORE with JSON than a NoSQL can. To take full advantage of the RDM the data must be structured AS RELATIONS.

      The attraction of BigData and NOSQL is precisely the lack of a data model. Their content can be analyzed, modeled and mapped to relations using the RDM, but this is precisely what nobody wants to do -- they delude themselves that they can achieve the same as with RDBMS, but without analysis, modeling and design.

      Delete
    2. I've wound up explaining this a bit more concretely: if I create an integrity constraint in my database, I know it's there because it raises an error when I violate it.

      Structure dictates the behavior of mathematical operations. It's why running your program against some inputs will produce useful outputs, and other inputs will crash, hang or return garbage.

      In an "unstructured" database, my code is still expecting inputs to have the correct types and sensible values. If they're not, I get an error message from deep in my application code *or* I get garbage output. The difference is that I don't find out until it's already in production! Or I wind up writing thousands of lines of tests and checks to do what dozens of lines of SQL could accomplish.

      This is the NoSQL swindle: they sell you half a DBMS, and you get to rewrite the other half (the optimizer, the integrity constraints, etc.) in your application code.

      Delete
  3. "Such systems often depend on a pattern matching language facility (e.g., awk) to find data that matches some user-defined format and then manipulate it in some ad-hoc way."

    In practice, what happens is you spread the schema all over application code. Your managers think it's simpler because they can't see it, but they're just living in denial.

    ReplyDelete
  4. That is why a real data model has all three components, which are defined jointly -- relational algebra and constraints are derived from relational structure.

    What the industry calls "unstructured data" are data has not ben modeled for any particular integrity enforcement and manipulation -- it's all adhoc and up to the application programmers and soundness is not guaranteed by the system.

    ReplyDelete