Friday, December 20, 2019

The RDM and Model Stability





“3rd normal form data models in data warehousing efforts struggle when changes impact parent child relationships. These impacts cause cascading changes to the data model, the queries, and the loading processes. [For example:]
  • There are bank accounts
  • Each account belongs to exactly one customer
  • A customer can have more than one account
The bank introduces a new product: joint accounts, which means an account can now have more than one owner. It is clear that the 3NF model has to be extended in order to keep this new information; the data vault models seems to be able to fulfill the new requirement.

Some banks propose joint accounts, some don’t, therefore some use M:N relation between client and accounts and others 1:N. A model which is good for any possible case is actually awful model because it describes nothing: by looking at this model you can’t say if joint accounts exist among bank's products.”

--Data Vault and Model (in)Stability

Data warehousing/vault[1] are a red herring here -- the real issue is data independence. Some corrections and clarifications first:

  • Normal forms do not pertain to the data model itself -- the RDM -- but to relations in logical models created using strictly the RDM[2].
  • 3NF is insufficient -- relations are in 5NF by definition, otherwise correctness is not guaranteed[3].
  • The RDM was introduced as a database representation superior to old directed graph -- hierarchic and network (CODASYL) -- systems for conceptual models focused on relationships among entity groups, rather than among individual entities[4]. Graph database representation (nodes and edges) corresponds to a worldview at the conceptual level of parents-children (network) relationships, of which parent-children (hierarchy) is a special case. The relational representation (relations) corresponds to M:N relationships among entity groups, of which M:1 is a special case[5].

Note: Correctness -- logical and semantic[6] -- requires adherence to three principles of database design that jointly imply 5NF[7].


------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE 

Up to 2018, DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 that website was discontinued. The content of this site is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. Thank you.

NEW

  • 12/20/19 Updated LINKS page.
  • 10/26/19: The POSTS page now links to all 2012-2018 posts (to be updated annually at year-end). Except for 2017, the (italicized) links are to abstracts of my columns @All Analytics site, which was discontinued (see below).
  • 10/26/19: Updated and cleaned up the WRITINGS page.
  • 08/09/19: Following my series of posts on data sublanguage (Parts 1-4), I have revised for consistency the corresponding section of paper #2 in the Understanding the Real RDM series, Logical Access, Data Sublanguage, Kinds of Relations, and Database Redundancy and Consistency, which is available for ordering from the PAPERS page.
LATEST PUBLICATIONS (order PAPERS and BOOKS) 

USING THIS SITE 

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search. 
  • Following the discontinuation of AllAnalytics site, the links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.
SOCIAL MEDIA 

I deleted my Facebook account. You can follow me:

  • @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
  • @The PostWest blog: Evidence for Antisemitism/AntiZionism – the only universally acceptable hatred – as the (traditional) response to the existential crisis of decadence and decline of Western (including the US)
  • @ThePostWest Twitter page where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------


Among the many advantages of -- and a core motivation for -- the RDM was data independence (DI) -- given a conceptual model, applications are insulated from changes in physical implementation (PI)[8] and logical design (LI)[9] (i.e., are stable). But note very carefully that the example is of a conceptual change: a M:1 relationship among account and customer groups changes to a M:N relationship, which a data model -- the RDM or some "graph data model", had one been formalized[10] -- cannot be expected to insulate logically (why?).

And yet, it turns out that for the specific type of conceptual change in the example, the RDM properly understood[11] can do even that -- yet another advantage for its domain of applicability.

Consider the practically identical example we used in the previous post: two entity groups, orders and customers, where many orders can be issued by exactly one customer (M:1). The conventional two-relation 5NF design displayed by two R-tables:

     CUSTOMERS
    ===============================================
    | CID | NAME     | AGE | ADDRESS   | BALANCE  |
    -=====-----------------------------------------
    |   1 | Ramesh   |  32 | Ahmedabad |  2000.00 |
    |   2 | Khilan   |  25 | Delhi     |  1500.00 |
    |   3 | Kaushik  |  23 | Kota      |  2000.00 |
    |   4 | Chaitali |  25 | Mumbai    |  6500.00 |
    |   5 | Hardik   |  27 | Bhopal    |  8500.00 |
    |   6 | Komal    |  22 | MP        |  4500.00 |
    |   7 | Muffy    |  24 | Indore    | 10000.00 |
    -----------------------------------------------

    ORDERS
    ===================================
    | OID | DATE       | CID | AMOUNT |
    -=====-----------------------------
    | 102 | 2009-10-08 |   3 |   3000 |
    | 100 | 2009-10-08 |   3 |   1500 |
    | 101 | 2009-11-20 |   2 |   1560 |
    | 103 | 2008-05-20 |   4 |   2060 |
    -----------------------------------


where ORDERS.CID is an "embedded" FK referencing the CUSTOMERS.CID PK.

M:1 is just a special case (N=1) of M:N, the relational representation of which is the association relation -- so why the different representation?

When the RDM was introduced, the old generation graph DBMSs -- particularly hierarchic -- were dominant and entrenched and, relational advantages notwithstanding, proved extremely difficult to dislodge (as they remain to this day). We suspect that embedded FKs were Codd's attempt to make relational databases more familiar and attractive to users of hierarchic DBMSs by emulating their structure with relations. In other words, embedded FKs are an artifact of history, not a formal element of the RDM. M:N relationships including the M:1 special case should be represented uniformly by association relations[12]. Hence the 3-relation design in the previous post:

CUSTOMERS
===============================================
| CID | NAME     | AGE | ADDRESS   | SALARY   |
======----------------------------------------- 
|   1 | Ramesh   |  32 | Ahmedabad |  2000.00 |
|   2 | Khilan   |  25 | Delhi     |  1500.00 |
|   3 | Kaushik  |  23 | Kota      |  2000.00 |
|   4 | Chaitali |  25 | Mumbai    |  6500.00 |
|   5 | Hardik   |  27 | Bhopal    |  8500.00 |
|   6 | Komal    |  22 | MP        |  4500.00 |
|   7 | Muffy    |  24 | Indore    | 10000.00 |
-----------------------------------------------

ORDERS
=============================
| OID | DATE       | AMOUNT |
-=====-----------------------
| 102 | 2009-10-08 |   3000 |
| 100 | 2009-10-08 |   1500 |
| 101 | 2009-11-20 |   1560 |
| 103 | 2008-05-20 |   2060 |
-----------------------------

CUSTOMERS_ORDERS
=============
| OID | CID |
-===========-
| 101 |   2 |
| 100 |   3 |
| 102 |   3 |
| 103 |   4 |
-------------

Given this design, consider now the change of the relationship from M:1 to M:N: does the 5NF logical model need "to be extended in order to keep this new information", and if so, how?

This design together with the 5NF mandate for both base and derived relations are just two elements of the McGoveran's reinterpreted and extended formalization of Codd's RDM[13]. It involves revisions of, for example, of relational closure and the relational algebra (RA), among others, for which you will have to wait for McGoveran's forthcoming book[11].

By the way, in the previous post we asked you to figure out the result of a join revised to accommodate this design relative to that of the current join and the conventional embedded FK design  -- did you?





Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.


References
[1] Pascal, F., Data Warehouses and the Logical-Physical Confusion

[2] Pascal, F., Data Model: Neither Business, Nor Logical, Nor Physical Model

[3] Pascal, F., What Relations Really Are and Why They Are Important 

[4] Pascal, F., Reader Mail: Sets vs. Graphs, Education vs. Training

[5] Pascal, F., Relationships and the RDM series

[6] Pascal, F., Relational Database Design, Logical Validity, and Semantic Consistency

[7] Pascal F., Database Design: What It Is and Isn't

[8] Pascal, F., Physical Independence series

[9] Pascal, F., On View Updating (C. J. Date and D. McGoveran)

[10] Pascal, F., Graph Databases: Them Who Forget the Past...

[11]
McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK (draft chapters), forthcoming. 

[12] Pascal, F., Fourth Order Properties series

[13] Pascal, F., Re-interpreting Codd


No comments:

Post a Comment

View My Stats