Sunday, June 10, 2018

Foreign Keys Part 1: Understanding the Fundamentals


Note: This is the first part of a two-part re-write of several earlier posts, to bring them into line with McGoveran's formalization and re-interpretation of Codd's true RDM, which includes his own corrections, refinements and extensions[1]. For a more in-depth treatment see the series of papers available here.

“Do I Have to Use Foreign Keys? If I am already manipulating data properly, are foreign keys required? Do they have another purpose that I’m just not aware of? I appreciate the guidance!”
“... [we] wish to make a point. There is something which is bad design/good design/mandatory/optional. Please stop insisting that Primary and Foreign keys are mandatory. They are good design habits but by no means mandatory. However, life is much more complex than a Normalized DB structure. This includes tables serving as event logs; tables, serving as User maintained materialized query tables, tables, serving as supporting structures, reflecting state of complex transactional databases; persistent tables serving as Result Set or Session keepers. And I personally believe that if they were truly mandatory, Sybase, Oracle, SQL Server, Ingres, DB2, etc. would require them. Oh, sorry, forgot the SQL standard itself. This is not the relational model we're talking about. These are commercially available RDBMSs which, not surprisingly, DO tend to listen to their customers. If they didn't, they wouldn't be in business!! Since Sybrand is unlikely to get FKs required by the SQL standard or the major RDBMS vendors, it seems that mandatory means that his answer to the question "Do I have to use foreign keys?" is "You would if you worked in my shop!". I'm inclined to agree with that.”
“Databases can work with or without primary keys and foreign keys. The choice is yours... However ... enforcing referential integrity can be done by many methods ... TMTOWTDT = There is more than one way to do this ... It all depends on your approach... In the last ten years... every one is enforcing referential integrity with help of primary and foreign keys but before this ... a lot of applications were working without primary and foreign keys to enforce referential integrity and to avoid orphaned rows/avoid duplicate records.”
“We don't have every possible logical relationship enforced by the database. Sometimes you have to compromise for performance reasons, as too many foreign key validations can slow down high volume inserts. Other times you have to create breakpoints just to keep the web of relationships from becoming too tangled and connecting hundreds or thousands of tables.”
“I think it is preferable to have FK constraints as an additional security layer and they can be disable[d] during loading if required; however, you need to be 100% certain your ETL is enforcing the constraints. It is best to do both - have the ETL reject records which fail FK checks and report on these whilst also enforcing FK intergrity on insert/update, if appropriate. The only additional thing I can add is - when you delete from a FK enabled DB, make sure you do it in the correct order.”
“I think, you have to learn about data structures and logical data design (not only database, which is nowadays interpreted mainly as only RDBMS), to be clear about usage primary, alternate, and foreign keys, normal forms, data integrity-and database integrity, because your database will work suboptimally without these knowledge if it will work at all.”
Data practitioners have high levels of tool knowledge, but lack a good grasp of fundamentals, for which reason they cannot be considered data professionals. Now, do not get me wrong: I do not mean that good knowledge of tools is unimportant -- if you work with them you gotta know them real well -- but the ability to fully assess them, use them optimally, and compensate for any shortcomings is limited in the absence of foundation knowledge. So let's have some before tackling the exchange.


------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE 

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so let's take advantage of his generosity. Purchasing my papers and books will also help. Thank you. 

NEW PUBLICATIONS

NEW: SOCIAL MEDIA

I deleted my Facebook account. You can follow me on Twitter:
The DBDebunk page will contain links to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture, which I am bringing back. I created a #RelModel hashtag for this purpose.

The PostWest pages will contain links to evidence for, and my take on Dystopian Western Decadence, The Only Acceptable Racism Left, and The Weaponized Myth of a "Palestinian Nation" for which the anti-semitic world fell lock stock and barrel. I will create a #PostWest hash tag for this purpose. 

HOUSEKEEPING

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 
-----------------------------------------------------------------------------------------------------------------

Inter-group Relationships and Referential Constraints


As explained elsewhere, there are several types of relationships that are specified informally by business rules (BR) in a conceptual model, all of which are supported by the RDM[2]. One type of relationships exist among members of different entity groups that form a multi-group. Entity groups are represented formally in the database by relations, and relationships by integrity constraints on and among relations: by enforcing the constraints, a RDBMS (rather than redundantly every application) ensures that the database is consistent with (represents accurately) the conceptual model. Inter-group relationships are collective properties of a multi-group: to determine whether a relationship holds (i.e., if the multi-group has the property), multiple members of multiple groups must be examined. How many members depends on whether the relationship is one-to-one (1:1), many-to-one (M:1), or many-to-many (M:N). Referred to as fourth order relationships (4OP), they are represented in the database by referential constraints among relations[3,4].



Hierarchies, Networks and Foreign Keys


M:1 and M:N relationships among entities of different types are referred to as hierarchies and networks, respectively. Here is the four-object hierarchy example used originally by Codd[5]:

                                     EMPLOYEE
                         |          
    +--------------------+-----------------------+
    |                                            |
JOBHISTORY                                    CHILDREN
    |
SALARYHISTORY


In a hierarchic database it would be represented by four physical "segments" (record types) connected by pointers. In other words, relationships among entity types were represented by physical organization in storage and data had to be accessed accordingly. For example, SALARYHISTORY data would be accessible only via EMPLOYEE records and pointer navigation through the JOBHISTORY records. Aside from complexity, queries and applications had to reference physical structure explicitly, and any reorganization would impair them, necessitating prohibitive maintenance.

In relational systems access is based exclusively on the data themselves, not machine internals that are irrelevant to application developers and users, and disruptive when they change. This is enshrined in Codd's core Information Principle (IP):

"All information in a relational database -- including about relationships -- is represented explicitly, and in exactly one way: as attribute values in relations".
which ensures physical independence (PI) -- the insulation of applications and users from physical details and changes thereof.


Foreign Keys and the Normal Form


In a straight mapping of segments to relations, JOBHISTORY, SALARYHISTORY, and CHILDREN would become attributes defined on non-simple domains, specifically, relation-valued domains (RVD) that have relations as values. A "nested" relation represents an entity group just like a "nesting" relation does. The nesting was a way to emulate a hierarchy in a relational database. But relations nested within relations (sets of sets) require logic higher than first-order predicate logic (FOPL), the formal foundation of RDM, which would eliminate all its advantages. This is why database relations are defined only on simple domains (the values of which are treated as atomic by the data language)[6] -- another way of saying that they are in their normal form, which is why Codd introduced foreign keys (FK) to enable representation of hierarchic relationships without nesting[5]:

"In order to discuss a preferred way (or normal form), we must first introduce a few additional concepts ... primary key, foreign key, non-simple domain)..."
This substituted a value-based for a pointer-based representation, and replaced navigation with joins (i.e., matching of FK-PK values) of the relational algebra (RA)In the example, JOBHISTORY.MAN# and SALARYHISTORY.JOBDATE are FKs. As we have explained, in the RDM PKs represent entity names, not properties, and are managed transparently by the DBMS[11]. Which means that FKs also represent names. 

A M:N relationship among members of different entity groups (network) is decomposable into two M:1 relationships (hierarchies), and is represented by an associative relation and two referential constraints among it and each of the relations representing the related relations.

Note: Given nested relations, normalization unnests them and puts them in their normal form. We alert the reader that:
  • Codd's normalization[7] is distinct from Codd's further normalization to higher normal forms[8];
  • The current 1NF [9], which is not the same as Codd's original normal form, is insufficient -- database relations are also in Fifth Normal Form (5NF) by definition[10];
  • Neither normalization, nor further normalization are necessary if relations:
- Are defined on simple domains with atomic values;
- Represent entity groups with members of a single type[4].

Graph DBMSs (the formal foundation of hierarchic and network DBMSs is directed graph theory) are in vogue these days, promoted as if they were a new a superior alternative to RDBMSs (which means SQL DBMSs, as true RDBMSs do not exist). Practitioners who were not around to experience the old graph DBMSs that FKs and SQL replaced better check them out (e.g.,hierarchic hierarchic and network) before they get too excited. Those who forget the past...

We can now debunk the exchange: practitioners complain (erroneously) that the RDM does not support relationships[2], while at the same time they dismiss the support that actually exists.

(Continued in Part 2)

 

Follow:  @DBDEBUNK    @THEPOSTWEST



References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., Relationships and the RDM Parts 1,2,3.

[3] Pascal, F., Conceptual Modeling for Database Design, new version forthcoming.

[4] Pascal, F., Database Design: Formalizing the Informal, new version forthcoming.

[5] Codd, E. F., A Relational Model of Data for Large Shared Data Banks, Commun. ACM 13(6): 377-387 (1970).

[6] Pascal, F., Simple Domains and Value Atomicity.

[7] Codd, E. F., Normalized Data Base Structure: A Brief Tutorial. IBM Research Report, San Jose, California RJ935 (1971).

[8] Codd, E. F., Further Normalization of the Data Base Relational Model. IBM Research Report, San Jose, California RJ909 (1971).

[9] Pascal, F., First Normal Form in Theory and Practice Parts 1,2.

[10] Pascal, F., What Relations Really Are and Why They Are Important.

[11] Pascal, The Key to Relational Keys: A New Understanding.

  
Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.




No comments:

Post a Comment