Wednesday, March 27, 2019

Graph Databases: They Who Forget the Past...


Out of the plethora of misconceptions common in the industry[1], quite a few are squeezed into this paragraph:
“The relational databases that emerged in the ’80s are efficient at storing and analyzing tabular data but their underlying data model makes it difficult to connect data scattered across multiple tables. The graph databases we’ve seen emerge in the recent years are designed for this purpose. Their data model is particularly well-suited to store and to organize data where connections are as important as individual data points. Connections are stored and indexed as first-class citizens, making it an interesting model for investigations in which you need to connect the dots. In this post, we review three common fraud schemes and see how a graph approach can help investigators defeat them.
--AnalyticBridge.DataScienceCentral.com

Relational databases did not emerge in the 80s (SQL DBMSs did);
  • There is no "tabular data" (the relational data structure is the relation, which can be visualized as a table on a physical medium[2], and SQL tables are not relations);
  • Analysis is not a DBMS, but an application function (while database queries, as deductions, are an important aspect of analysis, and computational functions can be added to the data sublanguage (as in SQL), the primary function of a DBMS is data management)[3];
  • A data model has nothing to do with storage (storage and access methods are part of physical implementation, which determines efficiency/performance[4]).

Here, however, we will focus on the current revival (rather than emergence) of graph DBMSs claimed superior -- without any evidence or qualifications -- to SQL DBMSs (not relational, which do not exist) that purportedly "make it difficult to connect data scattered across multiple tables". This is a typical example of how lack of foundation knowledge and of familiarity with the history of the field inhibit understanding and progress[5].


Saturday, March 9, 2019

Fourth Order Properties Part 2: Association Relations in Database Design - An Example


Part 1 outlined fundamentals of fourth order properties (4OP) of a multigroup arising from relationships among its group members due to 1:1, M:1, M1:M2, and, generally, M1:M2:M3:...:Mn relationships among the groups' entity members. Fundamentals are commonly missing from database practice, as reflected in the exchange:

“We have Building, Room, and Bed entities. Logically, if this is in the scope of some hypothetical hotel, then each one of those entities is dependent on their parent to exist ... you cannot have a bed without a room. Also, that room wouldn't exist without its parent, Building. So, why have I rarely seen this identifying relationship introduced? When I was learning databases, everything was apparently "non-identifying". When is this type of relationship necessary, if at all? I see the issue arises when that BED can exist without a BUILDING. If you were to INSERT into the BED table, you are constraint [sic] to provide a building_id, as the building_id is part of that BED's primary key. Couldn't you avoid an identifying relationship by giving each table its own surrogate primary key? Is this the correct representation  of an identifying relationship? I could avoid that by just giving each table its own ID. At the end of the day, this is about IDENTIFYING relationships, not their existence, which is how I've been logically determining if something is an "identifying relationship" If that were the case, then any 1:N relationship could be "identifying" but that's not how you define identifying or non-identifying.”
“Interesting -- I’d never heard this term before. I’ve heard it referred to as a cached ID though, as that 2nd ID isn’t required, but may be beneficial for performance purposes. For this example with 3 levels it’s not a huge joint statement, but for some systems with 12 tables the joins get unpleasant. I’ve never started a system with this additional id, but I have added one later on once the need was there and the profiling led to this being the best solution for our specific situation. Usually though, just creating a view that does the joins for me has been easier. I’ll be curious what has led others to use this approach.”
It's not really introduced because it's way more towards academic than functional.”
--Reddit.com
Knowledge of the fundamentals would have obviated the question, the ad-hoc terminology, and the answers. Note in particular how -- notwithstanding the conceptual and logical nature of the question -- the first answer typically delves directly into implementation[1]. The second answer does not merit attention, except as indicator of the sad state of the industry.

Given a conceptual model, we shall now compare the database design proposed in the question with (1) conventional industry practice that includes "embedded foreign keys", and (2) assuming a true RDBMS, the unified representation using association relations of not just M1:M2, or, generally, M1:M2:M3:...:Mn 4OP relationships, but also the 1:1 and M:1 special cases[2].


Saturday, March 2, 2019

Fourth Order Properties Part 1: Association Relations vs. Foreign Keys


 “We have Building, Room, and Bed entities. Logically, if this is in the scope of some hypothetical hotel, then each one of those entities is dependent on their parent to exist ... you cannot have a bed without a room. Also, that room wouldn't exist without its parent, Building. So, why have I rarely seen this identifying relationship introduced? When I was learning databases, everything was apparently "non-identifying". When is this type of relationship necessary, if at all? I see the issue arises when that BED can exist without a BUILDING. If you were to INSERT into the BED table, you are constraint [sic] to provide a building_id, as the building_id is part of that BED's primary key. Couldn't you avoid an identifying relationship by giving each table its own surrogate primary key? Is this the correct representation  of an identifying relationship? I could avoid that by just giving each table its own ID. At the end of the day, this is about IDENTIFYING relationships, not their existence, which is how I've been logically determining if something is an "identifying relationship" If that were the case, then any 1:N relationship could be "identifying" but that's not how you define identifying or non-identifying.”

“Interesting -- I’d never heard this term before. I’ve hears it referred to as a cached ID though, as that 2nd ID isn’t required, but may be beneficial for performance purposes. For this example with 3 levels it’s not a huge joint statement, but for some systems with 12 tables the joins get unpleasant. I’ve never started a system with this additional id, but I have added one later on once the need was there and the profiling led to this being the best solution for our specific situation. Usually though, just creating a view that does the joins for me has been easier. I’ll be curious what has led others to use this approach.”

“It's not really introduced because it's way more towards academic than functional.”

--Reddit.com

Such questions, and ad-hoc terms like "identifying relationships"[1] come up because practice is driven by intuition and experience (if any), without the benefit of foundation knowledge[2]. Whether practitioners know/like it or not, a database is a formal computable representation of an informal conceptual model[3] and, therefore, data modeling (i.e., logical database design)[4] is impossible without (1) a well-defined and complete conceptual model and (2) a formal data model with which to formalize it as a logical model[5]and the two should not be confused[6]. Otherwise all bets are off.

Here's how foundation knowledge should have informed modeling and design.