Saturday, November 30, 2019

5NF, Association Relations, and Join: Test Your Foundation Knowledge



Assume a conceptual model of a multigroup consisting of two related entity groups, Customers and Orders, where a customer can issue multiple orders. The conventional logical database design is:
CUSTOMERS
===============================================
| CID | NAME     | AGE | ADDRESS   | SALARY   |
-=====-----------------------------------------
|   1 | Ramesh   |  32 | Ahmedabad |  2000.00 |
|   2 | Khilan   |  25 | Delhi     |  1500.00 |
|   3 | Kaushik  |  23 | Kota      |  2000.00 |
|   4 | Chaitali |  25 | Mumbai    |  6500.00 |
|   5 | Hardik   |  27 | Bhopal    |  8500.00 |
|   6 | Komal    |  22 | MP        |  4500.00 |
|   7 | Muffy    |  24 | Indore    | 10000.00 |
-----------------------------------------------

ORDERS
===================================
| OID | DATE       | CID | AMOUNT |
-=====-----------------------------
| 102 | 2009-10-08 |   3 |   3000 |
| 100 | 2009-10-08 |   3 |   1500 |
| 101 | 2009-11-20 |   2 |   1560 |
| 103 | 2008-05-20 |   4 |   2060 |
-----------------------------------
where ORDERS.CID is an "embedded" foreign key (FK) referencing the primary key (PK) CUSTOMERS.CID.

Consider the query "For all orders, find the CID, name, OID, amount, and date" that applies a join of the two relations on CID. In SQL:

SELECT c.cid,c.name,o.oid,o.amount,o.date
FROM customers c
INNER JOIN orders o
ON c.cid = o.cid;
with the result displayed by the table:
====================================================
| C.CID | C.NAME   | O.OID | O.AMOUNT | O.DATE     |
-=======------------=======-------------------------
|     2 | Khilan   |   101 |     1560 | 2009-11-20 |
|     3 | Kaushik  |   102 |     3000 | 2009-10-08 |
|     3 | Kaushik  |   100 |     1500 | 2009-10-08 |
|     4 | Chaitali |   103 |     2060 | 2008-05-20 |
----------------------------------------------------
Note: A table is just a tabular display of a relation and the two should not be confused[1,2]. Bear in mind that SQL tables are not relations.

It may surprise you to know that both the design and the result are problematic from a relational standpoint.

Friday, November 8, 2019

The CAP Theorem and the RDM: Consistency, Availability, Partitioning


by David McGoveran


I have recently posted on LinkedIn two "To Laugh or Cry?" quotes about advantages and disadvantages of "relational" and non-relational" DBMSs from "Choosing The Right Database" @TowardsDataScience.com (by Jun Wu, "a Content Writer for Technology, AI, Data Science, Psychology, and Parenting" with "background in programming and statistics" who "on her spare time, writes poetry and blogs on her parenting website").

I also brought the article to the attention of David McGoveran, who replied "This deserves a response -- it is simplistic, but makes the current, widespread thinking in the industry clear". His comments that follow focus on the three concepts in the title.

Although likely written with good intentions, Jun Wu's article is representative of the muddy thinking that pervades the database and "data science" communities[1,2]. Besides the rather obvious fact that one does not choose a "database" (an organized collection of data), but a DBMS (software that manages databases), it swallows whole and then spreads confused and inconsistent usage of the terms consistency, availability, and partitioning. [FP: That practitioners use the terms database and DBMS interchangeably is part and parcel of the general misuse and abuse of terminology reflecting poor foundation knowledge.]

The terms consistency, availability, and partitioning as used in Brewer's CAP Theorem are not the same terms as those traditionally used in relational database theory and even commercial SQL DBMS practice. For pragmatic reasons I'll qualify the former terms with "CAP" and the latter terms with "RDB", even though I find it misleading to refer to commercial SQL systems as "relational"[3,4,5,6]. Let's look at each term and its intended meaning.

Friday, November 1, 2019

Comments on a Stonebraker Article


These comments were prompted by a LinkedIn post referencing Michael Stonebraker's Those Who Forget the Past Are Doomed to Repeat It  -- something I often reiterate myself -- where he argues:
“Over the past decade, there have been a number of DBMSs introduced (typically labeled as NoSQL) which utilize a network or hierarchical data model. MongoDB and Cassandra come immediately to mind as examples. Some such systems support networks through the concepts of "links" and some support hierarchical data using a nested data model often utilizing JSON. In my opinion, these systems have not internalized lessons from history.
“At the SIGFIDET (now SIGMOD) annual conference in 1974, there was a "Great Debate" over the merits of the relational model versus the network and hierarchical models ... Basically, the argument was about which model [relational or network] was a better fit for structured data (as opposed to documents, e-mails, etc.) and boiled down to two questions:

Question 1: Are high-level data sublanguages a good idea?
Question 2: Are tables the best data structure or should one use a network or hierarchy?”

“The last 45 years have definitely affirmed Codd’s position on both issues ... The conclusion from the 1970s was that the relational model provides superior data independence, compared to the network and hierarchical [graph] models. Forty-five years later, this conclusion is still true. If you want to insulate yourself from the changes that business conditions dictate, use a relational DBMS. If you want the successor to the successor to your job to thank you for your wise decision, use a relational model.”
I couldn't agree more, having repeatedly argued this myself. But he misses some old aspects that the industry has failed to recognize, has ignored, or dismissed[1]; and some important new aspects due to a new understanding of Codd's work[2].