Wednesday, January 22, 2020

TYFK: “Why is a relational database so powerful?”


 Note: About TYFK posts (Test Your Foundation Knowledge) see the post insert below.
“...the theoretical awesomeness of relational algebra is kinda hard to intuitively relate back to your payroll or audit-log tables - the real power is the computed join ... it lets you dynamically fetch sets of data in the exact format that you need ... with any group of tables in the dataset. Unlike other data models, where the things you can fetch are typically fixed when you define your elements, and where relationships between data - if any - are statically defined in advance ... joins let you specify the relationships between objects (rows and tables in SQL-based relational databases) ... create queries and run them on your data without needing to write a lot of extra code beyond the SQL itself. This Ad Hoc Query capability ends up being hugely valuable when doing "secondary" business tasks in a big data world such as doing reporting and analytics, and is often hugely difficult to do in non-relational environments without a lot of extra code and often a specialized reporting schema.”

“Relational theory as applied to databases is that all data is connected to each other, keyed to each piece ... And with a SQL query [you can] create anything that can exist, as output.”

“...main so called power of RDBMs lies within ACID compliance. A transaction in a RDB is Atomic, Consistent, isolated, durable ... makes a database useful, unique, or I suppose powerful ... let's say you update or insert some new record, right in the middle power goes out, due to ACID compliance your transaction will not go through ... either the operation will complete or fail, nothing in between. [And RDBMSs are] tried, tested and true for almost 50 years.”

First try to detect the misconceptions then check against our debunking (some of the above is correct per se but not directly relevant to relational power, some only partially correct, and some is wrong. Can you discern which is which?) If there isn't a match, you can acquire the necessary foundation knowledge in our POSTS, BOOKS, PAPERS, LINKS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.

Wednesday, January 15, 2020

TYFK: "What is better than relational databases?"

Note: About TYFK posts (Test Your Foundation Knowledge) see the post insert below.
“Relational databases like MySQL, PostgreSQL and SQLite3 represent and store data in tables and rows. They're based on a branch of algebraic set theory known as relational algebra. Meanwhile, non-relational databases like MongoDB represent data in collections of JSON documents. The Mongo import utility can import JSON, CSV and TSV file formats. Mongo query targets of data are technically represented as BSON (binary JASON).

“Relational databases use Structured Querying Language (SQL), making them a good choice for applications that involve the management of several transactions. The structure of a relational database allows you to link information from different tables through the use of foreign keys (or indexes), which are used to uniquely identify any atomic piece of data within that table. Other tables may refer to that foreign key, so as to create a link between their data pieces and the piece pointed to by the foreign key. This comes in handy for applications that are heavy into data analysis.

“If you want your application to handle a lot of complicated querying, database transactions and routine analysis of data, you’ll probably want to stick with a relational database. And if your application is going to focus on doing many database transactions, it’s important that those transactions are processed reliably. This is where ACID (the set of properties that guarantee database transactions are processed reliably) really matters, and where referential integrity comes into play.”
First try to detect the misconceptions, then check against our debunking. If there isn't a match, you can acquire the necessary foundation knowledge in our POSTS, BOOKS, PAPERS, LINKS or, better, organize one of our on-site SEMINARS, which can be customized to specific needs.

Friday, January 3, 2020

Science, "Data Science", and Database Science


“The foundation of modern database technology is without question the relational model; it is the foundation that makes the field a science.”
--C. J. Date, AN INTRODUCTION TO DATABASE SYSTEMS
“Over the past decades mainstream economics in universities has become increasingly mathematical, focusing on complex statistical analyses and modeling to the detriment of the observation of reality.”
-- J. Luyendijk, Don’t let the Nobel Prize fool you. Economics is not a science

Science is the formulation and validation of theories about the real world in the context of discovery (CoD) and context of validation (CoV), respectively. There is "hard" science -- theories about the physical world (physics, chemistry, biology) -- and "soft" science -- theories about human behavior (political, economics, psychology). All science uses data, initially only in the CoV, but increasingly also in the CoD -- computerized discovery of patterns as potential hypotheses (i.e., "data mining"). 

Friday, December 20, 2019

The RDM and Model Stability



“3rd normal form data models in data warehousing efforts struggle when changes impact parent child relationships. These impacts cause cascading changes to the data model, the queries, and the loading processes. [For example:]
  • There are bank accounts
  • Each account belongs to exactly one customer
  • A customer can have more than one account
The bank introduces a new product: joint accounts, which means an account can now have more than one owner. It is clear that the 3NF model has to be extended in order to keep this new information; the data vault models seems to be able to fulfill the new requirement.

Some banks propose joint accounts, some don’t, therefore some use M:N relation between client and accounts and others 1:N. A model which is good for any possible case is actually awful model because it describes nothing: by looking at this model you can’t say if joint accounts exist among bank's products.”

--Data Vault and Model (in)Stability

Data warehousing/vault[1] are a red herring here -- the real issue is data independence. Some corrections and clarifications first:

  • Normal forms do not pertain to the data model itself -- the RDM -- but to relations in logical models created using strictly the RDM[2].
  • 3NF is insufficient -- relations are in 5NF by definition, otherwise correctness is not guaranteed[3].
  • The RDM was introduced as a database representation superior to old directed graph -- hierarchic and network (CODASYL) -- systems for conceptual models focused on relationships among entity groups, rather than among individual entities[4]. Graph database representation (nodes and edges) corresponds to a worldview at the conceptual level of parents-children (network) relationships, of which parent-children (hierarchy) is a special case. The relational representation (relations) corresponds to M:N relationships among entity groups, of which M:1 is a special case[5].

Note: Correctness -- logical and semantic[6] -- requires adherence to three principles of database design that jointly imply 5NF[7].

Saturday, November 30, 2019

5NF, Association Relations, and Join: Test Your Foundation Knowledge



Assume a conceptual model of a multigroup consisting of two related entity groups, Customers and Orders, where a customer can issue multiple orders. The conventional logical database design is:
CUSTOMERS
===============================================
| CID | NAME     | AGE | ADDRESS   | SALARY   |
-=====-----------------------------------------
|   1 | Ramesh   |  32 | Ahmedabad |  2000.00 |
|   2 | Khilan   |  25 | Delhi     |  1500.00 |
|   3 | Kaushik  |  23 | Kota      |  2000.00 |
|   4 | Chaitali |  25 | Mumbai    |  6500.00 |
|   5 | Hardik   |  27 | Bhopal    |  8500.00 |
|   6 | Komal    |  22 | MP        |  4500.00 |
|   7 | Muffy    |  24 | Indore    | 10000.00 |
-----------------------------------------------

ORDERS
===================================
| OID | DATE       | CID | AMOUNT |
-=====-----------------------------
| 102 | 2009-10-08 |   3 |   3000 |
| 100 | 2009-10-08 |   3 |   1500 |
| 101 | 2009-11-20 |   2 |   1560 |
| 103 | 2008-05-20 |   4 |   2060 |
-----------------------------------
where ORDERS.CID is an "embedded" foreign key (FK) referencing the primary key (PK) CUSTOMERS.CID.

Consider the query "For all orders, find the CID, name, OID, amount, and date" that applies a join of the two relations on CID. In SQL:

SELECT c.cid,c.name,o.oid,o.amount,o.date
FROM customers c
INNER JOIN orders o
ON c.cid = o.cid;
with the result displayed by the table:
====================================================
| C.CID | C.NAME   | O.OID | O.AMOUNT | O.DATE     |
-=======------------=======-------------------------
|     2 | Khilan   |   101 |     1560 | 2009-11-20 |
|     3 | Kaushik  |   102 |     3000 | 2009-10-08 |
|     3 | Kaushik  |   100 |     1500 | 2009-10-08 |
|     4 | Chaitali |   103 |     2060 | 2008-05-20 |
----------------------------------------------------
Note: A table is just a tabular display of a relation and the two should not be confused[1,2]. Bear in mind that SQL tables are not relations.

It may surprise you to know that both the design and the result are problematic from a relational standpoint.

Friday, November 8, 2019

The CAP Theorem and the RDM: Consistency, Availability, Partitioning


by David McGoveran


I have recently posted on LinkedIn two "To Laugh or Cry?" quotes about advantages and disadvantages of "relational" and non-relational" DBMSs from "Choosing The Right Database" @TowardsDataScience.com (by Jun Wu, "a Content Writer for Technology, AI, Data Science, Psychology, and Parenting" with "background in programming and statistics" who "on her spare time, writes poetry and blogs on her parenting website").

I also brought the article to the attention of David McGoveran, who replied "This deserves a response -- it is simplistic, but makes the current, widespread thinking in the industry clear". His comments that follow focus on the three concepts in the title.

Although likely written with good intentions, Jun Wu's article is representative of the muddy thinking that pervades the database and "data science" communities[1,2]. Besides the rather obvious fact that one does not choose a "database" (an organized collection of data), but a DBMS (software that manages databases), it swallows whole and then spreads confused and inconsistent usage of the terms consistency, availability, and partitioning. [FP: That practitioners use the terms database and DBMS interchangeably is part and parcel of the general misuse and abuse of terminology reflecting poor foundation knowledge.]

The terms consistency, availability, and partitioning as used in Brewer's CAP Theorem are not the same terms as those traditionally used in relational database theory and even commercial SQL DBMS practice. For pragmatic reasons I'll qualify the former terms with "CAP" and the latter terms with "RDB", even though I find it misleading to refer to commercial SQL systems as "relational"[3,4,5,6]. Let's look at each term and its intended meaning.

View My Stats