Saturday, December 24, 2016

This Week with Season's Greetings



Data Sublanguages, Programming and Data Integrity

My December post @All Analytics

Both data science employers and candidates stress the eclectic nature of the required skills, programming in particular. Indeed, coding has acquired such an elevated role, that it now entirely replaces education. Aside from the societal destructive consequences of this trend, in the context of data management it is a regressive self-fulfilling prophecy that obscures and disregards the core practical objective of database management to minimize programming. You can frequently encounter it in comments like:
"Anything you can model in a DBMS you can model in Java. The next paradigm shift is business rules centralized in Java business objects, rather than hard-coded in SQL for better manageability, scalability, etc. The only ones that should reside in a database are referential integrity (and sometimes even that isn't really necessary). Don't let pushy DBAs tell you otherwise -- integrity constraints slow down development as well as performance."
Upside down and backwards.

Read it all (and comment there, not here, please).


THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.

("What's Wrong with this Picture" will return in 2017)


1. Quote of the Week

"The value of the model may be diminishing in certain enterprises, since busy with deliverables." --Harshendu Desai, LinkedIn.com

3. To Laugh or Cry?

5 Reasons Relational Databases Hold Back Your Business

4. Added to the LINKS page

  • What a Database Really Is: Predicates and Propositions
  • The Logical Fallacies

5. Of Interest


And now for something completely different


New at The PostWest (check it out)

My take of the week

Choosing not to veto, Obama lets anti-settlement resolution pass at UN Security Council
The press refused to publish Obama's Chicago speech to the Palestinian lobby to hide his anti-semitism. He was never as troubled by Assad, or Putin, or Erdogan as he was by Netanyahu.  That's because Jews have always been a soft target (Barak Obama's Israeli Settlements Canard).  If that is not anti-semitism, I don't know what is.
When the US is in the same camp with Russia, China, Iran and Turkey and her acts are cheered by Hamas, Islamic Jihad and Hezbollah, she has sold out and moved to the dark side.

America (like most other countries) is occupied Indian land via atrocities (and not by people who returned to their own country, like the Jews did). So when America returns its settlements to the Indians, Israel will return its "settlements" (which Israelis got when they defended themselves from "being thrown into the sea"). Until then moralizing and selling out Israel to genocidal terrorists is hypocritical anti-semitism, just like everyone else's (see below).


Global Hypocritical Anti-semitism

UN

 
US

EU

Article of the week

Israel and the Occupation Myth

Video of the week
The Red Disaster. The "life" in Romania during the 60s. The Jews did the worst due to deep anti-semitism. America paid Ceausescu to get us out, but neither she nor Europe wanted us. Had there been no Israel, we would have probably starved to death, not necessarily in a rotten jail. Nobody talks about us, or the hundreds of thousands of the Jewish refugees kicked out from the Arab countries, none of whom were murderous, but everybody is obsessed with the suffering of the Palestinians, who are genocidal.

Pinch-me of the week

Ahmad Tibi urges Israelis not to ‘live by the sword’. As if she is allowed to live without it.

Book of the week (Purchase via this link to support the site)
Bard, M., MYTHS AND FACTS: A GUIDE TO THE ARAB-ISRAELI CONFLICT

Note: I will not publish or respond to anonymous comments. If you want to say something, stand behind it. Otherwise don't bother, it'll be ignored.

Friday, December 23, 2016

Data Sublanguages, Programming, and Data Integrity

My December 2016 post @All Analytics (that I did not link to from here).

Second, many years ago I wrote an article titled “Integrity is Not Only Referential” with a double-entendre: I was criticizing a DBMS vendor that was claiming, misleadingly, that its product’s application-enforced integrity was actually relational DBMS-enforced integrity and also deploring data professionals’ poor grasp of this important distinction that was letting vendors get away with it. Sadly, nothing much has changed since then.

Read it all.

Please comment there, not here. 



Tuesday, December 20, 2016

On View Updating (C. J. Date and D. McGoveran)

My recent posts on denormalization [1], identical relations [2] and the POOD [3,4,5] based on D. McGoveran (DMG) interpretation of Codd's RDM, triggered online reactions [6,7] and some comments in place that reflect the current understanding of the RDM. One of my readers referred me back to a 2004 exchange triggered by an exchange @old dbdebunk.com with both CJD and DMG on view updating--an important aspect discussed in my posts--on which the two perspectives differ. Last week I asked what's wrong with CJD's position in the exchange. Here is the original exchange, albeit in abbreviated form (I made minor revisions for clarity and added references to more recent sources the reader may want to consult, such as the 2013 CJD book on the subject [8], which also purports to describe some of DMG's more recent thinking), in which DMG counters with his position and adds a new comment. Throughout, I've changed "relvar predicate" (still used by CJD) to "relation predicate", as preferred in the DMG interpretation.

Monday, December 12, 2016

This Week

THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.

1. What's wrong with this picture

"First of all, let me say that I no longer regard view updating as a fully solved problem. A year ago or so I thought it was--but then Hugh Darwen started to ask some hard questions and I realized I was wrong. (David McGoveran will probably disagree with me here.) That said, I remain optimistic that the problem is solvable. The discussion in my 8th Ed. is generally along the right lines, though it gets some of the details wrong." --C. J. Date

2. Quote of the Week

"SQL is the lingua franca for retrieving structured data. Existing semantics for SQL, however, either do not model crucial features of the language (e.g., relational algebra lacks bag semantics, correlated subqueries, and aggregation)." --Konstantin Weitz, homotopytypetheory.org

3. To Laugh or Cry?

Why is MongoDB wildly popular? It's a data structure thing

Monday, December 5, 2016

Prediction, Explanation and the November Surprise

My November post @All Analytics. 


Given the overhyped promise of data science, the "shock" at the broad failure to predict the election outcome was inevitable. Skimming through the media and technical accounts, it looks like a better understanding of prediction and explanation is necessary for less surprises and sounder analytics.

Read it all. (Please comment there, not here.)

Re-write

See the rewrite
Class, Type, Relation and Domain in Database Management



Monday, November 28, 2016

This Week

THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.



1. What's wrong with this picture?

"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases:
  1. "type" = "domain": I fully understand why one might equate "type" and "domain", but ... in today's programming practice, "type" and "domain" are quite different. The word "type" is largely tied to system-level (or "physical"-level) definitions of data, while a "domain" is thought of as an abstract set of acceptable values.
  2. "class" != "relvar": In simple terms, the word "class" applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a "relation" *is* a "class" (and trivially also a "set"), which contributes to confusion.
In modern programming parlance "class" is generally distinguished from "type" *only* in that "type" refers to "primitive" (system-defined) data definitions while "class" refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, "type" and "class" are actually synonymous." --Comment @dbdebunk.com

Sunday, November 20, 2016

The Principle of Orthogonal Database Design Part III

Note: This is a 11/24/17 re-write of Part III of a three-part series that replaced several previous posts (the pages of which redirect here), to to bring it in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.
(Continued from Part II)

POOD and SQL


As we have seen, if relations are uniquely constrained, with a true RDBMS supporting logical independence (LI) and constraint inheritance, database design can adhere to the POOD and enable DBMS-enforced consistency. A RDBMS can also support ESS explicitly.

Industry misconceptions notwithstanding, SQL DBMSs are, of course, not relational. They have weak declarative integrity support, which, coupled with bad database designs, makes adherence to the POOD (as well as the other design principles) difficult. While even its weak relational fidelity was sufficient to render SQL superior to what preceded it, this is but one example of the many advantages of the RDM that SQL has failed to concretize.

Monday, November 14, 2016

This Week

I have revised:
THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.

1. Quotes of the Week
"Wow. Been using SQL for 15 years, and didn’t even know about Except and Intersect!" --Blog.Jooq.org
"I am looking for database schemas with many tables (>100 tables). Where can I find them? I am currently using mysql and haven't done serious database design. So interested in looking at samples with ER diagrams." --YCombinator.com
2. To Laugh or Cry?

Monday, November 7, 2016

The Principle of Orthogonal Database Design Part II

Note: This is a 11/24/17 re-write of Part II of a three-part series that replaced several previous posts (the pages of which redirect here), to bring in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

(Continued from Part I)

To recall from Part I, adherence to the POOD means independent base relations (i.e., not derivable from other base relations), which the design example in Part I,
EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,ENAME,HIREDATE,SALARY)
COMM_EMPS (EMP#,ENAME,HIREDATE,COMMISSION)

violates: EMPS is derivable via union of projections of SAL_EMPS and COMM_EMPS. It requires at least:
  • A disjunctive constraint on each of the SAL_EMPS and COMM_EMPS relations, to ensure mutual exclusivity;
  • A redundancy control constraint to prevent inconsistency due to partial updates (can you formulate it?);
  • Use of the transaction management component of the data language to ensure that each candidate tuple is properly inserted (1) into EMPS and (2) the correct subtype relation;


Monday, October 31, 2016

This Week

1. Quote of the Week

"Normalization is and will always be a direct trade-off. You give up performance (time) for space. Indexing mitigates in the opposite direction of this trade-off." --YCombinator.com

2. To Laugh or Cry?

Comments on my "Denormalization for Performance: Don't Blame the Relational Model"

3. THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS 

is available. Order here.

Monday, October 24, 2016

The Costly Illusion of Denormalization for Performance

My October post @All Analytics.

Be that as it may, practitioners insist that performance improves when they denormalize databases, because "bundling" facts into less relations reduces joins. But even if this were always true -- it is not -- performance gains, if any, do not come from denormalization per se, but from trading off integrity for performance. What many data professionals miss is that the redundancy introduced by denormalization must be controlled by the DBMS to ensure data integrity, which requires special integrity constraints that, it turns out, involve the very joins that denormalization is intended to avoid, defeating its purpose. These constraints are practically never declared and enforced, which creates the illusion that denormalization improves performance at no cost.

Read it all. (Please comment there, not here)



Sunday, October 16, 2016

THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS Available to Order


When I discussed with a book publisher the idea of a guide/reference to misconceptions about data fundamentals, whose objective -- distinct from the usual cookbooks -- is to help data professionals base their practice on understanding, rather than cookbooks, he said "they are not interested in understanding, only in succeeding in their jobs". Apparently, the former is no longer a factor in the latter. Given the increasingly deteriorating experiences I had with publishers, it was time to stop bothering with them -- they pay and do very little -- and self-publish.

THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS  - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER is now available for purchase ($35, order via the BOOKS page (not to be confused with the RECOMMENDED BOOKS page); contact me by email for volume discounts). 


Monday, October 10, 2016

This Week

1. Quote of the Week
"Legion is a Hadoop MapReduce tool that turns big, messy data sources into clean, normalized flat files ready for ingestion into relational tables in a data warehouse (e.g., Postgres COPY)." --GitHub.com
2. To Laugh or Cry?

Wednesday, September 28, 2016

Brother, Spare Me the Paradigm

My September post @All Analytics:

Every few years -- the interval is getting shorter -- data management is claimed to undergo a “paradigm shift,” some new fundamentally different new way of doing things is being promoted (e.g., “Consider dimensional design and Big Data as two additional paradigms”) that if you don't adopt you’ll be left behind (I’ve written about Big Data, for data warehouse dimensional modeling see Data Warehouses and the Logical-Physical Confusion). For the few who understand what a paradigm is and are familiar with the history of data management, the irony could not be richer. 

Read it all.  (Please comment there, not here)



Monday, September 26, 2016

This Week

1. Quote of the Week
"Which leads to another bad experience: the pernicious use of foreign keys. In the ORMs I've used, links between classes are represented in the data model as foreign keys which, if not configured carefully, result in a large number of joins when retrieving the object. (A recent count of one such table in my work resulted in over 600 attributes and 14 joins to access a single object, using the preferred query methodology.)
...
When you have foreign keys, you refer to related identities with an identifier. In your application, "identifier" takes on various meanings, but usually it's the memory location (a pointer). In the database, it's the state of the object itself. These two things don't really get along because you can really only use database identifiers in the database (the ultimate destination of the data you're working with)." --wozniak.ca

Monday, September 19, 2016

The Principle of Orthogonal Database Design Part I

Note: This is a 11/24/17 re-write of Part I of a three-part series that replaced several older posts (the pages of which which now redirect here), to bring in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.
"The principle of orthogonal design (abbreviated POOD) ... is the second of the two principles of database design, which seek to prevent databases from being too complicated or redundant, the first principle being the principle of full normalization (POFN). Simply put, it says that no two relations in a relational database should be defined in such a way that they can represent the same facts. As with database normalization, POOD serves to eliminate uncontrolled storage redundancy and expressive ambiguity, especially useful for applying updates to virtual relations (views). Although simple in concept, POOD is frequently misunderstood ... is a restatement of the requirement that a database is a minimum cover set of the relational algebra. The relational algebra allows data duplication in the relations that are the elements of the algebra. One of the efficiency requirements of a database is that there be no data duplication. This requirement is met by the minimum cover set of the relational algebra." --Wikipedia.org
Well, not quite.

Monday, September 12, 2016

This Week

1. Quote of the Week
"Data sense-making does not benefit from the relational data model. Dr. Codd’s rules for relational modeling were designed to improve efficiencies in data processing and storage, not to make data more intelligible. In fact, structuring data relationally makes the work of data sensemaking more difficult, which is why Dr. Kimball created dimensional data modeling and also why an entire industry of middleware products emerged to hide the complexities of relational models." --Stephen Few, PerceptualEdge.com

2. To Laugh or Cry?

Saturday, September 3, 2016

Relation Predicates and Identical Relations

Note: This is a 11/25/17 re-write of an earlier post, to bring it in line with the McGoveral formalization and interpretation [1] of Codd's real RDM.

Here's what's wrong with the last wrong picture I posted, namely:

Q: "Can you have 2 tables, VIEWS and DOWNLOADS, with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other."

A1:"I don't think that there is a problem, per se. From a E/R modeling point of view I don't see a problem with that, as long as they represent two semantically different entities."

A2:"Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding."

A3: "Chris Date and Dave McGoveran formalised the Principle of Orthogonal Design. Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result."

A4: "When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence. Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often. This alone could be a good reason not to burden it with records from the downloads." --StackOverflow.com

Business Rules and Meaning


To recall, a base relation (which can be visualized as a R-table) represents a set of facts about a group  of property-sharing objects. Its meaning is denoted by a conjunction of informal business rules that specify the individual and collective object properties required for membership in the group that the relation represents:

  • Property rules specify individual first order properties (1OP) shared by objects;
  • Object rules specify the second order properties (2OP) that arise from relationships among 1OPs;
  • Multiobject rules specify the collective third order properties (3OP) that arise from relationships among all members of a group;
  • Multigroup rules specify properties of the groups as a whole that arise from relationship among the object groups;

Note: A 'base relation' does not mean a stored relation (as in SQL), although it can and usually will be stored. Rather, it is a member of the base set of relations, the tuples of which represent axioms -- facts about the real world objects of interest recorded in the database, from which all relations meaningful to applications are derived, the tuples of which represent theorems -- facts that are logical implications of (inferences from) the axioms.

Because rules are informal, they are not "computable". They must be formalized as constraints -- predicates "understood" algorithmically by a DBMS that can be enforced by it for consistency with the rules. For each relation, the constraints that correspond to the first three types of rule comprise the relation predicate (RP). The RPs in conjunction with the constraints corresponding to the fourth type of rule comprise the database predicate (DBP).

Semantics and Constraints


Constraints are semantic -- they constitute the formal assignment of the meaning to the relation. If two relations are identically structured and constrained,

VIEWS (USER_ID, ITEM_ID)
DOWNLOADS (USER_ID,ITEM_ID)

the only discernible difference between them is in their names. But the core Information Principle (IP) of the RDM mandates that all information in a relational database be represented explicitly and in exactly one way -- as values of relation attributes defined on domains. The difference between relation names is, thus, meaningful information, the representation of which violates the IP and the RDM, for which reason it is inaccessible to the DBMS: consider the candidate tuple {v1,v2} -- it is impossible for the DBMS to know to which relation it belongs based on the relation and attribute names because it does not understand semantics! Applications/users must specify -- with little, or no help from the DBMS -- the proper relation. Not only is it prone to error, but also  because the information is implicit in the names, relational operations lose it: if you UNION the two relations, you get users that either viewed or downloaded items, or both. For the DBMS to be able to decide, RPs must be unique -- such that a candidate tuple always satisfies exactly one [2]).

One way to make the information explicit is by representing it as values of an attribute. Assuming there are only two actions, view and download,

User (USER_ID) performed action (ACTION) on item (ITEM_ID).
where ACTION is an attribute defined on the binary domain {view,download}. This yields a one-relation design with the difference incorporated as a domain constraint on the attribute.

All of this is lost, of course, on most data professionals, who lack data foundation knowledge [3
], as pointed out in two replies:
A1: The relations are semantically different, but the DBMS is kept unaware of the difference (the E/RM is irrelevant here);
A2: Renaming attributes (not fields) does not address the basic problem;

The Principle of Orthogonal Design


As to A3, the treatment given in the mentioned article to the Principle of Orthogonal Design (POOD) was incomplete, which created misunderstandings such as the one in the comment (the article authors do no longer agree on the subject
[4]). Suffice it to say here that the POOD has to do with the properties of formal systems, of which the RDM, grounded as it is on FOPL, is one.
"Formal systems have highly desirable properties when they are governed by three principles, one of which is the FOPL Principle of Orthogonality (or Axiomatic Independence): Axioms should be independent -- none should be derivable from the others. In database design terms base relations should be independent -- not derivable from the other base relations. Two relations are trivially independent if they are defined in terms of mutually exclusive domains, attributes and tuples (there are, of course, more complex ways they can be independent)." --David McGoveran
Obviously, the two relations do not violate the POOD -- the facts represented by VIEWS are not derivable from those represented by DOWNLOADS and vice-versa. The problem here is conceptual specification and logical design that "hide" information from the DBMS in the names.


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Pascal, F., The Principle of Orthogonal Database Design Part I, II, III.

[3] Pascal, F., THE DBEDBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS. 


[4] On View Updating (C. J. Date and D. McGoveran).



Friday, August 26, 2016

Data, Information, Knowledge Discovery, and Knowledge Representation

My August post @All Analytics

As any true scientist should know, science has a context of discovery and a context of validation (I wonder how many of those who call themselves “data scientists” know the concepts). In the former, experiments are usually done to discover relationships -- a theory (to be validated in the latter, in which further implications thereof can also be derived). So there is no theory/conceptual model/knowledge yet. Data or information can be represented by any database (e.g. a SQL database can represent variables and/or runs as columns and/or rows in tables), docubase and even file -- none is superior or inferior for analytical purposes.

Read it all. (Please comment there, not here)

Monday, August 22, 2016

This Week

1. What's wrong with this picture?
"Can you have 2 tables
- VIEWS
- DOWNLOADS
with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other.

"I don't think that there is a problem, per se from a E/R modelling point of view, as long as they represent two semantically different entities."
"Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding."
 "Chris Date and Dave McGoveran formalised the Principle of Orthogonal Design. Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result."

"When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence. Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often. This alone could be a good reason not to burden it with records from the downloads."
--Can you have 2 tables with identical structure in a good DB schema?, StackOverflow.com

Monday, August 15, 2016

Understanding Keys: Intrinsic Object Properties and Object Names

Note: This is a 12/04/17 re-write of an older post to bring it in line with the McGoveran's formalization and interpretation [1] of Codd's true RDM.

Here's what's wrong with last week's picture, namely:

Q: "My understanding has always been that a primary key should be immutable, and my searching since reading this answer has only provided answers which reflect the same as a best practice. Under what circumstances would a primary key value need to be altered after the record is created?"

A: "When a primary key is chosen that is not immutable?" --Why should a primary key change, StackExchange.com

Monday, August 8, 2016

This Week

1. What's wrong with this picture?
Q: "My understanding has always been that a primary key should be immutable, and my searching since reading this answer has only provided answers which reflect the same as a best practice. Under what circumstances would a primary key value need to be altered after the record is created?"

A: "When a primary key is chosen that is not immutable?" --StackExchange.com

2. Quote of the Week

"We start with a value. This may be a number, a date, or a chunk of text. Domain refers to the meaning for a value or a set of possible values. When we have a value with a consistently and widely used set of units, value domains, or applications, we call this value a data element. A data element may be a ticket number, a temperature reading, a hair color. (Some modeling approaches omit the notion of data elements or domains.)
In all the techniques and stages of data modeling, the concepts of entity and attribute are universal. An entity is the “thing” that must be manipulated as a data object. Entity represents an indivisible concept that consists of data elements. Each data element in the entity is called an attribute. Conversely, we could say that we build an entity by attributing data elements to it. You create an entity as a whole, and you delete it as a whole." --What a Concept! Is Logical Data Modeling Obsolete? -- LinkedIn.com
Note: In the preface to my PRACTICAL DATABASE FOUNDATIONS series of papers I deplored an author's (considered an expert) reliance on an industry ANSI committee as a starting point in an explanation of conceptual, logical and physical data fundamentals--there is practically 0 chance of soundness and 100% chance of confusion. So when this article started with "In the common usage established by ANSI in 1975, data modeling goes from abstract to concrete in three steps" I knew I would not read very far. Sure enough, I stopped almost immediately, after the above quote. If you don't understand why, I recommend you check out my papers.

Friday, July 29, 2016

Data Science, Coding, the Automation Paradox, and the Silicon Valley State


My July post @All Analytics.

Indeed, if what Linus Torvald is admiringly quoted as saying is true -- "Don’t ever make the mistake [of thinking] that you can design something better than what you get from ruthless massively parallel trial-and-error with a feedback cycle. That’s giving your intelligence much too much credit." -- why bother with education, science, and theory at all? Just “plug your coding skills” by getting a Microsoft or IBM “Professional Degree”, or even take a free coding course by your favorite billionaire.

Read it all. (Please comment there, not here) 

Monday, July 25, 2016

Duplicates: Stating the Same Fact More Than Once Does Not Make it Truer, Only Redundant

Here's what what wrong with last week's picture, namely:
RB: "From the tabular point of view, does it make sense why we can't have duplicate rows in a relation?"

John Sullivan: "As with everything else in life, it depends what you are trying to do (and exactly what you mean when you talk about a DBMS table v. a formal relation). From an operational (transactional) database point of view, for obvious reasons, you don't want duplicate rows (enforce a natural key). But if you're analysing data from various legacy sources (e.g. spreadsheets) it might be useful. Then again, you might introduce a surrogate key to give you more control over what's going on - again, depends on what you are trying to do." --LinkedIn.com

One of my readers once wondered why "database professionals understand uniqueness via keys, but don't seem to understand why duplicate rows should be prohibited and the consequences of breaking relational closure": 

Sunday, July 17, 2016

This Week

1. What's wrong with this picture? 
RB: "From the tabular point of view, does it make sense why we can't have duplicate rows in a relation?"

John Sullivan: "As with everything else in life, it depends what you are trying to do (and exactly what you mean when you talk about a DBMS table v. a formal relation). From an operational (transactional) database point of view, for obvious reasons, you don't want duplicate rows (enforce a natural key). But if you're analysing data from various legacy sources (e.g. spreadsheets) it might be useful. Then again, you might introduce a surrogate key to give you more control over what's going on - again, depends on what you are trying to do." --Question about the relational model


Sunday, July 10, 2016

Levels of Representation: Relationships, Rules, Relations and Constraints

What's Wrong with Last Week's Picture (Question about relational model )

There are relationships at both the conceptual and logical representation levels. Confusing them is bad conceptual modeling and database design.

Relationships, Rules and Relations

AT: "In my personal understanding, a relation is defined as a set of tuples. Then ... "in the relational model every relation represents a relationship". And then a quote from Chen: "each tuple of entities ... is a relationship". If I use the first and the second statements - I can say that a relationship is a set of tuples. The third statement says that a relationship is a tuple. So far, is a relationship a set of an element of a set? (Or may be a set of sets?)".
A relation is a set of tuples that is a subset of the Cartesian product of the domains, i.e., it is a relationship of domains at the logical level (Chen may have been misquoted: a tuple is a set of attribute values i.e., a relationship thereof, that represents a fact about a single entity).

Thursday, July 7, 2016

NoSQL, Big Data Analytics, and the Loss of Knowledge and Reason

My June post @All Analytics:

The data management industry operates like the fashion industry. Its most persistent characteristic is migration from fad to fad. Every few years -- the number keeps getting smaller -- some "new" problem is discovered, for which the solution is so magical, that it is extended everywhere to everything, whether it is applicable or not. But many of these problems are old and fundamental and some of the “solutions” bring them back, rather than solve them. ...

Read it all. (Please comment there, not here)



Saturday, July 2, 2016

This Week

1. What's wrong with this picture?

AT: Well, I think I am a bit confused now. In my personal understanding, a relation is defined as a set of tuples. Then ... "in the relational model every relation represents a relationship". And then a quote from Chen: "each tuple of entities ... is a relationship". If I use the first and the second statements - I can say that a relationship is a set of tuples. The third statement says that a relationship is a tuple. So far, is a relationship a set of an element of a set? (Or may be a set of sets?)
GE: I argue that there is essentially no difference between relationships between entity (type tables) and between an entity and its attributes. They both represent relationships between two populations of things. Something is an attribute by virtue of there being a relationship. If relationships are represented by foreign keys and the entity tables must be in 1NF, as in the relational model, then all relationships must be at most Many-to-One (a very unnecessary limitation when modeling some user domain).
TF: The relational model was a mathematical construct, derived from set theory. Hence that particular terminology. The entity-relationship model is essentially a directed graph model, where relationships are prominent residents. Not so in the relational model (despite the name), where relationships (between relations, mind you) are not visible and in the SQL implementations is reduced to constraints. Relationships are about structure, which is as important as meaning (the semantics of the terms used in the universe being modeled).
2. Quote of the Week
"In Relational Theory sometimes the relationships, where we do our Joins are much more important than the attributes on an Entity." (quoted in LinkedIn.com exchange)

Sunday, June 26, 2016

On Kinds of Keys: Natural, Primary and Surrogate Are Sufficient

Note: This is a 12/04/17 re-write of a 2012 post (revised in 2016) to bring it in line with the McGoveran formalization and interpretation [1] of Codd's RDM.

Here is what is wrong with last week's picture, namely:

"Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a debate rises on the use (and misuse) of all these keys ... The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)?" --Martijn Evers, Kinds of Keys: On the Nature of Key Classifications

I, of course, sympathize, as I've deplored the misuse and abuse of terminology in the industry for longer than I care to remember. This is not unique to keys, it's a general absence of foundation knowledge in the industry [2]. The answer is yes, we can disambiguate -- the key (pun intended) to keys is understanding what from the real world they represent and what database function they fulfill. Things are much simpler than the mindless industry jargon.

Sunday, June 19, 2016

This Week

1. What's wrong with this picture?

This week's picture is the one of the state of knowledge about keys that Martijn Evers painted in Kinds of Keys: On the Nature of Key Classifications, that I had already commented on. As a result of discussions I've been having with David McGoveran in the context of our forthcoming books (his LOGIC FOR SERIOUS DATABASE FOLKS, my DBDEBUNK GUIDE TO FUNDAMENTAL DATA MANAGEMENT MISCONCEPTIONS), I've decided to rewrite my
comments On Kinds of Keys on the subject. I refer the reader to Martijn's article for a refresher--my rewrite will be posted next week.

2. Quote of the Week

There are no rules of normalization for non-relational databases. Effectively, you start out by denormalizing everything. Which means you're designing the data organization to serve specific queries. So follow the same principle in NoSQL databases as you would for denormalizing a relational database: design your queries first, then the structure of the database is derived from the queries. --Bill Karwin, What is a good way to design a NoSQL database

Sunday, June 12, 2016

Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation

From last week:

What's wrong with this picture? (Kinds of Data Models, LinkedIn.com)

David Hay: "Part of the ... confusion as to what exactly was meant by “data modeling”--conceptual, logical or physical--is that most data modeling activities seem to focus on achieving good relational database designs ... my approach is the portrayal of the underlying structure of an enterprise’s data--without regard for any technology that might be used to manage it ... a “conceptual data model” ... that represents the business."

Nothing raises uncertainty whether to laugh or cry better than attempts to dispel confusion which suffer from the very confusion they purport to dispel.

Sunday, June 5, 2016

This Week

1. What's wrong with this picture?
David Hay: Part of the ... confusion as to what exactly was meant by “data modeling”--conceptual, logical or physical--is that most data modeling activities seem to focus on achieving good relational database designs ... my approach is the portrayal of the underlying structure of an enterprise’s data--without regard for any technology that might be used to manage it ... a “conceptual data model” ... that represents the business.

Nigel Higgs: ... many folks do not get the difference between the Barker entity relationship style of modeling and the relational style of modeling ... [because] the modeling conventions are very similar and the former [is always] a precursor to RDBMS design.

Clifford Heath: Any terminology for models must project three aspects of intention: (a) audience, (b) level of detail and (c) purpose. These three variables are sufficient to discriminate all the main kinds of models in use. The traditional terms of "conceptual/logical/physical" are manifestly inadequate.

Remy Fannader: Models are meant to describe sets of instances (objects or behaviors).
--Kinds of Data Models, LinkedIn.com

2. Quote of the Week

The first consideration that needs to be made when selecting a database is the characteristics of the data you are looking to leverage. If the data has a simple tabular structure, like an accounting spreadsheet, then the relational model could be adequate. Data such as geo-spatial, engineering parts, or molecular modeling, on the other hand, tends to be very complex. It may have multiple levels of nesting and the complete data model can be complicated. Such data has, in the past, been modeled into relational tables, but has not fit into that two-dimensional row-column structure naturally. --Jnan Dash, RDBMS vs. NoSQL: How do you pick?

Sunday, May 29, 2016

Multidimensional Relations, Flat Tables and Logical-Physical Confusion

Many--and I would guess, most--data professionals are clueless about the Relational Data Model (RDM) and its practical value for database practice. They confuse RDBMS with SQL DBMS and use relational jargon, but don't understand what it really means and don't care whether they do or not. As I announced, every other week (with occasional exceptions), I will debunk common and entrenched myths and misconceptions exposed by the "What's wrong with this picture?" of the preceding week.

From last week:

"Data is stored in two-dimensional tables consisting of columns (fields) and rows (records). Multi-dimensional data is represented by a system of relationships among two-dimensional tables. This usually leads to data storage becoming redundant, and also difficult to maintain on account of addition and deletion anomalies. This is only the case if we do not normalize the data. Keys are fields or combinations of fields used to identify records." --Nigel Peck, Denormalization Summary
"Relations are multidimensional. They are not flat. They are not two dimensional. Don't let the term table mislead you." I read the above statement on the back cover of CJ Date book on relational theory "Database in depth". Can anyone help how to visualize this multidimensional nature of relations? --LinkedIn.com

Wednesday, May 25, 2016

Why Data Scientists Must Understand Normalization

My May post @All Analytics:

We are constantly told how data scientists must be “jacks of many skills”, but one of the most important is rarely included in the list.Very few databases are properly designed. Many SQL databases are denormalized inadvertently, or intentionally (and erroneously) "for performance". They require special constraints to control data redundancy and prevent inconsistencies, which are practically never enforced. Analysts cannot, therefore, take database consistency for granted. Furthermore, to issue sensible queries and ensure correct results and interpretation thereof, it’s not enough for analysts to know the types of fact represented in the database, but also whether and how the database designer has chosen to bundle -- nest or merge -- those facts and how to disentangle them for analysis.

Read it all. (Please comment there, not here)



 



Sunday, May 22, 2016

This Week (& a Change in Format)

I am changing the format of the posts on DBDebunk.

"This Week" posts every other week will continue to be followed by posts on data and relational fundamentals, but with a twist: each of the latter posts will explain the preceding week's "What's wrong with this picture?"Here's the first post in the new format

1. What's wrong with this picture?

"Data is stored in two-dimensional tables consisting of columns (fields) and rows (records). Multi-dimensional data is represented by a system of relationships among two-dimensional tables. This usually leads to data storage becoming redundant, and also difficult to maintain on account of addition and deletion anomalies. This is only the case if we do not normalize the data. Keys are fields or combinations of fields used to identify records." --Nigel Peck, Denormalization Summary

Sunday, May 15, 2016

Weak Entities, Referential Constraints and the Conceptual-Logical Conflation

Note: This is a rewrite of a 09/07/14 post.

A LinkedIn exchange initiated by the question What is a weak entity? diversified into various aspects of data modeling and database design, some of which was contaminated by the conceptual-logical conflation.

"Peter Chen (E/R Modeling, 1976) used the term "weak" entity to describe one which could not meaningfully have an independent existence. An example might be an Order which requires a customer (and a Product or set of products). A weak entity need not be a composite, as in your OrderItem example. The central issue here is dependency of one entity [of one] type on another [of another type]. Furthermore, there could be more that one such dependency. In your example, the OrderItem would be dependent on BOTH Order and Item.

It is important to note that this dependency will NOT be enforced by a simple RDBMS unless you also define referential integrity on both parts of the composite key, notwithstanding that some RDBMS's (e.g., SQL Server) couple the definition of the relationship with a referential integrity constraint. In other words, you cannot have a defined relationship without enforcing referential integrity.

This is not always desirable. Consider the task of bulk loading some records which have a relationship with some other entity type(s). The only practical way to do this is to first turn off referential integrity enforcement which means deleting the definition of the relationship entirely. After completing the loading of data, if you reinstate the defined relationship, will the system automatically ensure that referential integrity is not violated? Good question."
--Gordon Everest
Relationships in the E/RM are at the conceptual level. Dependencies of entities of one class/type on those of another are one kind of relationships defined as business rules in a conceptual model. Referential constraints are their formal representation at the logical level of a relational database. So it's not that "some" SQL DBMS's (true RDBMS's do not exist) "couple the definition of the relationship with a referential integrity constraint", the constraints are precisely how a RDBMS (what exactly is a "simple RDBMS"?) enforces business rules: a defined relationship is enforced as a referential constraint in a relational database. How does Everest propose to enforce business rules without constraints?

There are no circumstances in which FK constraints "are not desirable" and deactivating them is not the the solution to bulk loads, staging tables are. For example:

"In Oracle you can instruct the DBMS to log violating rows into an 'exceptions' table while enabling a constraint (be it, a primary key, unique key, foreign key, or plain check constraint). This fits the bill better, particularly if it's possible to INSERT...SELECT them after corrections." --Toon Koppelaars
"SQL Server will check on re-enabling the constraint that it is not violated by the new state of the database. If it is, it only reports that an error occurred. It has a real time violation catalog to easily map a constraint name to a specific violation and transaction id. In other words, doing bulk batch constraint validation is part of the product." --William Sisson and Racim Boudjakdji
Much of the criticism of the RDM is rooted in ignorance of data and relational fundamentals and the history of database management.
"How to implement“existence dependence” ... between two entities ...is a design decision and is dependent upon the architecture of the implementation technology. [Chen's E/R models] ... were completely technology agnostic and had no “relational” flavor at all. Why? With the rapid emergence and popularity of the NoSQL, NewSQL, Graph Databases, etc. no longer can relational implementation and FK be assumed and we again need to separate the “conceptual data model” from the “logical data model”. Whether this is evolution of de-evolution is a whole other subject (for those of us that remember WHY relational technology emerged in the 1980s--and as a warning to those who would forget their history--IDMS, IMS, etc.)"

The relational model has been so dominant for so long, we have forgotten our roots and the reason that there were originally three layers of data model, not just two. The rise of the post-relational technologies has again required their creation and reminds us of their utility.
--David Tryon
Data model agnosticism has nothing to do with the “emergence/popularity of non-relational technologies” (NoSQL, NewSQL did not exist when Chen introduced the E/RM), it is  is inherent in conceptual modeling--which is what E/RM is used for--by definition. Conceptual models are expressed in real world terms--facts about classes of property-sharing entities. To be represented in a database, a conceptual model must be formalized as a logical model expressed in database terms. That's when a data model comes into play, to provide abstract data structure, integrity and manipulation (relation, constraints and relational algebra in the relational case). The conceptual and  logical were always separate, but as Everest comments show, levels of representation are constantly confused. 

Incidentally
  • A logical model is implemented in specific hardware and DBMS software, a conceptual model is formalized as a logical model by means of a data model (e.g., the RDM). Using 'implementation' for both induces and reinforces the confusion.
  • Emergence, or popularity are not grounds for technology adoption: a superior generality-to-simplicity ratio (G2SR), soundness and flexibility are. 
If only there were relational dominance. Relational technology was treated by the industry as just any other "hot" fad and was never understood and implemented.  Instead we got SQL DBMS's, whose even poor relational fidelity proved superior to the hierarchic and network products that they replaced, but did not confer all the advantages of true RDBMS's.

Do away with the RDM and constraints and you're back to relying on application developers for database functions, particularly integrity enforcement and optimization, without guaranteed logical and semantic correctness--clearly de-evolution.