Saturday, December 24, 2016

This Week with Season's Greetings



Data Sublanguages, Programming and Data Integrity

My December post @All Analytics

Both data science employers and candidates stress the eclectic nature of the required skills, programming in particular. Indeed, coding has acquired such an elevated role, that it now entirely replaces education. Aside from the societal destructive consequences of this trend, in the context of data management it is a regressive self-fulfilling prophecy that obscures and disregards the core practical objective of database management to minimize programming. You can frequently encounter it in comments like:
"Anything you can model in a DBMS you can model in Java. The next paradigm shift is business rules centralized in Java business objects, rather than hard-coded in SQL for better manageability, scalability, etc. The only ones that should reside in a database are referential integrity (and sometimes even that isn't really necessary). Don't let pushy DBAs tell you otherwise -- integrity constraints slow down development as well as performance."
Upside down and backwards.

Read it all (and comment there, not here, please).


THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.

("What's Wrong with this Picture" will return in 2017)


1. Quote of the Week

"The value of the model may be diminishing in certain enterprises, since busy with deliverables." --Harshendu Desai, LinkedIn.com

3. To Laugh or Cry?

5 Reasons Relational Databases Hold Back Your Business

4. Added to the LINKS page

  • What a Database Really Is: Predicates and Propositions
  • The Logical Fallacies

5. Of Interest


And now for something completely different


New at The PostWest (check it out)

My take of the week

Choosing not to veto, Obama lets anti-settlement resolution pass at UN Security Council
The press refused to publish Obama's Chicago speech to the Palestinian lobby to hide his anti-semitism. He was never as troubled by Assad, or Putin, or Erdogan as he was by Netanyahu.  That's because Jews have always been a soft target (Barak Obama's Israeli Settlements Canard).  If that is not anti-semitism, I don't know what is.
When the US is in the same camp with Russia, China, Iran and Turkey and her acts are cheered by Hamas, Islamic Jihad and Hezbollah, she has sold out and moved to the dark side.

America (like most other countries) is occupied Indian land via atrocities (and not by people who returned to their own country, like the Jews did). So when America returns its settlements to the Indians, Israel will return its "settlements" (which Israelis got when they defended themselves from "being thrown into the sea"). Until then moralizing and selling out Israel to genocidal terrorists is hypocritical anti-semitism, just like everyone else's (see below).


Global Hypocritical Anti-semitism

UN

 
US

EU

Article of the week

Israel and the Occupation Myth

Video of the week
The Red Disaster. The "life" in Romania during the 60s. The Jews did the worst due to deep anti-semitism. America paid Ceausescu to get us out, but neither she nor Europe wanted us. Had there been no Israel, we would have probably starved to death, not necessarily in a rotten jail. Nobody talks about us, or the hundreds of thousands of the Jewish refugees kicked out from the Arab countries, none of whom were murderous, but everybody is obsessed with the suffering of the Palestinians, who are genocidal.

Pinch-me of the week

Ahmad Tibi urges Israelis not to ‘live by the sword’. As if she is allowed to live without it.

Book of the week (Purchase via this link to support the site)
Bard, M., MYTHS AND FACTS: A GUIDE TO THE ARAB-ISRAELI CONFLICT

Note: I will not publish or respond to anonymous comments. If you want to say something, stand behind it. Otherwise don't bother, it'll be ignored.

Friday, December 23, 2016

Data Sublanguages, Programming, and Data Integrity

My December 2016 post @All Analytics (that I did not link to from here).

Second, many years ago I wrote an article titled “Integrity is Not Only Referential” with a double-entendre: I was criticizing a DBMS vendor that was claiming, misleadingly, that its product’s application-enforced integrity was actually relational DBMS-enforced integrity and also deploring data professionals’ poor grasp of this important distinction that was letting vendors get away with it. Sadly, nothing much has changed since then.

Read it all.

Please comment there, not here. 



Tuesday, December 20, 2016

On View Updating (C. J. Date and D. McGoveran)

My recent posts on denormalization [1], identical relations [2] and the POOD [3,4,5] based on D. McGoveran (DMG) interpretation of Codd's RDM, triggered online reactions [6,7] and some comments in place that reflect the current understanding of the RDM. One of my readers referred me back to a 2004 exchange triggered by an exchange @old dbdebunk.com with both CJD and DMG on view updating--an important aspect discussed in my posts--on which the two perspectives differ. Last week I asked what's wrong with CJD's position in the exchange. Here is the original exchange, albeit in abbreviated form (I made minor revisions for clarity and added references to more recent sources the reader may want to consult, such as the 2013 CJD book on the subject [8], which also purports to describe some of DMG's more recent thinking), in which DMG counters with his position and adds a new comment. Throughout, I've changed "relvar predicate" (still used by CJD) to "relation predicate", as preferred in the DMG interpretation.

Monday, December 12, 2016

This Week

THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.

1. What's wrong with this picture

"First of all, let me say that I no longer regard view updating as a fully solved problem. A year ago or so I thought it was--but then Hugh Darwen started to ask some hard questions and I realized I was wrong. (David McGoveran will probably disagree with me here.) That said, I remain optimistic that the problem is solvable. The discussion in my 8th Ed. is generally along the right lines, though it gets some of the details wrong." --C. J. Date

2. Quote of the Week

"SQL is the lingua franca for retrieving structured data. Existing semantics for SQL, however, either do not model crucial features of the language (e.g., relational algebra lacks bag semantics, correlated subqueries, and aggregation)." --Konstantin Weitz, homotopytypetheory.org

3. To Laugh or Cry?

Why is MongoDB wildly popular? It's a data structure thing

Monday, December 5, 2016

Prediction, Explanation and the November Surprise

Given the overhyped promise of data science, the "shock" at the broad failure to predict the election outcome was inevitable. Skimming through the media and technical accounts, it looks like a better understanding of prediction and explanation is necessary for less surprises and sounder analytics.

My November post @All Analytics. Read it all.

Domain vs. (Data) Type, Class vs. Relation (UPDATE)

Revised 12/8/16

What's wrong with last week's picture, namely:
"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases.
  1. "type" = "domain": I fully understand why one might equate "type" and "domain", but ... in today's programming practice, "type" and "domain" are quite different. The word "type" is largely tied to system-level (or "physical"-level) definitions of data, while a "domain" is thought of as an abstract set of acceptable values.
  2. "class" != "relvar": In simple terms, the word "class" applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a "relation" is a "class" (and trivially also a "set"), which contributes to confusion.
In modern programming parlance "class" is generally distinguished from "type" only in that "type" refers to "primitive" (system-defined) data definitions while "class" refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, "type" and "class" are actually synonymous."
With respect to 1, well, yes, they are distinct, but not for the stated reason. With respect to 2, well, no insofar as "programming parlance" goes. The terminology introduced by Codd was explicitly intended to distinguish formal concepts from set theory and first order predicate logic from the terminology used in programming practice. 

Monday, November 28, 2016

This Week

THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.



1. What's wrong with this picture?

"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases:
  1. "type" = "domain": I fully understand why one might equate "type" and "domain", but ... in today's programming practice, "type" and "domain" are quite different. The word "type" is largely tied to system-level (or "physical"-level) definitions of data, while a "domain" is thought of as an abstract set of acceptable values.
  2. "class" != "relvar": In simple terms, the word "class" applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a "relation" *is* a "class" (and trivially also a "set"), which contributes to confusion.
In modern programming parlance "class" is generally distinguished from "type" *only* in that "type" refers to "primitive" (system-defined) data definitions while "class" refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, "type" and "class" are actually synonymous." --Comment @dbdebunk.com

Sunday, November 20, 2016

The Principle of Orthogonal Database Design Part III

(Continued from Part II)


As we have seen, a true RDBMS that supports constraint inheritance can--if database design adheres to the POOD--support directly and transparently entity supertype-subtype (ESS) relationships by
  • Generating the appropriate views;
  • Generating and enforcing disjunctive constraints on independent base relations;
  • Properly propagating view updates to the underlying base relations.

SQL and ESS Support

Common misconceptions notwithstanding, for many reasons SQL DBMSs are not truly relational.
There is little integrity support beyond shorthands for key and referential (foreign key) constraints, let alone constraint inheritance. In fact, even though disjunctive constraints are syntactically similar to referential constraints, users cannot define them (at least not declaratively) because SQL DBMSs have not implemented the ANSI/ISO standard SQL ASSERTION [1,2]. Consequently, and because designs are not guaranteed to adhere to the POOD, multi-relation views  are not updatable. So while the POOD is desirable, even if database design adheres to it, ESS support by SQL DBMSs is not possible.

Monday, November 14, 2016

This Week

I have revised:
THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.

1. Quotes of the Week
"Wow. Been using SQL for 15 years, and didn’t even know about Except and Intersect!" --Blog.Jooq.org
"I am looking for database schemas with many tables (>100 tables). Where can I find them? I am currently using mysql and haven't done serious database design. So interested in looking at samples with ER diagrams." --YCombinator.com
2. To Laugh or Cry?

Monday, November 7, 2016

The Principle of Orthogonal Database Design Part II

Revised: 11/11/16

(Continued from Part I)

To recall, adherence to the POOD means independent base relations--no base facts are derivable from the other base facts--that have unique relation predicates (RP) i.e., relations are uniquely constrained.

Going back to the employees example in Part 1, because salary and commission are mutually exclusive properties of employees, salaried and commissioned employees are subtypes of the employee supertype. Entity subtypes inherit the properties and constraints of the supertype.

Entity Supertype-subtype Relationships

The three base relations
EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,ENAME,HIREDATE,SALARY)
COMM_EMPS (EMP#,ENAME,HIREDATE,COMMISSION)
represent the entity supertype and two subtypes. This design avoid NULLs, but note that aside from redundancy, the relations are not independent--supertype facts are derivable from subtype facts--in violation of the POOD.

Monday, October 31, 2016

This Week

1. Quote of the Week

"Normalization is and will always be a direct trade-off. You give up performance (time) for space. Indexing mitigates in the opposite direction of this trade-off." --YCombinator.com

2. To Laugh or Cry?

Comments on my "Denormalization for Performance: Don't Blame the Relational Model"

3. THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS 

is available. Order here.

Monday, October 24, 2016

The Costly Illusion of Denormalization for Performance

My October post @All Analytics.

Be that as it may, practitioners insist that performance improves when they denormalize databases, because "bundling" facts into less relations reduces joins. But even if this were always true -- it is not -- performance gains, if any, do not come from denormalization per se, but from trading off integrity for performance. What many data professionals miss is that the redundancy introduced by denormalization must be controlled by the DBMS to ensure data integrity, which requires special integrity constraints that, it turns out, involve the very joins that denormalization is intended to avoid, defeating its purpose. These constraints are practically never declared and enforced, which creates the illusion that denormalization improves performance at no cost.

Read it all.

Please comment there, not here.



Sunday, October 16, 2016

THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS Available to Order


When I discussed with a book publisher the idea of a guide/reference to misconceptions about data fundamentals, whose objective -- distinct from the usual cookbooks -- is to help data professionals base their practice on understanding, rather than cookbooks, he said "they are not interested in understanding, only in succeeding in their jobs". Apparently, the former is no longer a factor in the latter. Given the increasingly deteriorating experiences I had with publishers, it was time to stop bothering with them -- they pay and do very little -- and self-publish.

THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS  - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER is now available for purchase ($35, order via the BOOKS page (not to be confused with the RECOMMENDED BOOKS page); contact me by email for volume discounts). 


Monday, October 10, 2016

This Week

1. Quote of the Week
"Legion is a Hadoop MapReduce tool that turns big, messy data sources into clean, normalized flat files ready for ingestion into relational tables in a data warehouse (e.g., Postgres COPY)." --GitHub.com
2. To Laugh or Cry?

Wednesday, September 28, 2016

Brother, Spare Me the Paradigm

My September post @All Analytics:

Every few years -- the interval is getting shorter -- data management is claimed to undergo a “paradigm shift,” some new fundamentally different new way of doing things is being promoted (e.g., “Consider dimensional design and Big Data as two additional paradigms”) that if you don't adopt you’ll be left behind (I’ve written about Big Data, for data warehouse dimensional modeling see Data Warehouses and the Logical-Physical Confusion). For the few who understand what a paradigm is and are familiar with the history of data management, the irony could not be richer. 

Read it all (and please comment there, not here). 



Monday, September 26, 2016

This Week

1. Quote of the Week
"Which leads to another bad experience: the pernicious use of foreign keys. In the ORMs I've used, links between classes are represented in the data model as foreign keys which, if not configured carefully, result in a large number of joins when retrieving the object. (A recent count of one such table in my work resulted in over 600 attributes and 14 joins to access a single object, using the preferred query methodology.)
...
When you have foreign keys, you refer to related identities with an identifier. In your application, "identifier" takes on various meanings, but usually it's the memory location (a pointer). In the database, it's the state of the object itself. These two things don't really get along because you can really only use database identifiers in the database (the ultimate destination of the data you're working with)." --wozniak.ca

Monday, September 19, 2016

The Principle of Orthogonal Database Design Part I (UPDATED)

REVISED: 11/2/16; 11/10/16;03/13/17
 


Note: This two-part post replaces several previous posts which now redirect here. The example is not meant to be realistic, but to convey specific points in an accessible way.
"The principle of orthogonal design (abbreviated POOD) ... is the second of the two principles of database design, which seek to prevent databases from being too complicated or redundant, the first principle being the principle of full normalization (POFN). Simply put, it says that no two relations in a relational database should be defined in such a way that they can represent the same facts. As with database normalization, POOD serves to eliminate uncontrolled storage redundancy and expressive ambiguity, especially useful for applying updates to virtual relations (views). Although simple in concept, POOD is frequently misunderstood ... is a restatement of the requirement that a database is a minimum cover set of the relational algebra. The relational algebra allows data duplication in the relations that are the elements of the algebra. One of the efficiency requirements of a database is that there be no data duplication. This requirement is met by the minimum cover set of the relational algebra." --Wikipedia.org
Well, not quite.

Monday, September 12, 2016

This Week

1. Quote of the Week
"Data sense-making does not benefit from the relational data model. Dr. Codd’s rules for relational modeling were designed to improve efficiencies in data processing and storage, not to make data more intelligible. In fact, structuring data relationally makes the work of data sensemaking more difficult, which is why Dr. Kimball created dimensional data modeling and also why an entire industry of middleware products emerged to hide the complexities of relational models." --Stephen Few, PerceptualEdge.com

2. To Laugh or Cry?

Saturday, September 3, 2016

Relation Predicates and Identical Relations

I have rewritten an earlier post that you should also read:
Database Design: Property-Sharing Classes and Relations
Here's what's wrong with the last wrong picture I posted, namely:

Q: "Can you have 2 tables
  • VIEWS
  • DOWNLOADS
with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other."
A1:"I don't think that there is a problem, per se. From a E/R modeling point of view I don't see a problem with that, as long as they represent two semantically different entities."
A2:"Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding."
A3: "Chris Date and Dave McGoveran formalised the Principle of Orthogonal Design. Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result."
A4: "When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence. Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often. This alone could be a good reason not to burden it with records from the downloads."
--Can you have 2 identical tables in a good schema?, StackOverflow.com

Friday, August 26, 2016

Data, Information, Knowledge Discovery, and Knowledge Representation

 My August post @All Analytics

As any true scientist should know, science has a context of discovery and a context of validation (I wonder how many of those who call themselves “data scientists” know the concepts). In the former, experiments are usually done to discover relationships -- a theory (to be validated in the latter, in which further implications thereof can also be derived). So there is no theory/conceptual model/knowledge yet. Data or information can be represented by any database (e.g. a SQL database can represent variables and/or runs as columns and/or rows in tables), docubase and even file -- none is superior or inferior for analytical purposes.

Read it all.

Please comment there, not here.

Monday, August 22, 2016

This Week

1. What's wrong with this picture?
"Can you have 2 tables
- VIEWS
- DOWNLOADS
with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other.

"I don't think that there is a problem, per se from a E/R modelling point of view, as long as they represent two semantically different entities."
"Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding."
 "Chris Date and Dave McGoveran formalised the Principle of Orthogonal Design. Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result."

"When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence. Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often. This alone could be a good reason not to burden it with records from the downloads."
--Can you have 2 tables with identical structure in a good DB schema?, StackOverflow.com

Monday, August 15, 2016

Understanding Keys: Entity-Defining Properties and Entity Names

Here's what's wrong with last week's picture, namely:
"My understanding has always been that a primary key should be immutable, and my searching since reading this answer has only provided answers which reflect the same as a best practice. Under what circumstances would a primary key value need to be altered after the record is created?"

"When a primary key is chosen that is not immutable?"
--Why should a primary key change, StackExchange.com
This is, a more detailed treatment of keys undertaken in:
[1] On Kinds of Keys: Natural, Primary and Surrogate Are Sufficient
[2] Surrogate Key Illusions
[3] Duplicates: Stating the Same Fact More Than Once Doesn't Make It Truer, Only Redundant
which I have just revised. It is the result of several discussions I had with David McGoveran as part of my familiarization with his formal interpretation of Codd's relational data model (RDM), which he'll introduce in his forthcoming book LOGIC FOR SERIOUS DATABASE FOLKS. My own forthcoming book--THE DBDEBUNK GUIDE TO FUNDAMENTAL DATA MANAGEMENT MISCONCEPTIONS--incorporates some of his thoughts in an informal form accessible to the thinking data professional. As you shall see, it diverges in some important ways from the current understanding of the RDM (well, whatever little understanding there is).

Monday, August 8, 2016

This Week

1. What's wrong with this picture?
Q: "My understanding has always been that a primary key should be immutable, and my searching since reading this answer has only provided answers which reflect the same as a best practice. Under what circumstances would a primary key value need to be altered after the record is created?"

A: "When a primary key is chosen that is not immutable?" --StackExchange.com

2. Quote of the Week

"We start with a value. This may be a number, a date, or a chunk of text. Domain refers to the meaning for a value or a set of possible values. When we have a value with a consistently and widely used set of units, value domains, or applications, we call this value a data element. A data element may be a ticket number, a temperature reading, a hair color. (Some modeling approaches omit the notion of data elements or domains.)
In all the techniques and stages of data modeling, the concepts of entity and attribute are universal. An entity is the “thing” that must be manipulated as a data object. Entity represents an indivisible concept that consists of data elements. Each data element in the entity is called an attribute. Conversely, we could say that we build an entity by attributing data elements to it. You create an entity as a whole, and you delete it as a whole." --What a Concept! Is Logical Data Modeling Obsolete? -- LinkedIn.com
Note: In the preface to my PRACTICAL DATABASE FOUNDATIONS series of papers I deplored an author's (considered an expert) reliance on an industry ANSI committee as a starting point in an explanation of conceptual, logical and physical data fundamentals--there is practically 0 chance of soundness and 100% chance of confusion. So when this article started with "In the common usage established by ANSI in 1975, data modeling goes from abstract to concrete in three steps" I knew I would not read very far. Sure enough, I stopped almost immediately, after the above quote. If you don't understand why, I recommend you check out my papers.

Friday, July 29, 2016

Data Science, Coding, the Automation Paradox, and the Silicon Valley State


My July post at my All Analytics blog.

Indeed, if what Linus Torvald is admiringly quoted as saying is true -- "Don’t ever make the mistake [of thinking] that you can design something better than what you get from ruthless massively parallel trial-and-error with a feedback cycle. That’s giving your intelligence much too much credit." -- why bother with education, science, and theory at all? Just “plug your coding skills” by getting a Microsoft or IBM “Professional Degree”, or even take a free coding course by your favorite billionaire.

Please comment there, not here. Thanks.

Read it all.

Monday, July 25, 2016

Duplicates: Stating the Same Fact More Than Once Does Not Make it Truer, Only Redundant

Here's what what wrong with last week's picture, namely:
RB: "From the tabular point of view, does it make sense why we can't have duplicate rows in a relation?"

John Sullivan: "As with everything else in life, it depends what you are trying to do (and exactly what you mean when you talk about a DBMS table v. a formal relation). From an operational (transactional) database point of view, for obvious reasons, you don't want duplicate rows (enforce a natural key). But if you're analysing data from various legacy sources (e.g. spreadsheets) it might be useful. Then again, you might introduce a surrogate key to give you more control over what's going on - again, depends on what you are trying to do." --LinkedIn.com

One of my readers once wondered why "database professionals understand uniqueness via keys, but don't seem to understand why duplicate rows should be prohibited and the consequences of breaking relational closure": 

Sunday, July 17, 2016

This Week

1. What's wrong with this picture? 
RB: "From the tabular point of view, does it make sense why we can't have duplicate rows in a relation?"

John Sullivan: "As with everything else in life, it depends what you are trying to do (and exactly what you mean when you talk about a DBMS table v. a formal relation). From an operational (transactional) database point of view, for obvious reasons, you don't want duplicate rows (enforce a natural key). But if you're analysing data from various legacy sources (e.g. spreadsheets) it might be useful. Then again, you might introduce a surrogate key to give you more control over what's going on - again, depends on what you are trying to do." --Question about the relational model


Sunday, July 10, 2016

Levels of Representation: Relationships, Rules, Relations and Constraints

What's Wrong with Last Week's Picture (Question about relational model )

There are relationships at both the conceptual and logical representation levels. Confusing them is bad conceptual modeling and database design.

Relationships, Rules and Relations

AT: "In my personal understanding, a relation is defined as a set of tuples. Then ... "in the relational model every relation represents a relationship". And then a quote from Chen: "each tuple of entities ... is a relationship". If I use the first and the second statements - I can say that a relationship is a set of tuples. The third statement says that a relationship is a tuple. So far, is a relationship a set of an element of a set? (Or may be a set of sets?)".
A relation is a set of tuples that is a subset of the Cartesian product of the domains, i.e., it is a relationship of domains at the logical level (Chen may have been misquoted: a tuple is a set of attribute values i.e., a relationship thereof, that represents a fact about a single entity).

Thursday, July 7, 2016

NoSQL, Big Data Analytics, and the Loss of Knowledge and Reason

My June post @All Analytics:

The data management industry operates like the fashion industry. Its most persistent characteristic is migration from fad to fad. Every few years -- the number keeps getting smaller -- some "new" problem is discovered, for which the solution is so magical, that it is extended everywhere to everything, whether it is applicable or not. But many of these problems are old and fundamental and some of the “solutions” bring them back, rather than solve them. ...

Read it all.

Please comment there, not here.





Saturday, July 2, 2016

This Week

1. What's wrong with this picture?

AT: Well, I think I am a bit confused now. In my personal understanding, a relation is defined as a set of tuples. Then ... "in the relational model every relation represents a relationship". And then a quote from Chen: "each tuple of entities ... is a relationship". If I use the first and the second statements - I can say that a relationship is a set of tuples. The third statement says that a relationship is a tuple. So far, is a relationship a set of an element of a set? (Or may be a set of sets?)
GE: I argue that there is essentially no difference between relationships between entity (type tables) and between an entity and its attributes. They both represent relationships between two populations of things. Something is an attribute by virtue of there being a relationship. If relationships are represented by foreign keys and the entity tables must be in 1NF, as in the relational model, then all relationships must be at most Many-to-One (a very unnecessary limitation when modeling some user domain).
TF: The relational model was a mathematical construct, derived from set theory. Hence that particular terminology. The entity-relationship model is essentially a directed graph model, where relationships are prominent residents. Not so in the relational model (despite the name), where relationships (between relations, mind you) are not visible and in the SQL implementations is reduced to constraints. Relationships are about structure, which is as important as meaning (the semantics of the terms used in the universe being modeled).
2. Quote of the Week
"In Relational Theory sometimes the relationships, where we do our Joins are much more important than the attributes on an Entity." (quoted in LinkedIn.com exchange)

Sunday, June 26, 2016

On Kinds of Keys: Natural, Primary and Surrogate Are Sufficient

Revised 7/26/16 (see the Understanding Keys post of 7/31/16 for a more in-depth discussion).

Here is what is wrong with last week's picture painted by Martijn Evers in Kinds of Keys: On the Nature of Key Classifications.

"Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them they talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a debate rises on the use (and misuse) of all these keys ... The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)?"
I, of course, sympathize, as I've deplored the misuse and abuse of terminology in the industry for longer than I care to remember. This is not unique to keys, it's a general absence of foundation knowledge. The answer is yes, we can disambiguate--the key to keys is understanding what in the real world they represent and what database function they fulfill. Things are much simpler than the mindless industry jargon.

A key is a set of one or more relation attributes that uniquely identify tuples, which represents facts about entities in the database. Uniqueness (key) constraints on the relation for every key enforces entity integrity (i.e., there is no duplicate representation of facts about any entity).

Primary Keys

"Of all kinds of key, the primary key and the surrogate key gained the most discussion.If we take a look at the relational model we only see candidate keys a minimal set of one or more attributes that are unique for each tuple in a relation--no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys. We formally do not have a primary key, it is a choice we make and as such we might treat this key slightly different from all other available keys in a relation. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys."
Codd mandated a primary key (PK) for every relation with multiple keys, mainly for referential purposes (i.e., as target for references by other relations). Obviously, a single key is the PK, but when there are multiple keys, they are candidates and one is selected as PK. The choice is, indeed, not formal (which is why Date downgraded Codd's mandate to just "a good idea"; I would not advise to skip it), but is not arbitrary either: for practical reasons, stability and simplicity are desirable PK qualities.

Entity Properties and Natural Keys

Natural keys (NK) represent in the database entity-defining identifiers (EDI): a set of one or more observable/assertable properties inherent to all entities that defines and uniquely identifies each (e.g., DNA for people, ownership and a set of books for companies). As such, they are meaningful to users and are used for logical data access. EDI's and, therefore, NK's are immutable by definition--they change only if the entities themselves change (e.g., companies involved in mergers, acquisitions, spinoffs)--but tend to be complex (i.e., combinations of attributes), or have complex representations (e.g., DNA). They don't make practical PK's.

Entity Names and Surrogate Keys

When no stable and simple EDI can be identified or explicitly specified in the real world, entities are usually assigned a unique name to substitute for the complex or implicit EDI (e.g., SSN, S/N, patient id), such that all descriptive properties and other names are functionally dependent on it. Entity names are represented in the database by surrogate keys (SK) and all other attributes and keys are functionally dependent on it.

SK's have nothing to do with SQL and NULL's--SQL tables with NULL's are not relations and, therefore, outside the RDM, where all bets are off.

"If we treat surrogate keys as an arbitrary candidate key, we actually decrease the normalization of a data model from, say, 3NF to 2NF. This has led to the special and strict treatment we have for surrogate keys as to prevent this from leading to unwanted issues. Surrogate keys should be seen as not an extra key but as a surrogate for an existing key whose semantics and behavior is strictly regulated by the DBMS system."
As its name implies, SK's are not keys--they do not represent real world properties--but key substitutes that are added to other keys, if any. Since all attributes are functionally dependent on a SK by definition, a surrogate PK does not decrease normalization.
"Business Keys on the other hand are independent keys that are not designated as a surrogate key whose familiarity, uniqueness, visibility and scope span an entire enterprise. Human keys for example are actually not keys in the strict sense, but are non-enforceable candidate keys that heuristically identify tuples, especially those related to real world entities. They are related to the process of relating (UoD) entities to tuples. E.g. using full name and full address to identify and relate a real world person to a tuple representing that person in a database and uniquely identifying that person by a Person ID."
NK's ensure entity integrity and are used for meaningful logical data access. SK's serve as simple and stable PK's for referential purposes. What in the real world do "business keys", or "human keys" represent, what database functions do they fullfill that keys/NK's/PK's and SK's do not and what are their desirable qualities?

Note: An "intelligent key" (also referred to as column overload) is an attribute defined on a non-simple domain with meaningful components. This is hardly intelligent practice (see The First Normal Form in Theory and Practice, Parts 1-3).


Request to my readers (particularly the regular ones): My current writings are in large part dedicated to making accessible to practitioners David McGoveran's interpretation of Codd's RDM, with refinements, extensions and corrections he believes Codd would have ended up with had he lived long enough. In some important aspects there is divergence from the current understanding of the RDM (well, whatever little understanding there is). David will introduce the formal theory in his forthcoming book, LOGIC FOR SERIOUS DATABASE FOLKS. In my writings I strive to make the theory and its practical implications accessible to the thinking data professional--material is not available anywhere else. Please disseminate it by linking to my posts, writings and publications on as many of the web and social media sites for whose audience may deem it of interest. Thanks.




Sunday, June 19, 2016

This Week

1. What's wrong with this picture?

This week's picture is the one of the state of knowledge about keys that Martijn Evers painted in Kinds of Keys: On the Nature of Key Classifications, that I had already commented on. As a result of discussions I've been having with David McGoveran in the context of our forthcoming books (his LOGIC FOR SERIOUS DATABASE FOLKS, my DBDEBUNK GUIDE TO FUNDAMENTAL DATA MANAGEMENT MISCONCEPTIONS), I've decided to rewrite my
comments On Kinds of Keys on the subject. I refer the reader to Martijn's article for a refresher--my rewrite will be posted next week.

2. Quote of the Week

There are no rules of normalization for non-relational databases. Effectively, you start out by denormalizing everything. Which means you're designing the data organization to serve specific queries. So follow the same principle in NoSQL databases as you would for denormalizing a relational database: design your queries first, then the structure of the database is derived from the queries. --Bill Karwin, What is a good way to design a NoSQL database

Sunday, June 12, 2016

Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation

From last week:

What's wrong with this picture? (Kinds of Data Models, LinkedIn.com)

David Hay: "Part of the ... confusion as to what exactly was meant by “data modeling”--conceptual, logical or physical--is that most data modeling activities seem to focus on achieving good relational database designs ... my approach is the portrayal of the underlying structure of an enterprise’s data--without regard for any technology that might be used to manage it ... a “conceptual data model” ... that represents the business."

Nothing raises uncertainty whether to laugh or cry better than attempts to dispel confusion which suffer from the very confusion they purport to dispel.

Sunday, June 5, 2016

This Week

1. What's wrong with this picture?
David Hay: Part of the ... confusion as to what exactly was meant by “data modeling”--conceptual, logical or physical--is that most data modeling activities seem to focus on achieving good relational database designs ... my approach is the portrayal of the underlying structure of an enterprise’s data--without regard for any technology that might be used to manage it ... a “conceptual data model” ... that represents the business.

Nigel Higgs: ... many folks do not get the difference between the Barker entity relationship style of modeling and the relational style of modeling ... [because] the modeling conventions are very similar and the former [is always] a precursor to RDBMS design.

Clifford Heath: Any terminology for models must project three aspects of intention: (a) audience, (b) level of detail and (c) purpose. These three variables are sufficient to discriminate all the main kinds of models in use. The traditional terms of "conceptual/logical/physical" are manifestly inadequate.

Remy Fannader: Models are meant to describe sets of instances (objects or behaviors).
--Kinds of Data Models, LinkedIn.com

2. Quote of the Week

The first consideration that needs to be made when selecting a database is the characteristics of the data you are looking to leverage. If the data has a simple tabular structure, like an accounting spreadsheet, then the relational model could be adequate. Data such as geo-spatial, engineering parts, or molecular modeling, on the other hand, tends to be very complex. It may have multiple levels of nesting and the complete data model can be complicated. Such data has, in the past, been modeled into relational tables, but has not fit into that two-dimensional row-column structure naturally. --Jnan Dash, RDBMS vs. NoSQL: How do you pick?

Sunday, May 29, 2016

Multidimensional Relations, Flat Tables and Logical-Physical Confusion

Many--and I would guess, most--data professionals are clueless about the Relational Data Model (RDM) and its practical value for database practice. They confuse RDBMS with SQL DBMS and use relational jargon, but don't understand what it really means and don't care whether they do or not. As I announced, every other week (with occasional exceptions), I will debunk common and entrenched myths and misconceptions exposed by the "What's wrong with this picture?" of the preceding week.

From last week:

"Data is stored in two-dimensional tables consisting of columns (fields) and rows (records). Multi-dimensional data is represented by a system of relationships among two-dimensional tables. This usually leads to data storage becoming redundant, and also difficult to maintain on account of addition and deletion anomalies. This is only the case if we do not normalize the data. Keys are fields or combinations of fields used to identify records." --Nigel Peck, Denormalization Summary
"Relations are multidimensional. They are not flat. They are not two dimensional. Don't let the term table mislead you." I read the above statement on the back cover of CJ Date book on relational theory "Database in depth". Can anyone help how to visualize this multidimensional nature of relations? --LinkedIn.com