Monday, July 25, 2016

Duplicates: Stating the Same Fact More Than Once Does Not Make it Truer, Only Redundant

Here's what what wrong with last week's picture, namely:
RB: "From the tabular point of view, does it make sense why we can't have duplicate rows in a relation?"

John Sullivan: "As with everything else in life, it depends what you are trying to do (and exactly what you mean when you talk about a DBMS table v. a formal relation). From an operational (transactional) database point of view, for obvious reasons, you don't want duplicate rows (enforce a natural key). But if you're analysing data from various legacy sources (e.g. spreadsheets) it might be useful. Then again, you might introduce a surrogate key to give you more control over what's going on - again, depends on what you are trying to do." --LinkedIn.com

One of my readers once wondered why "database professionals understand uniqueness via keys, but don't seem to understand why duplicate rows should be prohibited and the consequences of breaking relational closure": 

Sunday, July 17, 2016

This Week

1. What's wrong with this picture? 
RB: "From the tabular point of view, does it make sense why we can't have duplicate rows in a relation?"

John Sullivan: "As with everything else in life, it depends what you are trying to do (and exactly what you mean when you talk about a DBMS table v. a formal relation). From an operational (transactional) database point of view, for obvious reasons, you don't want duplicate rows (enforce a natural key). But if you're analysing data from various legacy sources (e.g. spreadsheets) it might be useful. Then again, you might introduce a surrogate key to give you more control over what's going on - again, depends on what you are trying to do." --Question about the relational model


Sunday, July 10, 2016

Levels of Representation: Relationships, Rules, Relations and Constraints

What's Wrong with Last Week's Picture (Question about relational model )

There are relationships at both the conceptual and logical representation levels. Confusing them is bad conceptual modeling and database design.

Relationships, Rules and Relations

AT: "In my personal understanding, a relation is defined as a set of tuples. Then ... "in the relational model every relation represents a relationship". And then a quote from Chen: "each tuple of entities ... is a relationship". If I use the first and the second statements - I can say that a relationship is a set of tuples. The third statement says that a relationship is a tuple. So far, is a relationship a set of an element of a set? (Or may be a set of sets?)".
A relation is a set of tuples that is a subset of the Cartesian product of the domains, i.e., it is a relationship of domains at the logical level (Chen may have been misquoted: a tuple is a set of attribute values i.e., a relationship thereof, that represents a fact about a single entity).

Thursday, July 7, 2016

NoSQL, Big Data Analytics, and the Loss of Knowledge and Reason

My June post @All Analytics:

"The data management industry operates like the fashion industry. Its most persistent characteristic is migration from fad to fad. Every few years -- the number keeps getting smaller -- some "new" problem is discovered, for which the solution is so magical, that it is extended everywhere to everything, whether it is applicable or not. But many of these problems are old and fundamental and some of the “solutions” bring them back, rather than solve them. ..."

Please comment there, not here.

Read it all. 



Saturday, July 2, 2016

This Week

1. What's wrong with this picture?

AT: Well, I think I am a bit confused now. In my personal understanding, a relation is defined as a set of tuples. Then ... "in the relational model every relation represents a relationship". And then a quote from Chen: "each tuple of entities ... is a relationship". If I use the first and the second statements - I can say that a relationship is a set of tuples. The third statement says that a relationship is a tuple. So far, is a relationship a set of an element of a set? (Or may be a set of sets?)
GE: I argue that there is essentially no difference between relationships between entity (type tables) and between an entity and its attributes. They both represent relationships between two populations of things. Something is an attribute by virtue of there being a relationship. If relationships are represented by foreign keys and the entity tables must be in 1NF, as in the relational model, then all relationships must be at most Many-to-One (a very unnecessary limitation when modeling some user domain).
TF: The relational model was a mathematical construct, derived from set theory. Hence that particular terminology. The entity-relationship model is essentially a directed graph model, where relationships are prominent residents. Not so in the relational model (despite the name), where relationships (between relations, mind you) are not visible and in the SQL implementations is reduced to constraints. Relationships are about structure, which is as important as meaning (the semantics of the terms used in the universe being modeled).
2. Quote of the Week
"In Relational Theory sometimes the relationships, where we do our Joins are much more important than the attributes on an Entity." (quoted in LinkedIn.com exchange)

Sunday, June 26, 2016

On Kinds of Keys: Natural, Primary and Surrogate Are Sufficient

Revised 7/26/16 (see the Understanding Keys post of 7/31/16 for a more in-depth discussion).

Here is what is wrong with last week's picture painted by Martijn Evers in Kinds of Keys: On the Nature of Key Classifications.
"Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them they talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a debate rises on the use (and misuse) of all these keys ... The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)?"
I, of course, sympathize, as I've deplored the misuse and abuse of terminology in the industry for longer than I care to remember. This is not unique to keys, it's a general absence of foundation knowledge. The answer is yes, we can disambiguate--the key to keys is understanding what in the real world they represent and what database function they fulfill. Things are much simpler than the mindless industry jargon.

A key is a set of one or more relation attributes that uniquely identify tuples, which represents facts about entities in the database. Uniqueness (key) constraints on the relation for every key enforces entity integrity (i.e., there is no duplicate representation of facts about any entity).

Primary Keys

"Of all kinds of key, the primary key and the surrogate key gained the most discussion.If we take a look at the relational model we only see candidate keys a minimal set of one or more attributes that are unique for each tuple in a relation--no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys. We formally do not have a primary key, it is a choice we make and as such we might treat this key slightly different from all other available keys in a relation. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys."
Codd mandated a primary key (PK) for every relation with multiple keys, mainly for referential purposes (i.e., as target for references by other relations). Obviously, a single key is the PK, but when there are multiple keys, they are candidates and one is selected as PK. The choice is, indeed, not formal (which is why Date downgraded Codd's mandate to just "a good idea"; I would not advise to skip it), but is not arbitrary either: for practical reasons, stability and simplicity are desirable PK qualities.

Entity Properties and Natural Keys

Natural keys (NK) represent in the database entity-defining identifiers (EDI): a set of one or more observable/assertable properties inherent to all entities that defines and uniquely identifies each (e.g., DNA for people, ownership and a set of books for companies). As such, they are meaningful to users and are used for logical data access. EDI's and, therefore, NK's are immutable by definition--they change only if the entities themselves change (e.g., companies involved in mergers, acquisitions, spinoffs)--but tend to be complex (i.e., combinations of attributes), or have complex representations (e.g., DNA). They don't make practical PK's.

Entity Names and Surrogate Keys

When no stable and simple EDI can be identified or explicitly specified in the real world, entities are usually assigned a unique name to substitute for the complex or implicit EDI (e.g., SSN, S/N, patient id), such that all descriptive properties and other names are functionally dependent on it. Entity names are represented in the database by surrogate keys (SK) and all other attributes and keys are functionally dependent on it.

SK's have nothing to do with SQL and NULL's--SQL tables with NULL's are not relations and, therefore, outside the RDM, where all bets are off.
"If we treat surrogate keys as an arbitrary candidate key, we actually decrease the normalization of a data model from, say, 3NF to 2NF. This has led to the special and strict treatment we have for surrogate keys as to prevent this from leading to unwanted issues. Surrogate keys should be seen as not an extra key but as a surrogate for an existing key whose semantics and behavior is strictly regulated by the DBMS system."
As its name implies, SK's are not keys--they do not represent real world properties--but key substitutes that are added to other keys, if any. Since all attributes are functionally dependent on a SK by definition, a surrogate PK does not decrease normalization.
"Business Keys on the other hand are independent keys that are not designated as a surrogate key whose familiarity, uniqueness, visibility and scope span an entire enterprise. Human keys for example are actually not keys in the strict sense, but are non-enforceable candidate keys that heuristically identify tuples, especially those related to real world entities. They are related to the process of relating (UoD) entities to tuples. E.g. using full name and full address to identify and relate a real world person to a tuple representing that person in a database and uniquely identifying that person by a Person ID."
NK's ensure entity integrity and are used for meaningful logical data access. SK's serve as simple and stable PK's for referential purposes. What in the real world do "business keys", or "human keys" represent, what database functions do they fullfill that keys/NK's/PK's and SK's do not and what are their desirable qualities?

Note: An "intelligent key" (also referred to as column overload) is an attribute defined on a non-simple domain with meaningful components. This is hardly intelligent practice (see The First Normal Form in Theory and Practice, Parts 1-3).


Request to my readers (particularly the regular ones): My current writings are in large part dedicated to making accessible to practitioners David McGoveran's interpretation of Codd's RDM, with refinements, extensions and corrections he believes Codd would have ended up with had he lived long enough. In some important aspects there is divergence from the current understanding of the RDM (well, whatever little understanding there is). David will introduce the formal theory in his forthcoming book, LOGIC FOR SERIOUS DATABASE FOLKS. In my writings I strive to make the theory and its practical implications accessible to the thinking data professional--material is not available anywhere else. Please disseminate it by linking to my posts, writings and publications on as many of the web and social media sites for whose audience may deem it of interest. Thanks.




 

Sunday, June 19, 2016

This Week

1. What's wrong with this picture?

This week's picture is the one of the state of knowledge about keys that Martijn Evers painted in Kinds of Keys: On the Nature of Key Classifications, that I had already commented on. As a result of discussions I've been having with David McGoveran in the context of our forthcoming books (his LOGIC FOR SERIOUS DATABASE FOLKS, my DBDEBUNK GUIDE TO FUNDAMENTAL DATA MANAGEMENT MISCONCEPTIONS), I've decided to rewrite my
comments On Kinds of Keys on the subject. I refer the reader to Martijn's article for a refresher--my rewrite will be posted next week.

2. Quote of the Week

There are no rules of normalization for non-relational databases. Effectively, you start out by denormalizing everything. Which means you're designing the data organization to serve specific queries. So follow the same principle in NoSQL databases as you would for denormalizing a relational database: design your queries first, then the structure of the database is derived from the queries. --Bill Karwin, What is a good way to design a NoSQL database