Monday, December 5, 2016

Prediction, Explanation and the November Surprise

Given the overhyped promise of data science, the "shock" at the broad failure to predict the election outcome was inevitable. Skimming through the media and technical accounts, it looks like a better understanding of prediction and explanation is necessary for less surprises and sounder analytics.

My November post @All Analytics. Read it all.

Domain vs. (Data) Type, Class vs. Relation (UPDATE)

(This is a new version of a 8/12/12 post)

What's wrong with last week's picture, namely:
"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases.
  1. "type" = "domain": I fully understand why one might equate "type" and "domain", but ... in today's programming practice, "type" and "domain" are quite different. The word "type" is largely tied to system-level (or "physical"-level) definitions of data, while a "domain" is thought of as an abstract set of acceptable values.
  2. "class" != "relvar": In simple terms, the word "class" applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a "relation" is a "class" (and trivially also a "set"), which contributes to confusion.
In modern programming parlance "class" is generally distinguished from "type" only in that "type" refers to "primitive" (system-defined) data definitions while "class" refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, "type" and "class" are actually synonymous."
With respect to 1, well, yes, they are distinct, but not for the stated reason. With respect to 2, well, no insofar as "programming parlance" goes. The terminology introduced by Codd was explicitly intended to distinguish formal concepts from set theory and first order predicate logic from the terminology used in programming practice. 

Monday, November 28, 2016

This Week

THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.



1. What's wrong with this picture?

"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases:
  1. "type" = "domain": I fully understand why one might equate "type" and "domain", but ... in today's programming practice, "type" and "domain" are quite different. The word "type" is largely tied to system-level (or "physical"-level) definitions of data, while a "domain" is thought of as an abstract set of acceptable values.
  2. "class" != "relvar": In simple terms, the word "class" applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a "relation" *is* a "class" (and trivially also a "set"), which contributes to confusion.
In modern programming parlance "class" is generally distinguished from "type" *only* in that "type" refers to "primitive" (system-defined) data definitions while "class" refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, "type" and "class" are actually synonymous." --Comment @dbdebunk.com

Sunday, November 20, 2016

The Principle of Orthogonal Database Design Part III

(Continued from Part II)
 

As we have seen, a true RDBMS that supports constraint inheritance can--if database design adheres to the POOD--support directly and transparently entity supertype-subtype (ESS) relationships by
  • Generating the appropriate views;
  • Generating and enforcing disjunctive constraints on independent base relations;
  • Properly propagating view updates to the underlying base relations.

SQL and ESS Support

Common misconceptions notwithstanding, for many reasons SQL DBMSs are not truly relational.
There is little integrity support beyond shorthands for key and referential (foreign key) constraints, let alone constraint inheritance. In fact, even though disjunctive constraints are syntactically similar to referential constraints, users cannot define them (at least not declaratively) because SQL DBMSs have not implemented the ANSI/ISO standard SQL ASSERTION [1,2]. Consequently, and because designs are not guaranteed to adhere to the POOD, multi-relation views  are not updatable. So while the POOD is desirable, even if database design adheres to it, ESS support by SQL DBMSs is not possible.

Monday, November 14, 2016

This Week

I have revised:
THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.

1. Quotes of the Week
"Wow. Been using SQL for 15 years, and didn’t even know about Except and Intersect!" --Blog.Jooq.org
"I am looking for database schemas with many tables (>100 tables). Where can I find them? I am currently using mysql and haven't done serious database design. So interested in looking at samples with ER diagrams." --YCombinator.com
2. To Laugh or Cry?

Monday, November 7, 2016

Principle of Orthogonal Database Design Part II (UPDATED)

Revised: 11/11/16

(Continued from Part I, which I revised significantly enough to recommend a re-read.)

To recall, adherence to the POOD means independent base relations--no base facts are derivable from the other base facts--that have unique relation predicates (RP) i.e., relations are uniquely constrained.

Going back to the employees example in Part 1, because salary and commission are mutually exclusive properties of employees, salaried and commissioned employees are subtypes of the employee supertype. Entity subtypes inherit the properties and constraints of the supertype.

Entity Supertype-subtype Relationships

The three base relations
EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,ENAME,HIREDATE,SALARY)
COMM_EMPS (EMP#,ENAME,HIREDATE,COMMISSION)
represent the entity supertype and two subtypes. This design avoid NULLs, but note that aside from redundancy, the relations are not independent--supertype facts are derivable from subtype facts--in violation of the POOD.

Monday, October 31, 2016

This Week

1. Quote of the Week

"Normalization is and will always be a direct trade-off. You give up performance (time) for space. Indexing mitigates in the opposite direction of this trade-off." --YCombinator.com

2. To Laugh or Cry?

Comments on my "Denormalization for Performance: Don't Blame the Relational Model"

3. THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS 

is available. Order here.