Sunday, August 30, 2015

Weekly Update

The Real Data Science: Tables--So What?

My Presentation to Silicon Valley SQL Server User Group

6:30 PM, Tuesday, September 15, 2015

1065 La Avenida Building 1
Mountain View, CA

Free and open to the public (+ pizza)
For details and RSVP see Meetup

1. Quote of the Week

[Do] formalized languages need the definition of data types? Up to now I have not read strong arguments against my statement that for interpretation and operation on data the use of character strings is sufficient when
  • All data are expressed as character strings that are explicitly based in language communities, whereas the character strings denote concepts that are represented by UID's;
  • The denoted concepts are defined by their supertype concepts (among others);
  • Collections of allowed qualitative concepts (that are denoted by string values or value ranges) are defined to enable the specification of constraints;
2. To Laugh or Cry?

3. Online Debunkings

4. Elsewhere
Which technologies emerge from the abyss
Why Big Data gets it Wrong

5. Housekeeping Added to LINKS page:

  • Query Optimization
  • Relational Algebra
  • Relational
  • Relational Algebra Translator
  • System for Translating Relational Algebra Scripts into Microsoft SQL Server SQL Scripts
  • System for Translating Relational Algebra Scripts into Oracle SQL Scripts
And now for something completely different

Saturday, August 22, 2015

Silicon Valley SQL Server User Group Presentation

The Real Data Science: Tables -- So What? 

During hyping of fads such as "Data Science", all you hear is the "huge opportunities for enterprises to gain hitherto unimagined insights" and very little about the potential to tell enterprises really big lies, which can rise from 100% correct data in poorly designed databases. That's because what passes for "Data Science" is not science, let alone science of data. 

Most data professionals know that relational databases consist of tables, but so what? Provably correct query results are guaranteed by the real data science--the RDM--if and only if tables are well-designed and properly constrained R-tables and the DBMS truly and fully supports it. Unfortunately, more often than not tables are neither, and SQL DBMS's don't, which makes databases harder to understand,  queries don't always make sense and results are hard to interpret, or outright wrong. 

You will learn: 
  • The Real Data Science
  • Relations and databases
  • 5NF R-tables
  • "Table arithmetic"
  • RDM and SQL

6:30 PM, Tuesday, September 15, 2015

1065 La Avenida Building 1
Mountain View, CA

For details see Meetup.

Sunday, August 16, 2015

Weekly Update

1. Quote of the Week
Later, as use of RDBMS became more widespread, the complexity associated with design of a RDBMS was also well documented ... The associative database model is claimed to offer advantages over RDBMS ... “two fundamental data structures” as “„Items‟ and a set of „Links‟ that connect them together ... Items, which have "a unique identifier, a name and a type” and Links, which have “a unique identifier, together with the unique identifiers of three other things, that represent the source, verb and target of a fact that is recorded about the source in the database ... “each of the three things identified by the source, verb and target may each be either a link or an item.”
--Homan, J. V. and Kovacs, P. J., A Comparison of the Relational Database Model and the Associative Database Model, Issues in Informtion Systems, Vol. X, No. 1, 2009.
2. To Laugh or Cry?
A Comparison of the Relational Database Model and the Associative Database Model
3. Online Debunkings

4. Interesting Elsewhere
Why Domain Expertise is More Important than Algorithms
5. And now for something completely different

Thursday, August 13, 2015

Sunday, August 9, 2015

Surrogate Key Illusions

When designing a database, early decisions can have a huge impact on the performance and storage requirements. These decisions can be difficult to change later, as most subsequent work will depend on the physical model.
--Keller, J. J., SQL Server Datatypes: Common modeling dilemmas
Here's one of those early design decisions:
When defining a surrogate primary key for a table, two options are the most common: Integer and UniqueIdentifier (aka Globally Unique Identifiers, or GUID's) ... Historically, Integer has been the logical choice. It’s human-readable, requires minimal storage, and can be set as an identity (auto-incrementing) to prevent the need for additional application logic. UniqueIdentifier comes with significant disadvantages. The most immediately noticeable is that it’s user-unfriendly. You’ll never hear a user or developer ask you about record “A78383A3-4AB1-42CF-B3FC-A4A23AD10398”. With high availability and replication becoming highly prevalent, UniqueIdentifier is being chosen more often, but has caveats that mean it isn’t always the optimal solution.

Sunday, August 2, 2015

Weekly Update

1. Quote of the Week
I am designing a mySQL database. I created tables and added extra columns for future use. Will it affect performance?
2. To Laugh or Cry?
Why you should never, ever, ever use MongoDB
3. Online Debunkings
Fixing 7 common database design errors
4. From the industry
Amazon's MySQL database challenger Aurora exits preview
5. And now for something completely different

Sunday, July 26, 2015

Interpreting Codd: Normalization

One of the most common and entrenched misconceptions about the RDM is the confusion of its sound theoretical foundation with "just a theory", implying lack of practicality. Not many realize that RDM includes several adaptations Codd made to set theory, to make it applicable to database practice.

In 1997, a computer science course at an Australian university posted on the Net a Q&A discussion of Codd's work. The answers were actually closer to the truth than what can be expected today and with a few corrections and clarifications they can be useful. The reader is encouraged to test self in assessing the answers before proceeding to my comments.
Q: When we normalise, we remove non-simple domains. However, in doing so, we create a lot of duplication. I was under the impression that we should try to minimise duplication as it is hard to keep data consistent and it also doesn't waste space. So what is so special about normalising?
A: The first advantage of normalising is that it renders all values atomic, thus simplifying all data structures: a huge advantage for storage and communication purposes. It is true that as a result of normalisation (in the sense of removing non-simple domains) duplication is introduced, but the process of normalisation is at the very heart of the relational model, i.e., atomic units in n-column homogeneous arrays. Therefore, a small element of redundant information is introduced for this advantage. Redundancy can be further reduced by other degrees of normalisation.
One of the most entrenched and frequent complaints is that the RDM is "just theory" that often clashes with database practical needs in the real world. Data professionals are mostly unaware that Codd made many adjustments to the abstract mathematical theory of relations, to adapt it to real world database management (that is why I insist on the distinction between relations and R-tables). One such adjustment are keys--PK's and FK's--that relations do not have.