Showing posts with label FD. Show all posts
Showing posts with label FD. Show all posts

Wednesday, August 29, 2018

DISTINCT and ORDER BY Are Not Relational




“One of the things that confuse SQL users all the time is how DISTINCT and ORDER BY are related in a SQL query ... most people quickly understand:
   
SELECT DISTINCT length
FROM film

[that] returns results in an arbitrary order, because the database can (and might apply hashing rather than ordering to remove duplicates) ... Most people also understand:
   
SELECT length
FROM film
ORDER BY length

[that] will give us duplicates, but in order ... And, of course, we can combine the two:
   
SELECT DISTINCT length
FROM film
ORDER BY length

[But if] somewhat intuitively, we may want to order the lengths differently, e.g. by title:
   
SELECT DISTINCT length
FROM film
ORDER BY title

[m]ost databases [sic] fail this query with an exception like Oracle’s:

ORA-01791: not a SELECTed expression

At first sight ... this

SELECT length
FROM film
ORDER BY title

works after all ... So, how are these different? We have to rewind and check out the logical order of SQL operations (as opposed to the syntactic order). And always remember, this is the logical order, not the actual order executed by the optimiser.”
--How SQL DISTINCT and ORDER BY are Related, Jooq.org

Sunday, April 22, 2018

A New Understanding of Keys Part 2: Kinds of Keys




Note: This the second of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades. 


(Continued from Part 1)
"Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a debate rises on the use (and misuse) of all these keys ... The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)? Of all kinds of key, the primary key and the surrogate key gained the most discussion."

"If we take a look at the relational model we only see of one or more attributes that are unique for each tuple in a relation -- no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys. We formally do not have a primary key, it is a choice we make and as such we might treat this key slightly different from all other available keys in a relation. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys."
--Martijn Evers,dm-unseen.blogspot.com
I've deplored the misuse and abuse of terminology due a general lack of foundation knowledge in the industry [3] for longer than I care to remember, and keys are not an exception. If "the discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys", then there is no understanding of relational keys whatsoever: whatever it is, a data structure that contains NULLs is not a relation, one reason for which SQL tables are not relations, SQL databases are not relational and SQL DBMSs are not RDBMSs (for a relational solution to missing data without NULLs see[4]).

We sure can disambiguate, but the key (pun intended) to keys is that they are a relational feature and, thus, can only be properly understood within the dual theoretical foundation of the RDM, which is an adaptation and application of simple set theory (SST) expressible in first order predicate logic (FOPL) to database management. Thus, their "nomenclature on properties and behavior" should reflect what from the real world they represent, and what function they fulfill in the RDM. Which is precisely what the industry disregards.


Sunday, February 11, 2018

The Key to Relational Keys - A New Understanding




Version 3 of paper #4 in the PRACTICAL DATABASE FOUNDATION series is now available to order here. The Key To Relational Keys: A New Understanding is a completely new, re-written version that incorporates David McGoveran's formalization and interpretation of Codd's true RDM and introduces a new perspective. It is distinct from and supersedes all previous versions.

Abstract


The dual theoretical foundation of the RDM — simple set theory (SST) expressible in first order predicate logic (FOPL) — is applicable to database management because it is a theory of unique objects and objects are unique in the real world that databases represent. In the real world objects are uniquely identified by (1) a combination of one or more defining properties and/or (2) names assigned to them as members of various groups. A database relation is a formal representation of an object group, with tuples representing (facts about) and attributes representing properties and/or names of object members. Relational keys are attributes that represent formally in the database those identifying properties and names.

Their fundamental database role notwithstanding, relational keys are poorly understood. 70% of hits @dbdebunk.com are about keys and misconceptions about their necessity, what kinds of keys are relational, their functions, their selection, and so on, abound.
This paper defines and explains:

  • The relational key concept
  • The kinds of relational keys, their properties and functions
  • The formal PK mandate
  • PK selection
  • RDBMS key support

and debunks common misconceptions.


Sunday, January 21, 2018

How to Think (and Not to Think) During Database Design



"I have to maintain some lists in DB (SQLServer, Oracle, DB2, Derby), I have 2 options to design underlying simple table:

"1st:
 NAME   VALUE
=================
 dept   HR
 dept   fin
 role   engineer
 role   designer
-----------------
UNIQUE CONSTRAINT (NAME, VALUE) and some other columns like auto generated ID, etc.
"2nd:
 NAME  VALUE_JSON_CLOB
==================================
dept   {["HR", "fin"]}
role   {["engineer", "designer"}]
----------------------------------
UNIQUE CONSTRAINT (NAME) and some other columns like auto generated ID, etc.
"There is no DELETE operation, only SELECT and INSERT/UPDATE. In first advantage is only INSERT is required but SELECT (fetch all values for a given NAME) will be slow. In second SELECT will be fast but UPDATE will be slow. By considering there could be 10000s of such lists with 1000s for possible values in the system with frequent SELECTs and less INSERTs, which TABLE design will be good in terms of select/insert/update performance." --SQL TABLE to store lists of strings, StackOverflow.com

Using a relational database to "maintain lists" probably does not merit attention and I actually considered canceling the debunking of this example. But it provides an opportunity to demonstrate the gap between conventional wisdom, database practice and SQL DBMSs and
Codd's true RDM, as formalized and interpreted by McGoveran [1]. Such use is induced by lack of foundation knowledge, so for the purpose of this discussion I treat the example as a case of "how not to think when performing database design".

Note: Certainly logical database design should not be contaminated with physical implementation considerations such as performance [2].


Wednesday, May 25, 2016

Why Data Scientists Must Understand Normalization



My May post @All Analytics:

We are constantly told how data scientists must be “jacks of many skills”, but one of the most important is rarely included in the list.Very few databases are properly designed. Many SQL databases are denormalized inadvertently, or intentionally (and erroneously) "for performance". They require special constraints to control data redundancy and prevent inconsistencies, which are practically never enforced. Analysts cannot, therefore, take database consistency for granted. Furthermore, to issue sensible queries and ensure correct results and interpretation thereof, it’s not enough for analysts to know the types of fact represented in the database, but also whether and how the database designer has chosen to bundle -- nest or merge -- those facts and how to disentangle them for analysis.

Read it all. (Please comment there, not here)



 



View My Stats