Sunday, January 21, 2018

How to Think (and Not to Think) During Database Design

"I have to maintain some lists in DB (SQLServer, Oracle, DB2, Derby), I have 2 options to design underlying simple table:

"1st:

 NAME   VALUE
=================
 dept   HR
 dept   fin
 role   engineer
 role   designer
-----------------
UNIQUE CONSTRAINT (NAME, VALUE) and some other columns like auto generated ID, etc.
"2nd:

 NAME  VALUE_JSON_CLOB
==================================
dept   {["HR", "fin"]}
role   {["engineer", "designer"}]
----------------------------------
UNIQUE CONSTRAINT (NAME) and some other columns like auto generated ID, etc.
"There is no DELETE operation, only SELECT and INSERT/UPDATE. In first advantage is only INSERT is required but SELECT (fetch all values for a given NAME) will be slow. In second SELECT will be fast but UPDATE will be slow. By considering there could be 10000s of such lists with 1000s for possible values in the system with frequent SELECTs and less INSERTs, which TABLE design will be good in terms of select/insert/update performance." --SQL TABLE to store lists of strings, StackOverflow.com

Using a relational database to "maintain lists" probably does not merit attention and I actually considered canceling the debunking of this example. But it provides an opportunity to demonstrate the gap between conventional wisdom, database practice and SQL DBMSs and
Codd's true RDM, as formalized and interpreted by McGoveran [1]. Such use is induced by lack of foundation knowledge, so for the purpose of this discussion I treat the example as a case of "how not to think when performing database design".

Note: Certainly logical database design should not be contaminated with physical implementation considerations such as performance [2].

Monday, January 15, 2018

This Week and The End of Empire


1. Database truth of the week

"ALL names are human created, either by non-algorithmic assignment, or via some algorithm. We ONLY know that two types of objects are distinct because they have different sets of defining properties and, for a given object type, we ONLY know that two objects are distinct because the values (observed or measured) of that object type's defining properties are distinct. Names (of objects of some type) allow us to distinguish two such entities ONLY when they are 1:1 with the values of the object defining properties. Two sets of names (whether human assigned or machine generated) consistently identify the same set of entities ONLY when they are 1:1." --David McGoveran


2. What's wrong with this database picture?

"I have to maintain some lists in DB (SQLServer, Oracle, DB2, Derby), I have 2 options to design underlying simple table:

"1st:
 NAME   VALUE
=================
 dept   HR
 dept   fin
 role   engineer
 role   designer
-----------------
UNIQUE CONSTRAINT (NAME, VALUE) and some other columns like auto generated ID, etc.
"2nd:
 NAME  VALUE_JSON_CLOB
==================================
dept   {["HR", "fin"]}
role   {["engineer", "designer"}]
----------------------------------
UNIQUE CONSTRAINT (NAME) and some other columns like auto generated ID, etc.
"There is no DELETE operation, only SELECT and INSERT/UPDATE. In first advantage is only INSERT is required but SELECT (fetch all values for a given NAME) will be slow. In second SELECT will be fast but UPDATE will be slow. By considering there could be 10000s of such lists with 1000s for possible values in the system with frequent SELECTs and less INSERTs, which TABLE design will be good in terms of select/insert/update performance." --SQL TABLE to store lists of strings, StackOverflow.com

Sunday, January 7, 2018

Understanding Relational Keys - A New Perspective: Primary Keys

Note: Rewritten 1/1/18. This is one of three re-writes of older posts on keys to bring them in line with the McGoveran formalization and interpretation [1] of Codd's true RDM (see second [2] and third [3]). They are abbreviated extracts of the forthcoming rewrite of [4], which proposes a new perspective on relational keys in depth and more detail, distinct from the conventional wisdom of the last five decades. Re-reads are strongly recommended.

Revised: 1/14/18.

"A key is a column or columns that together have no duplicate values across rows. Also the columns must be irreducibly unique, meaning no subset of the columns has this uniqueness ... In most databases primary keys have survived as a vestige, and nowadays merely provide some conveniences rather than reflecting or determining physical layout. For instance declaring a primary key includes a NOT NULL constraint automatically, and defines the default foreign key target in a PostgreSQL table. Primary keys also give a hint that their columns are preferred for joins." --Joe Nelson, SQL Keys in Depth, begriffs.com

Q: "My understanding has always been that a primary key should be immutable, and my searching since reading this answer has only provided answers which reflect the same as a best practice. Under what circumstances would a primary key value need to be altered after the record is created?"

A: "When a primary key is chosen that is not immutable?"
--Why should a primary key change, StackExchange.com

There is a general and persistent lack of foundation knowledge in the industry [5] and keys are not an exception. 70% of searches hitting this site are about keys, just one indication that this fundamental relational feature is poorly understood decades after the RDM. "wading through sixty-four articles, skimming sections in five books, and asking questions on IRC and StackOverflow" to "put the pieces together"? And then what he put together is conventional wisdom (besides, SQL is the last source to go to for really understanding anything in depth, let alone relational features [6].

Keys can only be understood within the RDM, which is simple set theory (SST) expressible in first order predicate logic (FOPL) adapted and applied to database management -- which SQL authors never understood -- that is disregarded in the industry. For a proper understanding of keys, read my trilogy of posts, [4] for the in-depth treatment,  then compare to Nelson's take and your SQL DBMS support of keys.

Friday, December 29, 2017

DBMS for Analytics: Risky Business Without Foundation Knowledge, Part 2 (A2)


My December Post @All Analytics
 
Data practitioners who perceive integrity as a weakness are in the wrong field -- they do not understand what integrity and consistency are and, therefore, the meaning of data (What Meaning Means: Business Rules, Predicates, Integrity Constraints and Database Consistency, Redundancy, Consistency, and Integrity Derivable Data). By enforcing integrity constraints, a DBMS ensures consistency of the database with the business rules that denote the meaning of the data (i.e., the faithfulness of the database to the conceptual model of the segment of the real world it represents).

Read it all (and please comment there, not here). 



Thursday, December 21, 2017

Three Re-writes with Season's Greetings

The following posts have been re-written to bring in line with the McGoveran formalization and interpretation of Codd's true RDM. Re-reading is strongly recommended.
"But the core Information Principle (IP) of the RDM mandates that all information in a relational database be represented explicitly and in exactly one way -- as values of relation attributes defined on domains. The difference between relation names is, thus, meaningful information, the representation of which violates the IP and the RDM, for which reason it is inaccessible to the DBMS: consider the candidate tuple {v1,v2} -- it is impossible for the DBMS to know to which relation it belongs based on the relation and attribute names because it does not understand semantics!" Database Design Relation Predicates and “Identical Relations”
"Some set defining properties are formed as the disjunction of two or more properties (a kind of relationship between two common properties). These disjuncts, taken together, are meaning criteria. Each meaning criterion (an individual disjunct) induces a partitioning of a set into two subsets, those that meet the criterion and those that do not. Alternatively, we can say that each meaning criterion serves to differentiate a possible subset of a set from other subsets of the set (some of the possible subsets will be disjoint, while others not). Each of the possible subsets of the set is then defined by (“inherits”): The defining properties of the set conjoined with at least one meaning criterion (that or those becoming the defining property, or properties, respectively, specific to the proper subset)." Meaning Criteria and Entity Supertype-Subtypes
"Although they are no longer used, inquiries about them persist and with the current proliferation of non-relational products (e.g., NoSQL, graph DBMSs) there is value in understanding them. The closest the industry came to implementing the RDM is SQL which, despite its poor relational fidelity, proved much superior relative to the complexity and inflexibility of preceding DBMSs. But the rules still expose poor relational fidelity of SQL DBMS's that have not been addressed for four decades, while new RDM violations were introduced.

We offer here our clarifications on the rules. For each rule, we:
  • Explain its intended objective;
  • Offer clarifications, some of which reflect our current understanding of the RDM -- distinct from conventional wisdom -- based on its dual theoretical foundation and a careful analysis of Codd's work;" --Interpreting Codd's 12 Rules



 




Sunday, December 17, 2017

This Week


1. Database truth of the week

"Within the database field, it is common to refer to three “level” of description: conceptual, logical, and physical. Both the logical level and the physical level are formal systems. By contrast, the conceptual level is typically an informal system and refers to the subject of the database.

The conceptual language is a subject language, in the terminology of formal systems. The conceptual level identifies the concepts to be formally represented by the logical and physical levels, and how users think and talk about those concepts. This level corresponds only informally to the so-called “conceptual schema” of earlier approaches to information management, which emphasized the capture of conceptual information using various techniques including diagrams and documentation having various degrees of formality, but not forming a strictly formal system themselves."
-- David McGoveran

2. What's wrong with this database picture?


I re-wrote two older debunkings to bring them in line with the McGoveran formalization and interpretation of Codd's true RDM. Re-reads are recommended.
"Can you have 2 tables, VIEWS and DOWNLOADS, with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other ..."
"I have a database for a school ... [with] are numerous tables obviously but consider these:
CONTACT - all contacts (students, faculty) has fields such as LAST, FIRST, MI, ADDR, CITY, STATE, ZIP, EMAIL;
FACULTY - hire info, login/password for electronic timesheet login, foreign key to CONTACT;
STUDENT - medical comments, current grade, foreign key to CONTACT.
Do you think it is a good idea to have a single table hold such info? Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? ..."

3. To Laugh or Cry?

"Database design is the structure a database uses to plan, store and manage data. Data consistency is achieved when a database is designed to store only useful and required data ... The outline of the table allows data to be consistent. Cascading also ensures data uniformity ... Optimized relationships ensure efficient database performance ... Overall performance of a database is dependent on the design ... Database normalization, or data normalization, is a technique to organize the contents of the tables for transactional databases and data warehouses ... without normalization, database systems can be inaccurate, slow, and inefficient, and they might not produce the data you expect ... When you normalize a database, you have four goals: arranging data into logical groupings such that each group describes a small part of the whole; ... organizing the data such that, when you modify it, you make the change in only one place; and building a database in which you can access and manipulate the data quickly and efficiently without compromising the integrity of the data in storage ... Sometimes database designers refer to these goals in terms such as data integrity, referential integrity, or keyed data access." --Halil Lacevic, What are the uses or importance of database design? Quora.com

Sunday, December 10, 2017

Conventional Wisdom and True Relational Features

Here's what's wrong with last week's picture, namely:
"Per Date’s AN INTRODUCTION TO DATABASE SYSTEMS, Date & Darwen’s DATABASES, TYPES, AND THE RELATIONAL MODEL, and related references, the features of a relational database are values, types, attributes, tuples, relations, relation-valued variables, operators, and constraints.
  • A type is a set of values and related operators.
  • An attribute is a name, value, type triple.
  • A tuple is a set of attributes.
  • A relation is a set of tuples with a given heading.
  • A relation-valued variable (known as a relvar) is a persistent variable whose time-varying value is a relation." --Dave Voorhis, Computer scientist; lead developer of Rel, a true relational database system, Quora.com

This is more or less the conventional wisdom, which is nothing like the true RDM envisioned by Codd [1].