Sunday, January 21, 2018

How to Think (and Not to Think) During Database Design



"I have to maintain some lists in DB (SQLServer, Oracle, DB2, Derby), I have 2 options to design underlying simple table:

"1st:
 NAME   VALUE
=================
 dept   HR
 dept   fin
 role   engineer
 role   designer
-----------------
UNIQUE CONSTRAINT (NAME, VALUE) and some other columns like auto generated ID, etc.
"2nd:
 NAME  VALUE_JSON_CLOB
==================================
dept   {["HR", "fin"]}
role   {["engineer", "designer"}]
----------------------------------
UNIQUE CONSTRAINT (NAME) and some other columns like auto generated ID, etc.
"There is no DELETE operation, only SELECT and INSERT/UPDATE. In first advantage is only INSERT is required but SELECT (fetch all values for a given NAME) will be slow. In second SELECT will be fast but UPDATE will be slow. By considering there could be 10000s of such lists with 1000s for possible values in the system with frequent SELECTs and less INSERTs, which TABLE design will be good in terms of select/insert/update performance." --SQL TABLE to store lists of strings, StackOverflow.com

Using a relational database to "maintain lists" probably does not merit attention and I actually considered canceling the debunking of this example. But it provides an opportunity to demonstrate the gap between conventional wisdom, database practice and SQL DBMSs and
Codd's true RDM, as formalized and interpreted by McGoveran [1]. Such use is induced by lack of foundation knowledge, so for the purpose of this discussion I treat the example as a case of "how not to think when performing database design".

Note: Certainly logical database design should not be contaminated with physical implementation considerations such as performance [2].


--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. If you deem this site worthy of continuing, please support its upkeep. A generous reader has offered to match all contributions, so please take advantage of his generosity. Regular monthly contributions and books and papers purchases can ensure this material unavailable anywhere else will continue to be free. Thanks.
---------------------------------------------------------------------------------



Fundamentals


Conventional wisdom notwithstanding, a database relation represents a real world object group. Objects are uniquely identifiable by a set of defining properties and names assigned to them within groups of which they are members. For formal theoretical reasons and database consistency with the conceptual model of reality it represents, every relation:

  • Has a PK; that
  • Represents a name (not a property), either one assigned and used to identify objects in the real world (and, thus, pre-exists the database), or an artificial one generated expressly for it (in which case it must be managed by the DBMS transparently to users) [3,4]; and
  • Is subject to a PK constraint that imposes uniqueness on PK values;
  • Is not just in First Normal Form (1NF), but in Fifth Normal Form (5NF) by definition (otherwise, it is not a relation [5];

A relation:

  • Is in 1NF if all its attributes are defined on simple domains that have no components meaningful to users/applications (i.e., have values that are treated as atomic by the relational data language) [6,7];
  • Is in 5NF if the only dependencies that hold in it are functional dependencies (FD) of the non-key attributes on the PK (informally, the relation represents facts about objects of a single type) [8];

If the designer defines relations only on simple domains, a true RDBMS (but not SQL DBMSs) will support them and its data language (though not SQL) can treat their values as atomic by not allowing constraint and query expressions to reference domain components (which SQL does allow) and, thus, enforce 1NF.



Primary Keys


Is the first table a visualization of a relation and, if so, what objects does the relation represent?

There is no conceptual information to determine that. There seem to be departments, roles and employees involved, but the individual and collective properties required for the corresponding group memberships, the dependencies between individual properties and the relationships between the groups are not specified. Asking for a design recommendation without saying what in the real world the design is supposed to represent -- let alone giving it -- is a fool's errand. For example, a enforcing a uniqueness constraint on {NAME,VALUE} does not, per se, make it a PK, unless:

  • It represents a name used in the real world to identify the objects; or,
  • Was generated for the database to represent an artificial name (and has a 1:1 relationship with a (possibly unknown) set of object defining properties;
whatever those (unknown) objects are. So it's not clear whether there is a PK and, therefore, whether the table is a visualization of a relation (we suspect not).


1NF and 5NF


The above applies to the second table too -- it's not clear there is a proper PK (what with "and some other columns like auto generated ID"). If there is, given that there will be only "fetch all values for a given NAME":

  • The designer could declare the domain on which VALUE_JSON_CLOB is defined as simple; and
  • A RDBMS with a relational data sublanguage would enforce atomicity and 1NF;

But only with a RDBMS! SQL DBMSs do not support simple domains and implement functions that can subvert atomicity, so the Principle of Cautious Design advises against such a design.

Is VALUE_JSON_CLOB functionally dependent on NAME or whatever PK there is? Without the dependency information of the conceptual model that is not clear either.

I refrain from design advice in the absence of a well-defined conceptual model and advise others against it [9].


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.


[2] Pascal, F., Don't Mix Model with Implementation.

[3] Pascal, F., Understanding Relational Keys - A New Perspective: Primary Keys.

[4] Pascal, F., The Key to Relational Keys: A New Perspective, forthcoming.

[5] Pascal, F., What Relations Really Are and Why They Are Important.

[6] Pascal, F., Simple Domains and Value Atomicity.

[7] Pascal, F., First Normal Form in Theory and Practice Part 1,2,3.

[8] Pascal, F., The Costly Illusion: Normalization, Integrity and Performance.

[9] Pascal, F., Don't Design Databases Without Foundation Knowledge and Conceptual Models.

No comments:

Post a Comment

View My Stats