Wednesday, January 9, 2019

Data and Meaning Part 3: Database Design




We have seen in Part 2 that the meaning of data in a database is the conceptual model that the database is intended to represent, namely (1) the three types of objects -- entities of multiple types that form entity groups that form a multigroup -- and (2) the business rules (BR) that specify their properties:
  • Properties in context (PiC) shared by entities of each type;
  • Collective group properties (i.e., relationships among entity group members);
  • Multigroup properties (i.e., inter-group relationships).
Often somebody produces one or more tables and asks if there's "anything wrong" with them,  or "if they are in some specific normal form and, if not, how to normalize them". This reflects lack of foundation knowledge. 


------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE 

Up to 2018, DBDebunk maintained and kept free with the proceeds from my @AllAnalitics column. In 2018 the website was has been discontinued. The content is not available anywhere else, and if you deem it useful, particularly if you are a regular reader, please ensure its continuation and free availability by supporting it with as much as you can afford via purchases of publications, or donations. Thank you.

LATEST PUBLICATIONS

·   Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING OF THE REAL RDM series, is available for ordering here.
·   Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING OF THE REAL RDM series, is available for ordering here.

SOCIAL MEDIA 

I deleted my Facebook account. You can follow me on Twitter:

@dbdebunk: will contain links to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture, which I am bringing back. 

@ThePostWest: will contain evidence for, and my take on the spike in Anti-semitism that usually accompanies existential crises. The current one is due to the decadent decline of the West and the corresponding breakdown of the world order.

HOUSEKEEPING

  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 columns to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work. 
------------------------------------------------------------------------------------------------------------------

Fundamentals


There are three formal database design principles that jointly system-guarantee logical validity and semantic correctness of query results, and according to an yet unproven McGoveran conjecture, they jointly imply the Principle of Full Normalization (POFN), but not vice-versa[1,2]. In other words, if you adhere to the three principles, you end up with fully normalized databases, but a fully normalized database does not necessarily satisfy all three principles. Explicit normalization, or further normalization is for repair of poor designs that failed to adhere to the principles[3].

Notwithstanding the common understanding in the industry that database relations are at least in first normal form (1NF) (or normalized), we now contend that a database relation is by definition in fifth normal form (5NF) (or fully normalized), otherwise it is not a database relation[1]: the only dependencies that hold in it are functional dependencies (FD) of all the non-key attributes on the primary key (PK) (i.e., for every PK value there is exactly one value of each of the non-key values, but not vice-versa)[4]. This is a formal way of saying that a relation represents (facts about) a collection of entities of a single type (i.e., an entity group), which, as we have seen, means consistency with
the BRs that denote its meaning[5,6], which we categorized in Part 2, including the single entity type BR.

Normal form is a property of a relation, not of its tabular display on some physical medium (e.g., screen) -- the two should not be, but are constantly confused[5]. It makes no sense to talk about the "normal form of a table" -- what can be asked is if a table is a R-table (i.e., if it displays a relation), which cannot be determined from the table alone. You can ascertain that tables (the data) are R-tables only with reference to some conceptual model (i.e., the intended meaning of the data). Note very carefully: meaning intended by the modeler -- the objects and the BRs specifying their properties -- which is not in the table! When you are called to determine the "correctness" of tables without a corresponding model -- as in the examples I linked to above -- you are being asked to infer meaning from the data, which is like assuming what you need to prove. 



Semantically Driven Design


Consider, for example, the table:

==========================
 EMP# PROJ#         DEPT#
==========================
 100  Sys Support   E21
 100  Comp Svcs     E21
 100  Supp Svcs     E21
 160  Info Center   D11
 310  Documentation D11
 310  Mfg Systems   D11
 150  Info Center   D11
-------------------------

It is a R-table if and only if its data is consistent with its intended meaning (i.e., satisfies the corresponding BRs in some specific conceptual model). It's easy to show that whether it is a R-table or not depends on the intended meaning of

its data -- the conceptual model.

Note: By satisfying the BRs we mean the constraints they formalize as that are enforced by the DBMS[7,8,9].

Model 1

Suppose three types of entities are identified -- employees, projects, and departments -- and:

  • An employee is assigned to one department, and a department has one or more employees assigned to it;
  • An employee is assigned to one or more projects, and every project has one or more employees assigned to it;
such that there are:
  • A multigroup 4OP due to m:1 relationship among departments and employees;
  • A multigroup 4OP due to m:n relationship among projects and employees.
Given this model, there are three entity groups, each formalizing as a properly constrained relation:
E {EMP#,DEPT#,...}
D {DEPT#,...}
P {PROJ#,...}

and the m:n 4OP is represented by the relation:

EP {EMP#,PROJ#}
By this model the table is not a R-table (why?).


Model 2

Suppose now that an employee is assigned to one one or more departments, and a department has one or more employees assigned to it, such that:

  • There is a multigroup 4OP due to m:n relationship among departments and employees; and,
  • The assignments to departments and to projects are independent.
Given this model, each of the 4OP is represented by a relation:
ED {EMP#,DEPT#}
EP {EMP#,PROJ#}
Under this model the table is not a R-table either (why?).


Model 3

Suppose now the assignments to departments and to projects are not independent (i.e., there is a BR that specifies a multigroup 4OP due to a three-way m:n:r relationship among employees, departments, and projects) that is represented by the relation:

EDP {EMP#,DEPT#,PROJ#}
By this model the table is a R-table.
 

Unless you know which of the three models was intended, you cannot tell if the table is a R-table, nor can you infer which model was intended.


Conclusion


It is meaningless to assess database design in the absence of a conceptual model, yet this is one of the most common requests for help in the industry[10]. Practitioners are lured into inferring a model from tables and they do. The simplicity of our example may tempt you too, and, if so, how about this table?
=========================================
 COURT  MEMBER  START  END     RATE
=========================================
 hard   yes     09:30  10:30   SAVER
 hard   yes     11:00  12:00   SAVER
 hard   no      14:00  15:30   STANDARD
 grass  no      10:00  11:30   PREMIUM-B
 grass  no      11:30  13:30   PREMIUM-B
 grass  yes     15:00  16:30   PREMIUM-A
-----------------------------------------

Note: Actually, it's worse[11]. As the second example I linked to above shows, even with a (albeit, partial) model, practitioners do not know how to design the database, or to assess an existing design.

It is a mark of the knowledgeable practitioner to resist the temptation and induce those requesting such advice to come up with a conceptual model first. 


Continued in Part 4.


Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.


References

[1] Pascal, F., Logical Validity and Semantic Correctness.


[2] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming (draft chapters).

[3] Pascal, F., Database Design: What It Is and Isn't.

[4] Pascal, F., Depends on the Dependencies: Normal Forms and the Conceptual-Logical Conflation.

[5] Pascal, F., Understanding Relations, Parts 1-3.

[6] Pascal, F., What Relations Really Are and Why They Are Important.

[7] Pascal, F., Relationships and the RDM Part Parts 1-3.

[8] Pascal, F., What Meaning Means Business Rules, Predicates, Integrity Constraints and Database Consistency.

[9] Pascal, F., Redundancy, Consistency, and Integrity: Derivable Data.

[10] Pascal, F., Don't Design Databases Without Foundation Knowledge and Conceptual Models.

[10] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER.

No comments:

Post a Comment