Friday, January 1, 2021

OBG: Database Design and Guaranteed Correctness Part 2



Note: This is a re-write of an earlier post (which now links here), to bring it into line with the current understanding of the RDM derived from McGoveran formalization and interpretation of Codd's work[1]. Reference [9] is also an important re-write and is recommended pre-requisite for this post.

Continued from Part 1

 “The term database design can be used to describe many different parts of the design of an overall database system. Principally, and most correctly, it can be thought of as the logical design of the base data structures used to store the data. In the relational model these are the tables and views ... However, the term database design could also be used to apply to the overall process of designing, not just the base data structures, but also the forms and queries used as part of the overall database application within the database management system(DBMS). The process of doing database design generally consists of a number of steps which will be carried out by the database designer. Usually, the designer must:

  • Determine the data to be stored in the database.
  • Determine the relationships between the different data elements.
  • Superimpose a logical structure upon the data on the basis of these relationships.
Within the relational model the final step above can generally be broken down into two further steps, that of determining the grouping of information within the system, generally determining what are the basic objects about which information is being stored, and then determining the relationships between these groups of information, or objects.”
--What is a Relational Database, Quora.com
There is, typically, much vagueness and confusion here and instead of debunking it makes more sense to provide a rigorous description of what database design really is: formalization of a conceptual model -- expressed as business rules -- as a logical model for representation in the database using a formal data model. If the data model is the RDM, the logical model consists of relations constrained for semantic consistency with the conceptual mode, the constraints being formalizations of the business rules.

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST UPDATES
-12/24/20: Added 2021 to the
POSTS page

-12/26/20: Added "Mathematics, machine learning and Wittgenstein" to LINKS page

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- The PostWest blog for monthly samples of global Antisemitism – the only universally acceptable hatred left – as the (traditional) response to the existential crisis of decadence and decline of Western  civilization (including the US).
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

Enterprise-specific Models and Data Model

The common references to conceptual, logical and physical models as "data models" (e.g., "industry data models", or "what is the difference between conceptual data model and logical data model?", or "logical vs physical data model") are misleading and inhibit understanding. While the former three are enterprise-specific, a data model is not[2].

A data model, as defined by Codd [3], is an abstract theory of data used to formalize conceptual models as logical models for database representation[4]. It has three components, data structure/integrity and manipulation, which for the RDM are constrained relations and relational algebra (RA).
 

Logical Relations and Stored Data

 
There are base relations from which others are derived using RA operations. There are two types of persistent derived relations: views (or virtual relations) and snapshots
 
Relations are not "used to store data". While data of some (not necessarily base) relations are stored, all relations are logical (see below). R-tables only visualize relations on physical media, but play no part in the RDM.

Note: SQL has muddied the waters by calling its tables -- which are not relations - with stored data "base tables". If and how data is actually stored is part of the implementation, not the model -- physical independence (PI) being one of the core advantages of the RDM[5]. We refer to the common failure to distinguish between relations and the representation of their data in storage logical-physical confusion (LPC)[6].

Business Rules, Predicates and Constraints

Database design is formalization of a conceptual model consisting of business rules that specify properties of related entity groups that form a multigroup, namely:

  • First order properties (1OP) of individual entities;
  • Second order properties (2OP) of individual entities that are relationships among 1OPs;
  • Third order properties (3OP) of groups that are relationships among all of an entity group members;
  • Fourth order properties (4OP) of the multigroup that are relationships among the groups[7,8].

Some properties are relationships.

  • Properties formalize as domains;
  • Entity groups formalize as relations: properties in a group's context formalize as attributes defined on the corresponding domains;

- Property rules as domain constraints;
- Property in context rules as attribute constraints;
- Entity rules as tuple constraints;
- Group rules as multi-tuple constraints;
- Multigroup rules as database (multirelation) constraints.


Business rules specify in specialized natural language the properties/relationships of entities, groups and multi-groups that formalize as first order predicates.  Expressed as constraints in a FOPL-based relational data sublanguage, they
can be implemented by algorithms to ensure  semantic consistency with the rules (i.e., constraints capture the meaning of the model assigned by the designer to the data)[9].

For each relation, the domain, attribute, tuple and multituple constraints jointly correspond to the relation predicate (RP). All RPs jointly with the database constraints correspond to the database predicate (DBP) [10].

The database designer must keep the conceptual (model) and logical (database) levels distinct in mind and avoid conceptual-logical conflation (CLC).

Note: Properties (e.g., $amount) formalize as abstract domains (MONEY). Properties in context (e.g., salary in context of employees, budget in context of departments) formalize as attributes (SALARY, BUDGET), derived from domains by imposing attribute constraints. Attributes are domain representations (i.e., simple functions -- 1:1 mappings) in relations. Try to express all these constraints in SQL and see how relational it is[11].
 

Data Sublanguage, DBMS Language and Host CCL

As we explained elsewhere, a DBMS language consists of two components (1) a relationally complete data sublanguage that expresses constraints/RA operations) and (2) language for other data management functions (data definition, concurrency control, security, transactions, storage management, and so on). The DBMS language is hosted by a  computationally complete language (CCL). Only the data sublanguage is restricted to FOPL, the other DBMS language component is based on higher logic, which is why they are implemented and invoked in the CCL, both of which must not subvert the data sublanguage and lose relational advantages[12].

The following functions are the responsibility of applications:

  • User communication with the DBMS;
  • Computation;
  • Presentation of results to users

and are implemented in the host CCL.

Guaranteed Correctness

Under the SST/FOPL theoretical foundation of the RDM, correctness of query results is guaranteed as follows:

  • Logical validity is guaranteed by the RDBMS;
  • Semantic consistency is guaranteed by database design adherence to three principles:

- Principle of Expressive Completeness (POEC);
- Principle of Representational Parsimony (PORP);
- Principle of Orthogonal Design (POOD).

This produces a set of independent base relations (POOD), which is minimally necessary (PORP) to derive via RA operations all relations meaningful to the user applications (POEC) which the database is intended to satisfy[13].

Base relations are independent in the sense that none has a RP derivable from the others'. Base does not mean stored:

"Codd's original meaning of base relations is "members of this base set" from which all other meaningful relations are RA-derived. Although their data is usually stored, that was not what he meant. Entity groups are modeling primitives (i.e., non-decomposable) and, therefore, base relations cannot (must not) be analyzed into parts. A projection or restriction is not a "decomposition" -- as they are commonly viewed -- but a relation derived from, that depends on, the base relation. In a sense, base relations have a kind of atomicity which avoids circular definitions. If, in the conceptual model, object group A is defined in terms of group B and B in terms of C, and C in terms of A, nothing is defined." --David McGoveran
There is an unproven conjecture that the three principles jointly imply the Principle of Full Normalization (POFN) -- but not vice-versa. For mathematical reasons beyond the scope of this discussion the three principles jointly guarantee that all derived views that are theoretically updatable can be correctly updated, or in other words, full support of logical independence (LI) [1]. That is why we contend that correct database design (i.e. adherent to the principles) produces 5NF databases -- there is no need to "do normalization".

Note: A previous attempt to specify conditions that guarantee view updatability devoid of anomalies [14] was based solely on the POOD, which is insufficient -- POEC and PORP are also necessary. 

Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.

 

References

McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK (draft chapters), forthcoming.
 
 
 
 

No comments:

Post a Comment

View My Stats