Sunday, October 29, 2017

Database Design: What It Is and Isn't



Revised 10/31/17.

Note: Posts starting with this one will be consistent with the TERMINOLOGY page. Fundamental terms -- the grasp of which is necessary for data management practice -- will be boldened. When you encounter one you don't understand, better find out what it means, chances are it's being misused or abused. Once the page is finalized, labels and, time permitting, old posts may also be revised accordingly. 

Reference [9] is an important rewrite and is recommended pre-requisite
for this post that you should read first.

Here's what's wrong with the picture of three weeks ago, namely:
"The term database design can be used to describe many different parts of the design of an overall database system. Principally, and most correctly, it can be thought of as the logical design of the base data structures used to store the data. In the relational model these are the tables and views. In an object database the entities and relationships map directly to object classes and named relationships. However, the term database design could also be used to apply to the overall process of designing, not just the base data structures, but also the forms and queries used as part of the overall database application within the database management system(DBMS).

The process of doing database design generally consists of a number of steps which will be carried out by the database designer. Usually, the designer must:

  • Determine the data to be stored in the database.
  • Determine the relationships between the different data elements.
  • Superimpose a logical structure upon the data on the basis of these relationships.
Within the relational model the final step above can generally be broken down into two further steps, that of determining the grouping of information within the system, generally determining what are the basic objects about which information is being stored, and then determining the relationships between these groups of information, or objects." --Halil Lacevic, What is a Relational Database?, Quora.com
Many problems in database practice are due to failure to grasp what a data model is and the important distinctions between DBMS functions on the one hand and application functions on the other.

The three design steps above are vague, somewhat confused and obscure more than enlighten. They do not reflect the fact that database design is formalization of a conceptual model of reality as relations constrained to be consistent with the business rules the model consists of. 

--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.
--------------------------------------------------------------------------------- 

What Is a Data Model


The common references to enterprise-specific conceptual, logical and physical models as data models (e.g., "industry data models", or "what is the difference between conceptual data model and logical data model?", or "logical vs physical data model") are incorrect [1].


A data model, as defined by Codd [2,3], is an abstract theory of data used to formalize symbolically informal conceptual models as logical models for database representation. It has three logical components:
  • Data structure;
  • Data integrity;
  • Data manipulation;
which for the relational data model (RDM) are:
  • Structure = relation defined on domains (displayed as an R-table to users);
  • Integrity = several types of constraints;
  • Manipulation = relational algebra (RA);
Thus, not "tables and views", but relations, some of which are virtual (i.e., views of base relations -- possibly via other relations and views -- derived using RA operations). Note also the common logical-physical confusion (LPC): relations are logical structure, not "used to store data" [4].

DBMS and Application Functions


DBMS functions are performed by the DBMS via its data language that has several components. They fall into two categories, the distinction between which is important, but poorly understood.

  • Data management DBMS functions support the data model:
  • Data definition: creation of domains, relations and constraints;
  • Data manipulation: data retrieval via RA operations;
  • Other DBMS functions: concurrency control, transaction management, storage management, performance optimization;
The practical advantages of the RDM derive from its dual formal theoretical foundation, simple set theory and first order predicate logic (FOPL). These advantages are obtained by hosting the data language, which has several components, in a  computationally complete language (CCL), as follows [5]:
  • The manipulative data management DBMS function is expressed in a FOPL-based relational data sub-language as the Data Manipulation Language (DML) component of the data language; 
  • The structural and integrity data management DBMS functions are implemented in the CCL and invoked in the Data Definition Language (DDL) component of the data language;
  • The other DBMS functions are implemented in the CCL and invoked in the corresponding data language components;
The data management functions (DDL and DML) comprise DBMS support of the data model. Only the DML is restricted to FOPL to ensure the practical benefits of the RDM, the other components of the data language require higher logic, which is why they are implemented in the host CCL. 

Application functions:
  • Communication with the DBMS;
  • Computation;
  • Presentation of results to users;
are implemented in the host CCL [6].

Database Design


Database design involves data definition (the structural and integrity data management DBMS functions). Queries are the manipulative data management DBMS function, so they are not part of database design, except when they are view definitions. Forms are for presentation and part of application development, not database design.

Given a conceptual model consisting of business rules that specify 
- individual first order object properties (1OP);
- collective second and third order properties (2OP, 3OP);
required for object group memberships and
- properties arising from multigroup relationships;
database design proceeds as follows [7]:
  • Formalize each object property as a user-defined domain;
  • Formalize each object group as a relation: fact about) objects as tuples, properties in context about attributes, which are representations of domains (possibly further constrained); 
  • Formalize business rules as constraints expressed as FOPL predicates:
- Property rules as domain and attribute constraints;

- Object rules as tuple constraints;

- Multiobject rules as multituple constraints;

- Multigroup rules as multirelation database constraints;
While in practice some of these steps are performed sort of concurrently, the database designer should keep them distinct in mind and avoid conceptual-logical conflation (CLC). 

Note: User-defined domains (e.g., SALARY) are derived by imposing domain constraints on abstract primitive domains (e.g., MONEY). Attributes are domain representations (i.e., simple functions -- 1:1 mappings) that may impose attribute constraints on user-defined domains (e.g., SALARY and COMMISSION attributes may further constrain a COMPENSATION domain derived from the MONEY domain by a domain constraint.)

Constraints are computable formalizations of the business rules expressed as FOPL predicates that can be implemented by algorithms and enforced by the DBMS for database consistency with the rules (i.e., they capture meaning (semantics) assigned by the database designer to the database) [9]. DBMS constraint enforcement protects data integrity, for which reason they are referred to as integrity constraints.

Domain, attribute, tuple and multituple constraints jointly comprise a relation predicate (RP). All RPs jointly with database constraints comprise a database predicate (DBP) [8].

Now, try to express all these constraints in SQL and see how relational it is (not) integrity-wise.


Principles of Database Design


Joint adherence to three design principles [10]:
  • The Principle of Expressive Completeness (POEC);
  • The Principle of Representational Parsimony (PORP);
  • Principle of Orthogonal Design (POOD);
produces a base set of independent relations (POOD), which is minimally necessary (PORP) to derive via RA all other relations meaningful to applications (POEC) -- hence base and derived relations. Base relations are independent in the sense that none has a RP derivable from the others.

"Codd's original meaning of base relations is "members of this base set" from which all other meaningful relations are RA-derived. Although their data is usually stored, that was not what he meant.Object groups are modeling primitives (i.e., non-decomposable) and, therefore, base relations cannot (must not) be analyzed into parts. A projection or restriction is not a "decomposition" -- as they are commonly viewed -- but a relation derived from, that depends on the base relation. In a sense, base relations have a kind of atomicity which avoids circular definitions. If, in the conceptual model, object group A is defined in terms of group B and B in terms of C, and C in terms of A, nothing is defined." --David McGoveran

There is an unproven conjecture that the three principles jointly imply the Principle of Full Normalization (POFN) -- but not vice-versa. For mathematical reasons beyond the scope of this discussion [11], the three principles jointly guarantee that all derived views that are theoretically updatable can be correctly updated, or in other words, support of full logical independence (LI). A previous attempt to specify conditions that guarantee view updatability devoid of anomalies [12] was based solely on the POOD, which is insufficient -- POEC and PORP are also necessary. 


References




, E. F., Data Models in Database Management. Workshop on Data Abstraction, Databases and Conceptual Modelling : 112-114.

[3] Pascal, F.,  What Is a Data Model.

[4] Pascal, F., Don't Mix Model with Implementation.

[5] Pascal, F., Data Sublanguages, Programming and Data Integrity.

[6] Pascal, F., Database Design: What It Is and Is Not.

[7] Pascal, F.,  To Really Understand Integrity, Don't Start with SQL.

[8] Pascal, F.,  Relation Predicates and Identical Relations.

[9] Pascal, F., What Meaning Means: Business Rules, Predicates, Constraints, and Database Consistency.

[10] Pascal, F., SQL Cursors, ORDER BY and Relational Fidelity.

[11] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[12] McGoveran, D. and Date, C. J. On View Updating



 

No comments:

Post a Comment

View My Stats