Monday, March 5, 2018

Physical Independence Part 1: Don't Mix Model with Implementation




Note: This is a rewrite of several older posts (which now link here), to bring them into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

"You constantly remind us that the relational model is a logical model having no connection to any physical model (so I infer). You also indicate how no commercial product fully implements the relational model. Therefore, how do we make use of the relational model when dealing with the physical constructs of a commercial database program (Oracle, Access, DB2, etc.)?" --DBDebunk.com query


--------------------------------------------------------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE



I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Purchasing my papers and books will also help. Thank you.



NEW PUBLICATIONS










HOUSEKEEPING



  • To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the FUNDAMENTALS page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 are relabeled. As other older posts are rewritten, they will also be relabeled, but in the meantime, use Blogger search for them. 
  • Following the discontinuation of AllAnalytics, the links to my columns there no longer work. I moved the 2017 content to dbdebunk and, time permitting, may gradually move all of them. Within the columns, only the links to sources external to AllAnalytics work.

----------------------------------------------------------------------------------------------------------------------------------------------------------------

Old DBMSs based on data models preceding the RDM -- hierarchic and network -- forced users and applications into the details of how data are internally stored and accessed (i.e., implementation) when they accessed databases. Such details are an irrelevant distraction from what users do -- make inferences about the world -- and when they changed, applications no longer worked and required maintenance.


Separation of Concerns


As an adaptation and application of simple set theory (SST) expressible in First Order Predicate Logic (FOPL) to database management, the RDM is a deductive formal system that can be used to derive new facts (theorems) that are logical implications of facts recorded in the database (axioms). Among its many critical advantages is physical independence (PI) -- the insulation of queries and applications from storage and access methods and changes thereof.

Here's Codd's explanation how this is achieved (slightly edited to bring up to date) [2]:

"Let us denote the [relational] data sublanguage by R and the host language by H. R permits the declaration of relations and their domains [and attributes, and] identifies the primary key for [every] relation ... [and] the specification for retrieval of any subset of data from the data bank ... H permits supporting declarations which indicate, perhaps less permanently, how these relations are represented in storage."
In other words, the expressions in the FOPL-based data sublanguage -- relational queries and constraints -- do  not reference physical implementation details, which are handled in a computationally complete language (CCL) that hosts it. Aside from not burdening users with machine internals, when they change (e.g., for performance reasons), queries and applications invoking them continue to work and do not require maintenance (i.e., logical models are independent of any specific physical implementation). Thus, PI eliminates the drawback of old DBMSs -- one of the core motivations for the RDM -- enshrined in the Physical Independence Rule, #8 of Codd's famous 12 rules [3]:
"Interactive applications and application programs should not have to be modified whenever changes in internal storage structures and access methods are made to the database".

Advantage, Not Liability


One of many misconceptions [4] is to misconstrue PI as a RDM liability: ("Without implementation, what good is the RDM?"), when it actually is a major relational advantage:
  • Data sublanguages are greatly simplified.
  • As long as rule 8 is not violated, no implementation is either imposed or prohibited. This gives DBMS vendors complete freedom of implementation and allows them to change it at will without disrupting queries and applications and the associated maintenance burden.
  • End users querying the database and developers of applications invoking database queries are not distracted by irrelevant implementation details and can focus on the logic of defining databases and inferencing. Both are taken out of the performance business.
These important advantages of PI should be considered in the context of the variety and complexity of the myriad of implementation configurations and options. For a feel of the importance of PI, consider the complexity associated with just a creation of what should be a logical SQL table:
CREATE TABLE [dbo].[Table1]
 ([ID] [INT] IDENTITY(1,1) NOT NULL,
  [Column1] [NVARCHAR](50) NOT NULL,
  [Column2] [NVARCHAR](15) NOT NULL,
  [Column3] [TINYINT] NULL,
  CONSTRAINT [pk_table1] PRIMARY KEY CLUSTERED ([id] ASC)
   WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
   IGNORE_DUP_KEY = OFF, FILLFACTOR=80,  ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION=PAGE))...
"Wait a minute! Some important options of the clustered index can't be controlled, like FILLFACTOR, ALLOW_xxx_LOCKS and DATA_COMPRESSION. What does that mean? Well, if you execute the following code, you can see that FILLFACTOR=0, ALLOW_xxx_LOCKS are 1 SELECT name, fill_factor, allow_row_locks, allow_page_locks FROM sys.indexes. That means that every page of the index will be filled to 100%, leaving no room for new records. Well, thats fine for indexes that are always increasing, but if not, you'll get extra IO operations caused by page splits and you'll get fragmented indexes, causing performance issues. To remove the fragmentation you need to rebuild your indexes regulary to increase the performance, but that is a subject for later blog posts. In the SQL azure platform, you shouldn't need to care about the infrastructure, like files or file groups and it isn't possible to place tables or indexes in different file groups."

Note: Of course, OS, hardware, concurrent access, network load and so on affect performance, but we are concerned here only with DBMS/database performance factors. 



The RDM and Logical Design Cannot Affect Performance


The deductive logic and mathematics of the RDM have absolutely nothing to say about storage and access methods. Logical models created using the RDM cannot possibly affect performance, which is determined exclusively by implementation, outside the RDM. By mixing logical models with physical implementation, non-relational DBMSs make the former dependent on the latter. True RDBMSs would restore former's independence, eliminating the drawbacks. Unfortunately, one of the most entrenched misconceptions in the industry is logical physical confusion (LPC), witness "denormalization for performance".

SQL DBMSs are not true RDBMSs and SQL tables are not relations, or treated as such. But while PI support is far from perfect, query expressions in SQL generally do not require or include physical details (although some may have crept in over time). Imagine such drudgery in all query expressions and the maintenance burden when they change. PI relegates it to DBMS vendors and DBAs, where it cost-effectively belongs. Without RDM and PI, application developers would have continued to get bogged down by these details and end users would be precluded from querying databases.
 

Failing to learn from experience, the industry has failed to absorb the importance of PI. There is regression away from even SQL DBMSs to non-relational ones, including graph (i.e., good old hierarchic) DBMSs, predictably bringing back all the drawbacks.

(Continued in Part 2).


References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.

[2] Codd, E. F., Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks, IBM Research Report, San Jose, California RJ599 (1969).

[3] Codd, E. F., Does Your DBMS Run By the Rules?, ComputerWorld (21 October 1985).

[4] Pascal, F., DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.


Note: I will not publish or respond to anonymous comments. If you have something to say, stand behind it. Otherwise don't bother, it'll be ignored.

No comments:

Post a Comment

View My Stats