Saturday, April 29, 2017

This Week

1. Database Truth of the Week

“… logic—is an analytical theory of the art of reasoning whose goal is to systematize and codify principles of valid reasoning. It has emerged from a study of the use of language in argument and persuasion and it is based on the identification and examination of those parts of language which are essential for these purposes. It is formal in the sense that it lacks reference to meaning. Thereby, it achieves versatility: it may be used to judge the correctness of a chain of reasoning (in particular, a “mathematical proof”) solely on the basis of the form (and not the content) of the sequence of statements, which make up the chain.” --R. R. Stoll

Monday, April 17, 2017

Don't Mix Model with Implementation

Revised: 11/15/17

Note: This is a rewrite of a 11/20/12 post to bring it in line with the McGoveran interpretation of his formalization [1] of Codd's true RDM.

 

Here's what's wrong with the last week's database picture, namely:
"When you design your database tables there are some important things to think of:
- Normalize to remove redundant data
- Use the smallest datatype possible
- Create as few indexes as possible, but not too few
- Avoid redundant indexes
- Every table must have clustered index
...
This is important in a normal database but it is even more important in SQL Azure because you have limited space for every database, your connections may be terminated due to heavy use of resources, you pay for what you use and the data that you transfer. You can use the SQL Azure management portal do design your tables or write the T-SQL statement yourself. The syntax to create a table in SQL azure is the same as in SQL server 2005/2008, but not all options are supported in SQL Azure.
...
CREATE TABLE [dbo].[table1]
 ([id] [int] IDENTITY(1,1) NOT NULL,
  [column1] [NVARCHAR](50) NOT NULL,
  [column2] [NVARCHAR](15) NOT NULL,
  [column3] [TINYNT] NULL,
  CONSTRAINT [pk_table1] PRIMARY KEY CLUSTERED ([id] ASC )
   WITH (-- PAD_INDEX = OFF,
            STATISTICS_NORECOMPUTE = OFF,
            IGNORE_DUP_KEY = OFF,
         -- FILLFACTOR=80,
         -- ALLOW_ROW_LOCKS = ON,
         -- ALLOW_PAGE_LOCKS = ON,
         -- DATA_COMPRESSION=PAGE)"
--HÃ¥kan Winther, A SQL Azure tip a day – Create a table

Logical-Physical Confusion


Failure to keep the three levels of representation
  1. Conceptual modeling: Informal representation of a reality of interest as business rules;
  2. Database (logical) design: Formal representation of the conceptual model as integrity constraints on relations;
  3. Physical implementation: Representation and access of relation data in storage.
distinct and and avoid conceptual-logical conflation (CLC)--mixing 1 and 2--and logical-physical confusion (LPC)--mixing 2 and 3--has been a constant theme here.

LPC is reflected in the five recommendations for database design: the last four are actually implementation, not database design recommendations--model. I know this may be considered pedantic, but LPC is so rampant and costly in the industry that this is a necessity, not pedantry.

Note: Incidentally, when you design databases you should think of relations, not tables--R-tables are representation of relations on some physical medium for visualization.

I don't know how an Azure SQL database is different from a "normal database" (whatever that means) and while I understand product-specific implementation recommendations, why the four above, let alone normalization, are "more important" for Azure, escapes me.



Physical Independence


LPC is one of the many fundamental misconceptions common in the industry [2] is that the RDM is not practical because it does not "handle" implementation, as reflected in the following:
"I have a single, if not simplistic, question: You constantly remind us that the relational model is a logical model having no connection to any physical model (so I infer). You also indicate how no commercial product fully implements the relational model. Therefore, how do we make use of the relational model when dealing with the physical constructs of a commercial database program (Oracle, Access, DB2, etc.)?
DBMS designers have not been immune to it: SQL products induce and reinforce it. Consider the above table definition statement in the context of Codd's 1969 proscription:
"Let us denote the data sublanguage by R and the host language by H. R permits the declaration of relations and their [constrained] domains [and] identifies the primary key for that relation. R permits the specification for retrieval of any subset of data from the data bank ... H permits supporting declarations which indicate, perhaps less permanently, how these relations are represented in storage." [2] (emphasis ours).
The section that starts with WITH violates the proscriptions by contaminating the model with implementation details.

It is in response to such violations that Codd issued his famous 12 rules to distinguish RDBMSs from non-relational ones [3]. Rule 8 states:

"Interactive applications and application programs should not have to be modified whenever [physical] changes in internal storage structures and access methods are made to the database".
So physical independence (PI) is a major relational advantage, not a weakness, because it simplifies data sub-languages and gives DBMS designers complete freedom of implementation and changes thereof at will without disrupting queries and applications, relegating performance optimization to the DBMS. This enormous value of PI should be considered in the context of the overwhelming variety and complexity of the myriad of DBMS configurations, settings and options, different OSs, hardware, storage and access methods, data growth, concurrent access, network loads and so on, each of which has configurable settings of its own. Without RDM's PI application developers would have hopelessly continued to optimize all of these and re-optimize when anything changed. Here's a glimpse at it:
"Wait a minute! Some important options of the clustered index can't be controlled, like FILLFACTOR, ALLOW_xxx_LOCKS and DATA_COMPRESSION. What does that mean? Well, if you execute the following code, you can see that FILLFACTOR=0, ALLOW_xxx_LOCKS are 1 SELECT name, fill_factor, allow_row_locks, allow_page_locks FROM sys.indexes. That means that every page of the index will be filled to 100%, leaving no room for new records. Well, thats fine for indexes that are always increasing, but if not, you'll get extra IO operations caused by page splits and you'll get fragmented indexes, causing performance issues. To remove the fragmentation you need to rebuild your indexes regulary to increase the performance, but that is a subject for later blog posts. In the SQL Azure platform, you shouldn't need to care about the infrastructure, like files or file groups and it isn't possible to place tables or indexes in different file groups."
Contaminating logical design--the model--with such implementation considerations is unproductive and does costly damage to both. Which demonstrates Codd's genius in targeting the RDM at PI. Unfortunately, the industry has yet to fully appreciate it. That practitioners still insist on blaming the RDM for poor performance is testament to the sorry state of foundation knowledge in the industry.



References

[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLK, forthcoming.


[2] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER.

[3] Codd, E. F., Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks, IBM Research Report, San Jose, California RJ599 (1969).




Saturday, April 15, 2017

This Week

Database Truth of the Week

"... systems of operations on data are most effective when they are formalisms, in which semantic considerations are unimportant until the formalism is applied to some specific application. In this way, database processing can join the ranks of successful mathematical abstractions. Differential equations, for instance, can be applied to situations ranging from orbit calculations to the quantum mechanics of the atom. The semantics of each application is unique to that application, but the formalism of differential equations is common. The power of the formalism lies in its abstraction from issues of meaning." --H. T. Merrett, Extending the Relational Algebra to Capture Less Meaning

Thursday, April 6, 2017

Understanding Kinds of Keys

My March  Post @All Analytics.

According to search queries hitting @dbdebunk.com, too many data professionals question the mandatory nature of primary keys, ask about changes to them, or prefer surrogate to natural keys. This indicates misunderstanding and misuse of a critical feature that can wreak havoc with inferences made from databases, including analytics. I have explained one function of keys, but there are several types of key that are poorly understood.

Read it all (and please comment there, not here




Saturday, April 1, 2017

"NULL Value" is a Contradiction in Terms

There is nothing wrong with Hugo Kornelis' picture of SQL NULL in NULL: The database's black hole. In fact, I recommend the series of which it is one part. It's SQL's picture of how to treat missing data that's wrong.
"Let’s first take a look at what NULL is supposed to be. Here is the definition of NULL from the SQL-2003 standard: null value--A special value that is used to indicate the absence of any data value."
While the absence of a value may be represented by some value at the implementation level, I strongly recommend users not think of NULL as any kind of value at the model level. The problems with NULL stem precisely from the fact that it is not a value, but rather a marker for the absence of a value. NULL value is a contradiction in terms that distracts from the problems.