Tuesday, November 27, 2012

Site Update

1.
A new 'To Laugh or Cry?' link was posted on the LAUGH/CRY? page.

2.
A new 'Quote of the Week' was posted on the QUOTES page.

3.
My latest blog post, Not All Structures Are Created Equal, at All Analytics.

4.
Want to get a sense of what a fad looks like? Check out
Big Data News Network. And if are familiar with my argument that instead of leading the industry with science, academia is following industry's fads by substituting vocational training for education, here's an excellent example: 

Announcing a New Master's Degree: Business Analytics 

The department of information systems and the supply chain management department have joined forces to launch an accelerated and specialized master's degree program in the fast growing field of business analytics. The Master of Science in Business Analytics program is full-time, and students will earn their MS in Business Analytics after one academic year. Classes will start in the fall of 2013, a



Sunday, November 25, 2012

Databases Representing ... What?

A database is the formal representation (logical model) of an informal business model via the relational data model. Informal business models have meaning understood only by humans: computer systems do not and cannot know what entities, properties, attributes, classes and business rules are. DBMSs "understand" only their formal (logical) database representations--domains, R-tables, integrity constraints--and even that understanding is algorithmic, not semantic. It is, therefore, impossible to either design a database, or assess accuracy of representation without detailed knowledge and understanding of the informal reality that the database is supposed to represent.

Tuesday, November 20, 2012

Don't Mix Model with Implementation

(This is a rewrite of a 11/20/12 post to bring it in line with the correct interpretation of Codd's true RDM as formalized by McGoveran.)

Here's what's wrong with the last week's database picture, namely:

"When you design your database tables there are some important things to think of:
- Normalize to remove redundant data
- Use the smallest datatype possible
- Create as few indexes as possible, but not too few
- Avoid redundant indexes
- Every table must have clustered index
...
This is important in a normal database but it is even more important in SQL Azure because you have limited space for every database, your connections may be terminated due to heavy use of resources, you pay for what you use and the data that you transfer. You can use the SQL Azure management portal do design your tables or write the T-SQL statement yourself. The syntax to create a table in SQL azure is the same as in SQL server 2005/2008, but not all options are supported in SQL Azure.
...
CREATE TABLE [dbo].[table1]
 ([id] [int] IDENTITY(1,1) NOT NULL,
  [column1] [NVARCHAR](50) NOT NULL,
  [column2] [NVARCHAR](15) NOT NULL,
  [column3] [TINYNT] NULL,
  CONSTRAINT [pk_table1] PRIMARY KEY CLUSTERED ([id] ASC )
   WITH (-- PAD_INDEX = OFF,
            STATISTICS_NORECOMPUTE = OFF,
            IGNORE_DUP_KEY = OFF,
         -- FILLFACTOR=80,
         -- ALLOW_ROW_LOCKS = ON,
         -- ALLOW_PAGE_LOCKS = ON,
         -- DATA_COMPRESSION=PAGE)"
--HÃ¥kan Winther, A SQL Azure tip a day – Create a table

Logical-Physical Confusion


Failure to keep the three levels of representation
  1. Conceptual modeling: Informal representation of a reality of interest as business rules;
  2. Database (logical) design: Formal representation of the conceptual model as integrity constraints on relations;
  3. Physical implementation: Representation and access of relation data in storage.
distinct and and avoid conceptual-logical conflation (CLC)--mixing 1 and 2--and logical-physical confusion (LPC)--mixing 2 and 3--has been a constant theme here.

LPC is reflected in the five recommendations for database design: the last four are actually implementation, not database design recommendations--model. I know this may be considered pedantic, but LPC is so rampant and costly in the industry that this is a necessity, not pedantry.

Note: Incidentally, when you design databases you should think of relations, not tables--R-tables are representation of relations on some physical medium for visualization.

I don't know how an Azure SQL database is different from a "normal database" (whatever that means) and while I understand product-specific implementation recommendations, why the four above, let alone normalization, are "more important" for Azure, escapes me.



Physical Independence


LPC is one of the many fundamental misconceptions common in the industry[1] is that the RDM is not practical because it does not "handle" implementation, as reflected in the following:
"I have a single, if not simplistic, question: You constantly remind us that the relational model is a logical model having no connection to any physical model (so I infer). You also indicate how no commercial product fully implements the relational model. Therefore, how do we make use of the relational model when dealing with the physical constructs of a commercial database program (Oracle, Access, DB2, etc.)?
DBMS designers have not been immune to it: SQL products induce and reinforce it. Consider the above table definition statement in the context of Codd's 1969 proscription:
"Let us denote the data sublanguage by R and the host language by H. R permits the declaration of relations and their [constrained] domains [and] identifies the primary key for that relation. R permits the specification for retrieval of any subset of data from the data bank ... H permits supporting declarations which indicate, perhaps less permanently, how these relations are represented in storage."[2] (emphasis ours).
The section that starts with WITH violates the proscriptions by contaminating the model with implementation details.

It is in response to such violations that Codd issued his famous 12 rules to distinguish RDBMSs from non-relational ones[3]. Rule 8 states:

"Interactive applications and application programs should not have to be modified whenever [physical] changes in internal storage structures and access methods are made to the database".
So physical independence (PI) is a major relational advantage, not a weakness, because it simplifies data sub-languages and gives DBMS designers complete freedom of implementation and changes thereof at will without disrupting queries and applications, relegating performance optimization to the DBMS. This enormous value of PI should be considered in the context of the overwhelming variety and complexity of the myriad of DBMS configurations, settings and options in the context of different OSs, hardware, storage and access methods, data growth, concurrent access, network loads and so on, each of which has configurable settings of its own. Without RDM's PI application developers would have hopelessly continued to optimize all of this and re-optimize when anything changed. Here's a glimpse at it:
"Wait a minute! Some important options of the clustered index can't be controlled, like FILLFACTOR, ALLOW_xxx_LOCKS and DATA_COMPRESSION. What does that mean? Well, if you execute the following code, you can see that FILLFACTOR=0, ALLOW_xxx_LOCKS are 1 SELECT name, fill_factor, allow_row_locks, allow_page_locks FROM sys.indexes. That means that every page of the index will be filled to 100%, leaving no room for new records. Well, thats fine for indexes that are always increasing, but if not, you'll get extra IO operations caused by page splits and you'll get fragmented indexes, causing performance issues. To remove the fragmentation you need to rebuild your indexes regulary to increase the performance, but that is a subject for later blog posts. In the SQL Azure platform, you shouldn't need to care about the infrastructure, like files or file groups and it isn't possible to place tables or indexes in different file groups."
Contaminating logical design--the model--with such implementation considerations is unproductive and does costly damage to both . Which demonstrates Codd's genius in targeting the RDM at PI. Unfortunately, the industry has yet to fully appreciate it. That practitioners still insist on blaming the RDM for poor performance is testament to the sorry state of foundation knowledge.


A Note on Normalization and Redundancy


Codd probably borrowed the term normal form for relations from logic, where relation specifications can be expressed as first order predicates in prenex normal form[4]. A relation is in its normal form by definition--i.e., has attributes defined on simple domains with atomic values, otherwise it is not a relation, and first order predicate logic (FOPL)--half of the formal foundation of the RDM--is not sufficient, which robs databases of all relational benefits. For  designs with attributes defined on non-simple relation-valued domains (RVD) Codd devised a method for eliminating them without loss of information that he called normalization. Note very carefully that it is only a design repair method--relational databases consisting of properly designed relations do not require it--and it involves no redundancy!

During the development of the relational algebra (RA)--whose operations derive relations from other relations--it was discovered that certain relation designs produced certain types of redundancy that cause complications. Although those designs had to do with attribute dependencies, not with the normal form, they were labeled "higher" normal forms (2NF-5NF), the normal form was re-labeled first normal form (1NF) and the methods devised to eliminate those redundancies were labeled "further normalization"[5]. The redundancy in each of those forms could be eliminated by further normalizing it--splitting an affected relation into multiple relations in a higher form--up to 5NF, a form devoid of (those types of) redundancy, at which point relations are fully normalized.

This historical sequence of events has inhibited the realization that relations should be really in 5NF--not just in 1NF--by definition, which is what the Principle of Full Normalization (POFN) really means. Adherence to it obviates  any explicit normalization altogether and guarantees not just logical, but also semantic correctness to boot.

In other words, strictly speaking, normalization to 1NF does not eliminate redundancy, only further normalization up to 5NF does. But neither is explicitly necessary if databases are properly designed such that they are in 5NF[6].

Note: We ignore, for the purposes of this discussion, that the initial normal form and the subsequent 1NF are not exactly the same; and 6NF.


References

[1] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER

[2] Codd, E. F., Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks, IBM Research Report, San Jose, California RJ599 (1969)

[3] Pascal, F., Interpreting Codd: The 12 Rules

[4] Pascal, F., The Interpretation and Representation of Database Relations

[5] Codd, E. F., Further Normalization of the Data Base Relational Model, IBM Research Report, San Jose, California RJ909 (1971)

[6] Pascal, F., The Costly Illusion: Normalization, Integrity and Performance

Saturday, November 17, 2012

Site Update

1.
A new Quote of the week was posted on the QUOTES page.

2.
A new To Laugh or Cry? item was posted on the LAUGH/CRY? page.

3.
It has come to this. It was only a matter of time and I dk what took so long.



Sunday, November 11, 2012

DBMS vs. Application Enforced Integrity

One of the online exchanges I participated in is Use of constraints on Db to improve data quality - good idea or better in the applications? initiated by the following question:
Can I ask whether people make use of the functionality provided by the database to ensure adequate data quality. Secondly do people apply this retrospectively as seems quite reasonable to me when a data problem is identified and the source data cleaned up - to do so could prevent future errors. There appears to be a tension between this sort of implementation and at least a perception of flexibility as database changes would be required should additional allowable values be required.

Thursday, November 8, 2012

News

1.
Reminder: My presentation
NoSQL, No  Schema and Database Management: Data Fundamentals and Technology Evaluation
San Francisco SQL Server User Group, Wednesday, 11/14/12, 6:30pm
Free and open to the public. Details here.

2.
Quote of the Week was posted to the QUOTES page. Consider it in the context of the comments to my last post, Kinds of keys.

3.
A new To Laugh or Cry? item was posted to the LAUGH/CRY? page.

4.
A new online exchange I participate in was added to the FP ONLINE page.

5.
Here's an item that validates my arguments that (1) SQL or the relational model are not the real problems when it comes to data management distributivity (2) proponents of distributivity and the Cloud have no grasp of what distributivity really entails.

Why Google Went Offline Today and a Bit about How the Internet Works - CloudFlare blog


Thursday, November 1, 2012

Site News

1.
My 11/16 San Francisco presentation. Details here.

I am planning a European tour in the March-April time frame, which will include the UK and possibly Israel. Those interested in organizing seminars or presentations, or who know others that might be, please contact me.

I have added a SCHEDULE option to the site's top menu -- a page listing forthcoming presentations/seminars with links to the events' pages.

2.
I added an FP ONLINE option to the top menu -- a page listing my web columns and online discussions in which I participate.

3.
New Quote and To Laugh or Cry of the Week were posted today.

The QUOTES and LAUGH/CRY pages were also updated with all the posts to date.

From now on Site News will announce when these two pages are updated with new posts so that interested readers can check them out.

4. Matt Rogish drew my attention to an article on programming which is not inconsistent with my claim that the IT industry operates like the fashion industry:

Programming is a Pop Culture
Discussion