Saturday, December 29, 2012

Site Update

1.
The new Quote of the Week posted on the QUOTES page is a complement to the previous quote and is by the same author.

2.
Both the quote and the new To Laugh or Cry? item posted to the LAUGH/CRY? page have to do with the author referenced in my prvious post, The Clouding Syndrome. Does he really think that titling his article the way he did elevates him to Codd's level?

My friend David McGoveran has some comments forthcoming, but in the meantime perhaps this is closer to reality.

3.
Added to FP ONLINE page:
  • My latest All Analytics column.
  • A LinkedIn exchange in which I participated (as of this writing my comments are not posted yet). One of the participants recommended the very same article by Meijer mentioned in 1.
4.
For quite a while I just could not display a LinkedIn Share button after each post, next to the other buttons provided by Blogger, no matter what I did and neither Google, nor LinkedIn helped. I was finally able to display it at the bottom of posts, but not exactly where I wanted. I am working on relocating it, but in the meantime it is usable where it is.



Tuesday, December 25, 2012

The Clouding Syndrome (REVISED)

Over the years I got enormous and nasty flak for my arguments about the sorry state of foundation knowledge in the industry, the contamination of academia by it and the increasing deterioration in both.  On one occasion, my claim that the designers of SQL did not really understand the relational model was dismissed as utter nonsense. But if that made me a crank, so were Ted Codd and Chris Date.

My friend Jim Lowden emailed me:
Last month [there] was a hopeless article All Your Database Are Belong to Us about DBMSs [in CACM] so badly mistaken, that it was too dreary even to reply. [The author] managed to get the Closed World Assumption pretty much backwards, if that's possible.

Wednesday, December 19, 2012

Site Update

1.
A new Quote of the Week was posted on the QUOTES page.

2.
A new To Laugh or Cry? item was posted on the LAUGH/CRY? page.

3.
A link to an exchange in which I participated was posted on the FP ONLINE page.

4.
I have added APPLIED MATHEMATICS FOR DATABASE PROFESSIONALS by Toon Koppelaars and Lex Haan to the recommended books widget on the home page.


Sunday, December 16, 2012

"Schema-less Models" and the New World Disorder

There is a tendency in the database field to distort, use poorly defined
terminology, or use it inconsistently.  As I already argued in previous posts, technologies that are founded or used without foundation knowledge and understanding will prove costly fads.

Wednesday, December 12, 2012

Site Update

1.
A new Quote of the Week was posted on the QUOTES page. Consider it in the context of my posts on database vs. application enforced integrity.

2.
A new To Laugh or Cry? item was posted on the LAUGH/CRY? page.


Sunday, December 9, 2012

Brother, Spare Me the Paradigms

In an interview Louis Davidson is mostly right and his heart is in the right place. A few quibbles and clarifications.
Consider dimensional design and Big Data as two additional paradigms. Data warehousing has not only allowed us to take the strain of reporting off our OLTP servers (leading to better reporting capabilities), but it has also given us the ability to support larger relational databases capturing more and more business data.

Wednesday, December 5, 2012

Site Update (UPDATED)

1.
A new Quote of the Week was posted on the QUOTES page.
I contend that in a certain sense the perspective is upside down and backwards. Can you figure out why?

2.
A new To Laugh or Cry? item was posted on the LAUGH/CRY? page.

3.
My latest Data Fundamentals column and a thread I participated in were posted on the ONLINE page.

4.
From Erwin Smout (via email):
In 2009, the ACM re-published Codd's paper Derivability, redundancy and consistency of relations stored in large data banks.

For searching purposes, the ACM "classifies" each published article using some "taxonomy" of their own making. Apparently, Codd's paper is classified under "Database Administration" and not under "Relational Database Model".
Says a lot, doesn't it?

UPDATE: I was wondering what ACM did classify under "relational model" (you can probably guess what I suspected). So I asked Erwin to check and here's what he found:

Saturday, December 1, 2012

Data Warehouses and the Logical-Physical Confusion

(Erwin Smout is co-author of this post.)

In Implementation Data Modeling Styles Martijn Evers writes:
Business Intelligence specialists are often on the lookout for better way to solve their data modeling issues. This is especially true for Data warehouse initiatives where performance, flexibility and temporalization are primary concerns. They often wonder which approach to use, should it be Anchor Modeling, Data Vault, Dimensional or still Normalized (or NoSQL solutions, which we will not cover here)? These are modeling techniques focus around implementation considerations for Information system development. They are usually packed with an approach to design certain classes of information systems (like Data warehouses) or are being used in very specific OLTP system design. The techniques focus around physical design issues like performance and data model management sometimes together with logical/conceptual design issues like standardization, temporalization and inheritance/subtyping.

Implementation Data Modeling techniques (also called physical data modeling techniques) come in a variety of forms. Their connection is a desire to pose modeling directives on the implemented data model to overcome several limitations of current SQL DBMSes. While they also might address logical/conceptual considerations, they should not be treated like a conceptual or logical data model. Their concern is implementation. Albeit often abstracted from specific SQL DBMS platforms they nonetheless need to concern themselves with implementation considerations on the main SQL platforms like Oracle and Microsoft SQL Server. These techniques can be thought of as a set of transformations from a more conceptual model (usually envisaged as an ER diagram on a certain 'logical/conceptual' level but see this post for more info on "logical" data models).

Tuesday, November 27, 2012

Site Update

1.
A new 'To Laugh or Cry?' link was posted on the LAUGH/CRY? page.

2.
A new 'Quote of the Week' was posted on the QUOTES page.

3.
My latest blog post, Not All Structures Are Created Equal, at All Analytics.

4.
Want to get a sense of what a fad looks like? Check out
Big Data News Network. And if are familiar with my argument that instead of leading the industry with science, academia is following industry's fads by substituting vocational training for education, here's an excellent example: 

Announcing a New Master's Degree: Business Analytics 

The department of information systems and the supply chain management department have joined forces to launch an accelerated and specialized master's degree program in the fast growing field of business analytics. The Master of Science in Business Analytics program is full-time, and students will earn their MS in Business Analytics after one academic year. Classes will start in the fall of 2013, a



Sunday, November 25, 2012

Databases Representing ... What?

A database is the formal representation (logical model) of an informal business model via the relational data model. Informal business models have meaning understood only by humans: computer systems do not and cannot know what entities, properties, attributes, classes and business rules are. DBMSs "understand" only their formal (logical) database representations--domains, R-tables, integrity constraints--and even that understanding is algorithmic, not semantic. It is, therefore, impossible to either design a database, or assess accuracy of representation without detailed knowledge and understanding of the informal reality that the database is supposed to represent.

Tuesday, November 20, 2012

Don't Mix Model with Implementation

(This is a rewrite of a 11/20/12 post to bring it in line with the correct interpretation of Codd's true RDM as formalized by McGoveran.)

Here's what's wrong with the last week's database picture, namely:

"When you design your database tables there are some important things to think of:
- Normalize to remove redundant data
- Use the smallest datatype possible
- Create as few indexes as possible, but not too few
- Avoid redundant indexes
- Every table must have clustered index
...
This is important in a normal database but it is even more important in SQL Azure because you have limited space for every database, your connections may be terminated due to heavy use of resources, you pay for what you use and the data that you transfer. You can use the SQL Azure management portal do design your tables or write the T-SQL statement yourself. The syntax to create a table in SQL azure is the same as in SQL server 2005/2008, but not all options are supported in SQL Azure.
...
CREATE TABLE [dbo].[table1]
 ([id] [int] IDENTITY(1,1) NOT NULL,
  [column1] [NVARCHAR](50) NOT NULL,
  [column2] [NVARCHAR](15) NOT NULL,
  [column3] [TINYNT] NULL,
  CONSTRAINT [pk_table1] PRIMARY KEY CLUSTERED ([id] ASC )
   WITH (-- PAD_INDEX = OFF,
            STATISTICS_NORECOMPUTE = OFF,
            IGNORE_DUP_KEY = OFF,
         -- FILLFACTOR=80,
         -- ALLOW_ROW_LOCKS = ON,
         -- ALLOW_PAGE_LOCKS = ON,
         -- DATA_COMPRESSION=PAGE)"
--Håkan Winther, A SQL Azure tip a day – Create a table

Logical-Physical Confusion


Failure to keep the three levels of representation
  1. Conceptual modeling: Informal representation of a reality of interest as business rules;
  2. Database (logical) design: Formal representation of the conceptual model as integrity constraints on relations;
  3. Physical implementation: Representation and access of relation data in storage.
distinct and and avoid conceptual-logical conflation (CLC)--mixing 1 and 2--and logical-physical confusion (LPC)--mixing 2 and 3--has been a constant theme here.

LPC is reflected in the five recommendations for database design: the last four are actually implementation, not database design recommendations--model. I know this may be considered pedantic, but LPC is so rampant and costly in the industry that this is a necessity, not pedantry.

Note: Incidentally, when you design databases you should think of relations, not tables--R-tables are representation of relations on some physical medium for visualization.

I don't know how an Azure SQL database is different from a "normal database" (whatever that means) and while I understand product-specific implementation recommendations, why the four above, let alone normalization, are "more important" for Azure, escapes me.



Physical Independence


LPC is one of the many fundamental misconceptions common in the industry[1] is that the RDM is not practical because it does not "handle" implementation, as reflected in the following:
"I have a single, if not simplistic, question: You constantly remind us that the relational model is a logical model having no connection to any physical model (so I infer). You also indicate how no commercial product fully implements the relational model. Therefore, how do we make use of the relational model when dealing with the physical constructs of a commercial database program (Oracle, Access, DB2, etc.)?
DBMS designers have not been immune to it: SQL products induce and reinforce it. Consider the above table definition statement in the context of Codd's 1969 proscription:
"Let us denote the data sublanguage by R and the host language by H. R permits the declaration of relations and their [constrained] domains [and] identifies the primary key for that relation. R permits the specification for retrieval of any subset of data from the data bank ... H permits supporting declarations which indicate, perhaps less permanently, how these relations are represented in storage."[2] (emphasis ours).
The section that starts with WITH violates the proscriptions by contaminating the model with implementation details.

It is in response to such violations that Codd issued his famous 12 rules to distinguish RDBMSs from non-relational ones[3]. Rule 8 states:

"Interactive applications and application programs should not have to be modified whenever [physical] changes in internal storage structures and access methods are made to the database".
So physical independence (PI) is a major relational advantage, not a weakness, because it simplifies data sub-languages and gives DBMS designers complete freedom of implementation and changes thereof at will without disrupting queries and applications, relegating performance optimization to the DBMS. This enormous value of PI should be considered in the context of the overwhelming variety and complexity of the myriad of DBMS configurations, settings and options in the context of different OSs, hardware, storage and access methods, data growth, concurrent access, network loads and so on, each of which has configurable settings of its own. Without RDM's PI application developers would have hopelessly continued to optimize all of this and re-optimize when anything changed. Here's a glimpse at it:
"Wait a minute! Some important options of the clustered index can't be controlled, like FILLFACTOR, ALLOW_xxx_LOCKS and DATA_COMPRESSION. What does that mean? Well, if you execute the following code, you can see that FILLFACTOR=0, ALLOW_xxx_LOCKS are 1 SELECT name, fill_factor, allow_row_locks, allow_page_locks FROM sys.indexes. That means that every page of the index will be filled to 100%, leaving no room for new records. Well, thats fine for indexes that are always increasing, but if not, you'll get extra IO operations caused by page splits and you'll get fragmented indexes, causing performance issues. To remove the fragmentation you need to rebuild your indexes regulary to increase the performance, but that is a subject for later blog posts. In the SQL Azure platform, you shouldn't need to care about the infrastructure, like files or file groups and it isn't possible to place tables or indexes in different file groups."
Contaminating logical design--the model--with such implementation considerations is unproductive and does costly damage to both . Which demonstrates Codd's genius in targeting the RDM at PI. Unfortunately, the industry has yet to fully appreciate it. That practitioners still insist on blaming the RDM for poor performance is testament to the sorry state of foundation knowledge.


A Note on Normalization and Redundancy


Codd probably borrowed the term normal form for relations from logic, where relation specifications can be expressed as first order predicates in prenex normal form[4]. A relation is in its normal form by definition--i.e., has attributes defined on simple domains with atomic values, otherwise it is not a relation, and first order predicate logic (FOPL)--half of the formal foundation of the RDM--is not sufficient, which robs databases of all relational benefits. For  designs with attributes defined on non-simple relation-valued domains (RVD) Codd devised a method for eliminating them without loss of information that he called normalization. Note very carefully that it is only a design repair method--relational databases consisting of properly designed relations do not require it--and it involves no redundancy!

During the development of the relational algebra (RA)--whose operations derive relations from other relations--it was discovered that certain relation designs produced certain types of redundancy that cause complications. Although those designs had to do with attribute dependencies, not with the normal form, they were labeled "higher" normal forms (2NF-5NF), the normal form was re-labeled first normal form (1NF) and the methods devised to eliminate those redundancies were labeled "further normalization"[5]. The redundancy in each of those forms could be eliminated by further normalizing it--splitting an affected relation into multiple relations in a higher form--up to 5NF, a form devoid of (those types of) redundancy, at which point relations are fully normalized.

This historical sequence of events has inhibited the realization that relations should be really in 5NF--not just in 1NF--by definition, which is what the Principle of Full Normalization (POFN) really means. Adherence to it obviates  any explicit normalization altogether and guarantees not just logical, but also semantic correctness to boot.

In other words, strictly speaking, normalization to 1NF does not eliminate redundancy, only further normalization up to 5NF does. But neither is explicitly necessary if databases are properly designed such that they are in 5NF[6].

Note: We ignore, for the purposes of this discussion, that the initial normal form and the subsequent 1NF are not exactly the same; and 6NF.


References

[1] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER

[2] Codd, E. F., Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks, IBM Research Report, San Jose, California RJ599 (1969)

[3] Pascal, F., Interpreting Codd: The 12 Rules

[4] Pascal, F., The Interpretation and Representation of Database Relations

[5] Codd, E. F., Further Normalization of the Data Base Relational Model, IBM Research Report, San Jose, California RJ909 (1971)

[6] Pascal, F., The Costly Illusion: Normalization, Integrity and Performance

Saturday, November 17, 2012

Site Update

1.
A new Quote of the week was posted on the QUOTES page.

2.
A new To Laugh or Cry? item was posted on the LAUGH/CRY? page.

3.
It has come to this. It was only a matter of time and I dk what took so long.



Sunday, November 11, 2012

DBMS vs. Application Enforced Integrity

One of the online exchanges I participated in is Use of constraints on Db to improve data quality - good idea or better in the applications? initiated by the following question:
Can I ask whether people make use of the functionality provided by the database to ensure adequate data quality. Secondly do people apply this retrospectively as seems quite reasonable to me when a data problem is identified and the source data cleaned up - to do so could prevent future errors. There appears to be a tension between this sort of implementation and at least a perception of flexibility as database changes would be required should additional allowable values be required.

Thursday, November 8, 2012

News

1.
Reminder: My presentation
NoSQL, No  Schema and Database Management: Data Fundamentals and Technology Evaluation
San Francisco SQL Server User Group, Wednesday, 11/14/12, 6:30pm
Free and open to the public. Details here.

2.
Quote of the Week was posted to the QUOTES page. Consider it in the context of the comments to my last post, Kinds of keys.

3.
A new To Laugh or Cry? item was posted to the LAUGH/CRY? page.

4.
A new online exchange I participate in was added to the FP ONLINE page.

5.
Here's an item that validates my arguments that (1) SQL or the relational model are not the real problems when it comes to data management distributivity (2) proponents of distributivity and the Cloud have no grasp of what distributivity really entails.

Why Google Went Offline Today and a Bit about How the Internet Works - CloudFlare blog


Thursday, November 1, 2012

Site News

1.
My 11/16 San Francisco presentation. Details here.

I am planning a European tour in the March-April time frame, which will include the UK and possibly Israel. Those interested in organizing seminars or presentations, or who know others that might be, please contact me.

I have added a SCHEDULE option to the site's top menu -- a page listing forthcoming presentations/seminars with links to the events' pages.

2.
I added an FP ONLINE option to the top menu -- a page listing my web columns and online discussions in which I participate.

3.
New Quote and To Laugh or Cry of the Week were posted today.

The QUOTES and LAUGH/CRY pages were also updated with all the posts to date.

From now on Site News will announce when these two pages are updated with new posts so that interested readers can check them out.

4. Matt Rogish drew my attention to an article on programming which is not inconsistent with my claim that the IT industry operates like the fashion industry:

Programming is a Pop Culture
Discussion


Wednesday, October 31, 2012

Weekly News

1.

I will give a presentation for the San Francisco SQL Server User Group:

Microsoft Office
835 Market Street
Suite 700
San Francisco, CA (map)

More details here

2. 
I am working on a European tour in the March-April time frame. UK will probably be included and possible Israel. Those interested in organizing seminars and/or presentations, or know somebody who might be, please contact me.

3.
Google claims it has resolved the problem with updates changing page URLs, so I have added all the quotes to date to the QUOTES page. LAUGH/CRY page is next.

4. It will take me a while to revise all my papers, so if you order the papers now you will get the already revised Business Modeling for Database Design and current versions of the other papers that will entitle you to revised versions when they are published.

5. There seems to be a problem with adding a LinkedIn share button to the blog, with LinkedIn pointing to Google and Google, as is its wont, couldn't care less. Until this is resolved--don't hold your breath-- please share manually. Thanks.

 

Monday, October 29, 2012

The Cloud and SQL

In Why SQL Is Not Suited for the Cloud and What Is NuoDB, one Aswin writes:
In the data world, SQL is a relation-based, data model, and the most common means used to retrieve and manipulate data. Predicates, clauses, expressions and queries that are all used in SQL data management system have made it abundantly popular and user friendly. But this traditional data management system suffers serious of limitations...

Friday, October 26, 2012

Weekly News

1.
My latest column Why Managers Need Business Models (my original title was Modeling and Usefulness).

2.
I am working on a European tour in the March-April time frame. UK will probably be included. Those interested in organizing seminars and/or presentations, or know somebody who might be, please contact me.

Have a nice weekend.


Wednesday, October 24, 2012

Education, Practicality and an Introductory SQL Book

WS: I have a question. I have been asked about the possibility of teaching a SQL course. My audience will be people from a scientific rather than an IT background (which I think will make my job easier). If the training does take place then obviously in my own material I will cover the basics of the relational model before moving on to SQL. One of them has asked me if there is a book I can recommend on the subject to get them started. Now although I own a lot of books about the relational model and about SQL they contain mainly quite advanced material, I don't know of an introductory book about SQL that is theoretically sound. I wonder if you can recommend one? ... I haven't seen anything that I am particularly enthusiastic about. I would say it is a gap in the market, though unfortunately the market probably demands books about Oracle, Sql Server or MySQL. That is, of course, true of trade media and book publishers too.

Sunday, October 21, 2012

To Laugh or Cry


tblIsThere, The Daily WTF


Quote of the Week

Until the rise of nosql solutions, you would force-fit all data and relations into traditional RDBMS model and then do a fire dance in the application layer to manage them. In other words, you would force a cube into a cylinder with enough force that it fits. With nosql solutions, that is not necessary, so you can pick the right solution based on the problem domain. If your data is largely schemaless, there is no point in forcing it into a relational table with most of the columns being null, so a document oriented DB such as MongoDB or CouchDB would be ideal. If you dealing with lists, sets, queues etc., which need to be persisted, then Redis maybe an ideal candidate. If you are managing key/values then Cassandra maybe your choice. Of course, you don't want to have a dozen of these in your system, but a careful use of RDBMS and nosql solutions would greatly simplify application design.
--Have any NoSQL databases become viable replacements for SQL databases yet?, quora.com

Tuesday, October 16, 2012

What a Truly Relational System Is (and What It Is Not)

Rewrite 2/1/17.

Here's what's wrong with last week's picture, namely:

"A quick-and-dirty definition for a relational database might be: a system whose users view data as a collection of tables related to each other through common data values.

The whole basis for the relational model follows this train of thought: data is stored in tables, which are composed of rows and columns. Tables of independent data can be linked, or related, to one another if they each have columns of data that represent the same data value, called keys. This concept is so common as to seem trivial; however, it was not so long ago that achieving and programming a system capable of sustaining the relational model was considered a longshot with limited usefulness.

If a vendor’s database product didn’t meet Codd’s 12 item litmus tests, then it was not a member of the club ... these rules determine whether the database engine itself can be considered truly “relational”. These rules were constructed to support a data model that would ensure the ACID properties of transactions and also eliminate a variety of data manipulation anomalies that frequently occurred on non-relational database platforms (and **still do**)." --Kevin Kline, SQLBlog.com

Thursday, October 11, 2012

Weekly News

1.
Reminder

My presentation: Foundation Knowledge for Database Professionals
Mountain View, October 16, 6:30pm (open to the public and free)

More details here.

2.
Three LinkedIn Data Modeling group threads in which I participated.

One or two files
Is there a 3NF to denormalized tool
Kinds of Data Models -- And How to Name Them

3.
To new visitors: If you are wondering why the SEMINARS, PAPERS, QUOTES and LAUGH/CRY pages have not been updated, there is a bug in Blogger--the URL changes every time a page is updated--which wreaks havoc with traffic/SEO. Google has not been able to fix this for months and until they do many bloggers do not update the pages. Until a fix is provided, for previous quotes and To Laugh or Cry? posts please use the Blogger search facility at the top of every page or just scroll back to older posts.

4.
Cloud computing: here we go again

Came across this article that is consistent with my arguments about how industry operates.
Summary: The same pattern has emerged again and again in the history of IT. New ideas emerge. Vendors develop their own approaches. Camps form to support those approaches. Eventually standards emerge. The battle over cloud computing standards and approaches is only the latest repetition of the pattern.
The Achiles heel in all this is that many if not most new ideas are not new at all. The only way to ensure that the wheel--and a square wheel at that-- is not constantly reinvented is to rely on sound foundations and knowledge of the history of the field. Otherwise, we're spinning wheels in place at best, or go into reverse.

5.
A comparison

Last week during Oracle Open World I had the pleasure of meeting briefly Jonathan Lewis (with Jonathan Gennick's help) and we had a chat about the state of the industry. In that context, please watch the two videos below and see if you discern a difference.

http://www.youtube.com/watch?v=g65fII0nX8g
http://www.youtube.com/watch?feature=endscreen&NR=1&v=FvRDK2B0T_I

Wednesday, October 10, 2012

To Laugh or Cry?

UPDATE: By mistake I posted a link to a piece that I intended to debunk, which is why you could not see any comments. Here is the correct one.



The Death of the Relational Database,
whydoeseverythingsuck.com

An oldie, but goodie. Particularly some of the comments.



Quote of the Week

SQL Server and all relational databases aren't much more than glorified file handlers. If you accept them as that, then you'll see the falacy [sic] of trying to do things like inheritence [sic] and the wad of other stuff people think it should do. SQL is not a full fledged language nor should it be. It's a simple file handling language. Tables are really nothing more than files with lines (rows) and fields (columns). It's a very simple thing... why do people keep trying to make it more complex?
--Relational Model Advantages,SQLServer Central

Sunday, October 7, 2012

Data Fundamentals, Fads and "Big Data"

WS writes:
It is fairly clear to me that the "NoSQL" movement (or "big data" as some people seem to be calling it nowadays) has thrown out the relational baby with the SQL bathwater.
That is exactly the point I was making in my several posts on NoSQL. Since SQL is practically the only commercially implemented data language with any relational characteristics, most practitioners confuse it with the relational model. I quoted with approval David McGoveran that the NoSQL movement has its roots in the anti-relational sentiment expressed by object and XML proponents. Underlying it is the failure to appreciate the distinction between the fads that characterize the industry's mode of operation and relational technology. This failre is due to disregard for and lack of knowledge of data fundamentals. In fact, the relational model has been treated for a time as just another fad.
In particular I wanted to say that I agree very strongly with Eric Kaun's comments about the supposed inflexibility of relational schemas. In my experience it is actually very easy to change the schema, but often fiendishly difficult to assess the impact of these changes on all the applications that use the database.
For a while I just could not understand what the claims of "fixed relational schema" meant, because it is obvious that the relational approach increased schema flexibility via several types of data independence relative to products that preceded it. But recently it occurred to me that what complainers mean by it is not that the schemas are fixed, but rather that inflexibility means exclusive reliance on tables. Why not change the structural basis of the schema "as needed"? This certainly is an explicit fundamental rejection of the relational model (they cannot express even their complaint correctly).
In spite of the apparent sophistication of many developer environments, none of them, as far as I am aware, do anything as simple as representing the relation between application language expressions and database attributes. This would mean a simple query could find all the dependencies between applications and the database. As is often the case application programmers are ignoring the failings of their own version and dependency management and blaming the DBMS instead.
It depends on what the meaning of "sophisticated" is. It may apply to programming languages, techniques and tools, but not necessarily to data fundamentals such as logic, the different data models with their pros and cons, their practical implications and the history of the field. That explains why old technologies are reinvented and relabeled as new and the constant stream of fads. As to the application-DBMS functional distinction, many developers are not even aware of it and see nothing wrong with having to undertake database functions in applications. Hence the increasing number of "application-specific databases".
Regarding NoSQL, I suspect there is an element of cargo-cult thinking going on here. Google have been very successful (at least from a business perspective) using distributed processing so if you use the same methods as Google then you are bound to be successful too.
Obviously if you lack foundation knowledge you will opt for emulating others without a real understanding to what extent their success--to the extent that it is not a short-lived one--is applicable to your circumstances (see SQL/PostSQL/NoSQL). As an example, from my experience with Google's Blogger in particular and their other products in general (see my previous post) I would not be surprised if their poor ability to fix problems and provide expedient support (if at all!) is probably due to some extent to their reliance on ad-hoc, schema-less products. These problems have the potential to reach the same walls as those reached by pre-Codd non-relational technologies. Facebook, Amazon and other "Big Data" companies reveal similar problems.

The size of the data has nothing to do with the data model. Distribution has nothing to do with the data model. Consider, again, David McGoveran's list of circumstances which may justify a NoSQL product:
First, when discovery of relationships is more important than consistent processing and specific data results.
Second, if the data processing is meant to be inductive (e.g., suggestive) rather than deductive (i.e., precise).
Third, when the application is changing very fast, data complexity is great (variety or amount).
Fourth, if physical issues, like big data or a high degree of parallelism, are more crucial than data integrity. You must be willing to throw away data consistency in favor of performance and scalability.
Fifth, if you have a mission-critical one-off application for which a fixed data organization is ideal, in which case the costs and risks may be lower than licensing a vendor’s RDBMS [read: SQL] or trying to force an open-source RDBMS [read: SQL] to fit the need.
With knowledge of data fundamentals one is tempted to conclude that the intention here was to demonstrate that the set of applicable circumstances is very close to being the empty set. In the absence of such knowledge, the temptation to use these tools in order to avoid hard thinking and design effort upfront, while expecting is just too strong to resist.

Thursday, October 4, 2012

Weekly News


It's been months since a problem was discovered in Blogger's new interface: pages' URLs change when they are updated, which is a disaster for SEO traffic. Amazingly, Google initially asked for details experienced by users, lots were provided, but there was no response from Google or any information that would indicate when a solution would be available, except a comment very recently that they are working on a solution that might be available within a couple of weeks. Worse, it switched users to the new interface before it provided a solution (see Page URLs Change, Though Link References Do Not Change), although it looks like the update introduced the problem to the old interface too (my gut tells me it might be a data management problem).

There's more. As I wrote earlier, I was planning to drop the old dbdebunk.com site and point the this blog to that domain, so that links to the old site would at least reach the blog, if not the original page. Google provides instructions how to use a custom domain with Blogger, so I canceled the old site, but when I followed the instructions, I got an error. Then I was alerted to yet another problem with Blogger:
Known Issue: Custom domains in Google Sites

We are aware of an issue with custom domains in Google Sites. The functionality to add a new custom domain is currently unavailable - however, existing custom domains are working. We are working to resolve the issue as quickly as possible and expect a fix within the next few weeks
It was posted on September 19th and note the timeframe of a solution!
That a company like Google cannot resolve such problems for months (and is unresponsive to boot) despite the damage it causes its users and so many complaints is a a phenomenon on the rise. Companies like Facebook and Google have, on the one hand, huge institutional market power and on the other hand hugely complex software infrastructures that are based on proprietary, ad-hoc software (all those "BigData", NoSQL or even non-database applications goodies) that are extremely difficult to manage, maintain, progress and optimize. This creates the combination of difficulty to satisfy user needs with lack of responsiveness.

Luckily, my domain registrar was able to provide a temporary solution by forwarding dbdebunk.com links to dbdebunk.blogspot.ca without users seeing the .ca.

To Laugh or Cry?

Database design and terminology, dbForums.com

Quote of the Week

If you have a wide existing table containing lots of frequently null columns, it is likely you can make it more efficient if you re-arrange the physical table so that most of the nulls are contiguous to the right hand side of the table. If the table is over 254 columns in certain RDBMSes, you may profit from breaking the table into two pieces such that the overflow is less often referenced and when you need the whole it is via a view slicing the two physical tables together one-for-one.
--Database tables and NULLs, stackexchange.com

Monday, October 1, 2012

Normalization, Further Normalization, Ease of Use, Integrity and Performance

Revised: 10/15/16
"Normalization was invented in the 70's as a way to put some structure around how developers were storing data in a database, in addition to trying to save disk space. You need to remember this was a time when 1MB was billions of dollars and a comput er needing 1GB of data storage was inconceivable. It was a way to squeeze as much data into as small a space as possible." --Tom Phillips, social.technet.microsoft.com
Perhaps the lack of understanding of the relational model was so acute at the time when it was first published (1969-70), that it would not surprise me if a belief existed then that normalization would save storage space, even if I don't understand in what this belief was grounded. But there is no justification for such a belief to persist in 2012, no matter what else one thinks of normalization.

For the multiple advantages from full normalization (5NF) -- chief among them semantic correctness of query results (i.e., no anomalous side-effects)--see the just published THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, available via the BOOKS page).

Thursday, September 27, 2012

A Note on Education vs. Training

Gene Wirchenko drew my attention to an Infoworld article by Andrew Oliver, Ill-informed haters go after MongoDB, which is kind of a response to the articles critical of MongoDB which I commented on in previous posts. The gist of the article is described as: "NoSQL databases like MongoDB are great for some tasks but not for others. Is it MongoDB's fault if misguided developers use it to solve the wrong problem?"
With any new technology comes a wave of marketing happy talk, which in turn leads to inexperienced developers "jumping on the train" of a new fad. Inevitably, these newbies find themselves disappointed that the technology doesn't deliver on their inflated expectations.
Oliver correctly identifies the core systemic problem in database management, one that I have been warning of for almost my entire 25+ career in the field: the lethal combination of proliferation of thoroughly hyped ad-hoc products and technologies by vendors unfamiliar with the foundation and history of the field to database professionals and users equally unfamiliar with same. Neither do I find fault with the advice he offers at the end of his article:
"Take blogs with a grain of salt ... make sure you understand the technology before using it on a critical project. If you don't heed this advice, some writer for Infoworld on a short deadline in a slow news week might decide to ridicule you!
although I don't think the ridicule by journalists, even less knowledgeable about what they cover, is the most serious consequence.

But he fails to make the connection between a major source of the problem and the effectiveness of his advice.

The IT industry in general and the database field in particular rely almost exclusively on tools experience. Practitioners are inducted in the field mainly via practice with specific tools that happen to be in vogue at specific times; job descriptions don't require much beyond that; and academia has been turned away from science and education into a research and certification vehicle for vendors and their tools, a trend which Dijkstra has attacked decades ago much better than I can. I experienced this personally on more than one occasion. To recall two:
  • When I offered a presentation on data fundamentals to a reputable computer science department, there was no interest, as they were too busy with "XML research".
  • When I tried to teach an introductory course in database management at a local university by developing a syllabus on data fundamentals, I was quickly disabused of that illusion by a demand to use a specific book and teach Oracle.
A lot is being made on the "high education bubble", the exploding cost of an academic education and the burdening indebtness caused by it. Among its many implications there is an insidious one. Enormous pressure is exerted, for obvious reasons, on academia to turn from educational to vocational: from employers and vendors (via various incentives that are hard to resist) and from students, particularly those sponsored by employers or vendors.

I do not believe that knowledge of and experience with tools alone is sufficient to address the problems underlying the database field. Without foundation knowledge, including the history of the field, relabeled old discarded products will continue to proliferate and practitioners will lack the capacity to avoid being seduced by hype.

Indeed, one could argue that the attraction of the so-called "schema-less" NoSQL products is due to the difficulty to think conceptually and logically about requirements and evaluate technologies and products critically because the necessary knowledge and ability to reason and abstract--distinct from tool experience--have not been inculcated. The commonly used assertion "different databases for different purposes" or "the right tool for the right task" are trivial and trite and can be misleading without the benefit of foundation knowledge external to the tools themselves.

Note very carefully that I do not mean to imply that tool experience is unimportant, which would be nonsense. Rather, I claim that it is necessary but insufficient for intelligent functioning in the database field, as it probably is in many other fields.

Wednesday, September 26, 2012

Mountain View Presentation

I will be giving a presentation to the Silicon Valley SQL Server User Group:

Foundation Knowledge for Database Professionals

October 16, 6:30pm

1065 La Avenida
Building 1
Mountain View, CA (map)

It is open to all. Please join us and invite anybody who might be interested.

More details here

Quote of the Week

I've read a few things about NoSQL (technology or movement? That is the question!). As my colleague Stuart McLachlan rephrased the question: If it's a movement, the real question is whether it's a religious or bowel movement.
--SQL/PostSQL/NoSQL, artfulopinions.blogspot.com

To Laugh or Cry?

To Laugh or Cry? items are specifically selected as lost causes (usually, the problems lie with either the author of the item, or with those who respond to it, or both). For that reason I normally do not comment on them. This week's piece is somewhat unusual in that the target is not the author, but rather a vendor. The following item was brought to my attention by Matt Rogish.

Diego Basch, I’ll Give MongoDB Another Try. In Ten Years.

Saturday, September 22, 2012

Davide and David on NoSQL

After publishing my 3-parter on NoSQL I came across a post by Davide Mauri and an interview with David McGoveran on the subject, to which I would like to add some clarifications (this is not a debunking, as they hardly justify such).

Mauri asks: "Do NoSQL people really want to drop the relational model?"

Thursday, September 20, 2012

PRACTICAL DATABASE FOUNDATIONS PAPER #1: BUSINESS MODELING FOR DATABASE DESIGN


Paper #1, revised and expanded, is available for purchase. Please follow the instructions on the PAPERS page to order.

About this Paper

The purpose of this paper is to introduce practitioners to data fundamentals and advance their understanding via business modeling for database design.
It focuses particularly on and distinguishes between the three types of model that are often confused: business model, logical model and data model.

The paper will delve into each one in detail, explain the basic modeling and design concepts, principles and methodology of formalizing an informal business model via the relational data model into a formal logical database model which can be computerized.

Preface

Introduction

Business Modeling
Basic Modeling Concepts
Business Rules
Property Rules
Class Rules

Attribute Rules

Distinctness Rules

Functional Dependency Rules

Referential Rules

Associative Entities

Optional Rules
Business Models
Database Design
Formalizing the Informal
Predicates and Propositions

The Relational Data Model
Relational Structure

Relational Manipulation

Relational Integrity
External Predicates

Integrity Constraints and Internal Predicates
Logical Models
Understanding Database Management
A Foundation Framework
Applying the Framework (Exercises)
Appendix A: Integrity Constraints Formulation and Verification

References

Wednesday, September 19, 2012

Data Models and Usefulness

In my 3-part article on dynamic schema, NoSQL and the relational model I admitted that I did not know much about NoSQL products in general and MongoDB in particular. Nevertheless, it was not difficult to figure out what problems would be faced using a docubase for database management purposes.

Then Matt Rogish alerted me to Why I Migrated Away From MongoDB. I suggest you read it in the context of my article and see how close my suspicions were to reality. Here are some quotes worth pondering (emphasis mine):
Alas, not being aware of the mathematics behind relational algebra, I could not see clearly the trap I was falling into - document databases are remarkably hard to run aggregations on and aggregating the data and presenting meaningrful statistics on your receipts is one of the core features of digiDoc. Without the powerful aggregation features that we take for granted in RDBMSs, I would constantly be fighting with unweildy map-reduce constructs when all I want is SUM(amount) FROM receipts WHERE <foo> GROUP BY <bar>. 
People keep complaining that JOINs make your data hard to scale. Well, the converse is also true - Not having JOINs makes your data an intractable lump of mud. 
...when I last looked something as simple as case-insensitive search did not exist. The recommended solution was to have a field in the model with all your search data in it in lower case [FP: heh, heh]... And if I add a new field to the model? Time to regenerate all the search strings. I can only come to the conclusion that mongodb is a well-funded and elaborate troll.
...somewhere along the stack of mongodb, mongoid and mongoid-map-reduce, somewhere there, type information was being lost.
Then it might have been the lack of an enforced schema? Thinking about it though, schemas are wonderful. They take all the constraints about your data and put it in one place. Without a schema, this constraint checking would be spread all over my application. A document added a month ago and a document added yesterday could look completely different and I’d have no way of knowing. Such fuzzy schemaless data models encourage loose thinking and undisciplined object orientation.
Anybody with foundation knowledge would expect these problems. As I argued so many times: a data structure determines manipulation and, therefore, usefulness for a given informational purpose.

Oh, and in this context, see my AllAnalytics posts and the exchanges with readers: 

Knowing What a Database Is
Unstructured Data

Tuesday, September 18, 2012

Object Orientation, Logic and Database Management

WS writes:
On the subject of object orientation I often feel inclined to quote Leslie Lamport's comment in the introduction to his book "Specifying Systems": "If exposure to C++ hasn't completely destroyed your ability to think logically, you should have no trouble filling in the gaps in your mathematics education".
I tried for many years to understand OO, but without success. It didn't occur to me until I encountered the material on dbdebunk that the problem might not lie with me. OO's lack of a formal definition leads, I think inevitably, to what I have come to call PALC (Pointless Additional Layers of Complexity). Every addition to OO leads to new problems that require ever more complicated solutions, that in turn create a whole new set of problems.

Monday, September 17, 2012

To Laugh or Cry?


Is it really necessary to normalize your databases in third normal form?, social.technet.microsoft.com

Quote of the Week

The products commonly known as Oracle, Exadata, DB2, Sybase, SQL Server, Teradata, Sybase IQ, Netezza, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio et al. all either are or incorporate relational database management systems, aka RDBMS or relational DBMS.

2. In principle, there can be difficulties in judging whether or not a DBMS is “relational”. In practice, those difficulties don’t arise — yet. Every significant DBMS still falls into one of two categories:

    Relational:
        Was designed to do relational stuff* from the get-go, even if it now does other things too.
        Supports a lot of SQL.
    Non-relational:
        Was designed primarily to do non-relational things.*
        Doesn’t support all that much SQL.

*I expect the distinction to get more confusing soon, at which point I’ll adopt terms more precise than “relational things” and “relational stuff”.
--Curt Monash

Weekly DBDebunk News

1. I have added
  • a "Link to this blog" button and code at the top right
  • URL, HTML and BB/Forum link codes at the end of each post
If you like the blog or posts, please use them to link to them. This being a new blog, it will help with visibility. Thanks.

2. Shortly I will switch the blog to the old domain, dbdebunk.com. You don't need to do anything as the blogspot will simply forward you to that domain.

3. For those interested in databases and gaming, there is a forum thread at

https://forums.eveonline.com/default.aspx?g=posts&m=1919633

about problems with EDK, an older 'massive multiplayer online' (MMO) game based on mySQL, the latter supposedly being broken by the application/user load. The discussion is about the new DBMS to be considered and the choice between SQL and NoSQL (MongoDB about which I wrote in my three-part article, is mentioned).

I do not presume to know the technicalities and terminology of gaming systems and I am not in a position to judge whether current SQL systems can adequately satisfy gaming in general and EDK in particular. Perhaps readers who have knowledge in this area can enlighten us.

But if you read the exchange carefully you can detect the familiar sources of the (misleading) attraction of NoSQL: lack of familiarity with history and foundation knowledge, developers not transcending their application view of databases, the logical-physical confusion, the failure by SQL and its implementations to adhere to the relational model and so on. There is one developer with foundation knowledge, though, who is readily discernible.

We are regressing decades. I may not be around when somebody will have to "discover" the relational model in order to address the very same problems that Codd  thought he did 40+ years ago.

4. Matt Rogish has submitted this link: 

Why I Migrated Away From MongoDB
 
which I may comment on later.

Friday, September 7, 2012

Forward to the Past: "Out-clevering" the DBMS

When I see a title like "Software Engineering: What are some clever programming techniques that aren't used often enough?" my antennae come up. More often than not (1) programmers consider certain things clever only because they are not familiar with history (2) in its quest for solving problems engineering sometimes tries to out-clever science, which usually spells trouble.

Wednesday, September 5, 2012

Quote of the Week

Relational databases ... had poisoned whole generation of programmers thinking (incl. me). With no clue we approached everything through relational prism that we studied in university. MySQL was the only OSS alternative that cut enough corners to make relational DBMS fit the problems we tried to solve. For which the relational paradigm wasn't a good fit anyways.

So. then comes along the NoSQL stuff and you find out that you have much better tools than relational for these kinds of problems (simple websites, blogging sites, feed sites, social sites, etc.). Along the way you might see all the other problems that really need realational (ERP systems etc.). But now you need real relational not some half-baked all corners cut MySQL stuff. Yeah that also goes for InnoDB which might have become something if Oracle hadn't freezed it.

Then you try out Postgres (because it's free) and when it works for big workloads (yes it does in skilled hands) then you are hooked. If by chance you then check out Oracle ... it feels like a twelve ton dinosaur ("erm. no boolean type? whaat - implicit commit with DDL?"). If by any miracle you don't have the 1% of projects that need something special - Oracle RAC or Spatial or whatnot. Then Postgre is the most mainstream option available - there just is no other as popular, as functional and as cheap solution for these kinds of problem domains that really require relational DBMS.

PS: been using MySQL (both MyISAM/InnoDB) over 10 years, PostgreSQL over 5 years and Oracle over 5 years. --Hipsters Hacking on PosgreSQL, theRegister.com

To Laugh or Cry?


We don't Need No Stinkin' Database, The Daily WTF

Friday, August 31, 2012

Weekly DBDebunk News

1.
As most of my readers know, for various reasons I stay away from products and will occasionally touch on them only in a very specific context of data and relational fundamentals, in both the negative (e.g. false claims, violations, or errors) and positive (true and correct implementations) sense. For such purposes one needs the very rare combination of thorough product experience in actual practice and knowledge, understanding and appreciation of fundamentals. I happen to know a few such practitioners and I've asked them for input when necessary and they kindly agreed. So in the future expect to see some of their contributions.

Recently somebody requested in a LinkedIn database related group advice on how to select a DBMS. He received a few responses which included some important factors such as cost, support, development community and so on, but all were practically devoid of critical factors such as technical properties and, as one of my advisors pointed out, suitability of those to business needs. As another of my advisors correctly pointed out, this may have been OK several years ago, but not now, when the BigData/NoSQL fad is in full bloom and the circumstances seem to have regressed to the good old pre-relational days of IMS and CODASYL.

I have asked some of my advisors to offer their thoughts on DBMS selection and I am getting some interesting replies which provide evidence for the validity of my gut sense about these "DBMSs", which I will use in the appropriate contexts. It does look, though, like even with my limited knowledge of NoSQL products, the gist of my impression about them seem to have been spot on: "we don't know much about the data and what we logically need to do with it, except that it's a lot of it and we want to store it physically in such ways that will maximize the performance of those operations that we hope to discover we need."

Stay tuned.


2.
I have added a TRDBMS links section (bottom right). It currently includes Dataphor and Rel, both are implementations of Date & Darwen's Tutorial D (which is intended to be a full development language with intergrated relational capabilities). The former is an industrial strength open source TRDBMS used by the Database Consulting Group. The latter is currently an educational vehicle that is being developed into a product. Both are free, so you can get a feel of what for what a truly relational product looks like/can do relative to SQL and the new wave of products claiming to be improvements over it.

Have a good weekend.

Wednesday, August 29, 2012

To Laugh or Cry?

Help designing database table, dbForums

Quote of the Week

Q: normalization of receipt help pls

   DESCRIPTION|QTY|UNIT COST|TOTAL|TRANSACTION TYPE

   i need to convert it to 1st NF, 2nd NF , 3rd NF and BCNF
   pls help i already did a lot of searching in the net. i really cant seem to understand how normalization works.

A: (Expert): Look for E. F. Codd book.

--forums.databasejournal.com

Sunday, August 26, 2012

Schema, NoSQL and the Relational Model Part 3

I started in Part 1 with Haberman's justifiable doubts about "schema-less" NoSQL databases, using the MongoDB docubase as an example; and an exchange on the problematics of a "document data model" with Matt Rogish. In Part 2 I completed the exchange using the example of the W3C effort to define such a model for XML. I am now returning to schema and MongoDB DocBMS.

Thursday, August 23, 2012

Weekly DBDebunk News

1.
The last Laugh or Cry? item was a thread on NULLs at StackExchange.com. A reader commented that he could not tell whether I am "mocking (1) the question (2) the current answers or (3) the StackExchange forum" in the thread. So perhaps I should explain: Laugh or Cry? items are too taxing to debunk and pretty hopeless. But these items are not just entertainment (or pain, as it is for some of us), but also opportunities to test yourself on foundation knowledge. Lack of clarity on what is wrong indicates some gap that needs filling.

Now, of course, not everything is wrong in long threads, but it is precisely the ability to distinguish between sensible pronouncements and nonsense that foundation knowledge confers.

In fact, a thread I debunked at the old site is an excellent illustration of what I mean. Stay tuned.

UPDATE: One more reason for old posts: despite all the claims of "progress", some are evidence to the contrary: that fundamental problems are not resolved and that there is even regress in that progress.

2.
I last revised the papers in January 2011. The current revisions are major expansions/overhauls which I may well renumber version 1. The first paper I announced earlier, Business Modeling for Database Design, required a bit
more effort than I initially thought, so I have delayed the publication a bit for further refinements--it'll be worth the wait.

I am also testing the waters of self-publishing on Kindle --no small hassle--with the kind help of Charlie Clark (thanks, Charlie).

The seminars are undergoing a similar overhaul. I highly recommend them ( and I am not biased! :)). Go ahead, organize one.

3.
If you noticed, I am interleaving some posts from the old site with new ones. This is not just laziness or to fill space. I am soon going to drop the old site and I am selecting very carefully those that are too valuable to lose.

Besides, their revision does not take much less time and effort than writing new ones.

I have now added the original post date and dated the revised section to all of them, for proper context (thanks, Yiorgos, for suggesting it).

Have a nice weekend.

Wednesday, August 22, 2012

To Laugh or Cry?

Representative Table,The Daily WTF

Quote of the Week

The First Normal Form: Stipulates that no repeating groups or sets should be stored within a table. These similar groups should be stored in a separate table. To identify each unique column of these tables, a primary key should be established

The Second Normal Form: Data Redundancy should be non-existent. Foreign key constraints should be enforced.

The Third Normal Form: Every column in the table should be related and dependant on the primary key of the table. If this is not possible, the field should be stored in a new table with a new key.

The Fourth Normal Form: This one would probably exists somewhere in dream land - the elimination of independent relationships in a Relational Database.

The Fifth Normal Form: Exists in never-never land - it calls for a perfect segregation of logically related  many-to-many relationships.

So now you know something about relationships (that's what our whole Relational Database thing is about right?). But just keep in mind that as we increase and tighten our relationship enforcements, there would be a little trade off with performance.--George Alexander, A fundamental approach to Database design

Sunday, August 19, 2012

SQL Sins


Ovid: I've recently purchased the book DATABASE IN DEPTH by C. J. Date and I've been enjoying it very much.  It's been quite an eye-opener to discover that most of my gripes about "databases" were actually related to SQL.

To that end, I've accidentally started a lively discussion about the "merits" of SQL at Why SQL Sucks (with a little Perl to fix it). dbdebunk.com is getting mentioned quite a bit (and not always in a favorable light, I might add).  (Mine is the root post in the thread). Admittedly, I'm just learning about many of the issues involved. So despite years of working with databases, I may have some of the basics off a bit.

The reason I mention this is because that site is very high profile amongst Perl programmers and convincing the people at that site about the issues with SQL would go a long way to reaching a broad swath of the tech world. 


Friday, August 17, 2012

Weekly DBDebunk News

1. My first post for my new Data Fundamentals column at TechWeb's AllAnalytics:

Knowing What a Database Is

2. Last week I reported some Blogger bugs. One of them changes the URL's of the static pages when they are updated. The Seminars, Papers, LaughCry, Quotes and About are static pages, which means that I won't update them until Google fixes it, although I will continue to post updates in the blog.
If such an update scrolls before you read it you will have to go back to it to read it. I apologize for the inconvenience.

I am keeping a list of the pertinent posts and will update the pages as soon as the problem is solved.

3. Oops, forgot. Some of you may be experiencing broken links. I have a bunch of posts in draft mode in the pipe and and at one point I mistakenly posted all of them and had to take them off. Some of you are accessing those dead links. But not to worry: all of them will be posted in time, although they may have different URL.Again, sorry for the inconvenience, chuck it to birth pains.

Have a nice weekend.

Wednesday, August 15, 2012

To Laugh or Cry?

Database table and NULLs, stackexchange.com

Quote of the Week

In the database world, the raw physical data model is at the center of the universe, and queries freely assume intimate details of the data representation (indexes, statistics, metadata). This closed-world assumption and the resulting lack of abstraction have the pleasant effect of allowing the data to outlive the application. On the other hand, this makes it hard to evolve the underlying model independently from the queries over the model.

As the move to the cloud puts pressure on the closed-world assumption of the database, exposing naked data and relying on declarative magic becomes a liability rather than an asset. In the cloud, the roles are reversed, and objects should hide their private data representation, exposing it only via well-defined behavioral interfaces...

The world of database models is noun-based, talking about Customers, Orders, LineItems, etc. Once modelers have designed the data model correctly, they consider their job done.

In the realm of modelers, there is no notion of data abstraction that separates abstract properties of the model from the concrete details of the fully normalized realization in terms of tables with PK/FK (primary-key/foreign-key) relationships. --All Your Database Are Belong to Us, Erik Meijer

Schema, NoSQL and the Relational Model Part 2

Part 1 ended with my following comment to Matt Rogish, on the subject of a document data model:
As Codd realized, to do database management you must have some data model, period! You cannot do it without one. Indeed, a schema is based on it.

So in order to design a document database system of the kind you envision you must first define a document data model: structure, manipulation and integrity. What exactly is it?
Part 2 continues my response to Matt, using an exchange between me and Hugh Darwen to illustrate  what happened when an attempt was made by a W3C committee to come up with an XML document data model.

Sunday, August 12, 2012

Domain vs. Type, Class vs. Relation

What's wrong with last week's picture (update of August 2012)
"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases.
  1. "type" = "domain": I fully understand why one might equate "type" and "domain", but ... in today's programming practice, "type" and "domain" are quite different. The word "type" is largely tied to system-level (or "physical"-level) definitions of data, while a "domain" is thought of as an abstract set of acceptable values.
  2. "class" != "relvar": In simple terms, the word "class" applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a "relation" is a "class" (and trivially also a "set"), which contributes to confusion.
In modern programming parlance "class" is generally distinguished from "type" only in that "type" refers to "primitive" (system-defined) data definitions while "class" refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, "type" and "class" are actually synonymous."
With respect to 1, well, yes, they are distinct, but not for the stated reason. With respect to 2, well, no insofar as "programming parlance" goes. The terminology introduced by Codd was explicitly intended to distinguish formal concepts from set theory and first order predicate logic from the terminology used in programming practice. 

What Is a Relational Table? URL Changed

I mentioned in my weekly news blog Blogger problems.

I just found out that the URL of a post that is taken offline, revised and its title changed, its URL changes. How stupid is that?

There is a similar problem with URL's of static pages: when they are updated, random numeric numbers are suffixed to them.

Until these problems are solved (don't hold your breath) I will not update my static pages and I will not be able to take posts offline, revise them and add UPDATED to their title.

Here are the changed URL of the post I revised:

What Is a Relational Table

Friday, August 10, 2012

Weekly DBDebunk News (UPDATED)

Welcome and thanks to readers, subscribers, well wishers and all expressions of interest and support.

1. Except for some Blogger bugs, the blog format seems to be working better than the old site, so at some point in the next few weeks I will drop the latter and probably shift the blog to the old domain. As far as I can tell, the current blogspot URL will continue to work. I will keep you posted.

2. Currently my top priority is to revise and expand my papers, seminars and to add to them. I am also experimenting with self-publishing on Kindle, which proves to be a hassle. This will keep me busy for a while, so if I  am not as responsive/active as I would like to be, thanks for your patience.

3. As a result of comments to two of my posts that identified some problems, I have taken them offline for revision and I will repost them as soon as I can.
Looks like I am a bit rusty after so many years in the wilderness. I apologize for the inconvenience. (UPDATE: My revised first post has been republished).

Please feel free to contact me at the address on the About page with submissions for debunking, questions, comments or whateve.

It's good to be back.
FP

Sunday, August 5, 2012

Schema, NoSQL and the Relational Model, Part 1

In What's Wrong with the Schema? Stephen Haberman is
...wondering why schema-less databases are so popular lately. Most any NoSQL store is schema-less. And while perhaps schema-less-ness is an integral part of NoSQL (e.g. most NoSQL databases are just opaque key/value stores), I would assert it’s an orthogonal concern, and that document-oriented databases, e.g. MongoDB, could arguably have a schema. However, MongoDB doesn’t just say it’s schema-less for technical reasons related to being a NoSQL store, it actually touts its lack of a schema as a benefit, claiming it is “agile” and offers “simplicity and power”. I find all of this confusing, as I actually want a schema.

Friday, August 3, 2012

To Laugh or Cry (UPDATED)

Readers of my old site may remember the "To Laugh or Cry?" feature. It was a series of links to material to which it was difficult to decide which of the two reactions would be justified.

Note: Weekly Quotes and To Laugh or Cry? items are archived on two pages accessible from the top menu, so that they can be perused after the posts scroll.  

Here is the first link initiating the series here.  

Comment On SQL Split, The Daily WTF

Wednesday, August 1, 2012

Quote of the Week

The relational calculus is good in describing sets. But it´s bad at describing relations between data in different sets. Explicit identities (primary keys) need to be introduced and normalization is needed to avoid update inconsistencies due to duplication of data.

To say it somewhat bluntly: The problem with the relational calculus and RDBMS etc. is the focus on data. It´s seems to be so important to store the data, that connecting the data moves to the background.

That might be close to how we store filled in paper forms. But it´s so unlike how the mind works.

There is no data stored in your brain. If you look at the fridge in your kitchen, there is no tiny fridge created in your brain so you can take the memory of your fridge with you, when you leave your kitchen. --Ralf's Sudelbücher, Musings on relations - or: WinFS is not enough

Physical Data Independence

CB: In your writings, you have made the point that a data model is something that's logical (e.g. orthogonal to its actual physical implementation). Does that mean, for example, that a DBMS could be truly relational if, under the hood, it is physically built upon such traditional programming constructs as linked lists, pointers, and arrays? And even if so - are you aware of a better way to develop a TRDBMS without using such constructs?

Sunday, July 29, 2012

What Meaning Means: Business Rules, Predicates, Constraints, Integrity Constraints and Database Consistency

This is a 6/10/17 rewrite of a 7/29/12 post to bring it in line with the McGoveran interpretation of Codd's RDM[1].

To understand what's wrong with the picture of two weeks ago, namely:

"If we step back and look at what RDBMS is, we’ll no doubt be able to conclude that, as its name suggests (i.e., Relational Database Management System), it is a system that specializes in managing the data in a relational fashion. Nothing more. Folks, it’s important to keep in mind that it manages the data, not the MEANING of the data! And if you really need a parallel, RDBMS is much more akin to a word processor than to an operating system. A word processor (such as the much maligned MS Word, or a much nicer WordPress, for example) specializes in managing words. It does not specialize in managing the meaning of the words ... So who is then responsible for managing the meaning of the words? It’s the author, who else? Why should we tolerate RDBMS opinions on our data? We’re the masters, RDBMS is the servant, it should shut up and serve. End of discussion." --Alex Bunardzic, Should Database Manage The Meaning?
it helps to consider the quote in the context of the author's other article, "The Myth of Data Integrity", where he reveals that those "DBMS opinions" are integrity constraints (the article has been deleted, but a few comments remain online and are highly recommended for a feel  of the consequences of lack of foundation knowledge).

Hugh Darwen concurred:

"Couldn't agree more. The strength of the relational model lies in its total abandonment of meaning. BTW, I detest the term "semantic constraint" that some people use, imagining that some constraints are to do with meaning and others are not. They are all just constraints. The word "meaning" is bandied about sometimes in ways that make me wonder if some people don't know what it means, whether they spell it that way or the posh way, "semantics."
Now, relations can indeed represent facts about anything and a RDBMS manipulates them mathematically as sets, independently of what they mean (just like arithmetic manipulates numbers). This is due to its foundation in logic, which is[2]:
"... an analytical theory of the art of reasoning whose goal is to systematize and codify principles of valid reasoning. It has emerged from a study of the use of language in argument and persuasion and it is based on the identification and examination of those parts of language which are essential for these purposes. It is formal in the sense that it lacks reference to meaning. Thereby, it achieves versatility: it may be used to judge the correctness of a chain of reasoning (in particular, a “mathematical proof”) solely on the basis of the form (and not the content) of the sequence of statements, which make up the chain." --R. R. Stoll
It is also true that few data professionals understand what meaning in database management means—Bunardzik being an example.

But as McGoveran points out, were logic or the RDM to "totally abandon meaning", they would be unusable! As Stoll points out, it is only the formal language "essential for argument and persuasion" (i.e., deduction) that "lacks reference to meaning" and thereby "achieves versatility." It does not abandon meaning, it merely lacks reference to it. This applies only to the formal deductive part of logic, not to its equally important methods of formal interpretation (semantics)—which are essential to its application
the RDM is theory applied to database management. As I explained so many times, it is impossible to interrogate a database sensibly, ensure correct results and interpret them correctly without knowledge of the meaning assigned to relations and the database by their designer. That is why meaning should be documented within the system and accessible to users on demand (which it is not). What is more:
  • In the context of formal systems, syntax and semantics have very special meanings and the latter is not just "posh spelling" of the former: meaning is informal, semantics is formal.
  • All meaning that can be formally captured relationally is enforced as constraints.
  • The RDM can capture more of it than what the current, often erroneous, understanding would have you believe, certainly more than what SQL DBMSs--with which true RDBMSs are constantly confused—support[3]

Business Rules and Meaning


Relations are sets that represent in the database facts about object groups in the real world. Their meaning is conveyed by informal descriptions of the groups, expressed in natural language, that specify their defining properties[4]. I refer to such descriptions as business rules. I already explained[5] that, per McGoveran, there are several types of properties that rules specify to describe a group of related groups, which is what a database represents. The meaning of every relation and of the database as a whole is conveyed by the conjunction of the rules that specify all the properties of those types.

The following example is for explanatory purposes only and not intended to be realistic. Consider three object groups of interest
parts, suppliers and supplies—where a supply is a relationship between a supplier. Each group is represented by a relation, with
SUPPLIES {S#,P#,QTY}
representing the supplies group. The kind of informal English interpretation—meaning—that would usually be given for SUPPLIES is:
Supplier with supplier number S# supplies part with part number P# in quantity of QTY
which we shall refer to as the supply rule. First, it specifies only the three first order properties (1OP) shared by supply objects—supplier number, part number and quantityand the second order property (2OP) rule arising from the relationship between them. So it does not convey the full meaning of SUPPLIES, which includes rules that specify other 2OPs, if any, as well as third order properties (3OP). Second, natural language is ambiguous:
"Does this mean QTY is the packaging quantity? The only quantity ever supplied? Is the tuple recording a previously supplied quantity? Conclusion: "Supplier S# supplies part P# in quantity QTY" is ambiguous! It takes a lot more English verbiage to make the meaning clear." --David McGoveran
In short, the supply rule is a poor assignment of meaning to the relation.

Predicates and Constraints


Users familiar with the reality represented by SUPPLIES (as they well should be to use the database) believe they know what suppliers, parts and supplies are and, therefore, what the database meansi.e., what is intended to represent. They work around incompleteness and ambiguity by inferring meaning from familiarity, but their interpretation can easily be at odds of the designer's intended meaning and often is, which is the problem!

For example, all objects (including supplies) are distinguishable in the real world—i.e., uniquely identifiable—by a combination of one or more 1OPs. So if they know that for the period during which supply facts are recorded there is only one supply of a specific part by a specific supplier, users can infer the distinguishability rule:

Supplies are uniquely identified by the combination of supplier number and part number
This is a multi-object rule that specifies a 3OP arising from a relationship between all supplies in the group—uniqueness.

A DBMS does not understand informal meaning, like users do—it can only manipulate abstract symbols mathematically—so rules must be expressed formally. In the case of RDBMSs, that implies being expressed symbolically in first order predicate logic (FOPL), or an equivalently expressive formal language. The supply rule formalizes as a tertiary (3-place) FOPL predicate:

S( S#,P#,QTY )
But because the supply rule does not specify all the properties (not even all the 1OPs!), the S() predicate does not formalize the full meaning of SUPPLIES. A DBMS cannot make the kind of inferences users might and is, therefore, unaware of the uniqueness predicate corresponding to the distinguishability rule
{EQ(<S#,P#,QTY1>,<S#,P#,QTY2>) OR NOT ((<S#,P#,QTY1>|SUPPLIES)
 AND (<S#,P#,QTY2>|SUPPLIES))}
where:

  • EQ(<S#,P#,QTY1>,<S#,P#,QTY2>) is an equality binary predicate asserting two arbitrary tuples are equali.e., all their values are equal;
  • | symbolizes 'in', or 'belongs to';
  • <> set off a tuple;

Its informal English interpretation is "either two tuples are equal
i.e., the same tupleor they are not both in SUPPLIES".

It is the way to express tuple uniqueness in FOPL and the predicate can be added as a conjunct to the S() predicate:

S(S#,P#,QTY1) AND {EQ(<S#,P#,QTY1>,<S#,P#,QTY2>)
OR NOT [(<S#,P#,QTY1>|SUPPLIES) AND (<S#,P#,QTY2>|SUPPLIES)]}

Constraints and Database Consistency


Relation-specifying predicates corresponding to group-describing rules can be expressed in a form that, when declared to the DBMS in a declarative FOPL-base data sub-language, can be enforced by it to constrain relations to be consistent with the rules. For which reason we shall refer to expressions in that form constraints.

Here's the constraint expression of the conjunction of the S() with the uniqueness predicate:

FORALL (<S#,P#,QTY1>),(<S#,P#,QTY2>) 
[S(S#,P#,QTY1) AND {EQ(<S#,P#,QTY1>,<S#,P#,QTY2>)
OR NOT(<S#,P#,QTY2>|SUPPLIES) AND (<S#,P#,QTY2>|SUPPLIES))}]
the interpretation of which is "For all tuples <S#,P#,QTY1>  and all tuples <S#,P#,QTY2>, tuple <S#,P#,QTY1> is in the SUPPLIES relation and tuple <S#,P#,QTY1> satisfies S() , either tuple <S#,P#,QTY2>  equals
tuple <S#,P#,QTY1> or else it is not the case that both tuple <S#,P#,QTY1> is in  SUPPLIES and tuple <S#,P#,QTY2> is in SUPPLIES", another way of saying that there are only tuples with unique (S#,P#) values in SUPPLIES.


Note that while rules are expressed in natural language in real world terms (groups, properties, objects), constraints are expressed in FOPL in database terms (relations, domain/attributes, tuples).

The predicates corresponding to all the rules can be added as conjuncts in the same way. Thus:

  • The conjunction of constraints corresponding to all the 1OP, 2OP and 3OP rules describing an object group comprises the relation predicate (RP) of the relation, that represents in the database the formalized meaning--semantics--of the relation representing the group;
  • The conjunction of all RPs with the constraints corresponding with the 4OP rules comprises the database predicate (DBP) that represents the semantics of the database representing the group of groups.
Note: In Appendix A in [6] I provide a more detailed example of constraint formulation and verification, but the reader is warned that it is "proof of concept" rather than rigorous.


Integrity Constraints


To be enforced, FOPL constraints must be expressed and declared to the DBMS in the syntax of a data language native to it—in which case we refer to them as integrity constraints.

In the distinguishability case, an identifier is represented formally in the database by a key—a combination of one or more attributes that satisfies a uniqueness constraint in the relation. A primary key (PK) integrity constraint is both (1) a predicate that identifies the PK and (2) a uniqueness predicate satisfied by the PK, expressed in a specific data language.

Unfortunately, there is currently no declarative, relationally complete FOPL-based data language, but the shorthand in SQL is an example of possible syntax for a PK integrity constraint:

CREATE TABLE supplies
 (S# CHAR(3),
  P# CHAR(3),
  PRIMARY KEY (S#,P#),
  QTY(INTEGER));

To understand how integrity constraints lend themselves to algorithmic implementation for enforcement by a DBMS, it helps to think of the constraint


FORALL (<S#,P#,QTY1>),
FORALL (<S#,P#,QTY2>)
[S(S#,P#,QTY1) AND  {EQ(<S#,P#,QTY1>,<S#,P#,QTY2>) OR
NOT((<S#,P#,QTY2>|SUPPLIES) AND (<S#,P#,QTY2>|SUPPLIES))}]

in procedural terms as two nested loops.
"The first loop steps through all tuples in the SUPPLIES relation selecting a tuple for each pass, while the inner loop selects a second (not necessarily distinct) tuple. For each selection of specific values for the first tuple (the outer loop) and second tuple (the inner loop), the bracketed expression is evaluated. Notice that the tuple selected by the inner loop can differ from the first tuple only in the values of QTY1 and QTY2, with S# and P# being fixed until the  next iteration of the outer loop. So, this means that for a specific tuple with its values of S# and P#, there can be only one QTY value if both the first and second tuples belong in SUPPLIES." --David McGoveran
Note carefully that the constraint enforcement procedure is undertaken transparently by the DBMS and is implicit in the user's declarative syntax, such as the SQL shorthand.

Note: A PK integrity constraint can only be enforced by using such an algorithm. SQL systems implement it physically with a unique index. For each tuple candidate for insertion there are tuples already in the index (or none, in which case the first values can't be rejected)--these are the <S#,P#,QTY1> tuples--and the new tuple--<S#,P#,QTY2>. The integrity constraint is applied separately for each insert. This is yet one more reinforcement of the logical-physical confusion (LPC): the SQL user is lead to think in terms of the physical implementation rather than the logical constraint declaration. DBAs often drop the index, then wonder why enforcing uniqueness requires so much processing when they recreate it.

 
With the integrity constraint enforced, only tuples with unique (S#,P#) values can be inserted in SUPPLIES, consistent with both the supply and distinguishability rules.


Conclusion


A RDBMS enforces the integrity constraints corresponding to the RPs and DBP declared to it to constrain consistency with the business rules describing the reality represented by the database.

First, integrity constraints are not "RDBMS opinions"—that is nonsense—and the avoidance of reference to meaning in the deductive system implies neither that "integrity is a myth", nor "total abandonment of meaning". Meaning is conveyed by user business rules and integrity constraints are formal expressions thereof, which the DBMS can enforce algorithmically. As Erwin Smout commented, by enforcing them, the DBMS ensures database consistency with the meaning the users themselves—via the designer—assigned to it. And in this sense "semantic constraints" is not nonsense, even if the DBMS does not understand semantically, like users. Indeed, it is failure to appreciate the semantic function of constraints that is responsible for many misconceptions in the industry and the weak integrity support by SQL DBMSs[7], among them the mistaken illusion I often deplore that the meaning of a relation can be inferred from sheer inspection of a tabular display of a relation, or comments such as this[8]:

"I just took "Please stop insisting that Primary and Foreign keys are mandatory. They are good design habits but by no means mandatory" to be a sloppy way of saying that defined constraints are not mandatory."
Second, consistency with the full meaning of relations and the database is ensured by enforcing the integrity constraints corresponding to the RPs and DBP that correspond to all the rules and that is possible only with a declarative FOPL-based relationally complete data sub-language, which has not yet been implemented. Any relational constraint expressible in FOPL that is not expressible in the data language remains an integrity risk, unless enforcement is relegated to application or procedural code, the approach that database management was explicitly intended to end for a multiplicity of reasons.

Third,

"A name of a database object—whether of a database, relation, domain, attribute, or tuple—should never be treated as simply a reference to data to be manipulated, but as a mnemonic for its formal set definition by constraints and a reference to that set. The DBMS can then find and manipulate data members of those sets in the usual manner, but according to the expression that results after substitution of the appropriate defining integrity constraints for those names." --David McGoveran
Fourth, whether we like it or not,
"All semantics that can be formalized in FOPL--including verbs such as ‘supplies’—formalize as constraints. We use constraints to reduce the number of permissible interpretations—meanings—of the purely abstract FOPL deductive system. Codd did bring integrity to the forefront by making it an explicit component of a formal data model, but that does not mean that all the semantics that can be captured formally, have been captured in the standard description of RDM, let alone supported in products. In 1979 Codd described a way to "capture" meaning using the relational formalism beyond the then current understanding. That formalism doesn't tell you how to discover meaning, but if you have it, he shows (at least to some degree) how to express the corresponding semantics relationally." --David McGoveran
None of this is properly understood in the industry.


References



[1] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS, (forthcoming)

[2] Stoll, R. R., SET THEORY AND LOGIC

[3] Codd, E. F., Extending the Database Relational Model to Capture More Meaning, ACM Trans. Database Syst. 4(4): 397-434 (1979)

[4] A Relational Model of Data for Large Shared Data Banks, Commun. ACM 13(6): 377-387 (1970)



[7] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER