AGILITY DERIVES FROM, IS NOT SUBSTITUTE FOR FOUNDATION KNOWLEDGE
by Fabian Pascal

 

 

 

Recently Chris Date sent me a three-part article, Agile Data Modeling by Scott Ambler, published in the Software Development Magazine, with the note “A certain amount of nonsense in here that you may want to debunk”.

 

Readers may recall that we have referred on several occasions to pronouncements by Ambler, be it to debunk them e.g. On the Level of Discourse in the Industry, or as weekly quotes posted in 20002 (10/6, 12/1) and 2004 like the above. Based on those my instinct was to invoke Date’s own Incoherence Principle but, sigh, debunking is what I do, so here goes.

 

My expectations were confirmed right second paragraph: “Contrary to some claims, there's nothing special about data modeling”. What the heck does this mean???

 

The first version of the KSMS [agile model] supports the critical functions required to run the dojo: the management of basic student data and collection of money from them. When you look at the “Initial Data Model,” you see that we’re not tracking the student’s state/province. Because we’re building for a single dojo that isn’t near the border, we can safely assume that all students live in the same province. Lesson number one: Agile data models are just barely good enough. Agile developers solve today’s problem today and trust that they can solve tomorrow’s problem tomorrow; therefore, if we need to support people living in another province, we’ll add that functionality at a later date … Don’t build something until it’s needed.

 

We don’t have much to say about Ambler’s so-called “agile modeling” per se (see Iterative Software Development and No Silver Bullet). Prioritizing system functions is, well, nothing special, but a basic development tenet. But insofar as business modeling and database design are concerned, we very much doubt that matters are as trivial and feasible as Ambler implies, namely, that it is a matter of simply adding columns, tables and applications, without significant impact on both subsequent efforts and the preceding “good enough” results. In our experience, business modeling is more often than not characterized poor knowledge, confusion and incompleteness and riddled with errors, and produces serious problems that are costly, if not impossible to resolve or work around in subsequent development/maintenance stages. Too iterative an approach tends to defer burdens which more than cancel out the benefits, if any, from upfront “agility”.

 

Note the common, confusion of levels of representation (for another example see Universal Data Models and the Importance of Thinking Precisely). What Ambler refers to as “data model” is actually a business model. As we explain in Business Modeling for Database Design, the former is a general formal mechanism for mapping enterprise-specific business models to enterprise-specific logical models that represent them in the database . The hierarchic, network and relational models are data models. We suspect, but are not certain, that Ambler’s “persistence model” is the logical model (level confusion is a hallmark of object terminology).

 

For physical data modeling, I’m using the Unified Modeling Language … I’ve chosen to keep my key strategy simple, using surrogate values called persistent object identifiers (POIDs), which are unique across all records within the database. I could have used natural keys for many tables, or even just surrogate values unique only within each table, but I’ve found that POIDs work incredibly well.

 

We have no idea what UML has to do with physical modeling—there is nothing physical about Ambler’s endeavor—which indicates another level confusion: what he calls physical is actually logical. As to UML, we refer the reader to Basic Concepts in UML: a Request for Clarification,Part 1 and Part 2.

 

We’ve already debunked more than once the universal deployment of surrogate keys to emulateOIDs, which is neither necessary, nor productive (see, for example, On Auto-Incrementing Surrogate Key Generation, chapter 3 in PRACTICAL ISSUES IN DATABASE MANAGEMENT, pp. 826-7 in AN INTRODUCTION TO DATABASE SYSTEMS, 8th Ed.) Ambler declares without explanation that POIDs “work incredibly well” and natural keys don’t, and provides no evidence to back up his claim. Note very carefully that even if and when there are reasons to use surrogate keys (I explain that in my said book), they are additions to natural keys, which guarantee logical access familiar to users, and entity integrity.

 

Notice how I’m keeping the data schema normalized—a “normal” data schema stores information in one place and one place only. Although there are many normalization rules, some are more critical than others. First, remove repeating groups into their own table. For example, I don’t have 10 columns to track the last 10 payments that an individual made; instead, I introduced a Payment table. Then ensure that your tables are cohesive by insisting that all attributes are fully dependent on the primary key. For example, the payment type description (for example, Visa) was captured in the PaymentType table instead of in the Payment table because that information describes the payment type and not an individual payment.

 

First, a schema (what exactly is a “normal” one?)—a collection of structures and integrity constraints—and data are two separate components of a database, so the former does not store the latter. And a schema does not “store information” either, because it is a logical construct, while storage is physical; besides, only data in databases is physically stored, information is inferred from the data via querying.

 

Second, there are not that “many normalization rules”, but rather seven normal forms; why exactly are “some are more critical than the others”? It is true that most databases that are in 3NF are usually also in 5NF because violations of 4NF-5NF are less frequent, but that does not mean that when they do occur, they are less important (see The Costly Illusion: Normalization, Integrity and Performance).

 

And third, “10 columns to track the last payments” is not an example of a repeating group, but simply poor database design (see What First Normal Form Really Means, and What First Normal Form Means Not). The relational equivalent to a repeating group is the multivalued column.

 

Traditionally, data professionals take a relatively serial approach to development: They start by creating a nearly complete domain data model, which they use to create an almost complete physical data model representing the database design. While there’s ample opportunity to update the models as a project progresses, it’s often a difficult and time-consuming task, because the database schema is usually set early in the project and subsequently remains sacrosanct. This is a convenient assumption for data professionals, as it streamlines their work, but it doesn’t reflect the iterative and incremental processes commonly followed by developers. Minimally, data professionals should work in an evolutionary manner; better yet, they should take an agile approach that is both evolutionary and collaborative in nature.

 

Aside from repeating both the confusion between the types of model, and between levels of representation, Ambler criticizes data professionals for “creating a nearly complete domain data model” which does not serve “the iterative and incremental processes commonly followed by developers”. But this is backwards. Databases are resources shared by multiple applications, which must serve all applications optimally, not every application maximally. The real problem in today’s practice is just the opposite of that described by Ambler: instead of entrusting modeling, and database design and administration to professionals properly educated in data fundamentals, they are increasingly left in the hands of application developers, most of whom are self-trained in programming languages and tools, and have an one-application-at-a-time perspective. What is more, these days even data professionals lack proper education in data fundamentals. They fail to realize and appreciate the havoc that the kind of “incremental” approach promoted by Ambler can wreak with accessibility, integrity and development/maintenance in the long run.

 

Now, it is true that business circumstances are often characterized by poor knowledge of the details that need to be recorded in databases, and that time frames are seldom sufficient for thorough business modeling and database design efforts. But it is one thing to recognize this as a tradeoff of upfront effort for long-term costly development maintenance burdens, and quite another to promote it as “agility”, oblivious to the considerable burdens down the road. Only those with knowledge, understanding and appreciation of data fundamentals are guaranteed not to be misled by such illusions.

 

Caveat emptor.

 

Ed. Note: In Scott Ambler and His Strawman Ambler assesses this review as follows:

 

And yet many other people think that my article is pretty darn good.  Pascal criticism, if you're actually to read it, is inane.  He complains about disagreeing with the requirements, interesting but it has nothing to do with the actual technique, and complains about how difficult it is to collect people's emails after the fact if you don't get the email column in the database for your first release.  The problem domain was a karate school, and I can safely tell you that it's possible to teach people karate without knowing their email address.  Furthermore, considering how volatile email addresses are, and considering that many people don't have them, the
quality of the data in that column will always be questionable at best.

 

I dare anybody with minimal thinking capacity to find any relationship between my criticism and his understanding of it. I’m afraid the problem here goes beyond just lack of database foundation knowledge.

 

See also On “Agile Modeling”, On Logical Persistence Models and IEEE Database Certification.

 

 

Revised 4/28/06

Posted 10/1/04

© Fabian Pascal 2000-2006 All Rights Reserved