Recently Chris Date sent me a three-part article, Agile
Data Modeling by Scott Ambler, published in the Software Development
Magazine, with the note “A certain amount of nonsense in here that you may
want to debunk”.
Readers may recall that we have referred on several occasions
to pronouncements by Ambler, be it to debunk them e.g. On the Level of Discourse
in the Industry, or as weekly quotes posted in 20002 (10/6, 12/1) and 2004 like the above. Based
on those my instinct was to invoke Date’s own Incoherence Principle but,
sigh, debunking is what I do, so here goes.
My expectations were confirmed right second paragraph:
“Contrary to some claims, there's nothing special about data modeling”. What
the heck does this mean???
The first version of the KSMS [agile model] supports the
critical functions required to run the dojo: the management of basic student
data and collection of money from them. When you look at the “Initial Data
Model,” you see that we’re not tracking the student’s state/province. Because
we’re building for a single dojo that isn’t near the border, we can safely
assume that all students live in the same province. Lesson number one: Agile
data models are just barely good enough. Agile developers solve today’s problem
today and trust that they can solve tomorrow’s problem tomorrow; therefore, if
we need to support people living in another province, we’ll add that
functionality at a later date … Don’t build something until it’s needed.
We don’t have much to say about Ambler’s so-called “agile
modeling” per se (see Iterative
Software Development and No
Silver Bullet). Prioritizing system functions is, well, nothing special,
but a basic development tenet. But insofar as business modeling and database
design are concerned, we very much doubt that matters are as trivial and
feasible as Ambler implies, namely, that it is a matter of simply adding
columns, tables and applications, without significant impact on both
subsequent efforts and the preceding “good enough” results. In our experience,
business modeling is more often than not characterized poor knowledge, confusion
and incompleteness and riddled with errors, and produces serious problems that
are costly, if not impossible to resolve or work around in subsequent
development/maintenance stages. Too iterative an approach tends to defer
burdens which more than cancel out the benefits, if any, from upfront
“agility”.
Note the common, confusion of levels of representation (for
another example see Universal Data Models
and the Importance of Thinking Precisely).
What Ambler refers to as “data model” is actually a business model. As
we explain in Business
Modeling for Database Design, the former is a general formal
mechanism for mapping enterprise-specific business models to
enterprise-specific logical models that represent them in the database .
The hierarchic, network and relational models are data models. We suspect, but
are not certain, that Ambler’s “persistence model” is the logical model (level
confusion is a hallmark of object terminology).
For physical data modeling, I’m using the Unified Modeling
Language … I’ve chosen to keep my key strategy simple, using surrogate values
called persistent object identifiers (POIDs), which are unique across all
records within the database. I could have used natural keys for many tables, or
even just surrogate values unique only within each table, but I’ve found that
POIDs work incredibly well.
We have no idea what UML has to do with physical
modeling—there is nothing physical about Ambler’s endeavor—which indicates
another level confusion: what he calls physical is actually logical. As to UML,
we refer the reader to Basic
Concepts in UML: a Request for Clarification,Part
1 and Part 2.
We’ve already debunked more than once the universal
deployment of surrogate keys to emulateOIDs, which is neither
necessary, nor productive (see, for example, On Auto-Incrementing
Surrogate Key Generation, chapter 3 in PRACTICAL ISSUES IN
DATABASE MANAGEMENT, pp. 826-7 in AN INTRODUCTION TO DATABASE
SYSTEMS, 8th
Ed.) Ambler declares without explanation that POIDs “work incredibly well” and
natural
keys don’t, and provides no evidence to back up his claim. Note very
carefully that even if and when there are reasons to use surrogate keys (I
explain that in my said book), they are additions to natural keys, which
guarantee logical access familiar to users, and entity integrity.
Notice how I’m keeping the data schema normalized—a “normal”
data schema stores information in one place and one place only. Although there
are many normalization rules, some are more critical than others. First, remove
repeating groups into their own table. For example, I don’t have 10 columns to
track the last 10 payments that an individual made; instead, I introduced a
Payment table. Then ensure that your tables are cohesive by insisting that all
attributes are fully dependent on the primary key. For example, the payment
type description (for example, Visa) was captured in the PaymentType table
instead of in the Payment table because that information describes the payment
type and not an individual payment.
First, a schema (what exactly is a “normal” one?)—a
collection of structures and integrity constraints—and data are
two separate components of a database, so the former does not store the latter.
And a schema does not “store information” either, because it is a logical
construct, while storage is physical; besides, only data in databases is
physically stored, information is inferred from the data via querying.
Second, there are not that “many normalization rules”,
but rather seven normal forms; why exactly are “some are more
critical than the others”? It is true that most databases that are in 3NF are
usually also in 5NF because violations of 4NF-5NF are less frequent, but
that does not mean that when they do occur, they are less important (see
The Costly Illusion:
Normalization, Integrity and Performance).
And third, “10 columns to track the last payments” is not
an example of a repeating group, but simply poor database design (see What First Normal Form
Really Means, and What First Normal Form
Means Not). The relational equivalent to a repeating group is the
multivalued column.
Traditionally, data professionals take a relatively serial
approach to development: They start by creating a nearly complete domain data
model, which they use to create an almost complete physical data model
representing the database design. While there’s ample opportunity to update the
models as a project progresses, it’s often a difficult and time-consuming task,
because the database schema is usually set early in the project and
subsequently remains sacrosanct. This is a convenient assumption for data
professionals, as it streamlines their work, but it doesn’t reflect the
iterative and incremental processes commonly followed by developers. Minimally,
data professionals should work in an evolutionary manner; better yet, they
should take an agile approach that is both evolutionary and collaborative in
nature.
Aside from repeating both the confusion between the types of
model, and between levels of representation, Ambler criticizes data
professionals for “creating a nearly complete domain data model” which does not
serve “the iterative and incremental processes commonly followed by
developers”. But this is backwards. Databases are resources shared by multiple
applications, which must serve all applications optimally, not every
application maximally. The real problem in today’s practice is just the
opposite of that described by Ambler: instead of entrusting modeling, and
database design and administration to professionals properly educated in data
fundamentals, they are increasingly left in the hands of application
developers, most of whom are self-trained in programming languages and tools,
and have an one-application-at-a-time perspective. What is more, these days
even data professionals lack proper education in data fundamentals. They fail
to realize and appreciate the havoc that the kind of “incremental” approach
promoted by Ambler can wreak with accessibility, integrity and
development/maintenance in the long run.
Now, it is true that business circumstances are often
characterized by poor knowledge of the details that need to be recorded in
databases, and that time frames are seldom sufficient for thorough business
modeling and database design efforts. But it is one thing to recognize this as
a tradeoff of upfront effort for long-term costly development maintenance
burdens, and quite another to promote it as “agility”, oblivious to the
considerable burdens down the road. Only those with knowledge, understanding
and appreciation of data fundamentals are guaranteed not to be misled by such
illusions.
Caveat emptor.
Ed. Note: In
Scott Ambler and His Strawman Ambler assesses this review as follows:
And yet many other people think that my article is pretty darn
good. Pascal criticism, if you're actually to read it, is inane. He
complains about disagreeing with the requirements, interesting but it
has nothing to do with the actual technique,
and complains about how difficult it is to collect people's emails after the
fact if you don't get the email column in the database for your first release.
The problem domain was a karate school, and I can safely tell you that
it's possible to teach people karate without knowing their email address.
Furthermore, considering how volatile email addresses are, and
considering that many people don't have them, the
quality of the data in that column will always be questionable at best.
I dare anybody with minimal thinking capacity to find any
relationship between my criticism and his understanding of it. I’m afraid the
problem here goes beyond just lack of database foundation knowledge.
See also On “Agile Modeling”, On Logical Persistence
Models and IEEE Database Certification.
Revised 4/28/06
Posted 10/1/04
© Fabian Pascal 2000-2006 All Rights Reserved