In April, 2005, Jim Gray with Mark Compton published A Call to Arms inACM Databases, Vol. 3, No. 3, yet another
“manifesto” on the direction of database technology. Gray's basic premise is
that “traditional relational” database management is sagging under the
onslaught of modern computing challenges. But what he is really indicting are SQL
DBMSs by the major vendors—Oracle, Microsoft (SQL Server) and IBM (DB2). He
appeals to the prejudices of many in the information technology field, but
provides little hard evidence in support of his arguments. He uses simplistic
anecdotes and distorts history and current reality to support his premise. Gray
bundles together a plethora of current practices and techniques, and intimates
that all should be integrated into database systems.
This response examines a number of his points, beginning with
his indictment of relational technology, and attempts to separate reality from
opinion. The task is daunting, because the article tries to encompass most
aspects of current computing techniques, not just those who clearly belong in
the database management field. Rather than cover all his proposals, this
article will concentrate on several salient ones, while touching on a few other
topics.
Is Relational Becoming Irrelevant?
Is the relational database architecture—as epitomized by SQL
DBMSs—really sagging at the knees? Or, are the major SQL vendors not keeping
their eye on the ball?
The major SQL DBMS vendors virtually ignore relational
principles. SQL itself is notorious for weaknesses in its relational fidelity,
e.g. weak integrity support, optional primary keys, lack of true domain
support, and so on (see A GUIDE
TO THE SQL STANDARD). SQL92 did make some improvements in relational
compliance over SQL89 and SQL86 (but didn't fix the above problems.) However,
SQL99 and later standards made a sharp turn away from relational concepts in
introducing features that violate the relational model (RM), e.g. pointers.
Open source SQL DBMSs are even worse, almost defiant about
non-conformance. Rather than "polishing the round ball" (in the words
of Michael Stonebraker), vendors have concentrated on marketing
differentiation, non-database features, and quirky optimizers requiring
vendor-specific tuning that locks customers into their proprietary solutions.
Improving relational compliance is the lowest priority, or is often dismissed
outright.
Yet full or even just better compliance with RM offers true
solutions to many of the issues that Gray raises. He disregards this, treating
weaknesses of current products as evidence of the failings of RM which they
failed to implement. He offers XML and XQuery as alternatives, supposedly
based on 'developer' preferences. This is a rash judgment, supported by neither
past, nor current experience.
XML and XQuery are insolubly flawed, the former being just
another incarnation of the discredited hierarchical data model that RM
superseded. XQuery brings the excessive complexity that hierarchy instilled in
the products that preceded SQL. A relational query language uses only one
method to access and manipulate information—by data values. XQuery has at least
three—by value, by navigating the hierarchy and through external linkage. This
adds complexity, but no power, exactly the problem that Codd intended to
eliminate with RM.
XML is may be appropriate for data transfer between
systems, for streaming operations (although it was not necessary, nor the most
efficient for that purpose). But XML is not a good data management technology
(see The Exchange Tail
and the Management Dog).
Dichotomy Between Data and Procedures
Gray claims that there is an “artificial separation” between
data and procedures (or later in his article, algorithms) in 'traditional'
relational databases. He asserts that this dichotomy began with the COBOL
language and the COBOL DBTG (Database Task Group). But this distorts history
and current reality. RM has nothing to do with COBOL, and actually predates
DBTG. It is not “pure data”; it defines powerful set operators for
manipulating data using relational algebra and calculus languages,
which are declarative rather than procedural (SQL is an improvement over
languages preceding it, but it is not always as declarative as it should and
could have been).
Perhaps Gray thinks that RM rejects procedural code for most
database processing, and requires declarative interfaces. But there are very
good reasons for this stance. Procedural code is convoluted and obtuse. It is
difficult for humans to verify correctness, or even to understand its purpose.
Machine verification of procedural code is still not possible, and is not
likely to occur in the near future.
Another major problem with procedural code is that
optimization techniques yield limited gains in efficiency, even after years of
research in this area. Functional, logic and declarative code allow much
greater improvements. An optimizer for a declarative language can yield
multiple levels of magnitude improvement. Correctness is easier to verify.
Note: The major Object-Oriented Languages, Java, C++,
etc., are procedural in nature.
For these reasons, procedural code is generally a poor choice
for binding with databases. Even so, most SQL DBMSs support stored
procedures and triggers written in procedural languages. At least
some of the impetus for adding them is due to the weakness of SQL as a full
declarative, relational language. Gray thinks that we've polished the SQL round
ball enough and looks to OO and XML for solutions, certainly a popular view
(more on this later). But if we are must discard SQL altogether—which may be a
good idea on the one hand, but difficult practically on the other—why not
develop a good truly relational data language?
Procedural needs can be satisfied by host application
development languages, as Codd envisaged with his concept of an embedded data
sub-language; SQL is just a poor implementation of that idea.
Synergy Between Objects and Database
OO programming is widely used today, and RM is the
dominant database technology, at least insofar as SQL adheres to it. Together,
they power many applications in use. However, many consider this marriage to be
in trouble. They describe the problem as an “impedance mismatch” between OO
applications and relational—that is, SQL—database systems. They usually
attribute this mismatch to the difference in the way information is represented
in OO—as pointer-linked lattices of objects, and the relational
representation of information, as logically linked tables, based on data
values.
One reason for this so-called mismatch is the fundamental
distinction between the concerns of an individual application and those of a
shared database system. An application works within a subset of the total
problem space. It accesses and manipulates a privately viewed subset of a
database. A database system, on the other hand, concerns itself with
serving multiple applications with different views of the database, and
maintaining the security, integrity and accessibility of the shared data; a
database and DBMS service a variety of reporting capabilities, batch processing
and ad-hoc access, all using different subsets of the same data in different—and
changing—ways.
Gray touts SQLJ (an OO style embedded SQL) as a “nice”
integration of SQL and Java. The reality is that SQLJ is not so nice, and is
rarely used. Most OO applications either access the database API and use the
values directly, construct their own objects, or use Object/Relational (OR)
wrappers like Hibernate for Java. All this occurs on the client, so the DBMS
backend is not involved. Unfortunately, the special requirements of OR wrappers
sometimes drive the design of the database, to the detriment of general
usability of the shared resource.
The modern OO languages, such as Java and C#, compile to an
intermediate form which then is compiled to machine code at runtime. Aside from
full portability and greater efficiency, this makes them host language
candidates for embedding a truly relational data language. A runtime compiler
can analyze usage patterns for improved optimization and can recompile portions
dynamically as it gathers newer statistics. These languages also offer greater
security. The runtime environment employs a verifier for the intermediate code
to protect against malicious code or inadvertent corruption. The compiled code
runs in a 'sandbox' that prevents interference with other parts of the DBMS.
Other Topics
This section briefly looks at other areas in Gray's article.
Most, but not all, show promise in improving current database systems, but are
orthogonal to RM.
Self-Managing and Always-Up Database Systems
Like many of the concepts mentioned in Gray's article,
self-managing and always-up database systems already exist with these
characteristics, proving again that Gray is overstating his case in claiming
relational database architectures are “slowly sagging to their knees”. In fact,
truly relational DBMSs (TRDBMS) are uniquely able to deal with these
situations, because of their physical data independence. Additional information
on this topic can be found in Highly Available
Databases, (the major SQL DBMSs have limited capabilities in
this area).
Improved Physical Structures
RM and even the SQL Standard are logical in nature, and
silent about physical details This allows product designers complete physical
freedom in implementation. Unfortunately, current SQL DBMSs tend to use simple
storage structures and access paths (primarily direct-image storage,
indexing, hashing and sorting). They use random access disks as the storage
medium, converting data to octets (8 bit bytes) using serialization. Random access
and serialization can be expensive, especially for objects.
There is much room for improvement here. In-memory storage is
a promising technique supported by some newer RDBMSs. In-memory structures are
direct access and can avoid conversion/serialization to achieve
high-performance processing. This is especially important for systems with
inefficient or no random access storage.
Active Databases and Publish/Subscribe Capabilities
A system that manages active databases keeps clients
up-to-date about any changes to queried data as they occur. For example, the
client sends a query to the database selecting and retrieving a set of data. As
long as the client keeps the query open, it will receive notification of any
change (Insert, Update, Delete) made to the backend database. A
publish/subscribe facility provides a similar capability--retrieving an initial
data set and then receiving notifications of changes to the set. SQL
implementations of active databases (e.g. via event triggers) have
existed since the '90s.
Federated Databases
A federated database system—a new name for distributed
DBMS—combines two or more distinct servers under a single interface. A common
example of their use is a single query that joins tables from different
databases. The problem with federated databases is that they are not efficient
in the general case, because only limited optimization is possible.
If each participating database system is from the same vendor
(homogeneous databases), they can communicate to achieve some reasonable
optimization plan. However, this isn't feasible for heterogeneous databases. A
standard optimization dialogue/protocol to support this is prohibitive, because
of divergent proprietary implementations. An efficient federation of
heterogeneous databases is a pipe dream. A number of implementations have been
developed e.g. R*Star at IBM, Ingres*Star, but did not fare well for that
reason.
Evolving Schemas
Gray is using an OO mindset when referring to schema
evolution, also an essential component in the Extreme Programming (XP)
paradigm. Because OO lacks a data model akin to RM, constant restructuring of
the object design and application maintenance becomes a prohibitive necessity.
Schema changes are much less common in a relational
environment. Relational and normalization principles, when adhered to, create
flexible designs that minimize integrity and evolution burdens. Should
structural changes become necessary, normalized designs and relational views
can ease the transition and limit the undesirable effects of the change. Some
applications may require no changes at all. Indeed, that was one of the major
objectives of RM.
Lee Fesperman is a database practitioner who has
worked with relational technology for over 30 years. He has developed a number
of database systems based on the relational model and helped start DATABASE
DEBUNKINGS. Lee is currently president and CEO of FFE Software, Inc. (www.firstsql.com) and can be reached at
lee.at.firstsql.com.
FFE Software, Inc. is developing database systems. The flagship product is FirstSQL/J ORDBMS, which supports SQL92 with
relational extensions. The product features some capabilities mentioned in this
article—self-managing and always-up databases, and in-memory processing. It
also provides complex domains, stored procedures, triggers and expression
functions implemented in Java; this is detailed in An ORDBMS that is Truly
Relational (forthcoming at www.firstsql.com).
Ed. Note: See also Date and Darwen’s assessments of Gray’s
piece (which were way too mild for my taste) at The
Third Manifesto. My own critique, A Call to Reason will now be
forthcoming elsewhere or here, now that my DBAzine column was
canceled.
Posted: 8/26/05