COMMENTS ON A JIM GRAY INTERVIEW
by Fabian Pascal

 

 

 

(Thanks to Chris Date for a review of and helpful comments on a draft of this article.)

 

A reader has recently forwarded an interview with Microsoft's Jim Gray about the future of SQL Server 2000 and "extreme scaling from databases" (whatever that means), published on February 1, 2002. For those who don't know, "[Gray] founded and runs Microsoft's Bay Area Research Center (BARC), a research lab that focuses on scalable computing by building superservers and workgroup systems from commodity software and hardware. Jim Gray was a key player in building the TerraServer, a 12-terabyte spatial data warehouse using Microsoft SQL Server 2000 as its building blocks."

 

Jim Gray is one of the recipients of the industry's top Turing Award, another recipient being Ted Codd, the inventor of the relational model. And it is in the context of the award that I was quite disappointed with the interview. I have yet to encounter one instance of association of technical expertise with a commercial interest, where the latter has not had some deleterious effect--be it significant or mild, obvious or subtle, conscious or not--on the former (Codd resigned from IBM to avoid it). The Gray interview does nothing to undermine my discomfort with such associations.

 

Now, from personal experience I am very much aware that the trade media tends to distort articles or interviews, rendering them incomprehensible, or even reversing intended meaning. This is one reason I refrain from writing for, or giving interviews to the mainstream media -- too often I would not recognize my material when it was published--and why I take the trouble to maintain this web site. I have to assume that Gray's interview is plagued at least to some degree by this or, at the very least, I would like to hope this is the case. Indeed, many of the questions make no sense and in the main I ignore those parts. Unfortunately, the average reader will be misled whether the published interview accurately reflects Gray's pronouncements or not. I would like, therefore, to emphasize that the following critique is of the interview as published.

 

Let's consider the parts of the interview that merit comments (ignoring those that make no sense, or have too obvious commercial tones).

 

Q: How will the upcoming integration of .NET into SQL Server affect performance, use of SQL Server as a data source for Web services, and so on. And how will it work? Will SQL Server essentially become a .NET host? Or just hooks with .NET Framework namespaces and classes?

 

JG:The integration of SQL Server with .NET is a major advance. It's the next step after OLE DB-integrating objects with databases. SQL Server becomes the data integrator, the way to deal with sets of objects. We have been chasing the Object-Relational rainbow and we are near the pot of gold-an extensible database. Now you will be able to store objects in the database, and you will be able to treat databases as objects. Performance should be even better with compiled languages than with interpreted T-SQL. And now stored procedures will be first-class objects.

 

Object orientation has fundamental flaws for database management. Essentially, it is a set of guidelines for "good" programming, pertinent at the application level, and extending it to data management is an unintended use and a highly questionable proposition at best, not a pot of gold. As I explain in PRACTICAL ISSUES IN DATABASE MANAGEMENT, object concepts and terminology are fuzzy, there is no agreement on them, no sound theoretical foundation and no provision of a specific data model analogous to the relational model. "Store objects in the database, and treat databases as objects" are exactly the kind of statement that object thinking induces and it is unclear what calling things "objects" contributes to the discussion To quote from Chris Date's and Hugh Darwen's THIRD MANIFESTO:

 

"... we acknowledge the desirability of supporting certain features that are commonly regarded as aspects of object orientation. However, we believe that the features in question are orthogonal to (i.e., independent of) the relational model, and hence the relational model needs no extension, no correction, no subsumption--and above all, no perversion!--in order to support [those features]."[emphasis theirs]

 

Q: SQL Server appears to be taking on more and more roles for any and all types of static data stores. What types of data stores can you foresee SQL Server replacing? The Registry? The file services? E-mail storage?

 

JG:Yes. Oracle, DB2, and SQL Server are each on a path to store "files": photos, videos, voicemail, e-mail, documents, and everything else. The integration of SQL Server with an object system (.NET) makes this attractive. Microsoft has been shooting for this for a long time. To the database folks it seems a natural evolution. To others it seems unnatural. There is a competition happening-each camp proposing a way to solve the problems. No doubt the end result will be a synthesis of all these ideas. But I think we all agree that storage management has to be more automatic and easier. We all have a common goal. The only debate is how to get there.

 

"We" don't all have a common goal. Anyway, files are an implementation, not a model, aspect and focus on them raises suspicions of a logical-physical confusion that is prevalent in the industry. The real issue is whether the DBMS supports user-defined data types of arbitrary complexity (a model aspect), namely: what representations are declared to users (not what files are physically stored) and what constraints and operators are provided for them. As I explain in Chapter 1 of my book, support of such types raises some nontrivial complications, to put it mildly, for any DBMS, ODBMS included. Gray is silent on those issues.

 

Q: How can modern DBMSs deal with the impedance mismatch between relational databases and nonrelational data (e.g., XML)? What innovations are on the horizon to make access to hierarchical and nonrelational data transparent? What's new in terms of performance and scalability with nonrelational data?

 

JG: Much has been made of the impedance mismatch. SQL is set-oriented and Cobol-C-Java-C# are iterator-oriented (one record at a time, please). The languages have rich type systems, while databases have poor type systems. This is about to change with .NET. The database is a Web service. The type system is universal. You can add databases to types; you can add types to databases. There is no longer an inside and an outside-the database is both above and below the type system. XQuery extends SQL to operate on trees and graphs (unifies SQL and XPath). That story is not over yet, but XQuery is our best answer so far. We are still exploring how to handle XML data: Do we shred it all, do we BLOb it, or something in between? Between is the obvious answer.

 

How about, Chris Date asks, a reasoned assessment as to whether XQuery is a good idea (people who understand (truly understand) the relational model, should say no!) Support of "complex" data types requires neither XML, nor XQuery.

 

According to its proponents, XML was invented to serve as a standard format for data exchange. As such, it does not require--and it did not initially provide--a data model, namely: data types, organization, integrity and manipulation. Like object orientation, XML is now being extended to data management, for which it was explicitly never intended. And, of course, this is impossible without some data model. That's why types and querying must be added, so the industry is now reinventing the database wheel. The problem is, of course, that the XML structure is hierarchic and, thus, nonrelational, so the wheel being reinvented is a "square" one, one that we discarded decades ago (see my articles on XML in the Against the Grain series. Thus, not only is SQL a poor implementation of the relational model, but it is now further bastardized to support obsolete nonrelational databases (not to mention pointers!!!) And this is considered progress?

 

Q: What are your predictions about the future of SQL Server Analysis Services in the business world, and the fate of OLAP, data mining, and business intelligence in general? Is Microsoft doing enough to evangelize the usefulness of these technologies in industry? What prevents these technologies from becoming more mainstream and commonplace? Will Microsoft's new Data Analyzer product help here?

 

JG: Data Analyzer will definitely help Analysis Services both in terms of OLAP and data mining. One aspect of SQL Server is that everyone gets everything. There is no separate charge for replication or for OLAP or for data mining or English Query. So the fact that these features are widely used is not so obvious. I am told that each of these features is very widely used. Certainly, when I go to the users-group meetings and when I lurk in the discussion groups, there is a huge amount of interest and innovation in these areas. The basic evolution seems to be that people build an operational system using SQL Server, and then they build some ancillary decision-support systems using that data. Most of those ancillary systems are built with Analysis Services. This might not seem mainstream, but the integrated system and tools make it easy to build these analysis systems. This is a reasonable way to operate.

 

If "everybody gets everything" is only one aspect of SQL Server, I wonder what the others are. This is a good example of the discomfort I was referring to: such claims--not exactly technical--are not likely to be made except in a commercial context.

 

From a technical standpoint it is important to distinguish between industry practices as they are, and what they ought to be. OLAP is what people are doing, but is it an appropriate solution to the problem which it purports to solve? The fact is that, like object orientation and XML, the emergence of OLAP and so-called "analytical services" is yet another example of the industry ignoring data fundamentals (see On Business Rules and Integrity Constraints: A Reply to Ralph Kimball. It is yet another regression to the bad old days of application-biased files, which were discarded in favor of centralized, application-neutral databases. It is because the industry chose SQL instead of a true and full implementation of the relational model, that commercial products are inhibited in functionality and performance, forcing users to resort to obsolete, unproductive workarounds, which had already been discarded in the past. A true implementation of relational technology would have obviated the need OLAP (and, as I alluded in my writings, there is now an implementation technology which has the potential to facilitate truly relational products).

 

Q: Do you see Microsoft technologies gaining greater acceptance among academic computer scientists? Do you think the ubiquity of Unix in the academic world is a hindrance to Microsoft's success in the business world?

 

JG: All my academic friends have a Windows system for their PowerPoint presentations. But they often have a Solaris or Linux box where they do their programming. Certainly it would be better for Microsoft if everyone used only our stuff, but that will never happen. In the SkyServer project the heterogeneity costs us a lot of energy. We have to support Netscape 4-6 and Opera. All the client tools have to work on Linux. That makes it much more complex to deploy apps. But that's the reality. As for Microsoft making progress in academia, that is certainly happening, and it is mostly being driven by our tools people. Academics are working with languages like Eiffel to make them compatible with our .NET technologies so that Visual Studio .NET is a great environment for programming with any language. The benefit to the academics and their students is that they get a robust, highly usable environment: a debugger, integrated help, rich editing capabilities, and so on.

 

Shouldn't academia be about education, rather than products? Turns out that it is not less prone to succumb to industry hype than the business world (see my two editorialsand Denormalization for Performance – Et Tu Academia?).

 

Q: What do you see for these kinds of databases in the next decade? What will we be doing with data then?

 

JG: When data volumes get large, you can't grep them any more, you have to use an index. That's when you need some kind of database. My group has been trying to figure out how much you should put in the database-everything or just the indices. As I get into it, I find it is a lot easier to put everything in the database: photos, videos, images, and of course, text. Sometimes it is just BLOb data, but often you want to "parse" the data and extract as much meta information as possible. For example, we built a database that has all our personal photos. The classic design would have these photos in a file system and just the metadata in the database. But it is so much easier to manage when everything is in the database. There is only one thing to back up and to secure-everything is just simpler. There are fewer design decisions to make. My prediction is that eventually all storage systems will evolve to be database systems.

 

The logical-physical confusion raises its ugly head again. It is hard to believe that the justification for databases is being reduced to a need for indexes. Was the migration from application programs and files to DBMSs and databases mainly for improved performance via indexes? Weren't ISAM files indexed? Interestingly, the current generation of Java programmers, who lack knowledge of data fundamentals, explicitly dismiss the use of databases and DBMSs for precisely the opposite reason: they claim that databases are the "performance bottleneck" and should be avoided at all costs. For example:

 

"With Oracle9i anything you can model in Java, you can model in Oracle. If just a few good programmers see the value of using this existing reliable infrastructure rather than recreating it, we can get on with the next paradigm shift, which is to model our business problems as objects instead of relationally mapped to Java objects. We're still stuck in the relational rut just as the hierarchical and network databasers of the past were stuck in a rut when relational technology first came out."

 

"The only rules that should reside in a database are referential integrity (and sometimes that isn't really necessary). It is also best to keep rules out of your data access code (hardcoding WHERE values) Business rules should be centralized in Java business objects for better manageability, scalability, etc. Don't let pushy DBAs tell you otherwise. Rules in a database slow down development as well as data access time."

 

What is even more ironic is that indexes were made necessary by the lack of relational fidelity of SQL and its commercial implementations: their support of physical data independence is rather poor. SQL DBMSs maintain a 1:1 relationship between a logical row and a physical record, which imposes specific physical orderings on rows and on columns, robbing SQL tables of set properties and constraining optimization. The new implementation technology I alluded to should allow the development of RDBMSs that are not so constrained. Aside from being much more optimizable, true RDBMSs may actually not need to rely on indexes, and may obviate the need for various data administration complexities of SQL products.

 

As to the notion that there are fewer design decisions if multimedia files are stored in the database, it is an illusion-–the opposite is actually true (see Chapter 1 of PRACTICAL ISSUES IN DATABASE MANAGEMENT). As I already explained, files are not at the model level and do not do away with logical design—nothing does. But I understand the attraction of such illusions: design requires thinking and knowledge, and that is hard.

 

 

Posted 03/22/02

 

 

 

[ABOUT] [QUOTES] [LINKS]