MORE ON MULTIVALUED DBMS
with Fabian Pascal

 

 

 

From: GJ

To: Editorial

Date: 9 Dec 2004

 

I enjoyed the first two PRACTICAL DATABASE FOUNDATION papers. I will have to read the first one a few more times to get all of the information. I have been a fan of DATABASE DEBUNKINGS and your and Mr. Date's books for several years now.

 

I have considerable experience with multivalued database systems, and with SQL-based systems, and I've put a lot of personal effort into educating myself in the theory and principles of relational databases. I'm no expert but I think I know more than most of the programmers and DBA’s I work with. I share your view of vocational tool-oriented education and the focus on tools instead of understanding.

 

A couple of comments in your paper #2 What First Normal Form Means Not!  caught my eye. Let me offer some corrections and elaboration.

 

Dick Pick invented the first multivalued database system, for Microdata, back in the dark ages of the early 1970s. The ideas may be older than that. Despite the marketing puffery of multivalued vendors who try to make MV systems appear better than relational databases, Mr. Pick should not be blamed. His design filled a real need in the days of slow CPUs, very limited memory, and slow disk storage, and his database system pre-dated commercial SQL-based systems by several years.

 

You wrote "The hashing algorithm strongly suggests that ITEM IDs are system-generated pointers, akin to Object IDs." The ITEM IDs are assigned by the database developer and/or programmer; like keys in a SQL database ITEM IDs can be any type, or a composite type. ITEM IDs do not have to be unique (there are provisions to force uniqueness by having the system maintain a B*tree index). The ITEM ID is used as input to a hash algorithm--which is chosen in advance based on expected ITEM ID values when the file is created--and the result is a group number. The hash functions take arbitrary string or numeric values and generate a number in the range [0, modulo], which is the group that will contain the record. Depending on the initial choice of modulo and hash function, ITEM IDs may map to groups more or less evenly distributed across the range of group numbers. A poorly-chosen hash function or value for modulo will cause hash collisions and force records that won't fit into the full groups into overflow groups. Once an ITEM ID is used to find the corresponding group, the frames that make up the group are searched sequentially for the record. Since records are variable-length (usually stored as delimited strings, with the delimiters in the ASCII 252-254 range), searching through frames is costly, and much effort goes into creating and resizing files to get the right values for modulo, separation, and hash algorithm (usually called file type).

 

Of course these are all physical implementation details that should have nothing to do with designing a schema, but it's not possible to separate the two activities in a multivalued system. The essential point is that the ITEM ID is not assigned by the system, it's part of the record (often called attribute 0), and the hashed value (group number) is recomputed dynamically, not stored as part of the record (at least not in any way the programmer or database developer is aware of, though it's probably an optimization that has been implemented).

 

For a variety of reasons you describe in your paper, multivalued databases are not relational, and not capable of being made to act relational: the inherent ordering of records and fields/attributes, values, and subvalues can't be avoided or wished away. With careful design a multivalued database can approximate a SQL database; in fact IBM (which owns Universe/U2) has overlaid a SQL language layer on top of the underlying multivalued database, and it works as long as certain fairly obvious restrictions are obeyed (no multivalued fields, for example).

 

The database query language and application language that are part of multivalued systems (they were called 4GLs when that term was in vogue) do support multivalued and subvalued fields directly, and have many syntax features and functions to operate on them. The physical ordering can't be ignored, though. For example all Pick-derived systems I've used specify field/value/subvalue access with subscripts surrounded by angle brackets: record<1,3,5> references subvalue 5 of value 3 of field (attribute) 1 of a record.

 

It should be obvious by now that multivalued databases not only allow ordered lists of values in a field, but they allow arbitrary lists of values of different types. Even seasoned multivalued advocates cringe when they see such heterogeneous multivalued fields; the only reason for creating such structures are (a) laziness or (b) working around the inflexibility of the database, which is hard to change because the fields have a physical ordering that quickly gets embedded in code, and is hard to find (my record<1,3,5> example above is the kind of thing typically found in code).

 

You identified the reasons multivalued systems cannot be relational (or even normalized). As a long-time user now engaged in moving a large database out of Universe into a (relatively) relational SQL-based system, while keeping all of the program code working, I can say that the lack of logical/physical eparation, and the required ordering of records, fields, values, and subvalues (especially the physical ordering of fields within records) makes multivalued systems hell to understand, maintain, and extend.

 

 

From: Fabian Pascal

To: GJ

 

It's good to hear from a MVDBMS practitioner who thinks independently and critically and validates our arguments, which “cookbook” MVDBMS proponents fail to understand and/or appreciate.

 

 

Posted 2/11/05