From: GJ
To: Editorial
Date: 9 Dec 2004
I enjoyed the first two PRACTICAL
DATABASE FOUNDATION papers. I will have to read the first
one a few more times to get all of the information. I have been a fan of DATABASE
DEBUNKINGS and your and Mr. Date's books for several years now.
I have considerable experience with multivalued database
systems, and with SQL-based systems, and I've put a lot of personal effort into
educating myself in the theory and principles of relational databases. I'm no
expert but I think I know more than most of the programmers and DBA’s I work
with. I share your view of vocational tool-oriented education and the focus on
tools instead of understanding.
A couple of comments in your paper #2 What First Normal Form
Means Not! caught my eye. Let
me offer some corrections and elaboration.
Dick Pick invented the first multivalued database system, for
Microdata, back in the dark ages of the early 1970s. The ideas may be older
than that. Despite the marketing puffery of multivalued vendors who try to make
MV systems appear better than relational databases, Mr. Pick should not be
blamed. His design filled a real need in the days of slow CPUs, very limited
memory, and slow disk storage, and his database system pre-dated commercial
SQL-based systems by several years.
You wrote "The hashing algorithm strongly suggests that
ITEM IDs are system-generated pointers, akin to Object IDs." The ITEM IDs
are assigned by the database developer and/or programmer; like keys in a SQL
database ITEM IDs can be any type, or a composite type. ITEM IDs do not have to
be unique (there are provisions to force uniqueness by having the system
maintain a B*tree index). The ITEM ID is used as input to a hash
algorithm--which is chosen in advance based on expected ITEM ID values when the
file is created--and the result is a group number. The hash functions take
arbitrary string or numeric values and generate a number in the range [0,
modulo], which is the group that will contain the record. Depending on the
initial choice of modulo and hash function, ITEM IDs may map to groups more or
less evenly distributed across the range of group numbers. A poorly-chosen hash
function or value for modulo will cause hash collisions and force records that
won't fit into the full groups into overflow groups. Once an ITEM ID is used to
find the corresponding group, the frames that make up the group are searched
sequentially for the record. Since records are variable-length (usually stored
as delimited strings, with the delimiters in the ASCII 252-254 range),
searching through frames is costly, and much effort goes into creating and
resizing files to get the right values for modulo, separation, and hash
algorithm (usually called file type).
Of course these are all physical implementation details that
should have nothing to do with designing a schema, but it's not possible to separate
the two activities in a multivalued system. The essential point is that the
ITEM ID is not assigned by the system, it's part of the record (often called
attribute 0), and the hashed value (group number) is recomputed dynamically,
not stored as part of the record (at least not in any way the programmer or
database developer is aware of, though it's probably an optimization that has
been implemented).
For a variety of reasons you describe in your paper,
multivalued databases are not relational, and not capable of being made to act
relational: the inherent ordering of records and fields/attributes, values, and
subvalues can't be avoided or wished away. With careful design a multivalued
database can approximate a SQL database; in fact IBM (which owns Universe/U2)
has overlaid a SQL language layer on top of the underlying multivalued
database, and it works as long as certain fairly obvious restrictions are
obeyed (no multivalued fields, for example).
The database query language and application language that are
part of multivalued systems (they were called 4GLs when that term was in vogue)
do support multivalued and subvalued fields directly, and have many syntax
features and functions to operate on them. The
physical ordering can't be ignored, though. For example all Pick-derived
systems I've used specify field/value/subvalue access with subscripts
surrounded by angle brackets: record<1,3,5> references subvalue 5 of
value 3 of field (attribute) 1 of a record.
It should be obvious by now that multivalued databases not
only allow ordered lists of values in a field, but they allow arbitrary lists
of values of different types. Even seasoned multivalued advocates cringe when
they see such heterogeneous multivalued fields; the only reason for creating
such structures are (a) laziness or (b) working around the inflexibility of the
database, which is hard to change because the fields have a physical ordering
that quickly gets embedded in code, and is hard to find (my record<1,3,5>
example above is the kind of thing typically found in code).
You identified the reasons multivalued systems cannot be
relational (or even normalized). As a long-time user now engaged in moving a
large database out of Universe into a (relatively) relational SQL-based system,
while keeping all of the program code working, I can say that the lack of
logical/physical eparation, and the required ordering of records, fields,
values, and subvalues (especially the physical ordering of fields within
records) makes multivalued systems hell to understand, maintain, and extend.
From: Fabian Pascal
To: GJ
It's good to hear from a MVDBMS practitioner who thinks
independently and critically and validates our arguments, which “cookbook”
MVDBMS proponents fail to understand and/or appreciate.
Posted 2/11/05