Date: June 4, 2003
From: JPD
To: Editor
I'd like your opinion on the following construct - is it
compatible with, or does it violate, relational principles? From your
discussion of 'domains', it seems to me that the relational model allows what I
am going to describe; then again, this gets into issues of direct
representation of repeating groups, multi-valued columns, and hierarchical
representation of data, all of which you've criticized in one way or another.
I'll use the example of a 'person' table:
ID# NAME ADDRESS
--- ---- -------
In one of today's SQL DBMS products, you might perhaps
specify the data types as INTEGER, VARCHAR(30), and VARCHAR(50), respectively.
Note that a VARCHAR(N) can be viewed as a structured type defined as 'the
ordered concatenation of up to N character values' (I cannot recall if you have
objections to VARCHAR primitives or not - but lets use them for a first
example. I'll switch to CHARs further on).
What if in a hypothetical DBMS, you could define your own
structured type, T_NAME(M, N), that rather than being 'the ordered
concatenation of up to N character values', was instead 'the ordered
concatenation of up to M VARCHAR values, each of which is (by definition of a
VARCHAR) itself a concatenation of up to N character values'? To give a
concrete example of what I am describing:
VARCHAR(30): 'Edgar Frank Codd' or {'E','d','g',...,'d'}
T_NAME(10,15):
'Edgar' 'Frank' 'Codd' or {{'E','d',...,'r'}, {'F', 'r',...,'k'},{'C',...,'d'}}
Then T_NAME in this example corresponds to a business rule
that "persons' full names are comprised of an ordered set of at most 10
name-parts, each of which may be at most 15 characters long". I
acknowledge that this looks like repeating groups, and that isn't the
discussion I want to get into, so I'm happy to abandon this example and move on
to the next.
We can do an example paralleling the case of a fixed-length
CHAR(N), 'an ordered concatenation of exactly N character values', by saying
that T_NAME is 'an ordered concatenation of exactly N name parts each of fixed
length':
T_NAME: (First,Middle,Last)
Similarly you might define a T_ADDRESS type as follows:
T_ADDRESS: (Street,City,State,Zipcode,Country)
Using the above two hypothetical types, defining a table
Person as (ID# INTEGER, Name T_NAME, Address T_ADDRESS) would be almost
identical to defining it as (ID#, First, Middle, Last, Street, City, State,
Zipcode, Country). The difference is, the column Name would encapsulate
sub-columns First, Middle, and Last; a comparison of Name between two rows
would be an implicit comparison of First, Middle, and Last between those rows
(according to some defined equality rule on data type T_NAME).
However, you could represent this in a similar way with a
join of three ordinary SQL tables defined as follows:
Person: (P_ID#,N_ID,A_ID)
Name: (N_ID#,First,Middle,Last)
Address: (A_ID#,Street,City,State,Zipcode,Country)
Which require no concept of 'structured type', but does
require the use of the N_ID and A_ID keys to associate a particular name and a
particular address with a particular person, as well as the constraint that no
two people share the same N_ID, and that no N_ID exists which is not assigned
to a person - in other words, constraints to enforce strict one-to-one
correspondence between Person and Name rows. For normalization reasons, it
might actually make the most sense to have the addresses in a separate table;
but for names at least, it seems to me that the more elegant semantics are to
have the single value 'full name' recorded implicitly with the person, while
still allowing that single value to be decomposable into component parts.
So, is the relational model compatible with the single-table
(ID, Name_Composite, Address_Composite) described first, or is the only valid
approach the multiple joined table approach described second?
Finally, if the use of structured types in this way is NOT
valid, then could you please explain how a VARCHAR, CHAR, DATE, TIMESTAMP, or
any other 'decomposable' type is permissible as a primitive, or confirm the
alternative: that in its pure form, the relational model always requires
separate tables for representing the decomposition of these data types. I have
always felt that DATE and TIMESTAMP especially are good examples of data types,
which make a great deal of sense I have speculated on the form that such tables
might take below.
VARCHAR: (id, precedence, char)--multiple rows required to
represent one VARCHAR value; precedence dictates the ordering of characters in
the string. CHAR(10): (id, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9) DATE: (id,
year, month, day) TIMESTAMP: (id, year, month, day, hour, minute, second,
second_fraction)
Chris Date Responds: JPD's comments are very close to
being on the right lines. In particular,
the following definition does conform to the prescriptions of the relational
model:
VAR PERSON RELATION {ID# INTEGER,
NAME
TUPLE {FIRST...,
MIDDLE...,
LAST},
ADDRESS TUPLE {STREET...,
CITY...,
STATE...,
ZIPCODE...,
COUNTRY...}
KEY
{ID#};
(The syntax is Tutorial D, which is defined in C. J.
Date and Hugh Darwen, FOUNDATION
FOR FUTURE DATABASE SYSTEMS: THE THIRD MANIFESTO, 2nd edition.) The relation variable ("table") PERSON
has three attributes: ID#, of type INTEGER; NAME, of a certain tuple type; and
ADDRESS, of another tuple type. Points
arising:
·
NAME and ADDRESS values are indeed tuples (i.e., any
given PERSON tuple contains an ID# value, which is an integer; a NAME value,
which is a tuple; and an ADDRESS value, which is also a tuple). Note too that tuples are values!
·
Pace Dupuis, NAME and ADDRESS values are not
"encapsulated." The whole
point about tuples (and relations too) is that they're not encapsulated; that
is, users can see their components. For
example, the user explicitly knows that a NAME value consists of a FIRST value,
a MIDDLE value, and a LAST value.
·
The Third Manifesto doesn't do what Dupuis did,
which is to introduce new names for tuple types (there's no type called T_NAME,
for example). Instead, tuple types can
simply be used "inline," as it were, as shown in the
example. See The Third Manifesto
for arguments in support of this state of affairs. Note: SQL does
much the same thing with its row and array types.
·
TUPLE in the foregoing example is a type generator.
·
In contrast to the foregoing, types like DATE and TIME
are best regarded as scalar types, not tuple types. Scalar types, unlike tuple and relation types, are
encapsulated (in fact, scalar and encapsulated mean exactly the
same thing, which is one reason why I don't use the term
"encapsulated" very much myself).
Thus, a TIME value (for example) is not a tuple with HOUR,
MINUTE, and SECOND components. However,
it's still possible, thanks to the availability of certain operators (called
THE_ operators in The Third Manifesto), to access the HOUR, MINUTE, and
SECOND components of the HOUR / MINUTE / SECOND representation of a
given TIME value.
For further discussion of this whole business of legal types
in the relational model, see:
·
Chapter 5 (especially) of C. J. Date: AN INTRODUCTION TO
DATABASE SYSTEMS (8th edition, due at the end of July 2003)
·
C. J. Date: What Does First Normal Form
Really Mean?, Paper #1 in DATABASE
FOUNDATIONS series
·
THE
THIRD MANIFESTO book already mentioned (passim but see especially
Appendix C)
Posted
09/12/03
[ABOUT]
[QUOTES]
[LINKS]