ON “STRUCTURED TYPES”
with Chris Date

 

 

 

Date: June 4, 2003

From: JPD

To: Editor

 

I'd like your opinion on the following construct - is it compatible with, or does it violate, relational principles? From your discussion of 'domains', it seems to me that the relational model allows what I am going to describe; then again, this gets into issues of direct representation of repeating groups, multi-valued columns, and hierarchical representation of data, all of which you've criticized in one way or another.

 

I'll use the example of a 'person' table:

 

ID# NAME ADDRESS

--- ---- -------

 

In one of today's SQL DBMS products, you might perhaps specify the data types as INTEGER, VARCHAR(30), and VARCHAR(50), respectively. Note that a VARCHAR(N) can be viewed as a structured type defined as 'the ordered concatenation of up to N character values' (I cannot recall if you have objections to VARCHAR primitives or not - but lets use them for a first example. I'll switch to CHARs further on).

 

What if in a hypothetical DBMS, you could define your own structured type, T_NAME(M, N), that rather than being 'the ordered concatenation of up to N character values', was instead 'the ordered concatenation of up to M VARCHAR values, each of which is (by definition of a VARCHAR) itself a concatenation of up to N character values'? To give a concrete example of what I am describing:

 

VARCHAR(30): 'Edgar Frank Codd' or {'E','d','g',...,'d'}

T_NAME(10,15): 'Edgar' 'Frank' 'Codd' or {{'E','d',...,'r'}, {'F', 'r',...,'k'},{'C',...,'d'}}

 

Then T_NAME in this example corresponds to a business rule that "persons' full names are comprised of an ordered set of at most 10 name-parts, each of which may be at most 15 characters long". I acknowledge that this looks like repeating groups, and that isn't the discussion I want to get into, so I'm happy to abandon this example and move on to the next.

 

We can do an example paralleling the case of a fixed-length CHAR(N), 'an ordered concatenation of exactly N character values', by saying that T_NAME is 'an ordered concatenation of exactly N name parts each of fixed length':

 

T_NAME: (First,Middle,Last)

 

Similarly you might define a T_ADDRESS type as follows:

 

T_ADDRESS: (Street,City,State,Zipcode,Country)

 

Using the above two hypothetical types, defining a table Person as (ID# INTEGER, Name T_NAME, Address T_ADDRESS) would be almost identical to defining it as (ID#, First, Middle, Last, Street, City, State, Zipcode, Country). The difference is, the column Name would encapsulate sub-columns First, Middle, and Last; a comparison of Name between two rows would be an implicit comparison of First, Middle, and Last between those rows (according to some defined equality rule on data type T_NAME).

 

However, you could represent this in a similar way with a join of three ordinary SQL tables defined as follows:

 

Person: (P_ID#,N_ID,A_ID)

Name: (N_ID#,First,Middle,Last)

Address: (A_ID#,Street,City,State,Zipcode,Country)

 

Which require no concept of 'structured type', but does require the use of the N_ID and A_ID keys to associate a particular name and a particular address with a particular person, as well as the constraint that no two people share the same N_ID, and that no N_ID exists which is not assigned to a person - in other words, constraints to enforce strict one-to-one correspondence between Person and Name rows. For normalization reasons, it might actually make the most sense to have the addresses in a separate table; but for names at least, it seems to me that the more elegant semantics are to have the single value 'full name' recorded implicitly with the person, while still allowing that single value to be decomposable into component parts.

 

So, is the relational model compatible with the single-table (ID, Name_Composite, Address_Composite) described first, or is the only valid approach the multiple joined table approach described second?

 

Finally, if the use of structured types in this way is NOT valid, then could you please explain how a VARCHAR, CHAR, DATE, TIMESTAMP, or any other 'decomposable' type is permissible as a primitive, or confirm the alternative: that in its pure form, the relational model always requires separate tables for representing the decomposition of these data types. I have always felt that DATE and TIMESTAMP especially are good examples of data types, which make a great deal of sense I have speculated on the form that such tables might take below.

 

VARCHAR: (id, precedence, char)--multiple rows required to represent one VARCHAR value; precedence dictates the ordering of characters in the string. CHAR(10): (id, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9) DATE: (id, year, month, day) TIMESTAMP: (id, year, month, day, hour, minute, second, second_fraction)

 

 

Chris Date Responds: JPD's comments are very close to being on the right lines.  In particular, the following definition does conform to the prescriptions of the relational model: 

 

VAR PERSON RELATION {ID# INTEGER,

                     NAME TUPLE {FIRST...,

                                 MIDDLE...,

                                 LAST},

                     ADDRESS TUPLE {STREET...,

                                    CITY...,

                                    STATE...,

                                    ZIPCODE...,

                                    COUNTRY...}

                     KEY {ID#};

 

(The syntax is Tutorial D, which is defined in C. J. Date and Hugh Darwen, FOUNDATION FOR FUTURE DATABASE SYSTEMS: THE THIRD MANIFESTO, 2nd edition.)  The relation variable ("table") PERSON has three attributes: ID#, of type INTEGER; NAME, of a certain tuple type; and ADDRESS, of another tuple type.  Points arising: 

 

·   NAME and ADDRESS values are indeed tuples (i.e., any given PERSON tuple contains an ID# value, which is an integer; a NAME value, which is a tuple; and an ADDRESS value, which is also a tuple).  Note too that tuples are values! 

 

·   Pace Dupuis, NAME and ADDRESS values are not "encapsulated."  The whole point about tuples (and relations too) is that they're not encapsulated; that is, users can see their components.  For example, the user explicitly knows that a NAME value consists of a FIRST value, a MIDDLE value, and a LAST value. 

 

·   The Third Manifesto doesn't do what Dupuis did, which is to introduce new names for tuple types (there's no type called T_NAME, for example).  Instead, tuple types can simply be used "inline," as it were, as shown in the example.  See The Third Manifesto for arguments in support of this state of affairs.  Note:  SQL does much the same thing with its row and array types. 

 

·   TUPLE in the foregoing example is a type generator

 

·   In contrast to the foregoing, types like DATE and TIME are best regarded as scalar types, not tuple types.  Scalar types, unlike tuple and relation types, are encapsulated (in fact, scalar and encapsulated mean exactly the same thing, which is one reason why I don't use the term "encapsulated" very much myself).  Thus, a TIME value (for example) is not a tuple with HOUR, MINUTE, and SECOND components.  However, it's still possible, thanks to the availability of certain operators (called THE_ operators in The Third Manifesto), to access the HOUR, MINUTE, and SECOND components of the HOUR / MINUTE / SECOND representation of a given TIME value. 

 

For further discussion of this whole business of legal types in the relational model, see: 

 

·   Chapter 5 (especially) of C. J. Date: AN INTRODUCTION TO DATABASE SYSTEMS (8th edition, due at the end of July 2003)

·   C. J. Date: What Does First Normal Form Really Mean?, Paper #1 in DATABASE FOUNDATIONS series

·   THE THIRD MANIFESTO book already mentioned (passim but see especially Appendix C)

 

 

Posted 09/12/03

 

 

 

[ABOUT] [QUOTES] [LINKS]