Date: May 6, 2003
From: JD
To: Editor
I have read some of your articles and found them very
interesting. Especially somewhere you explain that tuple values in a relational
database should be considered as axioms. I have also begun to read THE THIRD
MANIFESTO by Date and Darwen.
My name is Johan Dufour and I am programmer at Datarutin, a
Swedish IT company. Recently I got the responsibility to redesign one of our
old systems. Of course I want to optimize the new architecture. Among other
things I would really like to fully follow the rules of the relational model if
I can defend such a strategy towards my colleagues. To better my chances to
succeed in that ambition, I feel that I self would need some guidance and it
seems to me to be a natural step to try to get it from the brains behind the DATABASE
DEBUNKINGS site.
I'd like to ask a question about which way should be
considered the right one to handle intervals within the relational model.
Please, consider the following (very simplified) example: A
business offers price reductions for some products depending of the quantity
the customer purchases. For the product A1:
1 to 4 gives no price reduction.
5 to 9 gives 5 % reduction.
10 and above gives 10 % reduction
A common way to store that information in a table could look
as following:
REDUCTION
==========================================
PRODUCT LOWEST HIGHEST REDUCTION
==========================================
A1 5 9 0,05
A1 10 999999999 0,1
------------------------------------------
To ask how much reduction (if any) you would get by
purchasing three A1 items, you could use the following query:
SELECT reduction
FROM reduction
WHERE product = 'A1'
AND lowest <= 3
AND highest >= 3
This doesn't seem too complicated but if the tuples are to be
considered as axioms, this structure has some serious problems because it
allows overlapping intervals as, for example:
==========================================
PRODUCT LOWEST HIGHEST REDUCTION
==========================================
A2 5 19 0,05
A2 10 999999999 0,1
------------------------------------------
Of course you could avoid such overlapping intervals with
some integrity constraints:
lowest <= highest (for each tuple)
r1.highest < r2.lowest or r1.lowest > r2.
highest (for each pair of distinct tuples r1 and r2 where
r1.product = r2.product (and r1.lowest <> r2.lowest (if you admit that
the primary key is composed of product and lowest)))
But is such a solution acceptable? Can it be considered to be
truly relational? If not I am also curious to know if you can identify which
normal form it breaks against and why.
Another way I could imagine is to eliminate the attribute
"Highest". As a matter of fact the "Highest" value of a
tuple seems to be redundant as it depends not only of the key but also of the
"Lowest" value of the "next" tuple (or if there is no
"next" tuple, it will be some artificial value which stands for
"no limit" - or what would you say about NULL? ;-) Hmm... the
acceptance of NULL values in "Highest" would not simplify the query
above...). So, by eliminating "Highest" the same information as above
would be stored as following:
REDUCTION
=============================
PRODUCT LOWEST REDUCTION
=============================
A1 5 0,05
A1 10 0,1
-----------------------------
Unfortunately this has some disadvantages. Asking how much
reduction (if any) you would get by purchasing three A1 items becomes more
complicated. For example you could use the following query:
SELECT reduction
FROM reduction r1
WHERE product = 'A1'
AND lowest <= 3
AND NOT EXISTS (SELECT *
FROM reduction r2
WHERE
r1.product = r2.product
AND
r1.lowest < r2.lowest
AND
r2.lowest <= 3)
Obviously, this is more complicated for the user and I'm
afraid this would take more time for the DBMS too to handle than the simpler
query above.
There is at least one more disadvantage with this structure.
It doesn't work in a general way with all kinds of intervals. You could need to
store some information depending of intervals but without the constraint that
the next interval must begin directly where the actual ends. In such a case the
attribute "Highest" is not redundant.
For example, you could consider the validity period of an
offer:
OFFER
============================
OFFER FROM TO
============================
O1 20030501 20030531
----------------------------
By the way, this is an interesting example as it illustrates
the need of more integrity constraints (concerning more than one table) to
avoid dissonant axioms. For example if you want a table to store when a
customer has accepted some offer...
PURCHASING
============================
CUSTOMER OFFER DATE
============================
C1 O1 20030505
----------------------------
... you want to be sure that this date lays within the valid
interval for that offer.
Back to the first example, the absolutely simplest way to ask
how much reduction (if any) I would get by purchasing three A1 items, would be
to use (if it was possible) the following minimalistic query:
SELECT reduction
FROM reduction
WHERE product = 'A1'
AND quantity = 3
That is in fact an exact translation of what we mean. Here is
a third way to store the same information as above, which indeed gives us the
possibility to formulate that simplest query:
REDUCTION
===============================
PRODUCT QUANTITY REDUCTION
===============================
A1 5 0,05
A1 6 0,05
A1 7 0,05
A1 8 0,05
A1 9 0,05
A1 10 0,1
A1 11 0,1
A1 12 0,1
: : :
A1 999999999 0,1
==============================
This would work with the other example too:
OFFER
================
OFFER DATE
================
O1 20030501
O1 20030502
O1 20030503
: :
O1 20030531
----------------
In fact this would also solve the integrity issue of the
PURCHASING table in an elegant way. Now the primary key of OFFER (Offer, Date)
is a foreign key for PURCHASING so we get safe with the usual referential
integrity constraint.
If I don't fool myself, these (extremely simplified and
hypothetical) tables should be fully normalized for now. However this structure
has at least one obvious disadvantage. In our real (much more complicated)
system we would need (among many other things) to store how the reduction
conditions vary from time to time. If both the quantities and the dates (perhaps
even more precise time values) have to be stored as distinct scalar values,
this will generate pretty many tuples. I'm not sure either my colleagues or our
DBMS would swallow it.
Even if the DBMS and the hardware had no limits, this way to
handle things would still force us to use some artificial limit when we really
mean "no upper limit". Finally this structure could only work with
discrete values (now I think it should be enough for our system but perhaps not
for all systems).
That's so far I have come by myself. I hope that I don't tire
you too much with my naive questions. I would be very grateful if you had time,
the opportunity and wanted to give me some answer about this for me and our
system important matter. Grateful for any comments, thoughts or references to
eventual literature that handle this issues in a way that may help me.
Chris Date Responds: Your message was a pleasure to
read for several reasons, most of all because--along with Hugh Darwen and Nikos
Lorentzos--I have recently published a book that seems to me directly relevant
to most if not all of your questions.
The book is TEMPORAL DATA AND
THE RELATIONAL MODEL, by C. J. Date, Hugh Darwen, and Nikos Lorentzos.
Although the title refers to temporal data specifically, the book is primarily
concerned with intervals and the relational model. Intervals are exactly the right abstraction
for numerous application areas, including temporal data in particular, and
including also your own application area as you observe yourself. Thus, while the book won't directly help
with your questions right now (because products don't support intervals yet),
it should give guidance on how to "roll your own" solutions in the absence
of such system support.
I note that what you call "the third way to store the
... information" effectively means using a point-based representation of
the information instead of an interval-based one. An interesting research question is whether we could efficiently
support such a point-based user-level view over an interval-based conceptual-
or internal-level representation. Your
observations regarding queries and integrity constraints (for a point-based representation)
are correct so far as they go, but you'd also need constraints to the effect
that (e.g.) every integer value greater than 4 must appear in conjunction with
every product ID.
By the way, there's no need to apologize--your questions are
far from naïve. For my part I find it
encouraging that your thinking very much parallels the thinking Hugh, Nikos,
and I went through when we were working on our book.
PS: No, don't use NULLs! Please.
Posted
09/05/03
[ABOUT]
[QUOTES]
[LINKS]