ON INTERVALS
with Chris Date

 

 

 

Date: May 6, 2003

From: JD

To: Editor

 

I have read some of your articles and found them very interesting. Especially somewhere you explain that tuple values in a relational database should be considered as axioms. I have also begun to read THE THIRD MANIFESTO by Date and Darwen.

 

My name is Johan Dufour and I am programmer at Datarutin, a Swedish IT company. Recently I got the responsibility to redesign one of our old systems. Of course I want to optimize the new architecture. Among other things I would really like to fully follow the rules of the relational model if I can defend such a strategy towards my colleagues. To better my chances to succeed in that ambition, I feel that I self would need some guidance and it seems to me to be a natural step to try to get it from the brains behind the DATABASE DEBUNKINGS site.

 

I'd like to ask a question about which way should be considered the right one to handle intervals within the relational model.

 

Please, consider the following (very simplified) example: A business offers price reductions for some products depending of the quantity the customer purchases. For the product A1:

 

1 to 4 gives no price reduction.

5 to 9 gives 5 % reduction.

10 and above gives 10 % reduction

 

A common way to store that information in a table could look as following:

 

REDUCTION

==========================================

PRODUCT   LOWEST   HIGHEST      REDUCTION

==========================================

A1            5            9      0,05

A1           10    999999999       0,1

------------------------------------------

 

To ask how much reduction (if any) you would get by purchasing three A1 items, you could use the following query:

 

SELECT reduction

FROM reduction

WHERE product = 'A1'

  AND lowest <= 3

  AND highest >= 3

 

This doesn't seem too complicated but if the tuples are to be considered as axioms, this structure has some serious problems because it allows overlapping intervals as, for example:

 

==========================================

PRODUCT   LOWEST   HIGHEST      REDUCTION

==========================================

A2            5           19       0,05

A2           10    999999999       0,1

------------------------------------------

 

Of course you could avoid such overlapping intervals with some integrity constraints:

 

lowest <= highest (for each tuple)

 

r1.highest < r2.lowest or r1.lowest > r2.

 

highest (for each pair of distinct tuples r1 and r2 where r1.product = r2.product (and r1.lowest <> r2.lowest (if you admit that the primary key is composed of product and lowest)))

 

But is such a solution acceptable? Can it be considered to be truly relational? If not I am also curious to know if you can identify which normal form it breaks against and why.

 

Another way I could imagine is to eliminate the attribute "Highest". As a matter of fact the "Highest" value of a tuple seems to be redundant as it depends not only of the key but also of the "Lowest" value of the "next" tuple (or if there is no "next" tuple, it will be some artificial value which stands for "no limit" - or what would you say about NULL? ;-) Hmm... the acceptance of NULL values in "Highest" would not simplify the query above...). So, by eliminating "Highest" the same information as above would be stored as following:

 

REDUCTION

=============================

PRODUCT   LOWEST   REDUCTION

=============================

A1            5      0,05

A1           10      0,1

-----------------------------

 

Unfortunately this has some disadvantages. Asking how much reduction (if any) you would get by purchasing three A1 items becomes more complicated. For example you could use the following query:

 

SELECT reduction

FROM reduction r1

WHERE product = 'A1'

  AND lowest <= 3

  AND NOT EXISTS (SELECT *

                  FROM reduction r2

                  WHERE r1.product = r2.product

                    AND r1.lowest < r2.lowest

                    AND r2.lowest <= 3)

 

Obviously, this is more complicated for the user and I'm afraid this would take more time for the DBMS too to handle than the simpler query above.

 

There is at least one more disadvantage with this structure. It doesn't work in a general way with all kinds of intervals. You could need to store some information depending of intervals but without the constraint that the next interval must begin directly where the actual ends. In such a case the attribute "Highest" is not redundant.

 

For example, you could consider the validity period of an offer:

 

OFFER

============================

OFFER   FROM       TO

============================

O1      20030501   20030531

----------------------------

 

By the way, this is an interesting example as it illustrates the need of more integrity constraints (concerning more than one table) to avoid dissonant axioms. For example if you want a table to store when a customer has accepted some offer...

 

PURCHASING

============================

CUSTOMER   OFFER   DATE

============================

C1         O1      20030505

----------------------------

 

... you want to be sure that this date lays within the valid interval for that offer.

 

Back to the first example, the absolutely simplest way to ask how much reduction (if any) I would get by purchasing three A1 items, would be to use (if it was possible) the following minimalistic query:

 

SELECT reduction

FROM reduction

WHERE product = 'A1'

  AND quantity = 3

 

That is in fact an exact translation of what we mean. Here is a third way to store the same information as above, which indeed gives us the possibility to formulate that simplest query:

 

REDUCTION

===============================

PRODUCT   QUANTITY   REDUCTION

===============================

A1               5        0,05

A1               6        0,05

A1               7        0,05

A1               8        0,05

A1               9        0,05

A1              10        0,1

A1              11        0,1

A1              12        0,1

:                :         :

A1       999999999        0,1

==============================

 

This would work with the other example too:

 

OFFER

================

OFFER   DATE

================

O1      20030501

O1      20030502

O1      20030503

 :             :

O1      20030531

----------------

 

In fact this would also solve the integrity issue of the PURCHASING table in an elegant way. Now the primary key of OFFER (Offer, Date) is a foreign key for PURCHASING so we get safe with the usual referential integrity constraint.

 

If I don't fool myself, these (extremely simplified and hypothetical) tables should be fully normalized for now. However this structure has at least one obvious disadvantage. In our real (much more complicated) system we would need (among many other things) to store how the reduction conditions vary from time to time. If both the quantities and the dates (perhaps even more precise time values) have to be stored as distinct scalar values, this will generate pretty many tuples. I'm not sure either my colleagues or our DBMS would swallow it.

 

Even if the DBMS and the hardware had no limits, this way to handle things would still force us to use some artificial limit when we really mean "no upper limit". Finally this structure could only work with discrete values (now I think it should be enough for our system but perhaps not for all systems).

 

That's so far I have come by myself. I hope that I don't tire you too much with my naive questions. I would be very grateful if you had time, the opportunity and wanted to give me some answer about this for me and our system important matter. Grateful for any comments, thoughts or references to eventual literature that handle this issues in a way that may help me.

 

 

Chris Date Responds: Your message was a pleasure to read for several reasons, most of all because--along with Hugh Darwen and Nikos Lorentzos--I have recently published a book that seems to me directly relevant to most if not all of your questions.  The book is TEMPORAL DATA AND THE RELATIONAL MODEL, by C. J. Date, Hugh Darwen, and Nikos Lorentzos. Although the title refers to temporal data specifically, the book is primarily concerned with intervals and the relational model.  Intervals are exactly the right abstraction for numerous application areas, including temporal data in particular, and including also your own application area as you observe yourself.  Thus, while the book won't directly help with your questions right now (because products don't support intervals yet), it should give guidance on how to "roll your own" solutions in the absence of such system support. 

 

I note that what you call "the third way to store the ... information" effectively means using a point-based representation of the information instead of an interval-based one.  An interesting research question is whether we could efficiently support such a point-based user-level view over an interval-based conceptual- or internal-level representation.  Your observations regarding queries and integrity constraints (for a point-based representation) are correct so far as they go, but you'd also need constraints to the effect that (e.g.) every integer value greater than 4 must appear in conjunction with every product ID. 

 

By the way, there's no need to apologize--your questions are far from naïve.  For my part I find it encouraging that your thinking very much parallels the thinking Hugh, Nikos, and I went through when we were working on our book. 

 

PS: No, don't use NULLs!  Please.

 

 

Posted 09/05/03

 

 

 

[ABOUT] [QUOTES] [LINKS]