ON AN OLD CELKO PUZZLE
with C.J. Date

 

 

 

From: PV
To: Editor

 

As I was searching for online tutorials on the relational model and related stuff, I stumbled on a puzzle problem posted on Joe Celko's old [Ed. Note: misnamed] column SQL For Smarties. Let me quote the puzzle:

 

Jack Wells (CompuServe 100442,3164) submitted this perplexing SQL problem in June 1996. His situation is pretty typical for SQL programmers who work with 3GL people. The programmers are writing a report on the employees, and they want information about each employee's current and previous salary status. The report will show the date of their promotion and the salary amount.

 

Jack spoke with Fabian Pascal, author of Understanding Relational Databases (John Wiley, 1993, ISBN 0-471-58538-6) and SQL and Relational Basics (M&T Books, 1989, ISBN 1-55851-063-X), the week he was working on this problem, and Mr. Pascal replied that this query could not be done. He said, 'In a truly relational language it could be done, but since SQL is not relational it isn't possible, not even with SQL-92.' Sounds like a challenge to me [Joe Celko]!

 

Oh, I forgot to mention an additional constraint on the query: Jack is working in Oracle. This product is still not up to SQL-92 standards (that is, there are no proper outer joins, no general scalar subexpressions, and so on), so your query must run under the old SQL-86 or SQL-89 rules.

 

Assume that you have this test data:

 

CREATE TABLE Salaries

(emp CHAR(10) NOT NULL,

sal_date DATE NOT NULL,

sal_amt DECIMAL (8,2) NOT NULL,

PRIMARY KEY (emp,sal_date));

 

INSERT INTO Salaries VALUES ('Tom','1996-06-20',500.00);

INSERT INTO Salaries VALUES ('Tom','1996-08-20',700.00);

INSERT INTO Salaries VALUES ('Tom','1996-10-20',800.00);

INSERT INTO Salaries VALUES ('Tom','1996-12-20',900.00);

INSERT INTO Salaries VALUES ('Dick','1996-06-20',500.00);

INSERT INTO Salaries VALUES ('Harry','1996-07-20',500.00);

INSERT INTO Salaries VALUES ('Harry','1996-09-20',700.00);

 

Tom has had two promotions, Dick is a new hire, and Harry has had one promotion.

 

Then near the bottom of the page, a solution is presented as follows:

SELECT S0.emp, S0.sal_date,S0.sal_amt,

S1.sal_date, S1.sal_amt

FROM Salaries AS S0,Salaries AS S1

WHERE S0.emp = S1.emp

AND S0.sal_date =

(SELECT MAX(S2.sal_date)

FROM Salaries AS S2

WHERE S0.emp = S2.emp)

AND S1.sal_date =

(SELECT MAX(S3.sal_date)

FROM Salaries AS S3

WHERE S0.emp = S3.emp

AND S3.sal_date < S0.sal_date)

UNION ALL

SELECT S4.emp,MAX(S4.sal_date),MAX(S4.sal_amt),

NULL, NULL

FROM Salaries AS S4

GROUP BY S4.emp

HAVING COUNT(*) = 1;

 

+-------+------------+--------+------------+--------+

| emp   | sal_date   |sal_amt | sal_date   | sal_amt|

+-------+------------+--------+------------+--------+

| Tom   | 1996-12-20 |  900   | 1996-10-20 | 800    |

| Harry | 1996-09-20 |  700   | 1996-07-20 | 500    |

| Dick  | 1996-06-20 |  500   | (NULL)     |(NULL)  |

+-------+------------+--------+------------+--------+

                             

I find the solution unintuitive and difficult to understand. I can't get the logic behind the MAX() trick, despite the explanation given in the article. And what's the purpose of "HAVING COUNT(*) = 1"?

 

Is the given solution indeed correct?

And why does the resulting table have duplicate columns? How would I distinguish between the previous and current salaries/dates, in case I create a report based on this table? I don't know that this is permitted in standard SQL.

I'll try to solve this problem using relational algebra. Fortunately, I have a copy of Chris Date's INTRODUCTION TO DATABASE SYSTEMS 7th edition, and your book UNDERSTANDING RELATIONAL DATABASES.

 

Here is my solution. I used the relational algebra operators as defined in Chris Date's book:


Assume that the following domains are defined: Name, Date, and Amount. Name is represented by a character string; Amount is represented by a number with two decimal places; and Date is represented by a string of the form 'YYYY-MM-DD'. Applicable comparison operators (=, <, >, etc.) are defined for each domain; in addition, the subtraction operator is defined for the Date domain. If D1 and D2 are two Dates, and D2 >= D1 (either D2 = D1 or D2 is later than D1), then D2 - D1 is the number of days between D1 and D2.

Suppose that the relation variable "salaries" is defined as follows (Tutorial D):

VAR salaries BASE RELATION

{emp Name,

   sal_date Date,

   sal_amt Amount}

PRIMARY KEY {emp,sal_date};

 

In my solution, all employees without previous salary records are assigned a previous salary of 0.00 and a previous salary date of '1900-01-01'. My solution is as follows (Tutorial D):

prevsal := salaries RENAME {sal_date AS prevdate,

sal_amt AS prevsalary};

currsal := salaries RENAME {sal_date AS currdate,

sal_amt AS currsalary};

 

/* These lines are for employees with previous salary records. */

 

joinsal := prevsal JOIN currsal;

lesssal := joinsal WHERE prevdate < currdate;

diffdatesal := EXTEND lesssal ADD currdate - prevdate

AS diff;

 

mindiffsal := SUMMARIZE diffdatesal PER

diffdatesal{emp, prevdate} ADD MIN(diff) as mindiff;

rendiffsal := mindiffsal RENAME {mindiff as diff};

joindiffsal := rendiffsal JOIN diffdatesal;

 

prevcurrsal := joindiffsal {ALL BUT diff};

 

/* These lines are for employees without previous

salary records. */

 

lookup := prevcurrsal {ALL BUT prevdate, prevsalary};

noprevsal := currsal MINUS lookup;

newempsal := EXTEND noprevsal ADD Date('1900-01-01')

AS prevdate, ADD Amount(0.00) AS prevsalary;

 

/* Here is the final result. */

 

result := prevcurrsal UNION newempsal;

 

A sample "result" relation:

+-------+------------+-----------+------------+-----------+

| EMP   | CURRDATE   |CURRSALARY | PREVDATE   |PREVSALARY |

+-------+------------+-----------+------------+-----------+

| Tom   | 1996-12-20 |  900.00   | 1996-10-20 | 800.00    |

| Harry | 1996-09-20 |  700.00   | 1996-07-20 | 500.00    |

| Dick  | 1996-06-20 |  500.00   | 1900-01-01 |   0.00    |

+-------+------------+-----------+------------+-----------+

 

I could have combined all those steps into a single expression, but that expression would be terribly long and complex.

 

 

Fabian Pascal Comments: This was a very long time ago and I do not recall the exact circumstances, and whether my reply was properly represented or understood (particularly coming from Celko). My guess is that it had something to do with inability to resolve such problems without a precise definition of the tables to which the query is to be applied, the business rules in effect for the tables, and the query at issue. I will let Chris Date to respond to PV’s solution.

 

 

Chris Date Responds:  

 

·   Regarding whether Celko’s solution is correct or not, I neither know, nor care.

 

·   Regarding duplicate column names, it is permitted in SQL, but you ask a good question. See A Sweet Disorder, Paper #3 in our DATABASE FOUNDATIONS SERIES.

 

·   Your solution is much better, but it is even better to use WITH instead of assignments to temporaries. I think this does it:

 

WITH (salaries RENAME (sal_date AS prev_date,

                       sal_amt AS prev_amt)) AS t1,

     ((salaries JOIN t1) WHERE sal_date > prev_date) AS t2.

      (salaries{emp} MINUS t2{emp}) JOIN salaries) AS t3,

      (EXTEND t3 ADD (DATE(‘1900-01-01’) AS prev_date,

                      0.00 AS pay_amt)) AS t4;

     t1 UNION t4;

 

Tedious, but essentially straightforward.

 

 

Posted 08/08/03

 

 

 

[ABOUT] [QUOTES] [LINKS]