From: PV
To: Editor
As I was searching for online tutorials on the relational
model and related stuff, I stumbled on a puzzle problem posted on Joe Celko's
old [Ed. Note: misnamed] column SQL
For Smarties. Let me quote the puzzle:
Jack Wells (CompuServe 100442,3164) submitted this perplexing
SQL problem in June 1996. His situation is pretty typical for SQL programmers
who work with 3GL people. The programmers are writing a report on the
employees, and they want information about each employee's current and previous
salary status. The report will show the date of their promotion and the salary
amount.
Jack spoke with Fabian Pascal, author of Understanding
Relational Databases (John Wiley, 1993, ISBN 0-471-58538-6) and SQL and
Relational Basics (M&T Books, 1989, ISBN 1-55851-063-X), the week he was
working on this problem, and Mr. Pascal replied that this query could not be
done. He said, 'In a truly relational language it could be done, but since SQL
is not relational it isn't possible, not even with SQL-92.' Sounds like a
challenge to me [Joe Celko]!
Oh, I forgot to mention an additional constraint on the query:
Jack is working in Oracle. This product is still not up to SQL-92 standards
(that is, there are no proper outer joins, no general scalar subexpressions,
and so on), so your query must run under the old SQL-86 or SQL-89 rules.
Assume that you have this test data:
CREATE TABLE Salaries
(emp CHAR(10) NOT NULL,
sal_date DATE NOT NULL,
sal_amt DECIMAL (8,2) NOT NULL,
PRIMARY KEY (emp,sal_date));
INSERT INTO Salaries VALUES
('Tom','1996-06-20',500.00);
INSERT INTO Salaries VALUES
('Tom','1996-08-20',700.00);
INSERT INTO Salaries VALUES
('Tom','1996-10-20',800.00);
INSERT INTO Salaries VALUES
('Tom','1996-12-20',900.00);
INSERT INTO Salaries VALUES
('Dick','1996-06-20',500.00);
INSERT INTO Salaries VALUES
('Harry','1996-07-20',500.00);
INSERT INTO Salaries VALUES
('Harry','1996-09-20',700.00);
Tom has had two promotions, Dick is a new hire, and Harry has
had one promotion.
Then near the bottom of the page, a solution is presented as
follows:
SELECT S0.emp, S0.sal_date,S0.sal_amt,
S1.sal_date, S1.sal_amt
FROM Salaries AS S0,Salaries AS S1
WHERE S0.emp = S1.emp
AND S0.sal_date =
(SELECT MAX(S2.sal_date)
FROM Salaries AS S2
WHERE S0.emp = S2.emp)
AND S1.sal_date =
(SELECT MAX(S3.sal_date)
FROM Salaries AS S3
WHERE S0.emp = S3.emp
AND S3.sal_date < S0.sal_date)
UNION ALL
SELECT S4.emp,MAX(S4.sal_date),MAX(S4.sal_amt),
NULL, NULL
FROM Salaries AS S4
GROUP BY S4.emp
HAVING COUNT(*) = 1;
+-------+------------+--------+------------+--------+
| emp | sal_date |sal_amt | sal_date | sal_amt|
+-------+------------+--------+------------+--------+
| Tom | 1996-12-20
| 900
| 1996-10-20 | 800 |
| Harry | 1996-09-20 |
700 | 1996-07-20 | 500 |
| Dick | 1996-06-20
| 500
| (NULL) |(NULL) |
+-------+------------+--------+------------+--------+
I find the solution unintuitive and difficult to understand.
I can't get the logic behind the MAX() trick, despite the explanation given in
the article. And what's the purpose of "HAVING COUNT(*) = 1"?
Is the given solution indeed correct?
And why does the resulting table have duplicate columns? How would I
distinguish between the previous and current salaries/dates, in case I create a
report based on this table? I don't know that this is permitted in standard
SQL.
I'll try to solve this problem using relational algebra. Fortunately, I have a
copy of Chris Date's INTRODUCTION TO DATABASE SYSTEMS 7th edition, and
your book UNDERSTANDING RELATIONAL DATABASES.
Here is my solution. I used the relational algebra operators
as defined in Chris Date's book:
Assume that the following domains are defined: Name, Date, and Amount. Name is
represented by a character string; Amount is represented by a number with two
decimal places; and Date is represented by a string of the form 'YYYY-MM-DD'.
Applicable comparison operators (=, <, >, etc.) are defined for each
domain; in addition, the subtraction operator is defined for the Date domain.
If D1 and D2 are two Dates, and D2 >= D1 (either D2 = D1 or D2 is later than
D1), then D2 - D1 is the number of days between D1 and D2.
Suppose that the relation variable "salaries" is defined as follows
(Tutorial D):
VAR salaries BASE RELATION
{emp Name,
sal_date Date,
sal_amt Amount}
PRIMARY KEY {emp,sal_date};
In my solution, all employees without previous salary records
are assigned a previous salary of 0.00 and a previous salary date of
'1900-01-01'. My solution is as follows (Tutorial D):
prevsal := salaries RENAME {sal_date AS prevdate,
sal_amt AS prevsalary};
currsal := salaries RENAME {sal_date AS currdate,
sal_amt AS currsalary};
/* These lines are for employees with previous salary records.
*/
joinsal := prevsal JOIN currsal;
lesssal := joinsal WHERE prevdate < currdate;
diffdatesal := EXTEND lesssal ADD currdate - prevdate
AS diff;
mindiffsal := SUMMARIZE diffdatesal PER
diffdatesal{emp, prevdate} ADD MIN(diff) as mindiff;
rendiffsal := mindiffsal RENAME {mindiff as diff};
joindiffsal := rendiffsal JOIN diffdatesal;
prevcurrsal := joindiffsal {ALL BUT diff};
/* These lines are for employees without previous
salary records. */
lookup := prevcurrsal {ALL BUT prevdate, prevsalary};
noprevsal := currsal MINUS lookup;
newempsal := EXTEND noprevsal ADD Date('1900-01-01')
AS prevdate, ADD Amount(0.00) AS prevsalary;
/* Here is the final result. */
result := prevcurrsal UNION newempsal;
A sample "result" relation:
+-------+------------+-----------+------------+-----------+
| EMP | CURRDATE |CURRSALARY | PREVDATE |PREVSALARY |
+-------+------------+-----------+------------+-----------+
| Tom | 1996-12-20
| 900.00 | 1996-10-20 | 800.00
|
| Harry | 1996-09-20 |
700.00 | 1996-07-20 |
500.00 |
| Dick | 1996-06-20
| 500.00 | 1900-01-01 |
0.00 |
+-------+------------+-----------+------------+-----------+
I could have combined all those steps into a single
expression, but that expression would be terribly long and complex.
Fabian Pascal Comments: This was a very long
time ago and I do not recall the exact circumstances, and whether my reply was
properly represented or understood (particularly coming from Celko). My guess
is that it had something to do with inability to resolve such problems without
a precise definition of the tables to which the query is to be applied, the
business rules in effect for the tables, and the query at issue. I will let
Chris Date to respond to PV’s solution.
Chris Date Responds:
·
Regarding whether Celko’s solution is correct or not, I
neither know, nor care.
·
Regarding duplicate column names, it is permitted in
SQL, but you ask a good question. See A Sweet
Disorder,
Paper #3 in our DATABASE FOUNDATIONS SERIES.
·
Your solution is much better, but it is even better to
use WITH instead of assignments to temporaries. I think this does it:
WITH (salaries RENAME (sal_date AS prev_date,
sal_amt AS prev_amt)) AS t1,
((salaries JOIN t1)
WHERE sal_date > prev_date) AS t2.
(salaries{emp} MINUS
t2{emp}) JOIN salaries) AS t3,
(EXTEND t3 ADD
(DATE(‘1900-01-01’) AS prev_date,
0.00 AS pay_amt)) AS t4;
t1 UNION t4;
Tedious, but essentially straightforward.
Posted
08/08/03
[ABOUT]
[QUOTES]
[LINKS]