From: NJ
To: Editor
Date: 28 Apr 2004
I have just re-read your book PRACTICAL ISSUES IN
DATABASE MANAGEMENT, which I bought several years ago. I really enjoyed
it (again), but the chapter dealing with trees contains some concepts that to
me seem like serious errors/mistakes.
The basic problem I see is that you do not distinguish in
subchapter 7.2.3 between "nodes" and "parts", or more
generally between "nodes" and "data about nodes".
As you write yourself in 7.2: In a tree every node (except
the root) has exactly one parent. So when in figure 7.5 you let P3 have both
P1, P2 and P4 as parents, these tables are not describing a tree! What is
missing in the drawing at figure 7.4, as well as in figure 7.5, is a unique
identifier of each node. I am surprised that you of all people would fail to
identify this missing key.
A proper implementation could look like this (using SQL for
my example - something I learned from Joe Celko):
CREATE TABLE parts (part_id text PRIMARY KEY);
INSERT INTO parts (part_id) VALUES ('P1');
INSERT INTO parts (part_id) VALUES ('P2');
INSERT INTO parts (part_id) VALUES ('P3');
INSERT INTO parts (part_id) VALUES ('P4');
INSERT INTO parts (part_id) VALUES ('P5');
INSERT INTO parts (part_id) VALUES ('P6');
INSERT INTO parts (part_id) VALUES ('P7');
CREATE TABLE nodes (node_id INTEGER PRIMARY KEY,
part_id VARCHAR(3) NOT NULL REFERENCES parts (part_id));
INSERT INTO nodes (node_id, part_id) VALUES (1, 'P1');
INSERT INTO nodes (node_id, part_id) VALUES (2, 'P2');
INSERT INTO nodes (node_id, part_id) VALUES (3, 'P3');
INSERT INTO nodes (node_id, part_id) VALUES (4, 'P4');
INSERT INTO nodes (node_id, part_id) VALUES (5, 'P3');
INSERT INTO nodes (node_id, part_id) VALUES (6, 'P7');
INSERT INTO nodes (node_id, part_id) VALUES (7, 'P5');
INSERT INTO nodes (node_id, part_id) VALUES (8, 'P6');
INSERT INTO nodes (node_id, part_id) VALUES (9, 'P3');
INSERT INTO nodes (node_id, part_id) VALUES (10, 'P6');
CREATE TABLE links
(child_id INTEGER PRIMARY KEY REFERENCES nodes(node_id),
parent_id INTEGER NOT NULL REFERENCES nodes(node_id));
INSERT INTO links (child_id, parent_id) VALUES (2, 1);
INSERT INTO links (child_id, parent_id) VALUES (3, 1);
INSERT INTO links (child_id, parent_id) VALUES (4, 1);
INSERT INTO links (child_id, parent_id) VALUES (5, 2);
INSERT INTO links (child_id, parent_id) VALUES (6, 2);
INSERT INTO links (child_id, parent_id) VALUES (7, 3);
INSERT INTO links (child_id, parent_id) VALUES (8, 3);
INSERT INTO links (child_id, parent_id) VALUES (9, 4);
INSERT INTO links (child_id, parent_id) VALUES (10,4);
The (surrogate) key on the NODES table, and the use of c_id
as the primary key on the LINKS table gets rid of the problem with duplicates
in explosions and computed surrogate keys like LINK in your model to
distinguish them. As a result the grand explosion of the LINKS table has (p_id,
c_id) as a natural key, and can be joined to the parts table to get a view
including part numbers.
A couple of further points:
Ø
The
EXPLODE operator could have used a little more description. Syntax probably
should be something like "EXPLODE (links, parent_id, child_id)"
returning a relation containing parent_id and child_id only.
Ø
Your
critique of the implementation of trees by Oracle (CONNECT BY) is still very
valid. In fact, I still haven't found a way to express the grand explosion
using CONNECT BY (partially because I haven't found any consistent definition of
the CONNECT BY clause). However, your
example code at the bottom of page 179 does NOT give the result listed in
figure 7.6. In fact not two but four duplicates of (P1, P6) are in the results.
Ø I
believe the treatment of the DB2/SQL3 approach (WITH and recursive UNION) is a
little bit too harsh. It is true that it lacks some of the immediate clarity of
EXPLODE, but it does make it possible to express the grand explosion as a
relation - and you don't need to use UNION ALL:
WITH tmp (parent_id, child_id)
AS (SELECT parent_id, child_id
FROM links
UNION
SELECT tmp.parent_id,
links.child_id
FROM links INNER JOIN
tmp
ON (tmp.child_id =
links.parent_ID))
SELECT *
FROM parent
Indeed this is not as short or "functional" as
EXPLODE (links,parent_id, child_id)
but it certainly can be useful in implementing this in
practice - for instance as a view (possibly materializing for performance)
which can then be used for all queries on a given hierarchy.
What I like about the DB2 syntax is that it is
·
apparently well-defined (although I do not have access
to the SQL standards defining it) - using unions, nested sets and fixpoints
·
generic enough to be useful for stuff that doesn't
involve trees, but some other data structure (unlike EXPLODE).
That's all. I hope you will at least tell me if you think
I've missed the mark completely.
From: Fabian Pascal
To: NJ
You may be right that my language did not explicitly
distinguish between “parts” and “nodes”. However, the tree definition pertains
to nodes, not to parts, and in that sense there is no error.
It may be, indeed, possible for the user to address the
problem of node identity, but it is one of the main points of the chapter that
that is a DBMS function and as a general rule we frown upon users taking upon
themselves database functions, which defeats the idea behind database
management. It is precisely this kind of function that should be left to the
DBMS.
Nothing can persuade me to comment on anything by Celko. Too
much experience and wasted time.
I do not presume to have expertise in Oracle code (and I’d
like to keep it that way). Whether there are two or four duplicates does not
affect the argument. However, for the sake of accuracy, I will be glad to post
the actual Oracle result you got in the Errata, if you send it
to me.
The suggested syntax for the EXPLODE was schematic in
character. Considerable more thought should go into devising a real language
syntax.
My “harshness” towards the DB2 feature is justified, given
that it violates the relational model and unnecessarily risks the loss of the
practical advantages the model guarantees. It is trivial to say that it is
better than nothing, but that is not the point. The point is that we need the
correct solution and DB2 fails to deliver it.
Ed. Note: A PRACTICAL DATABASE FOUNDATIONS paper on
how trees and recursive operations are handled relationally is in the works and,
we hope, will include comparisons with Oracle, DB2 and SQL Server.
Posted
07/23/04