ON “CLIMBING TREES” IN SQL AND ITS IMPLEMENTATIONS
with Fabian Pascal

 

 

 

From: NJ

To: Editor

Date: 28 Apr 2004

 

I have just re-read your book PRACTICAL ISSUES IN DATABASE MANAGEMENT, which I bought several years ago. I really enjoyed it (again), but the chapter dealing with trees contains some concepts that to me seem like serious errors/mistakes.

 

The basic problem I see is that you do not distinguish in subchapter 7.2.3 between "nodes" and "parts", or more generally between "nodes" and "data about nodes".

 

As you write yourself in 7.2: In a tree every node (except the root) has exactly one parent. So when in figure 7.5 you let P3 have both P1, P2 and P4 as parents, these tables are not describing a tree! What is missing in the drawing at figure 7.4, as well as in figure 7.5, is a unique identifier of each node. I am surprised that you of all people would fail to identify this missing key.

 

A proper implementation could look like this (using SQL for my example - something I learned from Joe Celko):

 

CREATE TABLE parts (part_id text PRIMARY KEY);

 

INSERT INTO parts (part_id) VALUES ('P1');

INSERT INTO parts (part_id) VALUES ('P2');

INSERT INTO parts (part_id) VALUES ('P3');

INSERT INTO parts (part_id) VALUES ('P4');

INSERT INTO parts (part_id) VALUES ('P5');

INSERT INTO parts (part_id) VALUES ('P6');

INSERT INTO parts (part_id) VALUES ('P7');

 

CREATE TABLE nodes (node_id INTEGER PRIMARY KEY,

   part_id VARCHAR(3) NOT NULL REFERENCES parts (part_id));

 

INSERT INTO nodes (node_id, part_id) VALUES (1, 'P1');

INSERT INTO nodes (node_id, part_id) VALUES (2, 'P2');

INSERT INTO nodes (node_id, part_id) VALUES (3, 'P3');

INSERT INTO nodes (node_id, part_id) VALUES (4, 'P4');

INSERT INTO nodes (node_id, part_id) VALUES (5, 'P3');

INSERT INTO nodes (node_id, part_id) VALUES (6, 'P7');

INSERT INTO nodes (node_id, part_id) VALUES (7, 'P5');

INSERT INTO nodes (node_id, part_id) VALUES (8, 'P6');

INSERT INTO nodes (node_id, part_id) VALUES (9, 'P3');

INSERT INTO nodes (node_id, part_id) VALUES (10, 'P6');

 

CREATE TABLE links

 (child_id INTEGER PRIMARY KEY REFERENCES nodes(node_id),

  parent_id INTEGER NOT NULL REFERENCES nodes(node_id));

 

INSERT INTO links (child_id, parent_id) VALUES (2, 1);

INSERT INTO links (child_id, parent_id) VALUES (3, 1);

INSERT INTO links (child_id, parent_id) VALUES (4, 1);

INSERT INTO links (child_id, parent_id) VALUES (5, 2);

INSERT INTO links (child_id, parent_id) VALUES (6, 2);

INSERT INTO links (child_id, parent_id) VALUES (7, 3);

INSERT INTO links (child_id, parent_id) VALUES (8, 3);

INSERT INTO links (child_id, parent_id) VALUES (9, 4);

INSERT INTO links (child_id, parent_id) VALUES (10,4);

 

The (surrogate) key on the NODES table, and the use of c_id as the primary key on the LINKS table gets rid of the problem with duplicates in explosions and computed surrogate keys like LINK in your model to distinguish them. As a result the grand explosion of the LINKS table has (p_id, c_id) as a natural key, and can be joined to the parts table to get a view including part numbers.

 

A couple of further points:

 

Ø       The EXPLODE operator could have used a little more description. Syntax probably should be something like "EXPLODE (links, parent_id, child_id)" returning a relation containing parent_id and child_id only.

 

Ø       Your critique of the implementation of trees by Oracle (CONNECT BY) is still very valid. In fact, I still haven't found a way to express the grand explosion using CONNECT BY (partially because I haven't found any consistent definition of the CONNECT BY clause).  However, your example code at the bottom of page 179 does NOT give the result listed in figure 7.6. In fact not two but four duplicates of (P1, P6) are in the results.

 

Ø       I believe the treatment of the DB2/SQL3 approach (WITH and recursive UNION) is a little bit too harsh. It is true that it lacks some of the immediate clarity of EXPLODE, but it does make it possible to express the grand explosion as a relation - and you don't need to use UNION ALL:

 

 

WITH tmp (parent_id, child_id)

AS (SELECT parent_id, child_id

    FROM links

      UNION

    SELECT tmp.parent_id, links.child_id

    FROM links INNER JOIN tmp

    ON (tmp.child_id = links.parent_ID))

SELECT *

FROM parent

 

Indeed this is not as short or "functional" as

 

  EXPLODE (links,parent_id, child_id)

 

but it certainly can be useful in implementing this in practice - for instance as a view (possibly materializing for performance) which can then be used for all queries on a given hierarchy.

 

What I like about the DB2 syntax is that it is

·   apparently well-defined (although I do not have access to the SQL standards defining it) - using unions, nested sets and fixpoints

·   generic enough to be useful for stuff that doesn't involve trees, but some other data structure (unlike EXPLODE).

 

That's all. I hope you will at least tell me if you think I've missed the mark completely.

 

 

From: Fabian Pascal

To: NJ

 

You may be right that my language did not explicitly distinguish between “parts” and “nodes”. However, the tree definition pertains to nodes, not to parts, and in that sense there is no error.

 

It may be, indeed, possible for the user to address the problem of node identity, but it is one of the main points of the chapter that that is a DBMS function and as a general rule we frown upon users taking upon themselves database functions, which defeats the idea behind database management. It is precisely this kind of function that should be left to the DBMS.

 

Nothing can persuade me to comment on anything by Celko. Too much experience and wasted time.

 

I do not presume to have expertise in Oracle code (and I’d like to keep it that way). Whether there are two or four duplicates does not affect the argument. However, for the sake of accuracy, I will be glad to post the actual Oracle result you got in the Errata, if you send it to me.

 

The suggested syntax for the EXPLODE was schematic in character. Considerable more thought should go into devising a real language syntax.

 

My “harshness” towards the DB2 feature is justified, given that it violates the relational model and unnecessarily risks the loss of the practical advantages the model guarantees. It is trivial to say that it is better than nothing, but that is not the point. The point is that we need the correct solution and DB2 fails to deliver it.

 

 

Ed. Note: A PRACTICAL DATABASE FOUNDATIONS paper on how trees and recursive operations are handled relationally is in the works and, we hope, will include comparisons with Oracle, DB2 and SQL Server.

 

 

Posted 07/23/04