Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
475 views
in Technique[技术] by (71.8m points)

sql - Detect duplicate items in recursive CTE

I have a set of dependencies stored in my database. I'm looking to find all the objects that depend on the current one, whether directly or indirectly. Since objects can depend zero or more other objects, it's perfectly reasonable that object 1 is depended on by object 9 twice (9 depends on 4 and 5, both of which depend on 1). I'd like to get the list of all the objects that depend on the current object without duplication.

This gets more complex if there are loops. Without loops, one could use DISTINCT, though going through long chains more than once only to cull them at the end is still a problem. With loops, however, it becomes important that the RECURSIVE CTE doesn't union with something it has already seen.

So what I have so far looks like this:

WITH RECURSIVE __dependents AS (
  SELECT object, array[object.id] AS seen_objects
  FROM immediate_object_dependents(_objectid) object
  UNION ALL
  SELECT object, d.seen_objects || object.id
  FROM __dependents d
  JOIN immediate_object_dependents((d.object).id) object
    ON object.id <> ALL (d.seen_objects)
) SELECT (object).* FROM __dependents;

(It's in a stored procedure, so I can pass in _objectid)

Unfortunately, this just omits a given object when I've seen it before in the current chain, which would be fine if a recursive CTE was being done depth-first, but when it's breadth-first, it becomes problematic.

Ideally, the solution would be in SQL rather than PLPGSQL, but either one works.

As an example, I set this up in postgres:

create table objectdependencies (
  id int,
  dependson int
);

create index on objectdependencies (dependson);

insert into objectdependencies values (1, 2), (1, 4), (2, 3), (2, 4), (3, 4);

And then I tried running this:

with recursive rdeps as (
  select dep
  from objectdependencies dep
  where dep.dependson = 4 -- starting point
  union all
  select dep
  from objectdependencies dep
  join rdeps r
    on (r.dep).id = dep.dependson
) select (dep).id from rdeps;

I'm expecting "1, 2, 3" as output.

However, this somehow goes on forever (which I also don't understand). If I add in a level check (select dep, 0 as level, ... select dep, level + 1, on ... and level < 3), I see that 2 and 3 are repeating. Conversely, if I add a seen check:

with recursive rdeps as (
  select dep, array[id] as seen
  from objectdependencies dep
  where dep.dependson = 4 -- starting point
  union all
  select dep, r.seen || dep.id
  from objectdependencies dep
  join rdeps r
    on (r.dep).id = dep.dependson and dep.id <> ALL (r.seen)
) select (dep).id from rdeps;

then I get 1, 2, 3, 2, 3, and it stops. I could use DISTINCT in the outer select, but that only reasonably works on this data because there is no loop. With a larger dataset and more loops, we will continue to grow the CTE's output only to have the DISTINCT pare it back down. I would like the CTE to simply stop that branch when it's already seen that particular value somewhere else.

Edit: this is not simply about cycle detection (though there can be cycles). It's about uncovering everything referenced by this object, directly and indirectly. So if we have 1->2->3->5->6->7 and 2->4->5, we can start at 1, go to 2, from there we can go to 3 and 4, both of those branches will go to 5, but I don't need both branches to do so - the first one can go to 5, and the other can simply stop there. Then we go on to 6 and 7. Most cycle detection will find no cycles and return 5, 6, 7 all twice. Given that I expect most of my production data to have 0-3 immediate references, and most of those to be likewise, it will be very common for there to be multiple branches from one object to another, and going down those branches will be not only redundant but a huge waste of time and resource.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The word dep in the second query (after union) is ambiguous. In fact it is interpreted as the column of rdeps, not as an alias of objectdependencies.

with recursive rdeps as (
  select dep
  from objectdependencies dep
  where dep.dependson = 4 -- starting point
  union all
  select dep -- this means r.dep
  from objectdependencies dep
  join rdeps r
    on (r.dep).id = dep.dependson
) select (dep).id from rdeps;

This is why the query creates an endless loop. You can correct this by changing the alias:

with recursive rdeps as (
  select dep
  from objectdependencies dep
  where dep.dependson = 4 -- starting point
  union all
  select objectdep
  from objectdependencies objectdep
  join rdeps r
    on (r.dep).id = objectdep.dependson
) select (dep).id from rdeps;

 id 
----
  1
  2
  3
  1
  2
  1
(6 rows)    

Or better, just by using columns, like the good Lord intended:

with recursive rdeps as (
    select id, dependson
    from objectdependencies
    where dependson = 4
union all
    select d.id, d.dependson
    from objectdependencies d
    join rdeps r
    on r.id = d.dependson
) 
select *
from rdeps;

The first query in the question is all you can do in plain sql as there is no communication between different (paralel) branches generated by a recursive query. In a functional approach you can use a temporary table as a store common to all branches. The function may look like this:

create or replace function rec_function(int)
returns void language plpgsql as $$
declare
    i int;
begin
    for i in
        select id
        from objectdependencies
        where dependson = $1
    loop
        if not exists(
            select from temp_table 
            where id = i)
        then
            insert into temp_table values(i);
            perform rec_function(i);
        end if;
    end loop;
end $$;

Usage:

create temp table temp_table(id int);

select rec_function(4);

select *
from temp_table;

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...