I’m trying to aggregate the contents of a column from a directed acyclic graph (DAG) stored in a Postgres table.
Each dag
row has an id
, bytes
, and may also have a parent
referencing another row. I’m trying to write a function which identifies every “tip” in the graph starting from a particular parent id
. The result needs to include each graph tip and its aggregated collected_bytes
from the parent id
to that tip. (The DAG can be very deep, so some collected_bytes
arrays can have millions of elements.)
The function below works, but memory usage grows quadratically as the collected_bytes
get longer. The results
CTE keeps a copy of every iteration of collected_bytes
until the end of the query, then the ranked
CTE is used to select only the deepest node for each tip.
I think I’m approaching this incorrectly: how can I do this more efficiently?
Is it possible to instruct Postgres to drop the intermediate results as the recursive query is running? (So we can also skip the ranked
CTE?) Is there a more natural way to achieve the result below?
DROP TABLE IF EXISTS dag;
CREATE TABLE dag (
id bigint PRIMARY KEY,
parent bigint,
bytes bytea NOT NULL,
FOREIGN KEY (parent) REFERENCES dag(id)
);
INSERT INTO dag (id, parent, bytes) VALUES (0, NULL, 'x0000'),
(1, NULL, 'x0100'),
(2, 0, 'x0200'),
(3, 1, 'x0300'),
(4, 2, 'x0400'),
(5, 3, 'x0500'),
(6, 4, 'x0600'),
(7, 5, 'x0700'),
(8, 4, 'x0800');
DROP FUNCTION IF EXISTS get_descendant;
CREATE FUNCTION get_descendant (input_id bigint)
RETURNS TABLE(start_id bigint, end_id bigint, collected_bytes bytea(), depth bigint)
LANGUAGE sql STABLE
AS $$
WITH RECURSIVE results AS (
SELECT id AS start_id, id AS end_id, ARRAY(bytes) AS collected_bytes, 0 AS depth
FROM dag WHERE id = input_id
UNION ALL
SELECT start_id,
dag.id AS end_id,
collected_bytes || dag.bytes AS collected_bytes,
depth + 1 AS depth
FROM results INNER JOIN dag
ON results.end_id = dag.parent
WHERE depth < 100000
),
ranked AS (
SELECT *, rank() over (PARTITION BY start_id ORDER BY start_id, depth DESC) FROM results
)
SELECT start_id, end_id, collected_bytes, depth FROM ranked WHERE rank = 1;
$$;
Here’s the result for 0
, which has two valid tips, an id
of 6
and 8
. The collected_bytes
field is the aggregation of bytes
along each path:
postgres=# SELECT get_descendant.* FROM get_descendant(0::bigint);
start_id | end_id | collected_bytes | depth
----------+--------+-------------------------------------------+-------
0 | 6 | {"\x0000","\x0200","\x0400","\x0600"} | 3
0 | 8 | {"\x0000","\x0200","\x0400","\x0800"} | 3
(2 rows)
While here are the intermediate results
before being ranked
(and only the maximum depths selected):
postgres=# SELECT get_descendant.* FROM get_descendant(0);
start_id | end_id | collected_bytes | depth
----------+--------+-------------------------------------------+-------
0 | 0 | {"\x0000"} | 0
0 | 2 | {"\x0000","\x0200"} | 1
0 | 4 | {"\x0000","\x0200","\x0400"} | 2
0 | 6 | {"\x0000","\x0200","\x0400","\x0600"} | 3
0 | 8 | {"\x0000","\x0200","\x0400","\x0800"} | 3
(5 rows)
As you can see, this implementation is already wasting ~half of the memory in use. How can I make this more memory efficient?
Thanks!