Quantcast
Viewing all articles
Browse latest Browse all 2

PostgreSQL - Working with array of thousands of elements

I'm looking to select rows based on whether a column is contained in a large list of values that I pass as an integer array.

Here's the query I currently use:

SELECT item_id, other_stuff, ...FROM (    SELECT        -- Partitioned row number as we only want N rows per id        ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,        item_id, other_stuff, ...    FROM mytable    WHERE        item_id = ANY ($1) -- Integer array        AND end_date > $2    ORDER BY item_id ASC, start_date ASC, allowed ASC) xWHERE x.r <= 12

The table is structured as such:

    Column     |            Type             | Collation | Nullable | Default ---------------+-----------------------------+-----------+----------+--------- item_id       | integer                     |           | not null |  allowed       | boolean                     |           | not null |  start_date    | timestamp without time zone |           | not null |  end_date      | timestamp without time zone |           | not null |  ... Indexes:"idx_dtr_query" btree (item_id, start_date, allowed, end_date)    ...

I came up with this index after trying different ones and running EXPLAIN on the query. This one was the most efficient for both querying and sorting. Here is the explain analyze of the query:

Subquery Scan on x  (cost=0.56..368945.41 rows=302230 width=73) (actual time=0.021..276.476 rows=168395 loops=1)  Filter: (x.r <= 12)  Rows Removed by Filter: 90275  ->  WindowAgg  (cost=0.56..357611.80 rows=906689 width=73) (actual time=0.019..248.267 rows=258670 loops=1)        ->  Index Scan using idx_dtr_query on mytable  (cost=0.56..339478.02 rows=906689 width=73) (actual time=0.013..130.362 rows=258670 loops=1)              Index Cond: ((item_id = ANY ('{/* 15,000 integers */}'::integer[])) AND (end_date > '2018-03-30 12:08:00'::timestamp without time zone))Planning time: 30.349 msExecution time: 284.619 ms

The issue is that the int array can contain up to 15,000 elements or so and the query gets quite slow in this case (about 800ms on my laptop, a recent Dell XPS).

I thought passing the int array as a parameter could be slow so, and considering the list of ids can be stored beforehand in the database I tried doing this. I stored them in an array in another table and used item_id = ANY (SELECT UNNEST(item_ids) FROM ...), which was slower than my current approach. I also tried storing them row by row and using item_id IN (SELECT item_id FROM ...), which was even slower, even with only the rows relevant to my test case in the table.

Is there a better way of doing this?

Update: following Evan's comments, another approach I tried: each item is part of several groups, so instead of passing group's item ids, I tried adding the group ids in mytable:

    Column     |            Type             | Collation | Nullable | Default ---------------+-----------------------------+-----------+----------+--------- item_id       | integer                     |           | not null |  allowed       | boolean                     |           | not null |  start_date    | timestamp without time zone |           | not null |  end_date      | timestamp without time zone |           | not null |  group_ids     | integer[]                   |           | not null |  ... Indexes:"idx_dtr_query" btree (item_id, start_date, allowed, end_date)"idx_dtr_group_ids" gin (group_ids)    ...

New query ($1 is the targeted group id):

SELECT item_id, other_stuff, ...FROM (    SELECT        -- Partitioned row number as we only want N rows per id        ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,        item_id, other_stuff, ...    FROM mytable    WHERE        $1 = ANY (group_ids)        AND end_date > $2    ORDER BY item_id ASC, start_date ASC, allowed ASC) xWHERE x.r <= 12

Explain analyze:

Subquery Scan on x  (cost=123356.60..137112.58 rows=131009 width=74) (actual time=811.337..1087.880 rows=172023 loops=1)  Filter: (x.r <= 12)  Rows Removed by Filter: 219726  ->  WindowAgg  (cost=123356.60..132199.73 rows=393028 width=74) (actual time=811.330..1040.121 rows=391749 loops=1)        ->  Sort  (cost=123356.60..124339.17 rows=393028 width=74) (actual time=811.311..868.127 rows=391749 loops=1)              Sort Key: item_id, start_date, allowed              Sort Method: external sort  Disk: 29176kB              ->  Seq Scan on mytable (cost=0.00..69370.90 rows=393028 width=74) (actual time=0.105..464.126 rows=391749 loops=1)                    Filter: ((end_date > '2018-04-06 12:00:00'::timestamp without time zone) AND (2928 = ANY (group_ids)))                    Rows Removed by Filter: 1482567Planning time: 0.756 msExecution time: 1098.348 ms

There might be room for improvement with indexes but I'm having a hard time understanding how postgres uses them, so I'm not sure what to change.


Viewing all articles
Browse latest Browse all 2

Trending Articles