postgresql performance – Postgres tracking count of recent items in count table or by selecting count of recent records from history table

Which would generally be more efficient, or does it not matter?
Option 1 is incrementing a count in a table that stores counts by the unique id, and resetting the count to zero when enough time has passed for that unique id.
Option 2 is not storing a count and fetching the count of records from a history table, by the records’ unique ids and within the past so many time units.
I would add an index on the id and timestamp fields for option 2.
The history table is in the low millions of records and doesn’t grow that fast.
Is the periodic vacuuming of the counts table likely to be much worse than selecting the count from the history table using indexed fields?
It seems like it would require less operations, no insert or update with Option 2.

postgresql – Postgres use index to create index

I have a partial index where a certain column is not null. This is a very small percentage of the table. Thanks to this index, SELECT * FROM table WHERE column IS NOT NULL is incredibly fast (5 milliseconds). But the table has hundreds of millions of rows.

If I want to create a second index on the same set of rows (where that same column is not null), how can I make Postgres use the first index that already exists to find those rows? Currently Postgres just scans the entire table to find them again, which takes many minutes. I can query all the rows in millisceonds, so why can’t CREATE INDEX get them in the same way?

postgresql – find out why table insert is slow in postgres cluster

Recently i did create a PostgresQL-HA 11, cluster using bitnami helm charts.
There is 1 master and 1 replica, pg_pool and metrics.

I am using NFS as file storage for the cluster and it is connected to a vmware guest machine with enough memory and cpu and a HP fast hard disk drive, in same broadcast domain on same VSwitch with a 1Gbps v-port.

But insert into database using a js written batch script is slow like this:
Insert 2000 records in 13.952 seconds.

The script inserts very fast on my local non-clustered postgresql database, on my local machine and inserts 2000 records in less than a second.

So since I am not a DBA or someone like that, how i can find why it is slow? can metrics or sth else help?

postgresql – Postgres, trigger function, is it bad for performance?

I have added the following trigger function to one of my tables in my Postgres database.

CREATE OR REPLACE FUNCTION trigger_set_timestamp()
RETURNS TRIGGER AS $$
BEGIN
  NEW.updated_at = NOW();
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

I am new to this, so, I am wondering, is this going to slow down my database? Should I avoid trigger functions for performance reasons and if so, what’s the alternative?

(In this case this is the only way I knew how to make the updated_at date column always set the date when any of the columns in the table would change….)

Are snapshots generated for each sub query inside a Postgres READ_COMMITTED transaction?

From what I understand – in a READ_COMMITTED Postgres transaction “the transaction obtains a snapshot whenever an SQL command is executed” source. Does this mean a snapshot will be obtained for each sub query in a nested query? Does the use of CTEs in place of sub queries affect this behavior?

postgresql – How restore a Postgres database on AMAZON RDS when the file is in another server?

I’m around 2 days to simply restore my database to an AMAZON RDS server. The tutorial page from AMAZON is so bad that I can’t do a simple task like this.

I have db_example.bak file on my local machine and in my other server and I want to simply restore this database on RDS from the amazon rds server.

Because of my database, I try to make this on the client-side but it will take the whole day to restore. I can connect with psql on AMAZON RDS but I don’t know how I can import a file that doesn’t exist there.

I don’t know if this a noob question but I will try everything to make this simple thing done and I think my doubt can be the doubt of someone.

observation: I only find tutorial in SQL SERVER but not in POSTGRES

postgresql 9.4 – Postgres dynamic filter conditions

I want to dynamically filter data based on condition, which is stored in specific column. This condition can change for every row.
For example I have a table my_table with couple of columns, one of them is called foo, where there are couple of filter conditions such as AND bar > 1 or in the next row AND bar > 2 or in the next row AND bar = 33.
I have a query which looks like:

SELECT something from somewhere 
LEFT JOIN otherthing on some_condition
WHERE first_condition AND second_condition AND 
here_i_want_dynamically_load_condition_from_my_table.foo

What is the correct way to do it? I have read some articles about dynamic queries, but I am not able to find a correct way.

postgresql – Postgres NOT IN performance

I’m currently doing some exploratory work with Postgres 12, and have bumped into an issue I want to try and understand more about. My background is primarily Microsoft SQL Server, so I’m looking at where various differences lie between the two.

Now in SQL Server, the query optimizer is quite good and seeing similar patterns and generating equivalent plans for a given query. However in Postgres, I’m getting different plans: now this isn’t the issue – that’s fine, and have no issue with, however one of the execution plans turns out incurs a large performance hit, so I don’t know if I’ve missed something obvious, or whether this is just the nature of Postgres (if so, fine, it’s something we can work with).

The questions I’ve trying answer are:

  1. Is this behaviour expected?
  2. Is NOT IN generally best avoided with Postgres where data sizes can’t be guranteed?
  3. Have I missed something obvious with Postgres that would resolve the problem with the query?

I’ve created a basic table that I’ve loaded with various word lists (the scrabble TWL, SOWPODS, and a few others sourced from this repository on github):

CREATE TABLE wordlist
(
    src character varying(7) COLLATE pg_catalog."default" NOT NULL,
    word character varying(100) COLLATE public.utf8_uk_ci_ai NOT NULL,
    wordlength integer GENERATED ALWAYS AS (char_length((word)::text)) STORED,
    seqid integer NOT NULL GENERATED ALWAYS AS IDENTITY (INCREMENT 1 START 1)
)

I’ve got the following indexes on the table:

CREATE INDEX ix_words ON wordlist (word);
CREATE INDEX ix_words2 ON wordlist (word) INCLUDE (src);
CREATE INDEX ix_words3 ON wordlist (src) INCLUDE (word);
CREATE INDEX ix_srcword ON wordlist (src, word);
CREATE INDEX ix_wordsrc ON wordlist (word, src);

This is more due to experiementation/poking, rather than any actual plan, so apologies if they don’t make immediate sense. I appreciate ix_words, ix_words2 and ix_words3 are likely superflouous to requirements given the two indexes.

The collation utf8_uk_ci_ai is defined as:

CREATE COLLATION public.utf8_uk_ci_ai
    (LC_COLLATE = 'en_GB@colStrength=primary', LC_CTYPE = 'en_GB@colStrength=primary');

The breakdown of word counts (SELECT src, COUNT(*) FROM wordlist GROUP BY src) in the table looks like this:

Word counts by source

The three queries I’ve been comparing are these – they’re all intended to do return the same dataset, just in different ways; the first two run quickly (EXCEPT and NOT EXISTS), it’s the third one (NOT IN) that has problems:

select word from wordlist where src = "http://dba.stackexchange.com/SOWPODS"
except
select word from wordlist where src = 'TWL'

select a.word 
    from wordlist a 
    where a.src = "http://dba.stackexchange.com/SOWPODS" 
        and not exists (select 1 from wordlist b where b.src = 'TWL' and a.word = b.word)

select a.word 
    from wordlist a 
    where a.src = "http://dba.stackexchange.com/SOWPODS" 
        and a.word not in (select word from wordlist b where b.src = 'TWL')

The first two complete in a second or so; the third however doesn’t complete for minutes. I’ve killed the query as I don’t know how long it would have taken.

The EXPLAIN for the third reads as follows:

Gather  (cost=13054.03..1667897336.79 rows=136620 width=13)
  Workers Planned: 2
  ->  Parallel Bitmap Heap Scan on wordlist a  (cost=12054.03..1667882674.79 rows=56925 width=13)
        Recheck Cond: ((src)::text = "http://dba.stackexchange.com/SOWPODS"::text)
        Filter: (NOT (SubPlan 1))
        ->  Bitmap Index Scan on ix_words3  (cost=0.00..7261.72 rows=273239 width=0)
              Index Cond: ((src)::text = "http://dba.stackexchange.com/SOWPODS"::text)
        SubPlan 1
          ->  Materialize  (cost=4758.15..33614.09 rows=177254 width=13)
                ->  Bitmap Heap Scan on wordlist b  (cost=4758.15..31861.82 rows=177254 width=13)
                      Recheck Cond: ((src)::text = 'TWL'::text)
                      ->  Bitmap Index Scan on ix_words3  (cost=0.00..4713.84 rows=177254 width=0)
                            Index Cond: ((src)::text = 'TWL'::text)

Graphically, looks like this:

enter image description here

However if I do the same query with a smaller list in the NOT IN (src = “http://dba.stackexchange.com/ME” that has 1000 rows, rather than the 178,000 of src = ‘TWL’), I get a much faster result and a different (and simpler) plan:

select a.word 
    from good.wordlist a 
    where a.src = "http://dba.stackexchange.com/SOWPODS" 
        and a.word not in (select word from good.wordlist b where b.src = "http://dba.stackexchange.com/ME")

EXPLAIN gives:

Bitmap Heap Scan on wordlist a  (cost=7304.08..36290.67 rows=136620 width=13)
  Recheck Cond: ((src)::text = "http://dba.stackexchange.com/SOWPODS"::text)
  Filter: (NOT (hashed SubPlan 1))
  ->  Bitmap Index Scan on ix_words3  (cost=0.00..7261.72 rows=273239 width=0)
        Index Cond: ((src)::text = "http://dba.stackexchange.com/SOWPODS"::text)
  SubPlan 1
    ->  Index Only Scan using ix_words3 on wordlist b  (cost=0.43..8.20 rows=1 width=13)
          Index Cond: (src = "http://dba.stackexchange.com/ME"::text)

which graphically looks like:

enter image description here

So it’s gone straight to one of the indexes with a smaller list, which it’s ignoring for the larger list. Now in MSSQL, all three of these queries give the same execution plan, as a result it’s given us more flexibility in how we write SQL as we can trust (to a point) that the query optimizer will do the right thing.

With Postgres, it feels like the approach we need to take to writing SQL needs to be different, hence the three questions I posed at the start of this:

  1. Is this behaviour expected?
  2. Is NOT IN generally best avoided with Postgres where data sizes can’t be guranteed?
  3. Have I missed something obvious with Postgres that would resolve the problem with the query?

Apologies for the length of this question, felt a truncated version of it wasn’t working (I did try).

postgresql – Get partial results in Postgres in case of timeout?

Is it possible to send a query that does a sequential scan of the table – something like

SELECT content
  FROM some_big_table
 WHERE <some criteria(content) are met>;
 LIMIT <...>;

with a time limit (e. g. 1000 milliseconds) – so that the query terminates after the specified time period and returns any rows it could find by that time as a result of the scan?

I’m not worried about the result being predictable – I just need to give the user the first matches the server can find, if any.

SET statement_timeout does not help here as it cancels the query altogether if it doesn’t execute within the time limit, while I need the partial results.

postgresql – Postgres sometimes uses sequential scan instead of an index only scan for data that doesn’t exist

We are using Postgres 12 and We have a simple table like this:

> d live_event_segmentation
                  Table "public.live_event_segmentation"
   Column   |            Type             | Collation | Nullable | Default 
------------+-----------------------------+-----------+----------+---------
 login_id   | bigint                      |           | not null | 
 event_id   | text                        |           | not null | 
 expires_at | timestamp without time zone |           |          | 
Indexes:
    "live_event_segmentation_pkey" PRIMARY KEY, btree (login_id, event_id)
    "live_event_segmentations_event_id_idx" btree (event_id)

Size of the table is:

> dt+ live_event_segmentation
                              List of relations
 Schema |          Name           | Type  |   Owner    |  Size  | Description 
--------+-------------------------+-------+------------+--------+-------------
 public | live_event_segmentation | table | liveevents | 171 MB | 
(1 row)

and the whole DB fits into RAM.

This table has this distribution of event_id:

> select event_id, count(*) from live_event_segmentation group by 1;
        event_id        | count  
------------------------+--------
 euro2020Test           |     67
 fevent20               | 164310
 summer2020Test         |      9
(3 rows)

And our app is executing this query a few times a second:

explain analyze select 1 as "one" where exists (select 1 as "one" from "public"."live_event_segmentation" where "public"."live_event_segmentation"."event_id" = 'summer2020' limit 1);
                                                                                 QUERY PLAN                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=2.44..2.45 rows=1 width=4) (actual time=0.023..0.023 rows=0 loops=1)
   One-Time Filter: $0
   InitPlan 1 (returns $0)
     ->  Index Only Scan using live_event_segmentations_event_id_idx on live_event_segmentation  (cost=0.42..2.44 rows=1 width=0) (actual time=0.022..0.022 rows=0 loops=1)
           Index Cond: (event_id = 'summer2020'::text)
           Heap Fetches: 0
 Planning Time: 0.106 ms
 Execution Time: 0.040 ms
(8 rows)

When we run this query from psql it always uses an index only scan, but when our Java app runs this same query it often uses a sequential scan. We saw this from the logs by using auto_explain extension and setting log_min_duration_statement=20 (so that we can see query parameters which auto_explain doesn’t display). The only difference between psql and our app is that the app uses a prepared statement, so when we tried it in psql sometimes it would actually use a sequential scan:

> prepare p1(text, int) AS select 1 as "one" where exists (select 1 as "one" from "public"."live_event_segmentation" where "public"."live_event_segmentation"."event_id" = $1 limit $2);
PREPARE
> explain analyze execute p1('summer2020', 1);
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.29..0.30 rows=1 width=4) (actual time=67.093..67.093 rows=0 loops=1)
   One-Time Filter: $0
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.00..2396.72 rows=8129 width=4) (actual time=67.090..67.090 rows=0 loops=1)
           ->  Seq Scan on live_event_segmentation  (cost=0.00..23968.34 rows=81294 width=4) (actual time=67.087..67.087 rows=0 loops=1)
                 Filter: (event_id = $1)
                 Rows Removed by Filter: 163728
 Planning Time: 0.138 ms
 Execution Time: 67.171 ms
(9 rows)

What’s important here is that the event summer2020 doesn’t exist in the table. We noticed that only queries that are selecting event_id that doesn’t exist sometimes use a sequential scan.

Table live_event_segmentation increases constantly with about 1 row per second, which is very slow. Autovacuum is working and the table is correctly analyzed. Also this happens constantly, we constantly have sequential scans on this table because of the query above, not int some strange spikes.