query performance – PostgreSQL LWLock: lock_manager issue

In our application we need to insert a lot of historical data that will be consumed later by some mobile devices.

These data is organized by a numeric incremental attribute that the device will use to download only the ‘offset’ it needs. So, if in the last connection a device downloaded records from 0 to 100, later it will download records starting 101 to last record available.

Let’s call Producer the component that will process and populate (insert) data on these historical tables. For each historical table that the producer will populate, it will:

  1. Open a transaction
  2. Insert data
  3. Update last available record for that table
  4. Close the transaction

There is a lot of producers (about 15, each one with 8 threads) running on the database concurrently.

Now, let’s say that the Consumer is the component responsible for reading the information saved by the producers. All it will do is check, table by table, if there is some record between a required offset a return it to the device.

When a device requests data, it will:

  1. Create a transaction
  2. Iterate each table querying it for new records based on the offset requested.
  3. Close the transaction

Actually, the project runs a AWS db.r5.2xlarge (8 vCPU, 64 GB Ram) Oracle RDS Instance, without any problems.

Now, our team was requested to change from Oracle to PostgreSQL 12. So, we created an PostgreSQL RDS Instance with same size and start some tests.

The producers are working pretty well! The insertions are fast, memory and CPU consumptions under control, no issues.

BUT, when we start the Consumers, everything changes…
The CPU goes to 100%, Freeable memory drops to zero. Swap increases, the system freezes and "out of shared memory" with hint "YOU MIGHT NEED TO INCREASE MAX_LOCKS_PER_TRANSACTION" comes in place.

Apparently, the execution plan of the queries is OK! We are using prepared statements. Statistics are OK also. No sessions being locked by other sessions… Records are being returned as Cursors. Exactly as with Oracle.

Looking in pgAdmin, we a lot of sessions wait for "LWLock: lock_manager", but we can’t understand why…

We tried the following approaches, with no success:

  • Increase MAX_LOCKS_PER_TRANSACTION (this eliminated the out_of_shared_memory error, but other behavior are the same)
  • Disable parallel queries
  • Increase work_mem
  • Reduce max_connections
  • Increase shared buffers
  • Increase wal_size

We can’t figure out why this is happening. The PGSQL database load is smaller than the load Oracle supports on production. Does anybody have any hint that we could try?

Additional info:

  1. PGSQL tables is currently partitioned, but partition prunning seems to be ok.
  2. Multiple Producers and Consumers writes/reads the same table at the same time, but a consumer will never try to consume a record that before producer explicitly marks as available.
  3. We read that out_of_shared_memory occurs when tables are being dropped and recreated within a long transaction… This is not our case.
  4. PostgreSQL Version: PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit

Attached there are some pictures that we see when the problem happens.

Grafans Stats before and after starting consumers
pgAdmin showing active sessions
AWS RDS Performance Monitor showing Wait Events

Thanks a lot!