postgresql – Postgres: ODBC: ERROR: SSL connection has been closed unexpectedly; Error while executing the query

I’m currently in Australia and so is the database.

Staff attempting to connect to the database from overseas (USA, UK) consistently run into the error: ERROR: SSL connection has been closed unexpectedly; Error while executing the query, when they try to connect. People here can connect fine without that error but anyone overseas will receive that error if their query lasts longer than say 1 minute?

The database is a postgres RDS hosted in AWS Sydney region.

What is the cause of an SSL connection closing like this? Is it the server or is it just caused by the latency of someone overseas trying to connect to the database? Is there any way to combat this?

This error was from an ODBC connection using Power BI or Tableau
We find a similar error when querying from Python or DBeaver as well.

postgresql – Does daily pg_dump mess up postgres cache?

I migrated my geospatial Postgres 12.5 database to another cloud provider. I use postgis and I have around 35GB of data and 8GB of memory.

Performances are way worse than on my previous provider, and new provider claims this is because the pg cache has to been “warmed up” everyday after automatic pg_dump backuping operations occuring in the night.

Geospatial queries that would normally take 50ms sometimes take 5-10s on first request, and some that would run in 800ms take minutes.

Is there something else looming or is the technical support right ?

If so, should I disable daily backups ? Or can I somehow use a utility function to restore the cache ? (pg_prewarm ?)

postgresql – Install postgres in VeraCrypt container

I would like to install postgres in a VeraCrypt container, to prevent saving data in plain text. During installtion process on the mounted container I got the error

"Problem running post-install step. Installation may not correctly. The database cluster initialisation failed."

After finishing the installation I couldnĀ“t connect to the database via the command line psql (“postgres could not connect to server connection refused”).

I am using the windows installer on Windows 10.

How could I handle this problem?

postgresql – Same postgres query in two different instances with the same data but with different times

I have the same postgres query running in two different instances restored with the same dump file:

  1. one instance in aws rds => https://explain.depesz.com/s/USMO (‘PostgreSQL 11.10 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit’)
  2. one instance in compute engine vm in gcp => https://explain.depesz.com/s/LTUL (‘PostgreSQL 11.10 (Debian 11.10-0+deb10u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit’)

But the query in 1) (40s) is faster than 2) (400s) (execution time calculated in python)

What I tried:

  • ran without cache (restart the on prem gcp instance)
  • changed the where clause values
  • ran the query in different computers
  • analyze the basics of explain command (same plan and same indexes)

What are the main reasons for this?

My main hypothesis now is the network traffic. How can I test that?

Thanks in advance

  • both have the same postgres version and hardware configuration
  • I am not sure, but these times are the first run of each query and cache (maybe) is not involved
  • traceroute 1) = 13.864 ms / traceroute 2) = 32.469 ms

postgresql – Prevent timing attacks in postgres?

I’m looking into accepting an API token from a user, and I would like to prevent timing attacks. The conventional wisdom seems to be that one should hash the token server-side in order to make the comparison non-deterministic. If you have some other identifier you can look up the record by that and then do a constant-time in-memory comparison, as with Rails secure_compare. In my case, though, I was planning on using the token to look up the user. I’m wondering if, by any chance, Postgres has some facility for doing constant-time comparisons when looking up records. Something that might look like:

SELECT * FROM users WHERE secure_compare(token, 'abcdef')

postgresql – Ltree query performance optimization of Postgres RDS DB

I have a AWS RDS m5.large Postgres 10.13 database that performs a lot of the following queries

SELECT "bundles".* FROM "bundles" WHERE "bundles"."version_id" = $1 AND (tree_path ~ ?) LIMIT $2

the problem is the poor performance of the overall system. Via advanced monitoring we see a very high value for current activity:

enter image description here

and it seems that the forementioned query have some sort of impact on the load by waits enter image description here

What do you suggest to check? I’m not a DBA so I can’t judge if those queries are efficent.

python – Parsing a large XML file and storing in Postgres efficiently using Pythonic code

I have a bunch of large XML files that I have to parse and store in Postgres. I have done a lot of procedural code but I want to think in terms of objects and reap the benefits. Any help in that direction is greatly appreciated.

The XML itself is fairly straightforward – Each Person object has employer, employer office, education information. I am wondering how best to design a Python Class to hold this data with the end goal of efficiently inserting data into Postgres.

Sample XML File

<?xml version="1.2" encoding="UTF-8"?>
<data>
    <person name="Mary Kitchen" personkey="123">
        <employers>
            <employer name="ABC Tech" id="767" startdate="02-2020" enddate="">
                <officeaddrs>
                    <officeaddr str1="101 MAIN ST" str2="" city="NYC" state="NY" zip="07789" />
                    <officeaddr str1="111 POOLE ST" str2="" city="NYC" state="NY" zip="07780" />
                </officeaddrs>
            </employer>
            <employer name="XYZ Tech" id="909"  startdate="06-2012" enddate="01-2020">
                <officeaddrs>
                    <officeaddr str1="122 Main St" str2="" city="NYC" state="NY" zip="07789" />
                    <officeaddr str1="199 Poole St" str2="" city="NYC" state="NY" zip="07780" />
                </officeaddrs>
            </employer>
        </employers>
        <educationrecords>
            <educationrecord type="Masters" school ="ABC School" graduated="12-14-2005"/>
            <educationrecord type="Bachelors" school ="XYZ School" graduated="12-14-2001"/>
        </educationrecords>
    </person>
    <person name="JASON KNIGHT" personkey="129">
        <employers>
            <employer name="NYState Bank" id="66" startdate="02-2015" enddate="">
                <officeaddrs>
                    <officeaddr str1="188 Main St" str2="" city="NYC" state="NY" zip="07789" />
                    <officeaddr str1="100 Poole St" str2="" city="NYC" state="NY" zip="07780" />
                </officeaddrs>
            </employer>
            <employer name="ZYK Tech" id="543" startdate="02-2010" enddate="01-2015">
                <officeaddrs>
                    <officeaddr str1="333 MAIN ST" str2="" city="NYC" state="NY" zip="07789" />
                </officeaddrs>
            </employer>
        </employers>
        <educationrecords>
            <educationrecord type="Bachelors" school ="Top School" graduated="04-01-2009"/>
        </educationrecords>
    </person>
</data>

Python code

# import xml.etree.ElementTree as ET
from lxml import etree


def fast_iter(context, func, args=None, kwargs=None):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    if kwargs is None:
        kwargs = {}
    if args is None:
        args = ()
    for event, elem in context:
        func(elem, *args, **kwargs)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()(0)
    del context


class PersonParser:

    def __init__(self, element):
        self.element = element

    def myparser(self):
        print('hello')
        print(self.element)


class PersonData:

    def __init__(self, person=None, employer=None, office=None, education=None):
        self.person = person
        self.employer = employer
        self.office = office
        self.education = education


class Person:

    def __init__(self, personkey=None, name=None):
        self.name = name.title()
        self.personkey = personkey
        # print(self.name, self.personkey)

    def __repr__(self):
        return '(name={}, personkey={})'.format(
            self.name, self.personkey)


class Employer:

    def __init__(self, id=None, name=None, startdate=None, enddate=None):
        self.id = id
        self.name = name
        self.startdate = startdate
        self.enddate = enddate

    def __repr__(self):
        return f'(name={self.name}, id={self.id}, startdate={self.startdate}, enddate={self.enddate})'


class Office:

    def __init__(self, str1=None, str2=None, city=None, state=None, zip=None, personkey=None, empid=None):
        self.empid = empid
        self.personkey = personkey
        self.str1 = str1
        self.str2 = str2
        self.city = city
        self.state = state
        self.zip = zip
        # print(self.name, self.personkey)

    def __repr__(self):
        return '(name={}, personkey={})'.format(
            self.name, self.personkey)


class Education:

    def __init__(self, personkey=None, name=None):
        self.name = name
        self.personkey = personkey
        # print(self.name, self.personkey)

    def __repr__(self):
        return '(name={}, personkey={})'.format(
            self.name, self.personkey)


def myfunction(element):

    print(element)

    for person in element.iter('person'):
        # print('person', person.attrib)
        person_rec = person.attrib
        print(person_rec)

    for employer in element.iter('employer'):
        # print('employer', employer.attrib)
        employer_rec = employer.attrib
        employer_rec('personkey') = person_rec('personkey')
        print(employer_rec)

        for office in employer.iter('officeaddr'):
            # print('office', office.attrib)
            office_rec = office.attrib
            office_rec('empid') = employer_rec('id')
            office_rec('personkey') = person_rec('personkey')
            print(office_rec)

    for education in element.iter('educationrecord'):
        # print('education', education.attrib)
        edu_rec = education.attrib
        edu_rec('personkey') = person_rec('personkey')
        print(edu_rec)


def main():

    # tree = ET.parse('person_data.xml')
    context = etree.iterparse('person_data.xml', events=('end',), tag='person')
    fast_iter(context, myfunction)

    myPersonParser = PersonParser(context)
    myPersonParser.myparser()


if __name__ == "__main__":
    main()

Current Output

<Element person at 0x1006f9a40>
{'name': 'Mary Kitchen', 'personkey': '123'}
{'name': 'ABC Tech', 'id': '767', 'startdate': '02-2020', 'enddate': '', 'personkey': '123'}
{'str1': '101 MAIN ST', 'str2': '', 'city': 'NYC', 'state': 'NY', 'zip': '07789', 'empid': '767', 'personkey': '123'}
{'str1': '111 POOLE ST', 'str2': '', 'city': 'NYC', 'state': 'NY', 'zip': '07780', 'empid': '767', 'personkey': '123'}
{'name': 'XYZ Tech', 'id': '909', 'startdate': '06-2012', 'enddate': '01-2020', 'personkey': '123'}
{'str1': '122 Main St', 'str2': '', 'city': 'NYC', 'state': 'NY', 'zip': '07789', 'empid': '909', 'personkey': '123'}
{'str1': '199 Poole St', 'str2': '', 'city': 'NYC', 'state': 'NY', 'zip': '07780', 'empid': '909', 'personkey': '123'}
{'type': 'Masters', 'school': 'ABC School', 'graduated': '12-14-2005', 'personkey': '123'}
{'type': 'Bachelors', 'school': 'XYZ School', 'graduated': '12-14-2001', 'personkey': '123'}
  • Is it beneficial to create a class for each table (I will have a List
    of dicts for each table that I have to bulk store in Postgres to
    avoid multiple roundtrips) I am wanting to store in the database?
  • Do I instantiate an object in the function myfunction?
  • How best to load the data in Postgres in bulk? A Person class that has all the individual List of dicts (employer, person, office, education) as variables and a method to load the data into the database?

postgresql – Tracking (un)successful autovacuums in postgres

I am trying to help a team of junior, senior, principal and chief (mostly JEE) developers to be more data-centric and data-aware. In some cases we look into the data-processing costs, complexity of the algorithms, predictability of the results and statistical robustness of the estimates for query plans. In other cases we blindly believe that use of indexes is always great and scan of the tables is always bad. Sometimes we just opportunistically throw gazillions of insert, update, delete queries into the DB and hope for the best. We run load tests afterwards and we notice that our tables and indexes are bloated beyond imagination, the tables became pretty much unmanageable in size and chaos rules the area.

A good way to proceed is to train and learn complexity classes, understand the costs, have the right attitude. This change is very fruitful but hard and slow. As long as I am breathing, I’ll continue this journey.

For now we are trying to understand why autovacuum for some tables kicks in so seldomly. We’ve got a postgres server (v9.5 I believe) running in azure cloud (test environment). We pay for 10K IOPS, and we use them fully (we write to the DB like hell). In the last 24 hours I see that autovacuum was run only 2 times for two large tables through

select * from pg_stat_all_tables order by last_autovacuum desc

In order to trigger an autovacuum, I created:

create table a(a int)
 
ALTER TABLE a SET (autovacuum_vacuum_scale_factor  = 0.0 );
ALTER TABLE a SET (autovacuum_vacuum_threshold     = 10  );
ALTER TABLE a SET (autovacuum_analyze_scale_factor = 0.0 );
ALTER TABLE a SET (autovacuum_analyz_threshold     = 10  );

and ran the following two statements multiple times:

delete from a;
insert into a (a) select generate_series(1,10);

This should have triggered an autovacuum on the table, but pg_stat_all_tables has NULL for last_autovacuum column for table a.

We also set log_autovacuum_min_duration to a very low value (like 250ms or even 0), but the only two entries in the logs are:

postgresql-2021-02-18_010000.log:2021-02-18 01:56:29 UTC-602a2e9c.284-LOG:  automatic vacuum of table "asc_rs.pg_toast.pg_toast_3760410": index scans: 1
postgresql-2021-02-18_060000.log:2021-02-18 06:35:47 UTC-602a2e9c.284-LOG:  automatic vacuum of table "asc_rs.pg_toast.pg_toast_3112937": index scans: 1

Our settings are:

settings for autovacuum

We have a feeling that autovacuum is killed on large tables because of row-locks. Can we log this information in any way? Can we also log (failing) autovacuum attempts? How does postgres decide to start an autovacuum job (or more generally speaking tradeoffs between regular changes in the DB vs. maintenance jobs) on a very high load system? If parameters for kicking off of autovacuum are meat, would it definitely be kicked off or wait until the load of IOs decrease?

postgresql – Postgres date_trunc quarter with a custom start month

I’m trying to create quarterly average for player scores, however the default behaviour of postgres date_trunc(‘quarter’, source) is that it starts first quarter with YYYY-01-01.

Is there any way possible or workaround I can do to say that the first month of first quarter is for instance September? So instead of the traditional: Q1: 1-3, Q2: 4-6, Q3: 7-9, Q4: 10-12
I want to able to specify which month is the start of Q1, so if I say September it should become: Q1: 9-11, Q2: 12-2, Q3: 3-5, Q4: 6-8

Here is how I make a standard quarterly score average with default quarter.

SELECT id,
       name,
       date_trunc('quarter', date) AS date,
       AVG(rank) AS rank,
       AVG(score) as score,
       country,
       device
FROM player_daily_score
GROUP BY id, name, 3, country, device
ORDER BY 3 desc;

I’m open for all suggestions to make this work.

postgresql – determine maximum advisory locks supported by Postgres

According to the Postgres documentation, the maximum number of regular and advisory locks is limited by a shared memory pool:

Both advisory locks and regular locks are stored in a shared memory pool whose size is defined by the configuration variables max_locks_per_transaction and max_connections. Care must be taken not to exhaust this memory or the server will be unable to grant any locks at all. This imposes an upper limit on the number of advisory locks grantable by the server, typically in the tens to hundreds of thousands depending on how the server is configured.

How can I determine the size of this pool? Is this the same thing as shared buffers, which I can see with show shared_buffers; or is it something different? I am trying to determine roughly how many advisory locks my installation would be able to support because I am doing a ton of locking. My shared_buffers size is 5012MB.

I also have a couple more detailed questions:

  • If the server was unable to grant an advisory lock when I called pg_advisory_xact_lock(), would it hang, error out, or fail silently? As long as it doesn’t fail silently I’m good, although ideally it would hang and then continue once memory frees up.
  • I locking not only with advisory locks but also with SELECT ... FOR UPDATE. If I know the size of the pool, how can I calculate roughly how much space in the pool each advisory lock takes, and each SELECT ... FOR UPDATE takes? I know roughly how many rows will be impacted by each SELECT ... FOR UPDATE.

The documentation is a little confusing because if you look at the documentation for max_locks_per_transaction it says:

The shared lock table tracks locks on max_locks_per_transaction * (max_connections + max_prepared_transactions) objects (e.g., tables); hence, no more than this many distinct objects can be locked at any one time. This parameter controls the average number of object locks allocated for each transaction; individual transactions can lock more objects as long as the locks of all transactions fit in the lock table. This is not the number of rows that can be locked; that value is unlimited.

This seems to track with the idea that the memory pool is equal to max_locks_per_transaction * max_connections described earlier, but here it is saying that the max has more to do with the number of tables and not the number of rows. I’m not really sure how to square this with the first quote, or how this relates to the space taken by advisory locks.

Any tips on calculating would be greatly appreciated!