I’m helping administer two PostgreSQL servers (one primary, one replica) plus a separate server running Barman for WAL archiving. I was not originally involved in setting up the replication and archiving. The configuration is rather old, some of it dating back to 2015 and PostgreSQL 9.3, although we’re now running 9.5. The history of configuration changes is sadly not documented. The person who built the setup is still around, but this was their first time setting up replication as well.
Some time ago we ran into a mysterious data corruption issue on our primary server after some storage system troubles at our VPS provider. These issues were present even when restoring a Barman backup to a fresh server. This was preventing us from cleanly upgrading to PostgreSQL 12, but we determined that we were able to cleanly dump our actual production database in the
main cluster and restore it to a newly created recovery cluster.
We decided to shut down our replication + Barman archiving setup and start them over from scratch as well, and in researching how to do this I ran into some interesting configuration issues. tl;dr: the current setup is working, but not how I thought it was, and now I need guidance on how to cleanly shut it off.
The first clue that something was off was when I was looking at what replication slots were configured on the primary server with
select * from pg_replication_slots. This produced no results, to my confusion. My assumption (based on the description of the person who built the setup) was that we were slots at least for Barman, and I thought we would be doing so for the replication as well.
Before I go further, here are the relevant settings from the three servers, with sensitive details altered:
# Primary server's postgresql.conf
wal_level = hot_standby
max_wal_senders = 4
archive_mode = on
archive_command = 'rsync -a %p companyuser@backup-server:/backup/thingamabob/incoming/%f'
max_replication_slots = 2
# Replica server's postgresql.conf
wal_level = hot_standby
hot_standby = on
# Replica server's recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=primary-server user=postgres'
trigger_file = '/tmp/trigger_file0'
# /home/companyuser/.barman.conf on the backup server
barman_home = /backup
barman_user = companyuser
log_file = /backup/barman.log
minimum_redundancy = 1
description = "Thingamabob"
archiver = on
# 5436 is the SSH tunneled port to our primary DB server, which has
# a 'barman' user
conninfo = host=localhost port=5436 user=barman dbname=postgres
backup_method = postgres
slot_name = backup
retention_policy = RECOVERY WINDOW OF 4 DAYS
retention_policy_mode = auto
Let’s look at the Barman WAL archiving side first, which I think I’ve figured out.
rsyncing the completed WAL segments from the primary server to the backup server, and Barman has the requisite control connection to the primary server. Apparently due to a misunderstanding,
slot_name = backup set, even though we’re using traditional WAL archiving only, and
slot_name is used by the more recent WAL streaming functionality of Barman. The dead giveaway here is the lack of the
streaming_conninfo settings. So, the setting is useless, but fortunately harmless in this setup.
What I don’t really understand is the replication setup. We’re not using slots here (
recovery.conf doesn’t have the
primary_slot_name setting), and the documentation for the
primary_conninfo settings is not telling me how the replication is actually functioning correctly.
Finally, some concrete questions based on the information presented so far:
We’re not using replication slots, nor
restore_command. My assumption is that our replication setup is working through sheer luck: the replica server has not had significant downtime, and therefore whatever the default WAL file retention period of the primary server is has been sufficient for our replication to never fail due to the primary server removing an old WAL file that the replica has not yet received. Is this assumption correct?
How does the replica server know when and from where to retrieve WAL files from the primary server? What in our settings is actually doing this, or are PostgreSQL defaults just achieving this? Is it polling something over
I assume that the simple way to end the Barman WAL archiving is to remove the
archive_command settings from the primary server’s configuration, and then to stop
barman cron from running once per minute on the backup server. Is this correct? Based on my reading of the documentation,
archive_mode is not involved in the replication side of things.
What is the correct way of terminating the replication? My current assumption is to do one of the two things below, but is either one of these actually correct?
- shut down Postgres on the replica server, remove the
standby_mode setting, remove
recovery.conf and upon starting Postgres again, it’ll act like the primary server.
- create the trigger file and wait for recovery to be done, and then possibly remove
standby_mode and restart Postgres..? I’m not quite sure how
standby_mode behaves after recovery has ended.