We are running some nightly data sync jobs which move data from an external database into postgres using an upsert statement. The data is updated in batches and using parallel workers (10 currently).
The problem we are having is that even when no data is changed (e.g. when running two consecutive syncs with no changes in between), there is a high amount of disk io (100% disk usage, 200 MB/s is writes). Our monitoring reports that there are no updated rows and only reads are performed on the table itself.
The expected behavior would be a high amount of reads but close to zero write operations on the disk. We can reproduce this consistently, even when running two syncs on a previously empty table.
What can be the reason for this high amount of disk write operations?
Some more details on our setup:
- Postgres 12.5 running on Kubernetes with 25 GB of memory
- 1TB disk with 3000 IOPS gp3 volumes (AWS)
- Table has around 570 millions rows (120 GB)
- Primary key index is 43 GB
INSERT INTO ... (...) VALUES (...) ON CONFLICT (primayKeys...) DO UPDATE ... WHERE (tableColumns...) IS DISTINCT FROM (excludedColumns...)