MySQL freezes every hour at the exact same time

We have an hard-to-understand issue with our production MySQL RDS 8.0.20 instance.

Problem:

  • Every hour, at the exact same time, at :00 we experience a database glitch. Sometimes it’s big and customers are affected, sometimes it’s not, but we always see a glitch.

    By big I mean it could be few seconds or even up to 10 seconds, by small I mean it could be second or so.
    What is a glitch? Well, our java application fires the following sequences:

  1. set autocommit = 0
  2. Then makes insert(s) / update(s) / select(s)
  3. commit

So what we see, is that this exact COMMIT sql takes lots of time.
Literally, in our slow query logs, we see 10s or 20s COMMITs.
To put it simply: selects, inserts, updates go fast, COMMITs are stuck.
Once an hour, at :00.

Observations:

  • In performance insights, we see that there is a spike, every hour, at the exact time of a glitch. wait/sync/cond/sql/MYSQL_BIN_LOG::COND_done contributes to this spike.
    The queries are in status “waiting for handler commit”.

  • There is an increased database connections at this time, but this is normal & expected: if application continues to receive http requests and some database connections from the conn. pool are stucked, then application starts to open new connections.

  • We noticed that there is a minor spike in Write Latency ( it’s visible in RDS console ).
    It’s usually 3ms, but it jumps to 8ms. We suspect that it may have a big/huge value during the glitch, but it’s 8ms because it’s averaged for a minute. Unfortunately, the minute of granularity seems the maximum.

  • We haven’t noticed any other change of metrics. Disk IO, CPU, innodb reads / writes / updates look good, both from normal monitoring and from the performance insights look good. We haven’t even noticed anything in Enhanced monitoring. Write IOPS is 600, Read IOPS is 300, Queue Depth is 0.75. They sometimes jumping due to the heavy read SQLs, but they are very stable +- during the glitch.

  • We don’t see anything unusual in innodb_rows_read.avg, innodb_rows_updated.avg, innodb_rows_deleted.avg and innodb_rows_inserted.avg. Basically, during the time of a glitch, +- 2 minutes these metrics are usual, which means that there is no any special activity going during this time.

  • We have also reviewed all our cron jobs etc, it doesn’t look to us that there is anything happening in the database which contributes to this glitch.
    There are increased locks at this time, but I think it’s also expected: queries are waiting for lock on binlog.

  • We have innodb_flush_log_at_trx_commit = 2 and sync_binlog = 0. We have innodb_log_file_size = 134217728, which is 134 MB. I know, it’s a bit small and could be made 1GB, but since the issue happens at exact time without any insert / update / delete activity, I doubt it’s connected.

  • There are no other slow query logs ( > 1s ) at the time or before the time.

  • A bit of data from SHOW ENGINE INNODB STATUS ( run at normal time when there is no issue but a good production load ):

7174379566 OS file reads, 
10959861316 OS file writes, 
1045626171 OS fsyncs
215.88 reads/s, 
16384 avg bytes/read, 
1249.28 writes/s, 
128.21 fsyncs/s
482.54 inserts/s, 
210.98 updates/s, 
48.02 deletes/s, 56656.29 reads/s
Number of system rows inserted 10524, 
updated 61130, deleted 6743, read 66988909
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 17.58 reads/s