Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It only takes a minute to sign up.
Sign up to join this community
Anybody can ask a question
Anybody can answer
The best answers are voted up and rise to the top
We have an hard-to-understand issue with our production MySQL RDS 8.0.20 instance.
Hope you guys can give some hints what to take a look at and/or what it could be.
- Every hour, at exact same time, at :00 we experience the database glitch. Sometimes it’s big and customers are affected, sometimes is not, but we always see a glitch.
By big I mean it could be few seconds or even 10 or 20 seconds, by small I mean it could be second or so.
What is a glitch? Well, our java application fires the following sequences:
- set autocommit = 0
- Then makes insert(s) / update(s) / select(s)
So what we see, is that this exact COMMIT sql takes lots of time.
Literally, in our slow query logs, we see 10s or 20s COMMITs.
To put it simply: selects, inserts, updates go fast, COMMITs are stuck.
Once an hour, at :00.
In performance insights, we see that there is a spike, every hour, at the exact time of a glitch. wait/sync/cond/sql/MYSQL_BIN_LOG::COND_done contributes to this spike.
The queries are in status “waiting for handler commit”.
There is an increased database connections at this time, but this is normal & expected: if application continues to receive http requests and some database connections from the conn. pool are stucked, then application starts to open new connections.
We noticed that there is a minor spike in Write Latency ( it’s visible in RDS console ).
It’s usually 3ms, but it jumps to 8ms. We suspect that it may have a big/huge value during the glitch, but it’s 8ms because it’s averaged for a minute. Unfortunately, the minute of granularity seems the maximum.
We haven’t noticed any other change of metrics. Disk IO, CPU, innodb reads / writes / updates look good, both from normal monitoring and from the performance insights look good. We haven’t even noticed anything in Enhanced monitoring.
We don’t see anything unusual in innodb_rows_read.avg, innodb_rows_updated.avg, innodb_rows_deleted.avg and innodb_rows_inserted.avg. Basically, during the time of a glitch, +- 2 minutes these metrics are usual, which means that there is no any special activity going during this time.
We have also reviewed all our cron jobs etc, it doesn’t look to us that there is anything happening in the database which contributes to this glitch.
There are increased locks at this time, but I think it’s also expected: queries are waiting for lock on binlog.
We have innodb_flush_log_at_trx_commit = 2 and sync_binlog = 0. We have innodb_log_file_size = 134217728, which is 134 MB. I know, it’s a bit small and could be made 1GB, but since the issue happens at exact time without any insert / update / delete activity, I doubt it’s connected.
There are no other slow query logs ( > 1s ) at the time or before the time.