We are using AWS RDS and we upgraded from MySQL 5.6.41 to MySQL 8.0.21 last December.
Since mid-January, we noticed some problems causing our replica to be stuck after running migrations adding foreign keys to new tables referencing our main table (in the sense that it is queried for almost all API calls).
Typically, we can have the following Django migration:
This would translate to some kind of
ALTER TABLE ... ADD CONSTRAINT ... FOREIGN KEY ... REFERENCES ... DDL query (verified by checking the SQL generated by the Django ORM).
When running this migration on production, it would execute fine on our master that mostly processes write requests and a small portion of our read queries. However, our replica will suddenly hang on the referenced table (
pittsburgh.Event in the above migration).
More specifically, here is what we can see from RDS Performance Insights:
As you can see, all the read requests to this
pittsburgh.Event model would just hang on the MDL_context lock (a.k.a., the table metadata lock).
When this happens, our entire production pretty much goes down and our only solution at the moment is to restart the replica.
We first thought this was coming from multi-threaded replication that we inadvertently enabled. However, the issue persisted after we disabled multithreaded replication.
After digging more, I was actually able to reproduce the issue with 3 sessions:
- In one session, I start a transaction and read something from the
pittsburgh.Event table, but I do not commit or rollback the transaction yet: I just leave it opened
- In another session, I run a
CREATE TABLE statement adding a foreign key constraint referencing
pittsburgh.Event (autocommit enabled): this statement will just hang, waiting to acquire the metadata lock on the
pittsburgh.Event, but it is already acquired by the transaction in the first session
- In a third session, I try to read something from the
pittsburgh.Event table: this one will also hang because MySQL will apparently queue the query until the second session can acquire the lock on
pittsburgh.Event (write prioritized over reads)
I tried to reproduce on MySQL 5.6, 5.7, and 8.0:
- on MySQL 5.6 and 5.7, the second and third session never hangs, even if the first session did not commit yet
- on MySQL 8.0, what I described above occurs
Clearly, this seems to be a change in how MySQL 8.0 implements metadata locks. When digging through the MySQL change log, I found this in the MySQL
8.0.3 release changelog
MySQL now extends metadata locks, as necessary, to tables that are related by a foreign key constraint. Extending metadata locks prevents conflicting DML and DDL operations from executing concurrently on related tables. This feature also enables updates to foreign key metadata when a parent table is modified. Previously, foreign key metadata, which is owned by the child table, could not be updated safely.
If a table is locked explicitly with LOCK TABLES, any tables related by a foreign key constraint are now opened and locked implicitly. For foreign key checks, a shared read-only lock (LOCK TABLES READ) is taken on related tables. For cascading updates, a shared-nothing write lock (LOCK TABLES WRITE) is taken on related tables that are involved in the operation.
If LOCK TABLES is active for a table in a foreign key relationship, ALTER TABLE … RENAME is not permitted for that table. This is a temporary restriction, lifted in MySQL 8.0.4 by the patch for Bug #26647340.
I’m not 100% certain this is the cause of our problem, but the description seems to clearly match what I’m observing. If this is the case, it seems that this behavior is apparently expected for MySQL 8.0.
From there, I’m not sure what to do and how to handle this problem. We can’t really downgrade to MySQL 5.7 (RDS only allows you to upgrade, and manually trying to re-build the database at an older version using some backup would incur way too much downtime and sounds quite risky):
- Are there some configuration options we can use to disable this new behavior? We have been running on MySQL 5.6 for years without issues, and we run migrations infrequently: it does not appear as if this new mechanism is solving any of our problem (in fact, it is causing us a lot of troubles)
- If we can’t disable, what are some appropriate way to mitigate this? I saw some people playing with the lock timeout, but it does not seem to solve anything and I’m not even clear whether it would work on a replica
- Some recommendations are to track down the long-running transaction that can cause this hang. But how to prevent this to be reintroduced later on with a mid-size engineering team where you can’t oversee all code changes? Again, we used 5.6 for years without having any issue with this, so it’s suddenly a lot of burdens to oversee any database query change or implement some monitoring to track this down
- We are also planning to improve our database requests routing to detect some problems with the replica and stop routing requests there. That would prevent our application to go down completely for too long and will improve our overall reliability, but it’s again just going around the problem without solving it