I am SQL Server DBA and encountered strange scenario, our cluster had 10 nodes 5 primary 5 secondary each node had SQL role …. Example Ams1pd11 to Ams1pd15 are primary .. and Ams3 side of PD11 to 15 was secondary … In this scenario, the behavior of the entire cluster after 2 node failure was unusual and all node availability groups were inaccessible, resulting in multiple customer failure.
Explanation of the real-time scenario, how it started when I was on the shift ….
Ams1pd12 has failed and hosted primary server A.
So role A on the best possible node was automatically canceled and he chose Ams1pd11.
Ams1pd11 already had a role, e.g. B. B
Now Ams1pd11 hosted both A and B when Ams1pd12 failed
Since both primary roles were on a node, there was a risk of balance. I failed over the B node to one of the secondary nodes on the Ams3 side. Now it was balanced and I was just about to investigate on Ams1pd12 why it failed and everything. ,
But suddenly the Ams1pd11 node also failed and the role did not fail and was stuck there …
Now 2 out of 10 nodes had failed and one role was stuck, so the customers in this role were affected.
We troubleshooted the same issue at Microsoft and found that other nodes were shown by the availability group for all blocked nodes and were not opening and blocking when expanding.
This affected all nodes in this cluster and therefore our backups were stopped.
There was data loss …
The stuck role and the customer on this node only had to accept a data loss of 15 minutes because the service for us and for them also failed at the same time
The nodes that appeared and AG groups were inaccessible. It was strange that the users who were already logged in could change the data. Only new connections were rejected. But the old connection was still active.
So if the problem started and the backup was stopped at 7am, most customers could access the database until 6pm, resulting in 11 hours of data loss.
Manual database recovery took a lot of effort …
These online nodes were easy to restore because we only had to attach the database databases while migrating data and log files for databases with deadlocked roles. We had to restore it manually.
Please suggest the best strategies to follow in this type of disaster to achieve rapid recovery.