Heads up for SQL Server on Linux folks using availability groups and Pacemaker. Pacemaker 1.1.18 has been out for a while now, but it’s worth mentioning that there was a behaviour change in how it fails-over a cluster. While the new behaviour is considered “correct”, it may affect you if you’ve configured availability groups on a previous version (specifically 1.1.16).
Pacemaker package 1.1.18-11.el7 introduce[d] a behavior change for the start-failure-is-fatal cluster setting when its value is false. This change affects the failover workflow. If a primary replica experiences an outage, the cluster is expected to failover to one of the available secondary replicas. Instead, users will notice that the cluster keeps trying to start the failed primary replica. If that primary never comes online (because of a permanent outage), the cluster never fails over to another available secondary replica.
That quote comes from the cumulative update page for SQL Server 2017, along with two possible workarounds. The choice is entirely your call, but keep in mind that any changes in production need testing and a maintenance window.
- Roll back to Pacemaker 1.1.16.
- If you cannot roll back to 1.1.16, perform the following steps below, in sequence:
Remove the start-failure-is-fatal
override from the existing cluster
- RedHat and Ubuntu
pcs property unset start-failure-is-fatal
orpcs property set start-failure-is-fatal=true
- SuSE
crm configure property start-failure-is-fatal=true
Decrease the cluster-recheck-interval
value
- RedHat and Ubuntu
pcs property set cluster-recheck-interval=<Xmin>
- SuSE
crm configure property cluster-recheck-interval=<Xmin>
Microsoft recommends that you set failure-timeout to 60s and cluster-recheck-interval to a value that is greater than 60 seconds. Setting cluster-recheck-interval to a small value is not recommended. For more information, refer to the Pacemaker documentation or consult the system provider.
Add the failure-timeout
meta property to each AG resource
- RedHat and Ubuntu
pcs resource update ag1 meta failure-timeout=60s
- SuSE
crm configure edit ag1
- Then in the editor, add
meta failure-timeout=60s
after anyparam
s and before anyop
s.
Hopefully you haven’t been caught out by this change in the past year. Share your thoughts in the comments below.