SQL Server on Linux - feature change in Pacemaker 1.1.18

Heads up for SQL Server on Linux folks using availability groups and Pacemaker. Pacemaker 1.1.18 has been out for a while now, but it’s worth mentioning that there was a behaviour change in how it fails-over a cluster. While the new behaviour is considered “correct”, it may affect you if you’ve configured availability groups on a previous version (specifically 1.1.16).

Pacemaker package 1.1.18-11.el7 introduce[d] a behavior change for the start-failure-is-fatal cluster setting when its value is false. This change affects the failover workflow. If a primary replica experiences an outage, the cluster is expected to failover to one of the available secondary replicas. Instead, users will notice that the cluster keeps trying to start the failed primary replica. If that primary never comes online (because of a permanent outage), the cluster never fails over to another available secondary replica.

That quote comes from the cumulative update page for SQL Server 2017, along with two possible workarounds. The choice is entirely your call, but keep in mind that any changes in production need testing and a maintenance window.

Roll back to Pacemaker 1.1.16.
If you cannot roll back to 1.1.16, perform the following steps below, in sequence:

Remove the `start-failure-is-fatal` override from the existing cluster

RedHat and Ubuntu
- pcs property unset start-failure-is-fatal or
- pcs property set start-failure-is-fatal=true

SuSE
- crm configure property start-failure-is-fatal=true

Decrease the `cluster-recheck-interval` value

RedHat and Ubuntu
- pcs property set cluster-recheck-interval=<Xmin>

SuSE
- crm configure property cluster-recheck-interval=<Xmin>

Microsoft recommends that you set failure-timeout to 60s and cluster-recheck-interval to a value that is greater than 60 seconds. Setting cluster-recheck-interval to a small value is not recommended. For more information, refer to the Pacemaker documentation or consult the system provider.

Add the `failure-timeout` meta property to each AG resource

RedHat and Ubuntu
- pcs resource update ag1 meta failure-timeout=60s

SuSE
- crm configure edit ag1
- Then in the editor, add meta failure-timeout=60s after any params and before any ops.

Hopefully you haven’t been caught out by this change in the past year. Share your thoughts in the comments below.

Photo by Mario Cañas on Unsplash.

SQL Server on Linux – feature change in Pacemaker 1.1.18

Remove the start-failure-is-fatal override from the existing cluster

Decrease the cluster-recheck-interval value

Add the failure-timeout meta property to each AG resource

Remove the `start-failure-is-fatal` override from the existing cluster

Decrease the `cluster-recheck-interval` value

Add the `failure-timeout` meta property to each AG resource