Skip to main content
NetApp Knowledge Base

A Linux node in HA Cluster was fenced after "A processor failed, forming new configuration" event

Views:
2,473
Visibility:
Public
Votes:
6
Category:
fas-systems
Specialty:
san
Last Updated:

Applies to

  • SLES15 SP1
  • Pacemaker
  • Corosync

Issue

  • Post a network fluctuation, SLES cluster lost communication between the nodes.

Example:

We take two SLES nodes NODE_1 and NODE_2. During the issue, following events are reported:

On NODE_1:

2021-03-22T19:23:53.519571+05:30 NODE_1 corosync[2399]:   [TOTEM ] A processor failed, forming new configuration.
2021-03-22T19:24:08.523256+05:30 NODE_1 corosync[2399]:   [TOTEM ] A new membership (100.70.47.199:2864) was formed. Members left: 2
2021-03-22T19:24:08.523644+05:30 NODE_1 corosync[2399]:   [TOTEM ] Failed to receive the leave message. failed: 2
2021-03-22T19:24:08.523787+05:30 NODE_1 corosync[2399]:   [CPG   ] downlist left_list: 1 received
2021-03-22T19:24:08.526645+05:30 NODE_1 pacemaker-based[3651]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.526943+05:30 NODE_1 sbd[2867]:    cluster:  warning: set_servant_health: Connected to corosync but requires both nodes present
2021-03-22T19:24:08.527139+05:30 NODE_1 pacemaker-based[3651]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.527276+05:30 NODE_1 sbd[2862]:  warning: inquisitor_child: cluster health check: UNHEALTHY
2021-03-22T19:24:08.527444+05:30 NODE_1 sbd[2862]:  warning: inquisitor_child: Servant cluster is outdated (age: 880966)
2021-03-22T19:24:08.527580+05:30 NODE_1 corosync[2399]:   [QUORUM] Members[1]: 1
2021-03-22T19:24:08.527735+05:30 NODE_1 pacemaker-controld[3656]:  warning: Stonith/shutdown of node NODE_2 was not expected
2021-03-22T19:24:08.527895+05:30 NODE_1 corosync[2399]:   [MAIN  ] Completed service synchronization, ready to provide service.
2021-03-22T19:24:08.528077+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528223+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.528344+05:30 NODE_1 pacemaker-controld[3656]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
2021-03-22T19:24:08.528474+05:30 NODE_1 pacemaker-controld[3656]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528583+05:30 NODE_1 pacemaker-controld[3656]:  warning: Stonith/shutdown of node NODE_2 was not expected
2021-03-22T19:24:08.528837+05:30 NODE_1 pacemakerd[3649]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528979+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.529100+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Removing all NODE_2 attributes for peer loss
2021-03-22T19:24:08.529226+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.533635+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:50"
2021-03-22T19:24:08.535723+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:51"
2021-03-22T19:24:08.537831+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:51"
2021-03-22T19:24:09.536719+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:09.536962+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Cluster node NODE_2 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:09.537058+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Node NODE_2 is unclean
2021-03-22T19:24:09.537749+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Action rsc_ip_P4H_ERS10_stop_0 on NODE_2 is unrunnable (offline)
2021-03-22T19:24:09.537871+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Action rsc_sap_P4H_ERS10_stop_0 on NODE_2 is unrunnable (offline)
2021-03-22T19:24:09.537950+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Scheduling Node NODE_2 for STONITH
2021-03-22T19:24:09.538026+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Fence (reboot) NODE_2 'peer is no longer part of the cluster'
2021-03-22T19:24:09.538116+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Move       rsc_ip_P4H_ERS10      ( NODE_2 -> NODE_1 )
2021-03-22T19:24:09.538191+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Move       rsc_sap_P4H_ERS10     ( NODE_2 -> NODE_1 )

On NODE_2:

2021-03-22T19:24:08.497451+05:30 NODE_2 corosync[2350]:   [TOTEM ] A new membership (100.70.47.204:2864) was formed. Members left: 1
2021-03-22T19:24:08.501925+05:30 NODE_2 corosync[2350]:   [TOTEM ] Failed to receive the leave message. failed: 1
2021-03-22T19:24:08.502284+05:30 NODE_2 corosync[2350]:   [CPG   ] downlist left_list: 1 received
2021-03-22T19:24:08.502544+05:30 NODE_2 pacemaker-controld[2866]:  notice: Our peer on the DC (NODE_1) is dead
2021-03-22T19:24:08.502788+05:30 NODE_2 pacemaker-controld[2866]:  notice: State transition S_NOT_DC -> S_ELECTION
2021-03-22T19:24:08.502981+05:30 NODE_2 sbd[2681]:    cluster:  warning: set_servant_health: Connected to corosync but requires both nodes present
2021-03-22T19:24:08.503233+05:30 NODE_2 sbd[2674]:  warning: inquisitor_child: cluster health check: UNHEALTHY
2021-03-22T19:24:08.503455+05:30 NODE_2 sbd[2674]:  warning: inquisitor_child: Servant cluster is outdated (age: 168738)
2021-03-22T19:24:08.503686+05:30 NODE_2 pacemaker-based[2861]:  notice: Node NODE_1 state is now lost

  • This causes a split-brain situation where both nodes are trying to fence each other. This event is called a "Fence Race" where data integrity is maintained, however access to all services are lost.

2021-03-22T19:24:09.536719+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:09.536962+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Cluster node NODE_2 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:09.537058+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Node NODE_2 is unclean 

2021-03-22T19:24:23.775660+05:30 NODE_2 pacemaker-schedulerd[2865]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:23.775948+05:30 NODE_2 pacemaker-schedulerd[2865]:  warning: Cluster node NODE_1 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:23.776130+05:30 NODE_2 pacemaker-schedulerd[2865]:  warning: Node NODE_1 is unclean

  • In the above example, "Fence Race" was won by node NODE_2 and fenced (rebooted) node NODE_1:

2021-03-22T19:24:09.540321+05:30 NODE_1 pacemaker-controld[3656]:  notice: Requesting fencing (reboot) of node NODE_2
2021-03-22T19:24:09.540428+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Client pacemaker-controld.3656.cafb628a wants to fence (reboot) 'NODE_2' with device '(any)'
2021-03-22T19:24:09.540527+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Requesting peer fencing (reboot) of NODE_2
2021-03-22T19:24:09.823655+05:30 NODE_1 pacemaker-fenced[3652]:  notice: stonith-sbd can fence (reboot) NODE_2: dynamic-list
2021-03-22T19:24:09.823908+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Delaying 'reboot' action targeting NODE_2 on stonith-sbd for 29s (timeout=60s, requested_delay=0s, base=0s, max=30s)

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.