Skip to main content

NetApp wins prestigious Coveo Relevance Pinnacle Award. Learn more!

NetApp Knowledge Base

A Linux node in HA Cluster was fenced after "A processor failed, forming new configuration" event

Views:
1,757
Visibility:
Public
Votes:
2
Category:
fas-systems
Specialty:
san
Last Updated:

Applies to

  • SLES15 SP1
  • Pacemaker
  • Corosync

Issue

  • Post a network fluctuation, SLES cluster lost communication between the nodes.

Example:

We take two SLES nodes NODE_1 and NODE_2. During the issue, following events are reported:

On NODE_1:

2021-03-22T19:23:53.519571+05:30 NODE_1 corosync[2399]:   [TOTEM ] A processor failed, forming new configuration.
2021-03-22T19:24:08.523256+05:30 NODE_1 corosync[2399]:   [TOTEM ] A new membership (100.70.47.199:2864) was formed. Members left: 2
2021-03-22T19:24:08.523644+05:30 NODE_1 corosync[2399]:   [TOTEM ] Failed to receive the leave message. failed: 2
2021-03-22T19:24:08.523787+05:30 NODE_1 corosync[2399]:   [CPG   ] downlist left_list: 1 received
2021-03-22T19:24:08.526645+05:30 NODE_1 pacemaker-based[3651]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.526943+05:30 NODE_1 sbd[2867]:    cluster:  warning: set_servant_health: Connected to corosync but requires both nodes present
2021-03-22T19:24:08.527139+05:30 NODE_1 pacemaker-based[3651]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.527276+05:30 NODE_1 sbd[2862]:  warning: inquisitor_child: cluster health check: UNHEALTHY
2021-03-22T19:24:08.527444+05:30 NODE_1 sbd[2862]:  warning: inquisitor_child: Servant cluster is outdated (age: 880966)
2021-03-22T19:24:08.527580+05:30 NODE_1 corosync[2399]:   [QUORUM] Members[1]: 1
2021-03-22T19:24:08.527735+05:30 NODE_1 pacemaker-controld[3656]:  warning: Stonith/shutdown of node NODE_2 was not expected
2021-03-22T19:24:08.527895+05:30 NODE_1 corosync[2399]:   [MAIN  ] Completed service synchronization, ready to provide service.
2021-03-22T19:24:08.528077+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528223+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.528344+05:30 NODE_1 pacemaker-controld[3656]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
2021-03-22T19:24:08.528474+05:30 NODE_1 pacemaker-controld[3656]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528583+05:30 NODE_1 pacemaker-controld[3656]:  warning: Stonith/shutdown of node NODE_2 was not expected
2021-03-22T19:24:08.528837+05:30 NODE_1 pacemakerd[3649]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528979+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.529100+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Removing all NODE_2 attributes for peer loss
2021-03-22T19:24:08.529226+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.533635+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:50"
2021-03-22T19:24:08.535723+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:51"
2021-03-22T19:24:08.537831+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:51"
2021-03-22T19:24:09.536719+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:09.536962+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Cluster node NODE_2 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:09.537058+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Node NODE_2 is unclean
2021-03-22T19:24:09.537749+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Action rsc_ip_P4H_ERS10_stop_0 on NODE_2 is unrunnable (offline)
2021-03-22T19:24:09.537871+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Action rsc_sap_P4H_ERS10_stop_0 on NODE_2 is unrunnable (offline)
2021-03-22T19:24:09.537950+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Scheduling Node NODE_2 for STONITH
2021-03-22T19:24:09.538026+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Fence (reboot) NODE_2 'peer is no longer part of the cluster'
2021-03-22T19:24:09.538116+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Move       rsc_ip_P4H_ERS10      ( NODE_2 -> NODE_1 )
2021-03-22T19:24:09.538191+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Move       rsc_sap_P4H_ERS10     ( NODE_2 -> NODE_1 )

On NODE_2:

2021-03-22T19:24:08.497451+05:30 NODE_2 corosync[2350]:   [TOTEM ] A new membership (100.70.47.204:2864) was formed. Members left: 1
2021-03-22T19:24:08.501925+05:30 NODE_2 corosync[2350]:   [TOTEM ] Failed to receive the leave message. failed: 1
2021-03-22T19:24:08.502284+05:30 NODE_2 corosync[2350]:   [CPG   ] downlist left_list: 1 received
2021-03-22T19:24:08.502544+05:30 NODE_2 pacemaker-controld[2866]:  notice: Our peer on the DC (NODE_1) is dead
2021-03-22T19:24:08.502788+05:30 NODE_2 pacemaker-controld[2866]:  notice: State transition S_NOT_DC -> S_ELECTION
2021-03-22T19:24:08.502981+05:30 NODE_2 sbd[2681]:    cluster:  warning: set_servant_health: Connected to corosync but requires both nodes present
2021-03-22T19:24:08.503233+05:30 NODE_2 sbd[2674]:  warning: inquisitor_child: cluster health check: UNHEALTHY
2021-03-22T19:24:08.503455+05:30 NODE_2 sbd[2674]:  warning: inquisitor_child: Servant cluster is outdated (age: 168738)
2021-03-22T19:24:08.503686+05:30 NODE_2 pacemaker-based[2861]:  notice: Node NODE_1 state is now lost

  • This causes a split-brain situation where both nodes are trying to fence each other. This event is called a "Fence Race" where data integrity is maintained, however access to all services are lost.

2021-03-22T19:24:09.536719+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:09.536962+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Cluster node NODE_2 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:09.537058+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Node NODE_2 is unclean 

2021-03-22T19:24:23.775660+05:30 NODE_2 pacemaker-schedulerd[2865]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:23.775948+05:30 NODE_2 pacemaker-schedulerd[2865]:  warning: Cluster node NODE_1 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:23.776130+05:30 NODE_2 pacemaker-schedulerd[2865]:  warning: Node NODE_1 is unclean

  • In the above example, "Fence Race" was won by node NODE_2 and fenced (rebooted) node NODE_1:

2021-03-22T19:24:09.540321+05:30 NODE_1 pacemaker-controld[3656]:  notice: Requesting fencing (reboot) of node NODE_2
2021-03-22T19:24:09.540428+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Client pacemaker-controld.3656.cafb628a wants to fence (reboot) 'NODE_2' with device '(any)'
2021-03-22T19:24:09.540527+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Requesting peer fencing (reboot) of NODE_2
2021-03-22T19:24:09.823655+05:30 NODE_1 pacemaker-fenced[3652]:  notice: stonith-sbd can fence (reboot) NODE_2: dynamic-list
2021-03-22T19:24:09.823908+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Delaying 'reboot' action targeting NODE_2 on stonith-sbd for 29s (timeout=60s, requested_delay=0s, base=0s, max=30s)

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

Scan to view the article on your device