Skip to main content
NetApp Response to Russia-Ukraine Cyber Threat
In response to the recent rise in cyber threat due to the Russian-Ukraine crisis, NetApp is actively monitoring the global security intelligence and updating our cybersecurity measures. We follow U.S. Federal Government guidance and remain on high alert. Customers are encouraged to monitor the Cybersecurity and Infrastructure Security (CISA) website for new information as it develops and remain on high alert.

NetApp KCS Award

NetApp Knowledge Base

A Linux node in HA Cluster was fenced after "A processor failed, forming new configuration" event

Views:
1,010
Visibility:
Public
Votes:
0
Category:
fas-systems
Specialty:
san
Last Updated:

Applies to

  • SLES15 SP1
  • Pacemaker
  • Corosync

Issue

  • Post a network fluctuation, SLES cluster lost communication between the nodes.

Example:

We take two SLES nodes NODE_1 and NODE_2. During the issue, following events are reported:

On NODE_1:

2021-03-22T19:23:53.519571+05:30 NODE_1 corosync[2399]:   [TOTEM ] A processor failed, forming new configuration.
2021-03-22T19:24:08.523256+05:30 NODE_1 corosync[2399]:   [TOTEM ] A new membership (100.70.47.199:2864) was formed. Members left: 2
2021-03-22T19:24:08.523644+05:30 NODE_1 corosync[2399]:   [TOTEM ] Failed to receive the leave message. failed: 2
2021-03-22T19:24:08.523787+05:30 NODE_1 corosync[2399]:   [CPG   ] downlist left_list: 1 received
2021-03-22T19:24:08.526645+05:30 NODE_1 pacemaker-based[3651]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.526943+05:30 NODE_1 sbd[2867]:    cluster:  warning: set_servant_health: Connected to corosync but requires both nodes present
2021-03-22T19:24:08.527139+05:30 NODE_1 pacemaker-based[3651]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.527276+05:30 NODE_1 sbd[2862]:  warning: inquisitor_child: cluster health check: UNHEALTHY
2021-03-22T19:24:08.527444+05:30 NODE_1 sbd[2862]:  warning: inquisitor_child: Servant cluster is outdated (age: 880966)
2021-03-22T19:24:08.527580+05:30 NODE_1 corosync[2399]:   [QUORUM] Members[1]: 1
2021-03-22T19:24:08.527735+05:30 NODE_1 pacemaker-controld[3656]:  warning: Stonith/shutdown of node NODE_2 was not expected
2021-03-22T19:24:08.527895+05:30 NODE_1 corosync[2399]:   [MAIN  ] Completed service synchronization, ready to provide service.
2021-03-22T19:24:08.528077+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528223+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.528344+05:30 NODE_1 pacemaker-controld[3656]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
2021-03-22T19:24:08.528474+05:30 NODE_1 pacemaker-controld[3656]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528583+05:30 NODE_1 pacemaker-controld[3656]:  warning: Stonith/shutdown of node NODE_2 was not expected
2021-03-22T19:24:08.528837+05:30 NODE_1 pacemakerd[3649]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528979+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.529100+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Removing all NODE_2 attributes for peer loss
2021-03-22T19:24:08.529226+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.533635+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:50"
2021-03-22T19:24:08.535723+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:51"
2021-03-22T19:24:08.537831+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:51"
2021-03-22T19:24:09.536719+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:09.536962+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Cluster node NODE_2 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:09.537058+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Node NODE_2 is unclean
2021-03-22T19:24:09.537749+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Action rsc_ip_P4H_ERS10_stop_0 on NODE_2 is unrunnable (offline)
2021-03-22T19:24:09.537871+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Action rsc_sap_P4H_ERS10_stop_0 on NODE_2 is unrunnable (offline)
2021-03-22T19:24:09.537950+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Scheduling Node NODE_2 for STONITH
2021-03-22T19:24:09.538026+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Fence (reboot) NODE_2 'peer is no longer part of the cluster'
2021-03-22T19:24:09.538116+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Move       rsc_ip_P4H_ERS10      ( NODE_2 -> NODE_1 )
2021-03-22T19:24:09.538191+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Move       rsc_sap_P4H_ERS10     ( NODE_2 -> NODE_1 )

On NODE_2:

2021-03-22T19:24:08.497451+05:30 NODE_2 corosync[2350]:   [TOTEM ] A new membership (100.70.47.204:2864) was formed. Members left: 1
2021-03-22T19:24:08.501925+05:30 NODE_2 corosync[2350]:   [TOTEM ] Failed to receive the leave message. failed: 1
2021-03-22T19:24:08.502284+05:30 NODE_2 corosync[2350]:   [CPG   ] downlist left_list: 1 received
2021-03-22T19:24:08.502544+05:30 NODE_2 pacemaker-controld[2866]:  notice: Our peer on the DC (NODE_1) is dead
2021-03-22T19:24:08.502788+05:30 NODE_2 pacemaker-controld[2866]:  notice: State transition S_NOT_DC -> S_ELECTION
2021-03-22T19:24:08.502981+05:30 NODE_2 sbd[2681]:    cluster:  warning: set_servant_health: Connected to corosync but requires both nodes present
2021-03-22T19:24:08.503233+05:30 NODE_2 sbd[2674]:  warning: inquisitor_child: cluster health check: UNHEALTHY
2021-03-22T19:24:08.503455+05:30 NODE_2 sbd[2674]:  warning: inquisitor_child: Servant cluster is outdated (age: 168738)
2021-03-22T19:24:08.503686+05:30 NODE_2 pacemaker-based[2861]:  notice: Node NODE_1 state is now lost

  • This causes a split-brain situation where both nodes are trying to fence each other. This event is called a "Fence Race" where data integrity is maintained, however access to all services are lost.

2021-03-22T19:24:09.536719+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:09.536962+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Cluster node NODE_2 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:09.537058+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Node NODE_2 is unclean 

2021-03-22T19:24:23.775660+05:30 NODE_2 pacemaker-schedulerd[2865]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:23.775948+05:30 NODE_2 pacemaker-schedulerd[2865]:  warning: Cluster node NODE_1 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:23.776130+05:30 NODE_2 pacemaker-schedulerd[2865]:  warning: Node NODE_1 is unclean

  • In the above example, "Fence Race" was won by node NODE_2 and fenced (rebooted) node NODE_1:

2021-03-22T19:24:09.540321+05:30 NODE_1 pacemaker-controld[3656]:  notice: Requesting fencing (reboot) of node NODE_2
2021-03-22T19:24:09.540428+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Client pacemaker-controld.3656.cafb628a wants to fence (reboot) 'NODE_2' with device '(any)'
2021-03-22T19:24:09.540527+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Requesting peer fencing (reboot) of NODE_2
2021-03-22T19:24:09.823655+05:30 NODE_1 pacemaker-fenced[3652]:  notice: stonith-sbd can fence (reboot) NODE_2: dynamic-list
2021-03-22T19:24:09.823908+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Delaying 'reboot' action targeting NODE_2 on stonith-sbd for 29s (timeout=60s, requested_delay=0s, base=0s, max=30s)

 

Scan to view the article on your device
CUSTOMER EXCLUSIVE CONTENT

Registered NetApp customers get unlimited access to our dynamic Knowledge Base.

New authoritative content is published and updated each day by our team of experts.

Current Customer or Partner?

Sign In for unlimited access

New to NetApp?

Learn more about our award-winning Support