CONTAP-449185: PANIC: Failover Monitor: unable to transit - takeover process is hung (wafl) in SK process cf_main on release 9.9.1P16 (C)
Issue
During a SnapMirror update, the source node encountered multiple "Out of Memory" (OOM) errors, causing subsequent SnapMirror failures. Eventually, a failover attempt resulted in a panic on the partner node.
Panic on cpu#10 : PANIC: Failover Monitor: unable to transit - takeover process is hung (wafl) in SK process cf_main on release 9.9.1P16 (C) on Tue Apr 29 15:40:36 CST 2025
This node started to takeover its partner as it is panicked.
Tue Apr 29 15:30:34 +0800 [node01: cf_firmware: cf.fm.partnerFwTransition:info]: params: {'prevstate': 'SF_UP', 'newstate': 'SF_SPARECORE', 'progresscounter': '2'}
Tue Apr 29 15:30:34 +0800 [node01: cf_main: cf.fsm.firmwareStatus:info]: Failover monitor: partner Dumping sparecore
Tue Apr 29 15:30:34 +0800 [node01: cf_main: cf.fsm.takeover.panic:alert]: Failover monitor: takeover attempted after partner panic.
Tue Apr 29 15:30:34 +0800 [node01: cf_main: cf.fsm.stateTransit:info]: Failover monitor: UP --> TAKEOVER
Tue Apr 29 15:30:34 +0800 [node01: cf_takeover: ha.takeover.stateChng:debug]: params: {'old_state': 'NOT_IN_TAKEOVER', 'new_state': 'IN_CFO_TAKEOVER'}
Tue Apr 29 15:30:34 +0800 [node01: cf_takeover: cf.fm.takeoverStarted:notice]: Failover monitor: takeover started
...
Tue Apr 29 15:30:34 +0800 [node01: cf_takeover: cf.fm.takeoverCommitted:debug]: Failover monitor: takeover committed
Tue Apr 29 15:30:34 +0800 [node01: ThreadHandlerun: clam.update.partner.state:info]: CLAM on node (ID=1000) updated failover state of partner (ID=1001) to to.
...
Tue Apr 29 15:31:00 +0800 [node01: monitor: monitor.globalStatus.ok:notice]: This node is attempting to takeover node02.
However, transit event timed out after 10 minutes, which caused this node to panic.