Unexpected node reboot in MetroCluster IP
Applies to
- ONTAP 9
- MetroCluster IP
- AFF-A700
- X91146A T6 card
Issue
- Node reboots unexpectedly with no indication of an issue
- SP logs showing the HA partner is taking disk reservations, which would occur after a takeover (CLAM takeover):
Apr 18 04:01:40 [NodeA1:clam.node.ooq:EMERGENCY]: Node (name=NodeA2,
ID=1001) is out of "CLAM quorum" (reason=quorum update).
A disk reservation was detected on disk 7a.10.3P3 at 18Apr2023 04:01:44
Ordinarily, this will only occur if the partner node has taken over.
This node will be shutdown.
HALT: HA partner has taken over disk reservations
Uptime: 47d18h37m13s
System rebooting...
- HA interconnect timeouts are reported shortly before the reboots and takeover is triggered due to heartbeat lost:
Sun Apr 18 20:35:39 +0200 [NodeA1: DR_heartbeat_thread: cf.ic.xferTimedOut:error]: HA interconnect: MCC_DRSOM transfer timed out. Sun Apr 18 20:35:39 +0200 [NodeA1: cf_firmware: cf.ic.xferTimedOut:error]: HA interconnect: OFW transfer timed out. Sun Apr 18 20:35:58 +0200 [NodeA1: cf_main:cf.fsm.takeover.noHeartbeat:alert]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node.