AFF-A900 node abnormal reboot in MetroCluster IP
Applies to
- ONTAP 9
- AFF-A900
- MetroCluster IP
- Node reboot
Issue
- Node reboots unexpectedly without a clear reason/panic in event log or BMC logs
- System logs report abnormal reboot events:
Record 1045: Wed Dec 13 11:54:07.700528 2023 [BMC.critical]: Filer Reboots
Record 1046: Wed Dec 13 15 11:54:07.711401 2023 [Trap Event.critical]: SNMP abnormal_reboot (28)
- HA partner reporting takeover initiated due to loosing heartbeat:
Wed Dec 13 12:54:21 +0100 [Node_A: cf_main: cf.fsm.takeover.noHeartbeat:alert]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node.
- Event logs showing ICL errors for one of the T6 cards in the system
[?] Wed Feb 15 12:53:25 +0100 [Node_A: ICL error: pcie.stealth.errors:debug]: params: {'pcie_errors': 'IIO0: RPT(166,2,0): T62100-CR Dual 40/100G NIC in slot 5 on Controller, Dv[600d](169,0,0) in slot 5: DevStatus(Corr), CorrErr(Rcvr); Dv[600d](169,0,1) in slot 5: DevStatus(Corr), CorrErr(Rcvr); Dv[600d](169,0,2) in slot 5: DevStatus(Corr), CorrErr(Rcvr); Dv[600d](169,0,3) in slot 5: DevStatus(Corr), CorrErr(Rcvr); Dv[640d](169,0,4) in slot 5: DevStatus(Corr), CorrErr(Rcvr); Dv[650d](169,0,5) in slot 5: DevStatus(Corr), CorrErr(Rcvr); Dv[660d](169,0,6) in slot 5: DevStatus(Corr), CorrErr(Rcvr); '}