PCIE Stealth errors on X91153A card leads to node rebooting
Applies to
- AFF A900
- ONTAP 9.12.1P4
- X91153A Ethernet Storage Controller
Issue
- Node reboots without a clear cause in the EMS or SP logs
- See errors in EMS like below repeated constantly in the logs:
Fri Oct 20 22:45:27 -0400 [cluster1-01: ICL error: pcie.stealth.errors:debug]: params: {'pcie_errors': 'IIO0: RPT(135,2,0): Microchip PCI-E Switch on Controller, Microchip PCI-E Switch in slot 11 on Controller, Br[4000](137,0,0): DevStatus(Corr), CorrErr(Rcvr,RpTim); Br[4036](139,0,0) in slot 11: DevStatus(Corr), CorrErr(RpTim); '}
Fri Oct 20 22:47:27 -0400 [cluster1-01: ICL error: pcie.stealth.errors:debug]: params: {'pcie_errors': 'IIO0: RPT(135,2,0): Microchip PCI-E Switch on Controller, Br[4000](137,0,0): DevStatus(Corr), CorrErr(Rcvr); '}
- When checking
sysconfig -ac
the card being called out, in this case slot 11, points to the X91153A card:
sysconfig: slot 11 OK: X91153A: 2p 40G/100G RoCE QSFP28