CONTAP-452309: A800 or C800 nodes reboot due to disk failure
Issue
- A disk failure on the internal shelf may cause PCI errors to one or both nodes, causing an unexpected reboot:
Sat Apr 19 16:16:58 -0400 [Node01: idle: cpu14: sk.panic:alert]: Panic String: Uncorrectable Machine Check Error at CPU14. SKL_IIO Error: STATUS<0xbb80000000000e0b>(VALID,UC,EN,MISCV,PCC,S,AR,CORR_ERR_STATUS(0),CORR_ERR_CNT(0),MSCOD(0),MCACOD(0xe0b))MISC<0x000000005d000000>(UCR_BUS_LOG(93),UCR_DEVICE_LOG(0),UCR_FUNCTION_LOG(0),UCR_SEGMENT_LOG(0))IIO Machine Check from device(s):RPT(93,0,0):ErrSrcID(CorrSrc(0x5e00),UCorrSrc(0)), PCI Device 1b4b:2241 in NVMe slot 27 on Controller. IIO Machine Check from device(s): in process idle: cpu14 on release 9.16.1P1 (C)
- Node(s) will reboot back into service, and disk called out in reboot string may be missing or failed.