Uncorrectable Machine Check Error on AFF A250 following motherboard replacement
Applies to
AFF A250
Issue
- After motherboard replacement, node continues to panic shortly after boot with similar uncorrectable machine check error panics:
PANIC : Uncorrectable Machine Check Error at CPU9. SKL_IIO Error: STATUS<0xbb80000000000e0b>(VALID,UC,EN,MISCV,PCC,S,AR,CORR_ERR_STATUS(0),CORR_ERR_CNT(0),MSCOD(0),MCACOD(0xe0b))MISC<0x0000000064000000>(UCR_BUS_LOG(100),UCR_DEVICE_LOG(0),UCR_FUNCTION_LOG(0),UCR_SEGMENT_LOG(0))IIO Machine Check from device(s):RPT(100,0,0):ErrSrcID(CorrSrc(0x66a0),UCorrSrc(0x66a0)), PLX PCIE 9797 switch on Controller, Br[9797](102,20,0): Link down. ,.
PANIC : Uncorrectable Machine Check Error at CPU7. SKL_IIO Error: STATUS<0xbb80000000000e0b>(VALID,UC,EN,MISCV,PCC,S,AR,CORR_ERR_STATUS(0),CORR_ERR_CNT(0),MSCOD(0),MCACOD(0xe0b))MISC<0x0000000064000000>(UCR_BUS_LOG(100),UCR_DEVICE_LOG(0),UCR_FUNCTION_LOG(0),UCR_SEGMENT_LOG(0))IIO Machine Check from device(s):RPT(100,0,0):ErrSrcID(CorrSrc(0x66a0),UCorrSrc(0x66a8)), PLX PCIE 9797 switch on Controller, Mellanox CX5 Ethernet in slot 2 on Controller, Mellanox CX5 Ethernet in slot 2 on Controller, PLX PCIE 9797 switch on Controller, Br[9797](102,21,0): Link down. ,.
- Console logs may report the following during boot:
Device Bus:76 Dev:0 Fun:0 (slot 2) failed to train at max link speed/width
-Expected GEN3, actual GEN1
-Expected x8, actual x8
Uncorrectable error detected at PCIE:Bus:100 Dev:0 Fun:0 for 1 time(s)!!!!
- BMC event logs indicate bus correctable errors leading to bus uncorrectable errors:
f1 | OEM record ee | Device Bus: 118 Dev: 0 Fun: 0 (slot 2) Failed to train at max link speed/width, retraining cycle 0
- Expected GEN3, actual GEN1
- Expected x8, actual x8
...
101 | 01/08/2023 | 18:20:06 | Critical Interrupt #0x31 | Bus Correctable error | Asserted
102 | 01/08/2023 | 18:20:06 | Critical Interrupt #0x31 | Bus Correctable error | Asserted
103 | 01/08/2023 | 18:20:06 | Critical Interrupt #0x31 | Bus Correctable error | Asserted
104 | 01/08/2023 | 18:20:06 | Critical Interrupt #0x31 | Bus Correctable error | Asserted
...
13b | OEM record ee | Device Bus: 118 Dev: 0 Fun: 0 (slot 2) Failed to train at max link speed/width, retraining cycle 0
- Expected GEN3, actual GEN1
- Expected x8, actual x8
...
149 | 01/08/2023 | 17:14:03 | Critical Interrupt #0x31 | Bus Correctable error | Asserted
14a | 01/08/2023 | 17:14:03 | Critical Interrupt #0x31 | Bus Uncorrectable error | Asserted