CONTAP-170876: FAS80x0 panics with "Uncorrectable Machine Check Error at CPU<n>" message
Issue
A FAS80x0 storage system can panic with an uncorrectable machine check error (UMCE)
due to a catastrophic error (CATERR) of one of its CPU's on the process control
module (PCM). The panic message ""Uncorrectable Machine Check Error at CPU<n>"
indicates a CPU core has experienced an error. In addition, the service
processor (SP) system event log will indicate which CPU has experienced the
catastrophic error.
Example panic message:
PANIC : Uncorrectable Machine Check Error at CPU1. SNB_DCU Error:
STATUS<0xbb80000000000175>(Val,UnCor,Enable,MiscV,PCC,ExtErr(0),ErrCode(Eviction,Data,L1))
MISC<0x0000000000000086>((0x86)).
SP system event log messages:
Record 769: ...[Agent.notice]: 094.026: 36 : CPU0 Catastrophic Error asserted <<<< CATERR
Record 770: ...[Agent.notice]: 094.026: 36 : CPU0 Catastrophic Error de-asserted
Record 771: ...[SP.critical]: Heartbeat stopped
Record 772: ...[Trap Event.warning]: hwassist loss_of_heartbeat (30)
Record 773: ...[ASUP.notice]: First notification email | (HEARTBEAT_LOSS) WARNING | Sent
Record 774: ...[Controller.notice]: Appliance panic. See logs for cause of panic.
Record 775: ...[IPMI.notice]: 9002 | 02 | EVT: 6f406fff | Sensor 255 | Assertion Event, "Storage OS stop/shutdown"
Record 776: ...[Agent.notice]: 531.537: 11 : Controller Attention LED asserted
Record 777: ...[Agent.notice]: 531.537: 14 : Attention LED (at Midplane) asserted
Record 778: ...[Agent.notice]: 082.424: 49 : PCH Platform Reset asserted
Record 779: ...[Agent.notice]: 082.485: 63 : BIOS Complete from PCH de-asserted
Record 780: ...[Agent.notice]: 091.309: 49 : PCH Platform Reset de-asserted
Record 781: ...[SP.critical]: Filer Reboots
