CHW-2351: AFF A90/AFF C80 halts after eSPI Fatal Error
Issue
- Node goes through unexpected reboot as follows:
eSPI Fatal Error
HALT!
System recovered from eSPI fatal error
PCH eSPI LnkErr0 = 0x0060FF00
PCH eSPI LnkErr1 = 0x8073FF25
Configuring Devices ...
CPU = 2 Processor(s) Detected.
Intel(R) Xeon(R) Gold 5416S (CPU 0)
CPUID: 0x000806F8. Cores per Processor = 16
Intel(R) Xeon(R) Gold 5416S (CPU 1)
CPUID: 0x000806F8. Cores per Processor = 16
131072 MB System RAM Installed.
NVME Device: 0X331111900329D0SAM000PM9A30001T00025000
Boot Loader version 8.2.0IR
Copyright (C) 2000-2003 Broadcom Corporation.
Portions Copyright (C) 2002-2024 NetApp, Inc. All Rights Reserved.
ACPI RSDP Found at 0x777fe014- The node is able to recover after the failure
- NMI errors can be observed in BMC logs after the eSPI Fatal Error recovery:
Record 1855: Tue May 28 06:44:34.787314 2024 [BMC CLI.notice]: admin "events all "
Record 1856: Tue May 28 07:16:27.066671 2024 [IPMI.notice]: 0d99 | 02 | EVT: 6fc824ff | System_Watchdog | Assertion Event, "Timer interrupt"
Record 1857: Tue May 28 07:16:27.477566 2024 [IPMI Event.critical]: NMI
Record 1858: Tue May 28 07:16:27.478002 2024 [IPMI.notice]: 0d9a | 02 | EVT: 6f00ffff | CriticalInt | Assertion Event, "NMI/Diag Interrupt"
Record 1859: Tue May 28 07:16:28.309502 2024 [IPMI.notice]: 0d9b | 02 | EVT: 6fc124ff | System_Watchdog | Assertion Event, "Hard reset"
Record 1860: Tue May 28 07:16:28.456602 2024 [IPMI Event.critical]: L2 watchdog timeout hard reset
Record 1861: Tue May 28 07:16:28.487454 2024 [IPMI Event.critical]: System reset
Record 1862: Tue May 28 07:16:28.488306 2024 [IPMI.notice]: 0d9c | 02 | EVT: 0301ffff | SysReset | Assertion Event, "State Asserted"
Record 1863: Tue May 28 07:16:28.492843 2024 [IPMI Event.critical]: L2 watchdog action completed
Record 1864: Tue May 28 07:16:28.492644 2024 [IPMI.notice]: L2 to L1 is 1(s) 10179(us)
Record 1865: Tue May 28 07:16:49.493705 2024 [IPMI.notice]: 0d9d | 02 | EVT: 6f0500ff | Sensor 255 | Assertion Event, "Timestamp Clock Sync"
Record 1866: Tue May 28 07:16:50.000430 2024 [IPMI.notice]: 0d9e | 02 | EVT: 6f0580ff | Sensor 255 | Assertion Event, "Timestamp Clock Sync"
Record 1867: Tue May 28 07:16:50.086813 2024 [IPMI.notice]: (PUA) Enable power to all PCIe slots
Record 1868: Tue May 28 07:16:50.147757 2024 [IPMI.notice]: (PUA) Enable power to all PCIe on board device
Record 1869: Tue May 28 07:16:50.172322 2024 [IPMI.notice]: (PUA) P_stat :slots=0x1,onboard_devs=0x0,final
Record 1870: Tue May 28 07:16:50.172359 2024 [IPMI.notice]: (PUA) Atleast one PCIe slot's power status cha
Record 1871: Tue May 28 07:16:51.000000 2024 [SysFW.notice]: BIOS Version: 20.0IR
Record 1872: Tue May 28 07:16:51.363071 2024 [BMC.notice]: ScratchPad Config Info received from BIOS
Record 1873: Tue May 28 07:16:52.000000 2024 [SysFW.notice]: System recovered from eSPI fatal error
Record 1874: Tue May 28 07:16:52.000000 2024 [SysFW.notice]: PCH eSPI LnkErr0 = 0x0060FF00
Record 1875: Tue May 28 07:16:52.000000 2024 [SysFW.notice]: PCH eSPI LnkErr1 = 0x8073FF25
Record 1876: Tue May 28 07:16:55.779460 2024 [BMC.notice]: Delaying L2_WDOG ASUP email for 120 seconds