Unexpected reboot on StorageGRID appliance due to hardware PCIe error
Applies to
- NetApp StorageGRID Appliance SG5700
- NetApp StorageGRID Appliance SG6000
- NetApp StorageGRID Appliance SG100/1000
Issue
StorageGRID reports unexpected node reboot.
From BMC logs, it may report:
[Information] [Extended PCIe Error] [OEM Record C0] ManufacturerID:000315/ VID:8086/ DID:2030/ ErrorID 1:51/ SlotNo : 1-1
[Information] [Extended PCIe Error] [OEM Record C0] ManufacturerID:000315/ VID:8086/ DID:2030/ ErrorID 1:24/ SlotNo : 1-1
[Critical] [PCIe Error] [Critical Interrupt] Bus Fatal (Bus17/Dev0/Fun0) - Asserted
[Critical] [Critical INT] [Critical Interrupt] Software NMI - Asserted
From base-os-logs/run/mount-tmp/pge-actv-root/var/log/syslog
in StorageGRID support bundle:
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.860304] BERT: Error records from previous boot:
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.865158] [Hardware Error]: event severity: fatal
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.870009] [Hardware Error]: Error 0, type: fatal
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.874859] [Hardware Error]: section_type: PCIe error
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.880142] [Hardware Error]: port_type: 4, root port
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.885337] [Hardware Error]: version: 1.16
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.889669] [Hardware Error]: command: 0x0010, status: 0x0000
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.895557] [Hardware Error]: device_id: 0000:00:02.2
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.900752] [Hardware Error]: slot: 0
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.904563] [Hardware Error]: secondary_bus: 0x00
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.909412] [Hardware Error]: vendor_id: 0x8086, device_id: 0x6f06
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.915732] [Hardware Error]: class_code: 000604
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.920495] [Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.927937] [Hardware Error]: aer_uncor_status: 0x00000000, aer_uncor_mask: 0x00000000
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.935983] [Hardware Error]: aer_uncor_severity: 0x00062030
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.941785] [Hardware Error]: TLP Header: 00000000 00000000 00000000 0000000