Panic on two-node cluster with interconnect down and no takeover

Last updated

Apr 15, 2024
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 222

Visibility:: Public

Votes:: 0

Category:: fas-systems

Specialty:: HW

Last Updated:: 4/15/2024, 8:43:03 PM

Applies to

FAS2750 and other platforms with internal HA interconnect
Two node clusters
ONTAP 9

Issue

Cluster master node panics with PCI Error NMI similar to:

PANIC: PCI Error NMI from device(s):ErrSrcID(CorrSrc(0x8),UCorrSrc(0)), RPT(0,1,0):PLX PCIE 8725 switch on Controller, X3311A in slot 1 on Controller.

HA interconnect goes down at time of panic:

[?]  Fri Apr 12 16:00:00 +0300 [cluster-01: statd: cf.takeover.disabled:alert]: HA mode, but takeover of partner is disabled due to reason : HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support.

 [?]  Fri Apr 12 16:00:00 +0300 [cluster-01: statd: ic.HAInterconnectDown:error]: HA interconnect: Interconnect down for 29 minutes: links down

 [?]  Fri Apr 12 16:00:00 +0300 [cluster-01: statd: callhome.hainterconnect.down:alert]: Call home for HA INTERCONNECT DOWN due to links down.

HA partner node remains up but stops serving data, with cluster applications all going offline (seen in output of advanced command cluster ring show).