Cluster network degraded alerts and takeover not possible on AFF A800

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 1,099

Visibility:: Public

Votes:: 0

Category:: aff-series

Specialty:: hw

Last Updated:

Applies to

AFF A800
AFF C800
X1146A T62100-CR

Issue

Receiving daily CLUSTER NETWORK DEGRADED alerts.

[cluster-01: vifmgr: callhome.clus.net.degraded:alert]: Call home for CLUSTER NETWORK DEGRADED: Total Packet Loss - Ping failures detected between cluster-01_clus2 ( 169.254.32.8 ) on cluster-01 and cluster-02_clus1 ( 169.254.99.167 ) on cluster-02

Receiving hourly HA interconnect down alerts.

6/21/2024 08:00:00  nodename     ERROR         ic.HAInterconnectDown: HA interconnect: Interconnect down for 93 minutes: link1 down
6/21/2024 07:00:00  nodename     ALERT         callhome.hainterconnect.down: Call home for HA INTERCONNECT DOWN due to link1 down.

Cluster also triggers alerts regarding unsynchronized NVRAM logs causing takeover being disabled

[cluster-01: statd: cf.takeover.disabled:alert]: HA mode, but takeover of partner is disabled due to reason : unsynchronized log.

In the EMS log we see the following messages

[cluster-01: nvmm_mirror_sync: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state NVMM_MIRROR_LAYOUT_SYNCING is aborted because of reason NVPM_ERR_MSG_SEND_FAILED. [cluster-01: nvmm_error: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state NVMM_MIRROR_OFFLINE is aborted because of reason NVMM_ABORT_SYNCING_MIRROR. [cluster-01: nvmm_helper: nvpm.state.changed:debug]: Node 1's NVPM state changed from "2" to "2".

These alerts begin triggering after the following message is seen

[cluster-01: intr: netif.fatal.err:alert]: The network device in slot 1 encountered fatal error e1a/e1b.

Service processor logs show:

e1a/e1b:Fatal parity error (0x10) PL_PERR_CAUSE 0x00004000 PL_PERR_ENABLE 0x1fffe3ff PCIE_INT_CAUSE 0x40002000 t6nex2: encountered fatal error, adapter stopped. e1a/e1b:PCI DMA channel write request parity error (0x2000) t6nex2: encountered fatal error, adapter stopped.