Cluster network degraded alerts and takeover not possible on AFF A800
Applies to
- AFF A800
- AFF C800
- X1146A T62100-CR
Issue
- Receiving daily
CLUSTER NETWORK DEGRADEDalerts.
[cluster-01: vifmgr: callhome.clus.net.degraded:alert]: Call home for CLUSTER NETWORK DEGRADED: Total Packet Loss - Ping failures detected between cluster-01_clus2 ( 169.254.32.8 ) on cluster-01 and cluster-02_clus1 ( 169.254.99.167 ) on cluster-02
- Receiving hourly
HA interconnect downalerts.
6/21/2024 08:00:00 nodename ERROR ic.HAInterconnectDown: HA interconnect: Interconnect down for 93 minutes: link1 down 6/21/2024 07:00:00 nodename ALERT callhome.hainterconnect.down: Call home for HA INTERCONNECT DOWN due to link1 down.
- Cluster also triggers alerts regarding unsynchronized NVRAM logs causing takeover being disabled
[cluster-01: statd: cf.takeover.disabled:alert]: HA mode, but takeover of partner is disabled due to reason : unsynchronized log.
- In the EMS log we see the following messages
[cluster-01: nvmm_mirror_sync: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state NVMM_MIRROR_LAYOUT_SYNCING is aborted because of reason NVPM_ERR_MSG_SEND_FAILED.
[cluster-01: nvmm_error: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state NVMM_MIRROR_OFFLINE is aborted because of reason NVMM_ABORT_SYNCING_MIRROR.
[cluster-01: nvmm_helper: nvpm.state.changed:debug]: Node 1's NVPM state changed from "2" to "2".
- These alerts begin triggering after the following message is seen
[cluster-01: intr: netif.fatal.err:alert]: The network device in slot 1 encountered fatal error e1a/e1b.
- Service processor logs show:
e1a/e1b:Fatal parity error (0x10)
PL_PERR_CAUSE 0x00004000 PL_PERR_ENABLE 0x1fffe3ff
PCIE_INT_CAUSE 0x40002000
t6nex2: encountered fatal error, adapter stopped.
e1a/e1b:PCI DMA channel write request parity error (0x2000)
t6nex2: encountered fatal error, adapter stopped.
