"DISK REDUNDANCY FAILED" due to transceiver issue
Applies to
- MetroCluster IP
- Cisco Backend switch
Issue
- Error messages:
Tue Sep 03 04:51:31 +0200 [ClusterA-02: wafl_exempt09: mirror.stream.qp.error:debug]: params: {'mirror': 'DR PARTNER', 'qp_name': 'WAFL', 'error': 'NVMM_ERR_MIRROR_POLL_TIMEOUT'}
Tue Sep 03 04:51:31 +0200 [ClusterA-02: wafl_exempt09: nvmm.mirror.aborting:debug]: mirror of sysid 2, partner_type DR PARTNER and mirror state NVMM_MIRROR_ONLINE is aborted because of reason NVMM_ERR_MIRROR_POLL_TIMEOUT.
Tue Sep 03 04:51:31 +0200 [ClusterA-02: nvmm_error: mirror.stream.qp.error:debug]: params: {'mirror': 'DR PARTNER', 'qp_name': 'WAFL', 'error': 'NVMM_ERR_MIRROR_COMPLETION'}
Tue Sep 03 04:51:31 +0200 [ClusterA-02: nvmm_error: ems.engine.suppressed:debug]: Event 'rdma.rlib.event.error' suppressed 11 times in last 263 seconds.
Tue Sep 03 04:51:31 +0200 [ClusterA-02: nvmm_error: rdma.rlib.event.error:debug]: QP wafl event error: client disconnect.
Tue Sep 03 04:51:31 +0200 [ClusterA-02: nvmm_error: nvmm.mirror.offlined:debug]: params: {'mirror': 'DR_PARTNER'}
Tue Sep 03 04:51:31 +0200 [ClusterA-02: DR_heartbeat_thread: cf.ic.xferTimedOut:error]: HA interconnect: MCC_DRSOM transfer timed out.
followed by successful retries like:
Tue Sep 03 04:51:32 +0200 [ClusterA-02: iw_cm_wq: rdma.rlib.connected:debug]: wafl:DR:A QP is now connected.
- High number of error messages mixed with successful retries applying to many different disks (all are remote disks):
Tue Sep 03 04:51:34 +0200 [ClusterA-02: doneq0: scsi.mcc.adt.ioTransportError:error]: mcc_adt[2] - Transport error during execution of command: HA status 0x13: CAM transport status 0x1b: cdb 0x28:356b73b3:000d.
Tue Sep 03 04:51:34 +0200 [ClusterA-02: doneq0: scsi.mcc.adt.ioTransportError:error]: mcc_adt[2] - Transport error during execution of command: HA status 0x13: CAM transport status 0x1b : cdb 0x28:356b6555:000d....Tue Sep 03 04:51:34 +0200 [ClusterA-02: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 0m.i1.2L17: Command aborted by host adapter: HA status 0x13: cdb 0x28:356b73b3:000d.
Tue Sep 03 04:51:34 +0200 [ClusterA-02: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 0m.i1.2L17: Command aborted by host adapter: HA status 0x13: cdb 0x28:356b6555:000d.
Tue Sep 03 04:51:34 +0200 [ClusterA-02: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 0v.i1.0L17: request successful after retry #1/#0: cdb 0x28:356b73b3:000d (1967)