Multi-disk failure due to back-end flex array disk going missing

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 61

Visibility:: Public

Votes:: 0

Category:: ontap-9

Specialty:: hw

Last Updated:

Applies to

ONTAP 9
Flex array

Issue

Single node is having reboots due to multi-disk failure:

Thu May 15 05:04:39 -0400 [Node-01: cf_main: cf.fsm.takeover.mdp:alert]: Failover monitor: takeover attempted after multi-disk failure on partner

Issue is isolated to a single storage port.
EMS messages seen for IO to disks aborting on the storage port, and retries succeed through the partner switch:

Thu May 15 00:23:37 -0400 [Node-02: slifc_timeout_1: fci.device.quiesce:debug]: Adapter 2c encountered a command timeout on Disk device Switch-1:21.126 (0x010b1500) LUN 2 cdb 0x2a:0d3619d3:019b retry: 0 Quiescing the device. Thu May 15 00:23:40 -0400 [Node-02: slifc_timeout_1: fci.device.timeout:debug]: HBA 2c encountered a device timeout on Disk device Switch-1:21.126 (0x010b1500) LUN 2 cdb 0x2a:0d3619d3:019b retry: 0 Thu May 15 00:23:46 -0400 [Node-02: slifc_intrd: scsi.cmd.abortedByHost:error]: Disk device Switch-1:21.126L42: Command aborted by host adapter: HA status 0x4: cdb 0x2a:0d3619d3:019b. Thu May 15 00:23:46 -0400 [Node-02: slifc_intrd: scsi.cmd.retrySuccess:debug]: Disk device Switch-2:21.126L42: request successful after retry #1/#0: cdb 0x2a:0d3619d3:019b (24266).

Occasionally, instead of the IO aborting, it times fails causing the disk to be marked as not responding:

Thu May 15 05:04:39 -0400 [Node-02: slifc_intrd: scsi.cmd.pastTimeToLive:error]: Disk device Switch-1:21.126L42: request failed after try #1: cdb 0x8a:00000001cfccd24a:00000249. Thu May 15 05:04:39 -0400 [Node-02: config_thread: raid.config.filesystem.disk.not.responding:notice]: File system Disk /aggr1/plex0/rg0/Switch-1:21.126L42 Shelf - Bay - [HITACHI OPEN-V 8301] S/N [XXXXXXXXXXXX] UID [xx...xx] is not responding. Thu May 15 05:04:39 -0400 [Node-02: config_thread: cf.multidisk.fatalProblem:error]: Node encountered a multidisk error or other fatal error while waiting to be taken over. aggr aggr1: raid volfsm, fatal disk error in RAID group with no parity disk.. Raid type - raid0 Group name plex0/rg0 state NORMAL. 1 disk failed in the group. Disk Switch-1:21.126L19 Shelf - Bay - [HITACHI OPEN-V 8301] S/N [XXXXXXXXXXXX] UID [xx..xx] error: disk operation timed out..

After the reboot, the disks are all visible and aggregates healthy.