CONTAP-252261: Drive failure causing long IO delays
Issue
- When an drive returns a non-retriable IO error, ONTAP may incorrectly keep retrying the IO.
- ONTAP will then take a long time to fail the drive resulting in IO delays that may affect clients (like ESX).
[Node1: scsi_cmdblk_strthr_admin: scsi.cmd.notReadyConditionEMSOnly:debug]: Disk device 0v.i1.1L34: Device returns not yet ready: CDB 0x2a:37e2f0f0:0001: Sense Data SCSI:not ready - (0x2 - 0x4 0x0 0x82)(32559).
[Node1:raid.label.io.writeError:notice]: Label write on Disk /aggr1/plex1/rg1/0v.i1.1L34 ... failed with storage error disk operation timed out
wafl_exempt02: wafl.cp.toolong:error]: Aggregate aggr1 experienced a long CP.
kernel: Nblade.nfsLongRunningOp:debug]: Detected a long running network process operation...