CONTAP-402656: Multiple recovery attempts of a failing drive can cause client I/O disruption
Issue
- Client I/O disruption during drive failure
- The following messages are seen in EMS:
Mon Feb 10 20:08:05 +1100 [node-01: scsi_cmdblk_strthr_admin: disk.timeout.flush.start:notice]: Aggressive timeout flush started on disk 0n.6 S/N XXXXXXXXXXXXX. Details: Power Cycled: true Reason: Disk Spasm.
Mon Feb 10 20:08:05 +1100 [node-01: config_thread: raid.disk.assign.offline_ref:debug]: aggregate /aggr01/plex0/rg1/0n.1P2 assigned as an offline reference storage for /aggr01/plex0/rg1/0n.6P2.
Mon Feb 10 20:08:05 +1100 [node-01: config_thread: raid.disk.assign.offline_ref:debug]: aggregate /aggr01/plex0/rg1/0n.0P2 assigned as an offline reference storage for /aggr01/plex0/rg1/0n.6P2.
Mon Feb 10 20:08:05 +1100 [node-01: config_thread: raid.rg.degraded:notice]: : Raid group /aggr01/plex0/rg1 is degraded
Mon Feb 10 20:08:05 +1100 [node-01: scsi_cmdblk_strthr_admin: scsi.cmd.checkCondition:error]: Disk device 0n.6L0: Check Condition: CDB 0x28:3970c8be:0001: Sense Data SCSI:aborted command - (0xb - 0x90 0x6 0xfa)(5236).
Mon Feb 10 20:08:05 +1100 [node-01: disk_server_0: disk.IO.status:debug]: params:
{'deviceName': '0n.6L0', 'ETime': '5237', 'cdb': '0x28:293711b2:0044', 'victimRetryCount': '0', 'retryCount': '0', 'timeoutRetryCount': '0', 'pathRetryCount': '0', 'adapterStatus': '0x5', 'targetStatus': '0x0', 'sSenseKey': 'SCSI:no sense', 'sSenseCode': '', 'iSenseKey': '0x0', 'iASC': '0x0', 'iASCQ': '0x0', 'pathsTried': '0', 'basicTimeout': '5', 'returnCode': '9', 'disk_information': 'Disk 0n.6 Shelf 0 Bay 6 [NETAPP X4013S17337T6NTE NA54] S/N [XXXXXXXXXXXXX] UID [36305230:52201014:00253845:00000004:00000000:00000000:00000000:00000000:00000000:00000000]'}
Mon Feb 10 20:08:05 +1100 [node-01: scsi_cmdblk_strthr_admin: scsi.cmd.notReadyConditionEMSOnly:debug]: Disk device 0n.6L0: Device returns not yet ready: CDB 0x2a:37c8ab30:0002: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x82)(0).
Mon Feb 10 20:08:13 +1100 [node-01: scsi_cmdblk_strthr_admin: scsi.cmd.notReadyConditionEMSOnly:debug]: Disk device 0n.6L0: Device returns not yet ready: CDB 0x28:6fc6d2f8:000a: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x82)(0).
Mon Feb 10 20:10:06 +1100 [node-01: scsi_cmdblk_strthr_admin: scsi.cmd.pastTimeToLive:error]: Disk device 0n.6L0: request failed after try #2: cdb 0x2a:37c8ab30:0002.
Mon Feb 10 20:10:06 +1100 [node-01: disk_server_1: scsi.debug:debug]: shm_setup_for_failure disk 0n.6 (S/N XXXXXXXXXXXXX) error 80000000h
Mon Feb 10 20:10:06 +1100 [node-01: disk_server_1: scsi.debug:debug]: shm_setup_for_failure disk 0n.6 (S/N XXXXXXXXXXXXX) error 40000000h
Mon Feb 10 20:10:13 +1100 [node-01: wafl_exempt20: wafl.cp.toolong:error]: Aggregate aggr01 experienced a long CP.
{}Mon Feb 10 20:10:13 +1100 [node-01: kernel: Nblade.NfsResponseTraceTriggerHourly:debug]: params: {'responseCount': '1', 'trigger': '60'{
}}Mon Feb 10 20:12:20 +1100 [node-01: config_thread: raid.disk.offline:notice]: Marking Disk 0n.6 Shelf 0 Bay 6 [NETAPP X4013S17337T6NTE NA54] S/N [XXXXXXXXXXXXX] UID [36305230:52201014:00253845:00000004:00000000:00000000:00000000:00000000:00000000:00000000] offline.