Potential E-Series performance degradation and host access issues in configurations where tray/drawer loss protection is disabled
Applies to
- E-Series platforms running SANtricity OS 11.70, 11,70R1 and 11,70R2 (pre-11.70R3).
- Includes StorageGRID Appliances.
- Dynamic disk pool (DDP) where tray/drawer loss protection is disabled.
- In SANtricity System Manager, check under Storage > Pools & Volume Groups > View/Edit Settings
- Dynamic disk pool (DDP) with equal number of drives in each shelf or drawer residing in the DDP.
Issue
Users may experience various symptoms that can range between performance degradation, host side connection issues or possibly controller reboot as result of storage side I/O delays.
Below are few potential issues the user may report as result of the performance degradation:
Note: Below signatures are not unique to this issue, but rather symptoms that could be a result of I/O delays or other storage related operations. |
- Performance degradation highlighted by high I/O latency. Host side (initiator) detected high latency to the E-Series storage array and volumes may surface as different alerts depending on the OS and application. Some applications may not notice it. (i.e VMware for instance may report storage connectivity related events
"Lost access to volume xxxxxx (yyyyy) due to connectivity issues."
. - Controller reset due to Ancient I/O. E-Series bundle file
"state-capture-data"
would contain the following exception under"excLogShow"
.
Reboot due to ancient IO, scsiOp=0x1031756c0 poolId=0 opCode=8a
age=330000ms
2020-12-03 18:44:16.892205
rebootReason 0x429c002, rebootReasonExtra 0x0
- Controller reset due to software watchdog timeout. E-Series bundle file
"state-capture-data"
would contain the following exception under"excLogShow"
.- This can also be caused by drive failure which results in both controllers stagger rebooting due to watchdog timeouts
Exception from kernel core:
2020-11-13 11:03:31.500638
WATCHDOG TIMEOUT
Backtrace of the crashed thread:
#0 0x00007fa2de5a2067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1 0x00007fa2df28a4ea in vkiPanic () from /raid/lib/libeos-System.so
No symbol table info available.
#2 0x00007fa2df28a62a in _vkiReboot () from /raid/lib/libeos-System.so
No symbol table info available.
#3 0x00007fa2df279bf4 in watchdogTimerService () from /raid/lib/libeos-System.so
From the E-Series storage array support logs, there are few signatures that NetApp support can check and confirm the system is exhibiting this exact issue:
- If the storage array was upgraded from pre-11.70 release (i.e 11.50.x or 11.60.x), then the following panic reboot will occur during the upgrade process. The panic will result in a additional controller reset during the upgrade, but should not cause complete loss of access to E-Series storage array. This can be found in the E-Series bundle file
"state-capture-data"
under"excLogShow"
command output.
xx/yy/zz-xx:yy:zz (ProcessHandlers): PANIC: resume is being called on WORKING!
xxxx-xx-xx xx:xx:xx.560320
resume is being called on WORKING!
- After the E-Series storage array is on 11.70x release, there will be signs of inter-controller communication delays in the E-Series debug queue logs (
trace-buffers.7z
). Example below:
02/24/21-23:25:29.086164 00 raidSched1 sas c0001 sas iditn:071 idcmd:122471322 req_idx:0204 skey:x05 asc:x26 ascq:x00
scsiStatus:2 mf:0x11bb97740 sasSendSense: Sense data
02/24/21-23:25:29.086172 00 raidSched1 sid c0001 SCSICmd <=E= iditn:071 idcmd:122471322 ioId:x00f87f7e devnum:x00f00011 lun:000 buf:0x1017c36c0 Bm IAC(C9) Target CkCond IllReq 2600 00 CR:False r
tUs:1012612 ageUs:1012614
CDB:c9 01 00 00 00 05 af e3 00 00 00 30
02/24/21-23:25:29.086192 00 raidSched1 eel hffff LogError ioId:x00f87f7e errId:x0 DST_DRV_CHK_COND(x10a) origin:Internal(3) fru/t/s:x0b0011
errSpecInfo:LDD-x580000 detectpt:x0000
02/24/21-23:25:29.086194 00 raidSched1 hid c0001 hid <=E=lid iditn:071 idcmd:122471322 action:FailCmd(2) failCmdReason:LastErr (4)
02/24/21-23:25:29.086197 00 raidSched1 hid c0001 IO Finish iditn:071 idcmd:122471322 ioId:x00f87f7e buf:0x1017c36c0 ioDone:_Z13dlbIOCompleteP3buf FailCmdReason:LastErr (4) #total:1 #errors:1 activeMs:1012/41000
02/24/21-23:25:29.086198 00 raidSched1 hid c0001 ErrorRecord iditn:071 idcmd:122471322 ioId:x00f87f7e buf:0x1017c36c0 #ticks:00254 02/24/21-23:25:28.620-02/24/21-23:25:29.652 Target CkCond IllReq 26/00 action:FailCmd(2)
02/24/21-23:25:29.086200 00 raidSched1 hid cffff <=E=hid ioId:x00f87f7e buf:0x1017c36c0 DevNum:x00f00011 bOp:IacResponse b_error:17 iodone:_Z13dlbIOCompleteP3buf uSec:1012186
02/24/21-23:25:30.096876 00 iacTask2 ras ffff RPM IACsend response failed - tgtDev: x00f00011 msgId: 372707 error: No target (0x3)
- After the E-Series storage array is on 11.70x release, there will be signs of high volume level latency which can be found in the E-Series debug queue logs (
trace-buffers.7z
). Example below:
02/24/21-23:44:37.470583 00 raidSched1 vdm v0000 RVol RV 0x0, Op W Max Response time 4261676 us timeframe:66796 secs
02/24/21-23:45:40.940053 00 raidSched2 vdm v0000 RVol RV 0x0, Op R Max Response time 1018005 us timeframe:1413 secs
02/24/21-23:53:41.245400 00 raidSched1 vdm v0000 RVol RV 0x0, Op R Max Response time 2012095 us timeframe:480 secs
02/24/21-23:58:08.991755 00 raidSched1 vdm v0000 RVol RV 0x0, Op W Max Response time 4027504 us timeframe:811 secs