Skip to main content
NetApp Knowledge Base

Potential E-Series performance degradation and host access issues in configurations where tray/drawer loss protection is disabled

Views:
1,280
Visibility:
Public
Votes:
0
Category:
e-series-systems
Specialty:
esg
Last Updated:

Applies to

  • E-Series platforms running SANtricity OS 11.70, 11,70R1 and 11,70R2 (pre-11.70R3).
    • Includes StorageGRID Appliances.
  • Dynamic disk pool (DDP) where tray/drawer loss protection is disabled.
    • In SANtricity System Manager, check under Storage > Pools & Volume Groups > View/Edit Settings
  • Dynamic disk pool (DDP) with equal number of drives in each shelf or drawer residing in the DDP.

Issue

Users may experience various symptoms that can range between performance degradation, host side connection issues or possibly controller reboot as result of storage side I/O delays.

Below are few potential issues the user may report as result of the performance degradation:

Note: Below signatures are not unique to this issue, but rather symptoms that could be a result of I/O delays or other storage related operations.

  • Performance degradation highlighted by high I/O latency. Host side (initiator) detected high latency to the E-Series storage array and volumes may surface as different alerts depending on the OS and application. Some applications may not notice it. (i.e VMware for instance may report storage connectivity related events "Lost access to volume xxxxxx (yyyyy) due to connectivity issues.".
  • Controller reset due to Ancient I/O. E-Series bundle file "state-capture-data" would contain the following exception under "excLogShow".

Reboot due to ancient IO, scsiOp=0x1031756c0 poolId=0 opCode=8a
 age=330000ms
2020-12-03 18:44:16.892205
rebootReason 0x429c002, rebootReasonExtra 0x0

  • Controller reset due to software watchdog timeout. E-Series bundle file "state-capture-data" would contain the following exception under "excLogShow".
    • This can also be caused by drive failure which results in both controllers stagger rebooting due to watchdog timeouts

Exception from kernel core:
2020-11-13 11:03:31.500638
WATCHDOG TIMEOUT


Backtrace of the crashed thread:
#0  0x00007fa2de5a2067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007fa2df28a4ea in vkiPanic () from /raid/lib/libeos-System.so
No symbol table info available.
#2  0x00007fa2df28a62a in _vkiReboot () from /raid/lib/libeos-System.so
No symbol table info available.
#3  0x00007fa2df279bf4 in watchdogTimerService () from /raid/lib/libeos-System.so

From the E-Series storage array support logs, there are few signatures that NetApp support can check and confirm the system is exhibiting this exact issue:

  • If the storage array was upgraded from pre-11.70 release (i.e 11.50.x or 11.60.x), then the following panic reboot will occur during the upgrade process. The panic will result in a additional controller reset during the upgrade, but should not cause complete loss of access to E-Series storage array. This can be found in the E-Series bundle file "state-capture-data" under "excLogShow" command output.

xx/yy/zz-xx:yy:zz (ProcessHandlers): PANIC: resume is being called on WORKING!
xxxx-xx-xx xx:xx:xx.560320
resume is being called on WORKING!

  • After the E-Series storage array is on 11.70x release, there will be signs of inter-controller communication delays in the E-Series debug queue logs (trace-buffers.7z). Example below:

02/24/21-23:25:29.086164 00 raidSched1         sas    c0001 sas          iditn:071 idcmd:122471322 req_idx:0204 skey:x05 asc:x26 ascq:x00
                                                                         scsiStatus:2 mf:0x11bb97740 sasSendSense: Sense data
02/24/21-23:25:29.086172 00 raidSched1         sid    c0001 SCSICmd <=E= iditn:071 idcmd:122471322 ioId:x00f87f7e devnum:x00f00011 lun:000 buf:0x1017c36c0 Bm     IAC(C9) Target  CkCond IllReq 2600 00 CR:False r
tUs:1012612 ageUs:1012614
                                                                         CDB:c9 01 00 00 00 05 af e3 00 00 00 30
02/24/21-23:25:29.086192 00 raidSched1         eel    hffff LogError     ioId:x00f87f7e errId:x0 DST_DRV_CHK_COND(x10a)           origin:Internal(3)   fru/t/s:x0b0011
                                                                          errSpecInfo:LDD-x580000 detectpt:x0000
02/24/21-23:25:29.086194 00 raidSched1         hid    c0001 hid <=E=lid  iditn:071 idcmd:122471322 action:FailCmd(2) failCmdReason:LastErr (4)
02/24/21-23:25:29.086197 00 raidSched1         hid    c0001 IO Finish    iditn:071 idcmd:122471322 ioId:x00f87f7e buf:0x1017c36c0 ioDone:_Z13dlbIOCompleteP3buf   FailCmdReason:LastErr (4) #total:1 #errors:1 activeMs:1012/41000
02/24/21-23:25:29.086198 00 raidSched1         hid    c0001  ErrorRecord iditn:071 idcmd:122471322 ioId:x00f87f7e buf:0x1017c36c0 #ticks:00254 02/24/21-23:25:28.620-02/24/21-23:25:29.652 Target  CkCond IllReq 26/00 action:FailCmd(2)
02/24/21-23:25:29.086200 00 raidSched1         hid    cffff <=E=hid      ioId:x00f87f7e buf:0x1017c36c0 DevNum:x00f00011 bOp:IacResponse    b_error:17 iodone:_Z13dlbIOCompleteP3buf uSec:1012186
02/24/21-23:25:30.096876 00 iacTask2           ras     ffff RPM IACsend  response failed - tgtDev: x00f00011 msgId: 372707 error: No target (0x3)

  • After the E-Series storage array is on 11.70x release, there will be signs of high volume level latency which can be found in the E-Series debug queue logs (trace-buffers.7z). Example below:

02/24/21-23:44:37.470583 00 raidSched1         vdm    v0000 RVol         RV 0x0, Op W Max Response time 4261676 us timeframe:66796 secs
02/24/21-23:45:40.940053 00 raidSched2         vdm    v0000 RVol         RV 0x0, Op R Max Response time 1018005 us timeframe:1413 secs
02/24/21-23:53:41.245400 00 raidSched1         vdm    v0000 RVol         RV 0x0, Op R Max Response time 2012095 us timeframe:480 secs
02/24/21-23:58:08.991755 00 raidSched1         vdm    v0000 RVol         RV 0x0, Op W Max Response time 4027504 us timeframe:811 secs

 

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.