Skip to main content
NetApp Knowledge Base

ESXi host down with PSOD error and transient storage errors on host end

Views:
270
Visibility:
Public
Votes:
4
Category:
ontap-9
Specialty:
san
Last Updated:

Applies to

  • ESXi Host
  • Ontap 9

Issue

  • ESXi host down and entered Purple Screen of Death ( PSOD error).
  • From the zdump , there are tons of transient storage errors logged:

2023-08-24T11:19:52.549Z cpu31:22473151)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:706: Path (vmhba64:C2:T2:L74) command 0xa3 failed with transient error status Transient storage condition, suggest retry. sense data: 0x6 0x3f 0x3.
2023-08-24T11:19:52.549Z cpu56:22156400)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:706: Path (vmhba64:C2:T2:L21) command 0xa3 failed with transient error status Transient storage condition, suggest retry. sense data: 0x6 0x3f 0x3.
2023-08-24T11:19:52.549Z cpu74:22156402)StorageDevice: 7059: End path evaluation for device naa.600a09803831357734244e4c6dxxxxxx
2023-08-24T11:19:52.549Z cpu14:2099001)NMP: nmp_ThrottleLogForDevice:3867: Cmd 0xa3 (0x45dabceec948, 0) to dev "naa.600a09803831357734244e4c6dxxxxxx" on path "vmhba64:C6:T2:L92" Failed:
2023-08-24T11:19:52.549Z cpu14:2099001)NMP: nmp_ThrottleLogForDevice:3875: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0x3. Act:NONE. cmdId.initiator=0x453a5741bbc8 CmdSN 0x0
2023-08-24T11:19:52.549Z cpu79:22473152)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:706: Path (vmhba64:C10:T2:L501) command 0xa3 failed with transient error status Transient storage condition, suggest retry. sense data: 0x6 0x3f 0x3
2023-08-24T11:19:52.549Z cpu78:2098303)NMP: nmp_ThrottleLogForDevice:3867: Cmd 0xa3 (0x45ba5d814648, 0) to dev "naa.600a09803831357734244e4c6dxxxxxx" on path "vmhba64:C1:T2:L455" Failed:

  • Link errors on host end:

2023-08-24T12:48:14.126Z: [netCorrelator] 413700037us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic10 is down. Affected dvPort: 37129774/50 21 f4 36 4c 7c 40 51-ec ee 57 d7 8d 0e 68 33. 1 uplinks up. Failed criteria: 128
2023-08-24T12:48:14.126Z: [netCorrelator] 413700045us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic10 is down. Affected dvPort: 37139132/50 21 f4 36 4c 7c 40 51-ec ee 57 d7 8d 0e 68 33. 1 uplinks up. Failed criteria: 128
2023-08-24T12:48:14.257Z: [netCorrelator] 413830728us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic5 is down. Affected dvPort: 538d9049-db44-4779-9bc7-df06af095601/50 21 f4 36 4c 7c 40 51-ec ee 57 d7 8d 0e 68 33. 0 uplinks up. Failed criteria: 128

2023-08-23T22:39:53.759Z cpu2:2099091)WARNING: iscsi_vmk: iscsivmk_StopConnection:739: Sess [ISID: 00023d000017 TARGET: iqn.1992-08.com.netapp:sn.786b6fe056c311e98c4100a098xxxxxx:vs.23 TPGT: 41a TSIH: 0]
2023-08-23T22:39:53.759Z cpu2:2099091)WARNING: iscsi_vmk: iscsivmk_StopConnection:740: Conn [CID: 0 L: 10.111.254.47:43401 R: 10.111.254.171:3260]

  • vmhba64 adapter, where the transient errors are reported is used to connect with the external storage. 

  • There was a sudden disconnection to storage over this adapter which caused the PSOD (Purple Screen of Death) where a memory address of 0x0 was passed to the `memcpy()` function, leading to an invalid memory access. 

  • This problem seems to stem from a race condition between two instances of `ScsiDeviceDataChangeCallback()` handling the `VMK_SCSI_DEVICE_EVENT_UA_INQUIRY_PARAMETERS_CHANGED` event.

  • When the disconnection of storage occured all the IOs started getting stored on cache so that it could be dumped back into the storage array once it came back online. The storage took too long to come back up and the cache got full causing the host to crash completely.

 

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.