VMware Atomic Test and Set heartbeats impact connectivity to storage
Applies to
- Data ONTAP
- SAN Protocols (iSCSI & FC)
- VMware vSphere 5.5U2 and up
- Atomic Test & Set (ATS) Heartbeat
Issue
- Unable to access the volume.
- This issue happens on one of an ESXi.
- Restarting the ESXi host resolves the issue.
- ESXi hosts experience repeated LUN resets. For example, from the vmkernel.log, events similar to the following are seen:
2022-04-04T12:33:17.949Z: [scsiCorrelator] 5693120512199us: [vob.scsi.scsipath.por] Power-on Reset occurred on naa.600a0980xxxxxxxxxxxxxxxxxxxxxxx
2022-04-04T12:41:45.715Z: [scsiCorrelator] 5693628237688us: [vob.scsi.scsipath.por] Power-on Reset occurred on naa.600a0980xxxxxxxxxxxxxxxxxxxxxxx
2022-04-04T12:41:48.217Z: [scsiCorrelator] 5693630739987us: [vob.scsi.scsipath.por] Power-on Reset occurred on naa.600a0980xxxxxxxxxxxxxxxxxxxxxxx
2022-04-04T12:41:59.394Z: [scsiCorrelator] 5693641916248us: [vob.scsi.scsipath.por] Power-on Reset occurred on naa.600a0980xxxxxxxxxxxxxxxxxxxxxxx
- In addition to LUN resets, there is also evidence of datastore heartbeat timeouts:
2022-04-04T08:03:16.087Z: [vmfsCorrelator] 5677371531974us: [esx.problem.vmfs.heartbeat.timedout] xxxxxxxx-xxxxxxxx-xxx-xxxxxxxxxxx lun_vmfs_1
2022-04-04T08:03:28.087Z: [vmfsCorrelator] 5677383532045us: [esx.problem.vmfs.heartbeat.timedout] xxxxxxxx-xxxxxxxx-xxx-xxxxxxxxxxx lun_vmfs_2
2022-04-04T08:03:28.087Z: [vmfsCorrelator] 5677383532279us: [esx.problem.vmfs.heartbeat.timedout] xxxxxxxx-xxxxxxxx-xxx-xxxxxxxxxxx lun_vmfs_3
2022-04-04T08:03:28.087Z: [vmfsCorrelator] 5677383532445us: [esx.problem.vmfs.heartbeat.timedout] xxxxxxxx-xxxxxxxx-xxx-xxxxxxxxxxx lun_vmfs_4
2022-04-04T12:33:17.354Z: [vmfsCorrelator] 5693572799758us: [esx.problem.vmfs.heartbeat.timedout] xxxxxxxx-xxxxxxxx-xxx-xxxxxxxxxxx lun_vmfs_5
2022-04-04T12:43:14.401Z: [vmfsCorrelator] 5694169846835us: [esx.problem.vmfs.heartbeat.timedout] xxxxxxxx-xxxxxxxx-xxx-xxxxxxxxxxx lun_vmfs_6
- When reviewing SCSI commands issued by the host, we find that both ATS (Atomic Test and Set) and SCSI Reservation commands are being issued:
SCSI RESERVE
2022-04-04T08:03:31.105Z cpu1:2097761)NMP: nmp_PathDetermineFailure:3560: SCSI cmd RESERVE failed on path vmhba64:C2:T1:L3, reservation state on device naa.600a0980383136446f3f4e4c5a484355 is unknown.
2022-04-04T08:03:31.118Z cpu1:2097761)NMP: nmp_PathDetermineFailure:3560: SCSI cmd RESERVE failed on path vmhba64:C2:T1:L3, reservation state on device naa.600a0980383136446f3f4e4c5a484355 is unknown.
2022-04-04T08:03:31.118Z cpu19:2097779)NMP: nmp_PathDetermineFailure:3560: SCSI cmd RESERVE failed on path vmhba64:C2:T1:L2, reservation state on device naa.600a0980383136446f3f4e4c5a484356 is unknown.
2022-04-04T08:03:31.131Z cpu1:2097761)NMP: nmp_PathDetermineFailure:3560: SCSI cmd RESERVE failed on path vmhba64:C2:T1:L3, reservation state on device naa.600a0980383136446f3f4e4c5a484355 is unknown.
ATS Compare and Write
2022-04-04T09:35:44.604Z cpu16:2098262)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x89 (0x45baf02353c0, 2099582) to dev "naa.600a0980xxxxxxxxxxxxxxxxxxxxxxx" on path "vmhba64:C3:T1:L2" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE
2022-04-04T12:30:30.386Z cpu27:2098262)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x89 (0x459adcd68040, 2099582) to dev "naa.600a0980383136446f3f4e4c5a484356" on path "vmhba64:C3:T1:L2" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE
2022-04-04T12:33:16.355Z cpu1:2097617)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x89 (0x45bac31031c0, 2097193) to dev "naa.600a09803831364468244e4c5a47494e" on path "vmhba64:C1:T0:L0" Failed: H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
2022-04-04T12:43:13.402Z cpu2:2097617)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x89 (0x45baf1751140, 2097193) to dev "naa.600a09803831364468244e4c5a47494e" on path "vmhba64:C0:T0:L0" Failed: H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL