iSCSI Outage During Hardware Failure and Automated Giveback in ONTAP Cluster
Applies to
- NetApp ONTAP 9.11.1P9 and above
- AFF-A250
- Clustered ONTAP environments using iSCSI protocol
- VMware ESXi hosts connected via iSCSI
Issue
- During a controller PCI adapter failure and automated giveback with data LIFs down, VMware hosts experienced iSCSI outages.
- Affected VMs switched their filesystems to read-only due to I/O timeouts, and intermittent I/O failures persisted until the impacted controller was halted.
Relevant Log Output:
event log show -severity '*' | grep -iE 'panic|emergency|error|giveback|takeover|alert|debug
[node_1: vifmgr: vifmgr.cluscheck.droppedall:alert]: Total packet loss when pinging from cluster lif 123 (node node_1) to cluster lif 123 (node node_2).
[Node1_:cfdisk_config:cf.diskinventory.sendFailed:debug]:params:{'reason':'HAInterconnectdown','errorCode':'0'}
[Node1_:vifmgr:vifmgr.cluscheck.droppedall:alert]:Total packet loss when pinging from cluster LIF ...
[Node1_:statd:callhome.hainterconnect.down:alert]:Callhome for HAINTERCONNECTDOWN due to all links are down.
[Node_1: vifmgr: vifmgr.lifdown.noports:alert]: LIF iscsi_123 (on virtual server 3), IP address 123, currently cannot be hosted on node node_1, port e1a_123, or any of its failover targets, and is being marked as down.
