ONTAP Select giveback vetoed by raid and partner iSCSI sessions up/down
Applies to
- ONTAP Select with Software Raid Configuration
- ESXi 7.0U2 with multipathing configured to HPP (High Performance Plugin) and Path selection scheme to FIXED
Issue
- Automatic or manual takeover was issued, in this example for node cluster-01 owning all *_01 aggregates
- Previousely rebooted node boots into waiting for giveback, but all plexes associated with rebooted node remain offline and do not start synching:
::> aggr plex show
Is Is Resyncing
Aggregate Plex Online Resyncing Percent Status
--------- --------- ------- ---------- --------- ---------------
aggr1_01 plex0 false false - failed,inactive
aggr1_01 plex1 true false - normal,active
aggr2_02 plex0 true false - normal,active
aggr2_02 plex1 false false - failed,inactive
aggr0_01 plex0 false false - failed,inactive
aggr0_01 plex4 true false - normal,active
aggr0_02 plex0 true false - normal,active
aggr0_02 plex4 false false - failed,inactive
8 entries were displayed.
::> storage aggregate plex online -aggregate aggr0_01 -plex plex0
Error: command failed: Failed to bring plex aggr0_01/plex0 online. Reason: Plex is failed and cannot be operated on.
Note: In the example, plexes > 0 for node cluster-02's aggregates, are owned by the down-node cluster-01.
- Giveback for the rebooted node gets vetoed by raid:
::> storage failover giveback -ofnode cluster-01
::> storage failover show-giveback
Partner
Node Aggregate Giveback Status
-------------- ----------------- ---------------------------------------------
Warning: Unable to list entries on node cluster-01. RPC: Couldn't make connection [from mgwd on node "cluster-02" (VSID: -1) to mgwd at 169.254.133.31]
cluster-02
CFO Aggregates Failed module: raid. Giveback vetoed: Cannot
send all specified aggregates home. Use the
"event log show -message-name
gb.sfo.abort.raid.fm|gb.cfo.abort.raid.fm"
command to get more information, and follow
the provided corrective actions. To execute
the giveback without checks, use the
"override-vetoes" parameter. Warning:
overriding vetoes may result in a data
service outage.
aggr1 Not attempted yet
2 entries were displayed.
- iSCSI sessions to the rebooted node are down and do not come up:
::*> storage iscsi-initiator show
Status
Node Type Label Target Portal Target Name Admin/Op
---- ---- -------- ------------------ -------------------------------- --------
Warning: Unable to list entries on node cluster-01. RPC: Couldn't make connection [from mgwd on node "cluster-02" (VSID: -1) to mgwd at 169.254.133.31]
cluster-02
mailbox
2fef72d8-3b87-11ea-9c42-005056a16698-mailbox
10.0.0.1 iqn.2012-05.local:mailbox.target.select000000
up/up
partner
2fee0ea2-3b87-11ea-9c42-005056a16698-partner
169.254.123.123:65200
iqn.2012-06.com.bsdctl:target0 up/down
Failure Reason: no ping reply after 58218 seconds
partner2
2fee0ea2-3b87-11ea-9c42-005056a16698-partner2
169.254.123.123:65200
iqn.2012-06.com.bsdctl:target0 up/down
Failure Reason: no ping reply after 58218 seconds
3 entries were displayed.
- ESXi vmkernel.log shows the physical device in APd state and HPP in use:
2023-09-19T11:27:28.124Z cpu0:2097216)ScsiDeviceIO: 4176: Cmd(0x45b8c1940ac8) 0x9e, CmdSN 0x800101b8 from world 1234567 to dev "naa.51402ec010d30442" failed H:0x1 D:0x0 P:0x0
2023-09-19T11:27:28.926Z cpu22:2097425)ScsiVmas: 1074: Inquiry for VPD page 00 to device naa.51402ec010d30442 failed with error No connection
2023-09-19T11:27:28.930Z cpu17:2097424)HPP: HppIsDeviceAPD:5142: APD detected for HPP device "naa.51402ec010d30442".
2023-09-19T11:27:28.930Z cpu17:2097424)StorageDevice: 7060: End path evaluation for device naa.51402ec010d30442
2023-09-19T11:27:52.661Z cpu5:2097205)ScsiDeviceIO: 4176: Cmd(0x45b8d5516e08) 0x1a, CmdSN 0x80010005 from world 2152968 to dev "naa.51402ec010d30442" failed H:0x1 D:0x0 P:0x0