Unable to giveback after motherboard replaced on FAS62xx/FAS80xx due to no disks attached
Applies to
- FAS62xx
- FAS80xx
- AFF8080
- Motherboard replacement
- NVRAM replacement
- non-partitioned drives
Issue
- Unable to perform giveback due to no root volume found caused by HA interconnect ports down on takeover node.
WARNING: there do not appear to be any disks attached to the system. No root volume found. Rebooting... (press ctrl-c during boot to break reboot loop)
- Interconnect links down on takeover node, it is likely the NVRAM card went into a hung state on takeover node.
- Controller-IOXM (CI) setup, physical ports show as down on both ends (loopback shows both interconnect links are down on the card).
- After takeover,you might get the following messages from EMS on the takeover node
Wed Dec 06 12:37:27 GMT [n2: ib_nap_tx_2: connectx.shoutTimeout:debug]: Node advertisement send timed out on Port ib0b.
Wed Dec 06 12:37:29 GMT [n2: ib_nap_tx_1: connectx.shoutTimeout:debug]: Node advertisement send timed out on Port ib0a.
Wed Dec 06 12:37:37 GMT [n2: cfdisk_config: cf.diskinventory.sendFailed:debug]: params: {'errorCode': '1', 'reason': 'HA Interconnect down'}
Wed Dec 06 12:37:40 GMT [n2: ib_nap_tx_2: connectx.shout.portDisabled:critical]: Node advertisement send timed out on Port ib0b. ConnectX registers have been dumped to the /etc/ConnectX_regdump file.
Wed Dec 06 12:37:40 GMT [n2: mlx4_intr_handler: mlx4.link.statusChange:info]: InfiniBand port ib0b: Link down.
Wed Dec 06 12:37:41 GMT [n2: ib_nap_tx_2: ems.engine.suppressed:debug]: Event 'rdma.rdr.opFailed' suppressed 5 times in last 29618503 seconds.
Wed Dec 06 12:37:41 GMT [n2: ib_nap_tx_2: rdma.rdr.opFailed:debug]: RDR operation get_entity_property failed on error 7005.
Wed Dec 06 12:37:42 GMT [n2: ib_nap_tx_1: connectx.shout.portDisabled:critical]: Node advertisement send timed out on Port ib0a. ConnectX registers have been dumped to the /etc/ConnectX_regdump file.
Wed Dec 06 12:37:42 GMT [n2: mlx4_intr_handler: mlx4.link.statusChange:info]: InfiniBand port ib0a: Link down.
Wed Dec 06 12:37:44 GMT [n2: ib_mad2_wq: ems.engine.suppressed:debug]: Event 'ic.rdma.qpDisconnected' suppressed 4 times in last 29618502 seconds.
Wed Dec 06 12:37:44 GMT [n2: ib_mad2_wq: ic.rdma.qpDisconnected:debug]: kstat is disconnected.
- When attempting to perform a giveback, the takeover node is not showing the partner as waiting for giveback:
Example:
7-mode: (partner in takeover, but not showing Waiting for Giveback):
n2(takeover)> cf status
n1 has taken over n2.
Cluster-mode:
n2
<---- Should be "Waiting for giveback"
n1 false In takeover
n1
n2 - Unknown
- Checking the interconnect, notice the interconnect shows down
7-mode:
n2*> ic status
Link 0: down
Link 1: down
IC RDMA connection : down
Cluster-mode:
cluster::*> storage failover interconnect show-link local
Node Port Number Link State
------------------------------------------------------------------------------
n2
0 down
1 down
2 entries were displayed.
- Physically if the controllers are in a controller-IOXM (CI) setup, the physical HA interconnect links will show no link light. If you do a loop back on the HA interconnect ports (cable from port 0 to port 1 on the same controller) while the down node is waiting for giveback, you get lights on the down controller, but no lights on the takeover node.
- Try to manually bring the interconnect port up, but receive the following error
7-mode:
n2(takeover)*> ic link on 0
Error: Failed to perform requested operation on port 0 due to an internal error.
The port has been disabled. To re-enable the port, reboot the system.
Cluster-mode:
cluster::*> interconnect link on -node n2 -link 0
(system ha interconnect link on)
Error: command failed: Failed to perform requested operation on link 0 due to
an internal error. The port has been disabled. To re-enable the port,
reboot the system.
- If the above error is observed on the takeover node, there is a likely chance the NVRAM card went into a hung state.