StorageGRID node connection state is unknown due to faulty network adapter
Applies to
- NetApp StorageGRID
- Bare metal-based Storage Node
Issue
- A Storage Node Connection State is Unknown in Grid Manager Interface:
Select Nodes > select the interested node > Overview:
servermanager.log
indicates there is network issue:
2021-01-23 12:39:10 +0000 | dynip | Possible network isolation: Node has no contact with other nodes.
- Base OS messages log shows errors about i40e and all interfaces of bond0 are link down:
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.0: HMC error interrupt
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.0: HMC error info 0x80000090, HMC error data 0x0
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.1: unhandled interrupt icr0=0x00010000
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.0: unhandled interrupt icr0=0x00010000
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.0: device will be reset
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.1: device will be reset
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.1: VSI seid 393 Tx ring 128 disable timeout
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.1: VSI seid 393 Rx ring 128 disable timeout
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.0: VSI seid 390 Tx ring 0 disable timeout
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.0: VSI seid 390 Rx ring 0 disable timeout
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.0: VSI seid 392 Tx ring 128 disable timeout
Jan 23 12:33:41 dc1-sn1 kernel: i40e 0000:1a:00.0: VSI seid 392 Rx ring 128 disable timeout
Jan 23 12:33:41 dc1-sn1 kernel: bond0: link status definitely down for interface eno1, disabling it
Jan 23 12:33:41 dc1-sn1 kernel: device eno1 left promiscuous mode
Jan 23 12:33:41 dc1-sn1 kernel: bond0: now running without any active interface!
Jan 23 12:33:41 dc1-sn1 kernel: bond0: link status definitely down for interface eno2, disabling it
Jan 23 12:33:57 dc1-sn1 kernel: i40e 0000:1a:00.1: PF reset failed, -15
Jan 23 12:33:57 dc1-sn1 kernel: i40e 0000:1a:00.0: PF reset failed, -15
Jan 23 12:34:01 dc1-sn1 kernel: i40e 0000:1a:00.1: Rebuild AdminQ failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Jan 23 12:34:01 dc1-sn1 kernel: i40e 0000:1a:00.0: Rebuild AdminQ failed, err I40E_ERR_ADMIN_QUEUE_TIMEOUT aq_err OK
Jan 23 12:34:01 dc1-sn1 kernel: i40e 0000:1a:00.0: ignoring delete macvlan error on PF, err I40E_ERR_QUEUE_EMPTY, aq_err OK
Jan 23 12:34:17 dc1-sn1 kernel: i40e 0000:1a:00.1: PF reset failed, -15
Jan 23 12:34:17 dc1-sn1 kernel: i40e 0000:1a:00.0: PF reset failed, -15
...
Jan 23 12:39:10 dc1-sn1 journal: Possible network isolation: Node has no contact with other nodes. If this warning persists, use the /usr/sbin/add_node_ip.py command to tell this node the address of another node in the grid. See the Recovery and Maintenance Guide for details.
Jan 23 12:39:10 dc1-sn1 journal: 2021-05-23 13:39:10 +0000 | dynip | Possible network isolation: Node has no contact with other nodes.