StorageGRID appliance having all HIC ports down frequently
Applies to
NetApp StorageGRID Appliances
Issue
StorageGRID node randomly loses connectivity on some ports. The ports are disconnected and when reconnecting it may synchronise with the LACP if configured
warn
log under/var/local/log
of the affected node shows instances ofTx Timeout
for the HIC ports:
Jan 10 03:12:23 localhost kernel: [1456351.753113] [qede_tx_timeout:991(hic2)]Tx timeout!
Jan 10 03:12:23 localhost kernel: [1456351.753338] [qed_mfw_report:3613(hic2)]Txq[1]: FW cons [host] fce8, SW cons fc97, SW prod fce8 [idx c6] [Jiffies 4658987302]
Jan 10 03:12:23 localhost kernel: [1456351.753588] [qed_mfw_report:3613(hic2)]Txq[1]: SB[0x0002] - IGU: prod 00339d9f cons 00339b03 CAU Tx fce8
Jan 10 03:12:23 localhost kernel: [1456351.753832] [qed_mfw_report:3613(hic2)]Last DB: 0000fce8 [Jiffies 4658985126]
Jan 10 03:11:57 localhost kernel: [1456325.502522] NETDEV WATCHDOG: hic4 (qede): transmit queue 6 timed out
Jan 10 03:11:58 localhost kernel: [1456326.281083] [qede_tx_timeout:991(hic4)]Tx timeout!
Jan 10 03:11:58 localhost kernel: [1456326.337487] bond0: link status down for interface hic4, disabling it in 200 ms
Jan 10 03:11:58 localhost kernel: [1456326.337490] bond0: invalid new link 1 on slave hic4
Jan 10 03:11:58 localhost kernel: [1456326.474543] qede 0000:42:00.3 hic4: speed changed to 0 for port hic4
Jan 10 03:11:58 localhost kernel: [1456326.497102] [qede_generic_hw_err_handler:4012(hic4)]Starting a generic HW error handling (sleep requiring operations) - err_flags 0x80000002, err_flags_override 0x0
- Later the HICs are recovered.
Jan 10 03:34:59 localhost kernel: [ 9.312373] qede 0000:42:00.1 hic2: renamed from eth0
Jan 10 03:35:08 localhost kernel: [ 43.979425] bond0: Enslaving hic2 as a backup interface with a down link
Jan 10 03:35:08 localhost kernel: [ 44.104547] [qede_validate_bond:423(hic2)]RDMA bonding - Can't bond PF1 and PF3
Jan 10 03:35:08 localhost kernel: [ 44.273897] device hic2 entered promiscuous mode
Jan 10 03:35:10 localhost kernel: [ 45.863791] [qede_link_update:3829(hic2)]Link is up
Jan 10 03:35:10 localhost kernel: [ 45.901661] bond0: link status up for interface hic2, enabling it in 0 ms
Jan 10 03:35:10 localhost kernel: [ 45.908646] bond0: link status definitely up for interface hic2, 10000 Mbps full duplex
Jan 10 03:34:59 localhost kernel: [ 9.398066] qede 0000:42:00.3 hic4: renamed from eth3
Jan 10 03:35:08 localhost kernel: [ 44.112259] bond0: Enslaving hic4 as a backup interface with a down link
Jan 10 03:35:08 localhost kernel: [ 44.280087] device hic4 entered promiscuous mode
Jan 10 03:35:10 localhost kernel: [ 46.077201] [qede_link_update:3829(hic4)]Link is up
Jan 10 03:35:10 localhost kernel: [ 46.137659] bond0: link status up for interface hic4, enabling it in 200 ms
Jan 10 03:35:10 localhost kernel: [ 46.144587] bond0: invalid new link 3 on slave hic4
Jan 10 03:35:10 localhost kernel: [ 46.353923] bond0: link status definitely up for interface hic4, 10000 Mbps full duplex