CRC errors on T6 ports after converting from 40GbE to 100GbE
Applies to
- AFF A800, AFF C800, ASA A800, ASA C800 with onboard T6 ports e0a and e0b (T62100-MEZZ) and T62100-CR (P/N X1146A) NIC in Slot 1
- AFF A320 onboard T6 ports e0g and e0h (T62100-SABR)
- IO Module,2p MC IP,40GbE QSFP+,100GbE QSFP28 (P/N X91146A) - T62100-CR
- 2p 100GbE iWARP QSFP28 NIC (P/N X1146A) - T62100-CR
Issue
- After converting T6-based Ethernet ports from 40GbE to 100GbE speeds, a continuous high number of CRC errors are reported due to corrupted Ethernet packets.
[node1 vifmgr: callhome.clus.net.degraded:alert]: Call home for CLUSTER NETWORK DEGRADED: CRC Errors Detected - High CRC errors detected on port e0a node node1
- Link parameters are not cleared after a 40GbE to 100GbE port conversion, resulting in the generation of malformed packets.
- In some cases, the receipt of these corrupted packets can lead to a system disruption (data outage). Examples:
- Vifmgr crashes with NAS protocols disruptions. Examples of ONTAP event messages reported:
[node_name: vifmgr: vifmgr.startup.merge.err:error]: The Logical Interface Manager (VIFMgr) encountered errors during startup.
vifmgr.startup.failover.err: VIFMgr encountered errors during startup.
vifmgr.dbase.checkerror: VIFMgr experienced an error verifying cluster database consistency. Some LIFs might not be hosted properly as a result.
- Multiple disks fail due to checksum errors
raidio_thread: raid_tetris_cksum_err_1:notice]: params: {'owner': '', 'disk_info': 'Disk /aggr1/plex4/rg0/0v.i1.2L4P3 Shelf 10 Bay 1 ...
io_thread: raid_rg_readerr_repair_parity_1:notice]: params: {'owner': '', 'disk_info': 'Disk /aggr1/plex4/rg0/0v.i1.2L4P3 Shelf 10 Bay 1 ...
io_thread: raid_rg_readerr_repair_cksum_stored_1:notice]: params: {'owner': '', 'disk_info': 'Disk /aggr1/plex4/rg0/0v.i1.2L4P3 Shelf 10 Bay 1 ...
disk_server_0: disk_checksum_verifyFailed_1:alert]: params: {'diskName': '0v.i1.2L4', 'bno': '13160468', 'vol': '------', 'fileid': '-1', 'block': '0'}
hamsg: disk.fail.ssdstats:info]: Disk 0v.i1.2L4 (S6H0NA0TB04199) failed with rated life used 0 %, percent spare blocks 0 %, spare blocks N/A.
hamsg: disk.outOfService:notice]: Drive 0v.i1.2L4 (S6H0NA0TB04199): message received. Power-On Hours: 8006, GList Count: 0, Drive Info: Disk 0v.i1.2L4 Shelf 10 Bay 1 ...
config_thread: raid.shared.disk.awaiting.done:info]: Received shared disk awaiting done Disk 0v.i1.2L4 Shelf 10 Bay 1 ..., state failing, substate 0x10, partner state failing, partner substate 0x4, partner dblade ID xxxx host type 1 add details receive awaiting done
dmgr_thread: raid.notify.on.failure:debug]: Received SDM_NOTIFY_ON_FAILURE for disk uid x:x:x:x, originating sysid 538300608, with failure reason 1 (failed).
raid_disk_thread: raid.disk.unload.done:info]: Unload of Disk 0v.i1.2L4 Shelf 10 Bay 1 [NETAPP X4010S173A1T9NTE NA50] S/N [S6H0NA0TB04199] UID [x:x:x:x] has completed successfully
- Vifmgr crashes with NAS protocols disruptions. Examples of ONTAP event messages reported:
- Port speed changes can occur in the following example scenarios. Examples:
- 40GbE Cluster switch or Ethernet data switches are replaced with 100GbE models
- Cluster ports are temporarily configured at 40GbE for a storage system upgrade, but the final port speed configuration is 100GbE
- During a switched to switchless conversion