StorageGRID compute node reports overheating due to no fans installed
Applies to
- NetApp StorageGRID
- Appliance Model SG1000
Issue
CPU overheating detected on StorageGRID Appliance node which can lead to a reboot due to damaged components:
[ 608.522299] CPU79: Package temperature above threshold, cpu clock throttled (total events = 118108)
[ 608.522327] CPU78: Package temperature above threshold, cpu clock throttled (total events = 118110)
[ 608.525326] CPU79: Package temperature/speed normal
[ 608.525331] CPU77: Package temperature/speed normal
[ 608.525340] CPU78: Package temperature/speed normal
[ 608.689771] CPU74: Package temperature above threshold, cpu clock throttled (total events = 118058)
[ 608.692332] CPU58: Package temperature above threshold, cpu clock throttled (total events = 176869)
[ 608.692341] CPU57: Package temperature above threshold, cpu clock throttled (total events = 176870)
[ 608.692344] CPU56: Package temperature above threshold, cpu clock throttled (total events = 176869)
[ 608.692359] CPU59: Package temperature above threshold, cpu clock throttled (total events = 176866)
[ 608.692369] CPU55: Package temperature above threshold, cpu clock throttled (total events = 176867)
[ 608.693295] CPU56: Package temperature/speed normal
[ 608.693301] CPU59: Package temperature/speed normal
[ 608.693305] CPU58: Package temperature/speed normal
[ 608.693324] CPU57: Package temperature/speed normal
[ 608.693328] CPU55: Package temperature/speed normal
[ 1070.198284] mlx5_core 0000:af:00.0: device's health compromised - reached miss count
[ 1070.206809] mlx5_core 0000:af:00.0: assert_var[0] 0x00000073
[ 1070.213230] mlx5_core 0000:af:00.0: assert_var[1] 0x00000073
[ 1070.219391] mlx5_core 0000:af:00.0: assert_var[2] 0x00000000
[ 1070.225546] mlx5_core 0000:af:00.0: assert_var[3] 0x00000000
[ 1070.231663] mlx5_core 0000:af:00.0: assert_var[4] 0x00000000
[ 1070.237782] mlx5_core 0000:af:00.0: assert_exit_ptr 0x00a4557c
[ 1070.244029] mlx5_core 0000:af:00.0: assert_callra 0x009a4d90
[ 1070.250132] mlx5_core 0000:af:00.0: fw_ver 16.25.1020
[ 1070.255621] mlx5_core 0000:af:00.0: hw_id 0x0000020d
[ 1070.261080] mlx5_core 0000:af:00.0: irisc_index 0
[ 1070.266316] mlx5_core 0000:af:00.0: synd 0x10: High temperature
[ 1070.272730] mlx5_core 0000:af:00.0: ext_synd 0x0000
[ 1070.278119] mlx5_core 0000:af:00.0: raw fw_ver 0x101903fc
[ 1070.710279] mlx5_core 0000:af:00.1: device's health compromised - reached miss count
[ 1070.718920] mlx5_core 0000:af:00.1: assert_var[0] 0x00000073
[ 1070.725245] mlx5_core 0000:af:00.1: assert_var[1] 0x00000073
[ 1070.731315] mlx5_core 0000:af:00.1: assert_var[2] 0x00000000
[ 1070.737395] mlx5_core 0000:af:00.1: assert_var[3] 0x00000000
[ 1070.743471] mlx5_core 0000:af:00.1: assert_var[4] 0x00000000
[ 1070.749583] mlx5_core 0000:af:00.1: assert_exit_ptr 0x00a4557c
[ 1070.755911] mlx5_core 0000:af:00.1: assert_callra 0x009a4d90
[ 1070.762109] mlx5_core 0000:af:00.1: fw_ver 16.25.1020
[ 1070.767723] mlx5_core 0000:af:00.1: hw_id 0x0000020d
[ 1070.773252] mlx5_core 0000:af:00.1: irisc_index 0
[ 1070.778607] mlx5_core 0000:af:00.1: synd 0x10: High temperature
[ 1070.785116] mlx5_core 0000:af:00.1: ext_synd 0x0000