Node reboots with unexpected takeover after no heartbeat and no panic message reported
Applies to
- AFF A400
- FAS 8700
- FAS 8300
- Unexpected Takeover (no Heartbeat Alert)
Issue
- Unexpected takeover of node with the following events:
[Node-01: kltp: clam.heartbeat.state.change:info]: Heartbeats to node (name=Node-02, ID=1001) are Failing.
[Node-01: cf_main: cf.fsm.takeover.noHeartbeat:alert]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node.
- The following heartbeat messages are reported in event logs:
[node_name_1: cf_main: fm_lastHeartbeatInfo_1:debug]: params: {'time_since_firmware_rcvd': '5000', 'time_since_htbt_read_attempt_ic': '1000', 'time_since_mb_upd_minor_seq': '13', 'time_since_mb_upd_major_seq': '520405022', 'time_since_htbt_read_success_ic': '6000', 'current_time': '520650179', 'time_since_htbt_read_success_mb': '4029', 'time_since_ic_upd_major_seq': '520407022', 'time_since_htbt_write_ic': '0', 'time_since_first_htbt_write_mb_drop': '259840013', 'time_since_htbt_write_mb': '13', 'time_since_ic_upd_minor_seq': '0', 'mb_htbt_drop_count': '10', 'time_since_recent_htbt_write_mb_drop': '259837513', 'time_since_firmware_written': '0', 'time_since_htbt_read_upd_seq_ic': '6000', 'partner_minor_seq_num_mb': '728568', 'time_since_firmware_read': '6979', 'partner_minor_seq_num_ic': '728567', 'partner_major_seq_num_ic': '1669711236', 'time_since_htbt_read_upd_seq_mb': '4029', 'partner_major_seq_num_mb': '1669711236', 'time_since_htbt_read_attempt_mb': '4029'}
[node_name_1: cf_main: fm_lastHeartbeatInfo_1:debug]: params: {'time_since_firmware_rcvd': '20000', 'time_since_htbt_read_attempt_ic': '1000', 'time_since_mb_upd_minor_seq': '13', 'time_since_mb_upd_major_seq': '520420022', 'time_since_htbt_read_success_ic': '21000', 'current_time': '520665179', 'time_since_htbt_read_success_mb': '867', 'time_since_ic_upd_major_seq': '520422022', 'time_since_htbt_write_ic': '0', 'time_since_first_htbt_write_mb_drop': '259855013', 'time_since_htbt_write_mb': '13', 'time_since_ic_upd_minor_seq': '0', 'mb_htbt_drop_count': '10', 'time_since_recent_htbt_write_mb_drop': '259852513', 'time_since_firmware_written': '0', 'time_since_htbt_read_upd_seq_ic': '21000', 'partner_minor_seq_num_mb': '728568', 'time_since_firmware_read': '21979', 'partner_minor_seq_num_ic': '728567', 'partner_major_seq_num_ic': '1669711236', 'time_since_htbt_read_upd_seq_mb': '19029', 'partner_major_seq_num_mb': '1669711236', 'time_since_htbt_read_attempt_mb': '867'}
- No Panic message found.
- In some instances, there may be SSRAM logs populated:
SRAM record type(CPU) from Data ONTAP: socket(0) core(8) bank(6)
SRAM record type(LOG) from Data ONTAP: IIO MCE Root Bus(23), Device(0), Function(0), Segment(0).
- The node rebooting completes the giveback and continues working normally.
