Unexpected controller reboot on AFF A250, FAS500f with automatic partner takeover and giveback
Applies to
- FAS500f
- AFF A250
- BMC fw 15.3 or lower
Issue
- Unexpected node reboot with automatic takeover and giveback from partner.
- No suspicious EMS messages in the node rebooting. Example:
Sun Jan 02 01:25:45 +0200 [node_name-01: config_thread: raid.rg.scrub.summary.lw:notice]: Scrub found 0 RAID write signature inconsistencies in /aggregate/plex0/rg0.
Sun Jan 02 01:43:35 +0200 [node_name-01: kernel: netif.linkUp:info]: Ethernet lo0: Link up.
- BMC events with the BMC reboot. Example:
35d | 01/01/2022 | 10:39:55 | System Event #0xff | Timestamp Clock Sync | Asserted
35e | 01/01/2000 | 00:00:20 | System Event | Timestamp Clock Sync | Asserted
35f | 01/01/2000 | 00:00:20 | System Event #0xff | Timestamp Clock Sync | Asserted
360 | 01/01/2022 | 23:42:54 | System Event #0xff | Timestamp Clock Sync | Asserted
361 | 01/01/2022 | 23:42:54 | System Event | Timestamp Clock Sync | Asserted
362 | 01/01/2022 | 23:43:10 | Other FRU #0x50 |
363 | 01/01/2022 | 23:43:10 | Other FRU #0x50 |
364 | 01/01/2022 | 23:43:10 | Other FRU #0x50 |
365 | 01/01/2022 | 23:43:10 | Other FRU #0x50 |
366 | 01/01/2022 | 23:43:10 | Power Supply #0x20 | Presence detected | Asserted
367 | 01/01/2022 | 23:43:10 | Power Supply #0x25 | Presence detected | Asserted
368 | 01/01/2022 | 23:43:14 | Battery #0x4f | State Deasserted
369 | 01/01/2022 | 23:45:00 | System Event #0xff | Timestamp Clock Sync | Asserted
- Partner takeover messages. Example:
Sun Jan 02 01:41:39 +0200 [node_name-02: cf_main: cf.fsm.partnerNotResponding:notice]: Failover monitor: partner not responding
Sun Jan 02 01:41:39 +0200 [node_name-02: cf_main: cf.fsm.takeoverCountdown:info]: Failover monitor: takeover scheduled in 10 seconds
Sun Jan 02 01:41:39 +0200 [node_name-02: cf_main: cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of node_name-02 by netapp03-06 disabled (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support).
Sun Jan 02 01:41:49 +0200 [node_name-02: cf_main: cf.fsm.takeover.noHeartbeat:alert]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node.
Sun Jan 02 01:41:49 +0200 [node_name-02: cf_takeover: cf.fm.takeoverStarted:notice]: Failover monitor: takeover started