(PARTNER DOWN, TAKEOVER IMPOSSIBLE) followed by unplanned switchover
Applies to
- Metrocluster IP
- ONTAP Mediator
Issue
- Receive EMS alert for the following
[Node-02: cf_main: callhome.partner.down:EMERGENCY]: Call home for PARTNER DOWN, TAKEOVER IMPOSSIBLE
- Forced switchover occurs moments later, initiated by the remote site
[RemoteNode-01: DR_heartbeat_thread: mcc.auso.triggered:notice]: The node has triggered an Automatic Unplanned Switchover of the DR partner. Reason: DR partner Hearbeat Lost. If Automatic Unplanned Switchover was triggered because of heartbeat loss, any ongoing dump on the partner may be aborted.
- Prior to this we see iSCSI sessions to both the remote site as well as the mediator disconnecting and then failing to reconnect
[Node-01: clock: ctl.session.stateChanged:notice]: iSCSI CAM target layer's session state is changed to terminated for the initiator iqn.1994-09.org.freebsd:61750f47-13fb-11ee-af94-d039ea4d5a17 (address: 192.168.255.4). Reason: no ping reply after 5 seconds.[Node-01: clock: ctl.session.stateChanged:notice]: iSCSI CAM target layer's session state is changed to terminated for the initiator iqn.1994-09.org.freebsd:9cc17864-13fa-11ee-accc-d039ea4d5a9e (address: 192.168.255.3). Reason: no ping reply after 5 seconds.[Node-01: kernel: iscsi.session.stateChanged:notice]: iSCSI session state is changed to Reconnecting for the target iqn.2016-07.com.netapp:9cc17864-13fa-11ee-accc-d039ea4d5a9e (type: dr_partner, address: 192.168.255.3:65200). Reason: no ping reply after 5 seconds.[Node-01: kernel: iscsi.session.stateChanged:notice]: iSCSI session state is changed to Reconnecting for the target iqn.2016-06.com.netapp:9cc17864-13fa-11ee-accc-d039ea4d5a9e (type: dr_partner, address: 192.168.255.3:65200). Reason: no ping reply after 5 seconds.[Node-01: kernel: iscsi.session.stateChanged:notice]: iSCSI session state is changed to Reconnecting for the target iqn.2016-06.com.netapp:9cc17864-13fa-11ee-accc-d039ea4d5a9e (type: dr_partner, address: 192.168.255.131:65200). Reason: no ping reply after 5 seconds.[Node-01: clock: ctl.session.stateChanged:notice]: iSCSI CAM target layer's session state is changed to terminated for the initiator iqn.1994-09.org.freebsd:9cc17864-13fa-11ee-accc-d039ea4d5a9e (address: 192.168.255.131). Reason: no ping reply after 5 seconds.[Node-01: kernel: iscsi.session.stateChanged:notice]: iSCSI session state is changed to Reconnecting for the target iqn.2016-06.com.netapp:61750f47-13fb-11ee-af94-d039ea4d5a17 (type: dr_auxiliary, address: 192.168.255.4:65200). Reason: no ping reply after 5 seconds.[Node-01: kernel: iscsi.session.stateChanged:notice]: iSCSI session state is changed to Reconnecting (destroy sim) for the target iqn.2012-05.local:mailbox.target.248b3fde-13fd-11ee-accc-d039ea4d5a9e:92a5b3ba-13fe-11ee-838f-d039ea4d5a1f:1 (type: mailbox, address: 192.168.222.170). Reason: session login timed out 2 times.
- Followed by
scsi.cmderrors to the remote disks (as they became unreachable) and mediator mailbox disks;after multiple retry attempts we receive disk missing alerts
[Node-01: pha_main000: scsi.cmd.abortedByHost:error]: Disk device 0f.1L0: Command aborted by host adapter: HA status 0x4: cdb 0x2a:00000008:0008. Disk 0f.1 Shelf - Bay - [NETAPP PHA-DISK 0001] S/N [XXXXXXXXXXXXXXXX] UID [37643538:33396663:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000][Node-01: pha_main000: scsi.cmd.abortedByHost:error]: Disk device 0f.1L0: Command aborted by host adapter: HA status 0x4: cdb 0x2a:0000a410:0008. Disk 0f.1 Shelf - Bay - [NETAPP PHA-DISK 0001] S/N [XXXXXXXXXXXXXXXX] UID [37643538:33396663:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]
[Node-01: config_thread: raid.config.filesystem.disk.missing:info]: File system Disk 0m.i1.2L26 Shelf 20 Bay 20 [NETAPP X357_SLBPE3T8ATE NA52] S/N [XXXXXXX] UID [5000C500:EC7AC0BB:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] is missing.[Node-01: config_thread: raid.config.filesystem.disk.missing:info]: File system Disk 0v.i2.0L13 Shelf 20 Bay 23 [NETAPP X357_SLBPE3T8ATE NA52] S/N [XXXXXXX] UID [5000C500:EC7A50BF:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] is missing.[Node-01: config_thread: raid.config.filesystem.disk.missing:info]: File system Disk 0m.i2.0L5 Shelf 20 Bay 19 [NETAPP X357_SLBPE3T8ATE NA52] S/N [XXXXXXX] UID [5000C500:EC7ACF53:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] is missing.
- Both the partner mailbox disks and mediator mailbox disks are declared not healthy
[Node-01: cf_worker: cf.mccip.med.auso.stDisabled:error]: Automatic switchover disabled: Mediator mailbox disk not healthy.[Node-01: cf_main: cf.fsm.backupMailboxError:error]: Failover monitor: partner mailbox error detected.[Node-01: cf_main: cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of Node-02 disabled (partner mailbox disks not accessible or invalid).[Node-01: cf_main: cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of Node-01 by Node-02 disabled (unsynchronized log).
