CFBRIDGE-263: Both nodes rebooted for a multi disk panic after NS224 disk failure
Issue
- Both nodes, in the same HA pair, reboot around the time of a disk failure.
- The disk misbehaving initial messages can be similar to the following example:
(...) [node_name: scsi_cmdblk_strthr_admin: scsi.cmd.checkCondition:error]: Disk device e0a.1.1.1L0: Check Condition: CDB 0x28:09857b04:0024: Sense Data SCSI:aborted command - (0xb - 0x90 0x5 0xfb)(10509).
(...) [node_name: scsi_cmdblk_strthr_admin: disk.timeout.flush.start:debug]: Aggressive timeout flush started on disk e0a.1.1.1.
- Both nodes trigger a "power on" AutoSupport message in a short period of time, with no power issues in the datacenter. Example:
(...) HA Group Notification (REBOOT (power on)) NOTICE
- In some scenarios a MultiDisk Panic (MDP) is reported before the node goes down.
- In some scenarios one node starts a takeover and we can see in the partner events the following message example:
Node encountered a multidisk error or other fatal error while waiting to be taken over
- From the BMC/SP logs we can see similar records as the following example:
Record (...) [BMC.critical]: Filer Reboots
- The NS224 shelf modules logs report an event like the following example:
Failure: software watchdog detected fault