SHUTDOWN PENDING (degraded mode) CRITICAL - AutoSupport message
Applies to
- ONTAP 9
- callhome.shutdown.pending
- monitor.shutdown.brokenDisk
- HA Group Notification from node_name (SHUTDOWN PENDING (degraded mode)) ALERT
Event Summary
This message occurs when an automatic shutdown sequence is initiated due to a degraded RAID group that cannot be reconstructed because there are insufficient appropriate spare disks. i.e. the RAID group is completely degraded.
The definition of "degraded" depends on the RAID group types used by the aggregate:
- raid4 - RAID group has one missing or failed disk
- raid-dp - RAID group has two missing or failed disks
- raid-tec - RAID group has three missing or failed disks
- A mirrored aggregate is considered "degraded" if both plexes of the aggregate has missing or failed disks in the same positional RAID group.
- In ONTAP versions earlier than 9.12.1, the system halts automatically to prevent a RAID group integrity failure and possible loss of data, if it runs in completely degraded mode for the defined timeout interval.
- The default timeout is 24 hours.
- If a spare drive becomes available while the system is running in degraded mode, the system immediately begins rebuilding the failed drive.
Validate
Event Log
event log show -severity * -message-name callhome*
[node1: statd: callhome.shutdown.pending:alert]: Call home for SHUTDOWN PENDING (degraded mode)
event log show -severity * -message-name monitor.brokenDisk*
[node1: statd: monitor.brokenDisk.notice:info]: When two disks are broken in raid_dp volume, the system shuts down automatically every 24 hours to encourage you to replace the disk. If you reboot the system it will run for another 24 hours before shutting down. (The 24 hour timeout may be increased by altering the "raid.timeout" value using the "options" command.)
[node1: statd: monitor.shutdown.brokenDisk.pending:notice]: two data disks in RAID group "/aggregate_name/plex0/rg0" are broken. Halting system in 24 hours.
Command line
To verify aggregate status, run storage aggregate show-status
RAID group /aggregate_name/plex0/rg1 (double degraded, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0b.07.12 0b 7 12 SA:B 0 SAS 10000 1713523/3509295616 1716957/3516328368 parity 0b.07.13 0b 7 13 SA:B 0 SAS 10000 1713523/3509295616 1716957/3516328368 data FAILED N/A 1713523/ - data 0b.07.15 0b 7 15 SA:B 0 SAS 10000 1713523/3509295616 1716957/3516328368 data FAILED N/A 1713523/ - data 0b.07.21 0b 7 21 SA:B 0 SAS 10000 1713523/3509295616 1716957/3516328368
Run storage failover show
to verify if the aggregate containing the disk that needs to be reconstructed/replaced is in a partial giveback state
::> storage failover show
Takeover
Node Partner Possible State Description
-------------- -------------- -------- -------------------------------------
Node-1 Node-2 true Connected to Node-2, Partial giveback
Node-2 Node-1 true Connected to Node-1.
Resolution
- Check for unassigned disks. Assign them to the node which requires spares to start reconstruction (the status should disappear once reconstructions start):
::> storage disk show -container-type unassigned
::> storage disk assign -disk <stackID>.<shelfID>.<bayID> -owner <node name>
- If in a Partial giveback state, complete the giveback . Refer to Disk does not reconstruct or evacuate when in the partial giveback state
- Replace any failed drives. Refer to this KB article to check your Part Status - DISK FAILED - AutoSupport message
Workaround
- Check if the HA partner node has additional available spare disks of the same type. Follow: How to reassign spare disks from HA or DR partner node
For further assistance:
Additional Information
- Spares Low Resolution guide
- Starting with ONTAP 9.12.1 via Bug ID 944990 - System no longer halts if a RAID aggregate remains degraded for 24 hours, the default system behavior is changed to not halt if an aggregate is completely degraded.
- If you wish to preserve the previous behavior, set the raid.timeout option to a nonzero value in order for the system to shut down on expiry of the timeout period.
- How does ONTAP select spares for aggregate creation, aggregate addition and failed disk replacement?
- Node shutdown with "monitor.shutdown.brokenDisk:EMERGENCY" error