SHUTDOWN PENDING (degraded mode) CRITICAL - AutoSupport message

Last updated

Jan 10, 2024
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 5,014

Visibility:: Public

Votes:: 8

Category:: ontap-9

Specialty:: hw

Last Updated:: 1/10/2024, 9:39:42 PM

Applies to

ONTAP 9
callhome.shutdown.pending
monitor.shutdown.brokenDisk
HA Group Notification from node_name (SHUTDOWN PENDING (degraded mode)) ALERT

Event Summary

callhome.shutdown.pending

This message occurs when an automatic shutdown sequence is initiated due to a degraded RAID group that cannot be reconstructed because there are insufficient appropriate spare disks. i.e. the RAID group is completely degraded.

The definition of "degraded" depends on the RAID group types used by the aggregate:

raid4 - RAID group has one missing or failed disk
raid-dp - RAID group has two missing or failed disks
raid-tec - RAID group has three missing or failed disks
A mirrored aggregate is considered "degraded" if both plexes of the aggregate has missing or failed disks in the same positional RAID group.
In ONTAP versions earlier than 9.12.1, the system halts automatically to prevent a RAID group integrity failure and possible loss of data, if it runs in completely degraded mode for the defined timeout interval.
- The default timeout is 24 hours.
If a spare drive becomes available while the system is running in degraded mode, the system immediately begins rebuilding the failed drive.

Validate

Event Log

event log show -severity * -message-name callhome*

[node1: statd: callhome.shutdown.pending:alert]: Call home for SHUTDOWN PENDING (degraded mode)

event log show -severity * -message-name monitor.brokenDisk*

[node1: statd: monitor.brokenDisk.notice:info]: When two disks are broken in raid_dp volume, the system shuts down automatically every 24 hours to encourage you to replace the disk. If you reboot the system it will run for another 24 hours before shutting down. (The 24 hour timeout may be increased by altering the "raid.timeout" value using the "options" command.)

[node1: statd: monitor.shutdown.brokenDisk.pending:notice]: two data disks in RAID group "/aggregate_name/plex0/rg0" are broken. Halting system in 24 hours.

Command line

To verify aggregate status, run storage aggregate show-status

RAID group /aggregate_name/plex0/rg1 (double degraded, block checksums)

      RAID Disk    Device     HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      ---------    ------     ------------- ---- ---- ---- ----- --------------    --------------
      dparity      0b.07.12   0b    7   12  SA:B   0   SAS 10000 1713523/3509295616 1716957/3516328368
      parity       0b.07.13   0b    7   13  SA:B   0   SAS 10000 1713523/3509295616 1716957/3516328368
      data         FAILED             N/A                        1713523/ -
      data         0b.07.15   0b    7   15  SA:B   0   SAS 10000 1713523/3509295616 1716957/3516328368
      data         FAILED             N/A                        1713523/ -
      data         0b.07.21   0b    7   21  SA:B   0   SAS 10000 1713523/3509295616 1716957/3516328368

Run storage failover show to verify if the aggregate containing the disk that needs to be reconstructed/replaced is in a partial giveback state

::> storage failover show Takeover Node Partner Possible State Description -------------- -------------- -------- ------------------------------------- Node-1 Node-2 true Connected to Node-2, Partial giveback Node-2 Node-1 true Connected to Node-1.

Resolution

Check for unassigned disks. Assign them to the node which requires spares to start reconstruction (the status should disappear once reconstructions start):

::> storage disk show -container-type unassigned

::> storage disk assign -disk <stackID>.<shelfID>.<bayID> -owner <node name>

If in a Partial giveback state, complete the giveback . Refer to Disk does not reconstruct or evacuate when in the partial giveback state
Replace any failed drives. Refer to this KB article to check your Part Status - DISK FAILED - AutoSupport message

Workaround

Check if the HA partner node has additional available spare disks of the same type. Follow: How to reassign spare disks from HA or DR partner node

For further assistance:

Additional Information

Spares Low Resolution guide
Starting with ONTAP 9.12.1 via Bug ID 944990 - System no longer halts if a RAID aggregate remains degraded for 24 hours, the default system behavior is changed to not halt if an aggregate is completely degraded.
- If you wish to preserve the previous behavior, set the raid.timeout option to a nonzero value in order for the system to shut down on expiry of the timeout period.
How does ONTAP select spares for aggregate creation, aggregate addition and failed disk replacement?
Node shutdown with "monitor.shutdown.brokenDisk:EMERGENCY" error