Shared Storage pool is unhealthy
Applies to
ONTAP
Answer
The following message is displayed when a storage pool goes unhealthy.
Dec 14 04:06:21 [cluster01-01:raid.sp.unhealthy:notice]: Storage pool sp1 is unhealthy. Reason: One of the aggregates belonging to the storage pool is not in normal state.
When this error occurs, check if there are any SSD disk failures in the storage pool:
cluster01::> storage pool show -storage-pool sp1 -instance
Storage Pool Name: sp1
UUID of Storage Pool: 84afe3e1-a215-11e5-ac48-00a09854bc10
Nodes Sharing the Storage Pool: cluster01-01, cluster01-02
Number of Disks in Storage Pool: 22
Allocation Unit Size: 1023GB
Allocation Unit Data Size for RAID4: 976.6GB
Allocation Unit Data Size for RAID-DP: 930.1GB
Allocation Unit Data Size for RAID-TEC: 883.6GB
Storage Type: SSD
Storage Pool Usable Size: 2.00TB
Storage Pool Total Size: 4.00TB
Is Pool Healthy?: false
State of the Storage Pool: degraded
Reason for Storage Pool Being Unhealthy: One of the aggregates belonging to the storage pool is not in normal state.
Job ID of the Currently Running Operation: -
Is Allocation Unit Broken?: false
cluster01::>
cluster01::storage pool*> run local aggr status -r aggr1
Aggregate aggr1 (online, raid_dp, reconstruct, hybrid) (block checksums)
Plex /aggr1/plex0 (online, normal, active, pool0)
RAID group /aggr1/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 2b.64 2b 4 0 FC:B 0 FCAL 15000 272000/557056000 280104/573653840
parity 2a.50 2a 3 2 FC:A 0 FCAL 15000 272000/557056000 280104/573653840
data 2a.34 2a 2 2 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 2a.18 2a 1 2 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 2a.65 2a 4 1 FC:A 0 FCAL 15000 272000/557056000 280104/573653840
RAID group /aggr1/plex0/rg1 (reconstruction 74% completed, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0a.30.4P1 0a 30 4 SA:A 0 SSD N/A 47619/97525248 47627/97541632
parity 0a.30.11P1 0a 30 11 SA:A 0 SSD N/A 47619/97525248 47627/97541632
data 0b.10.22P1 0b 10 22 SA:B 0 SSD N/A 47619/97525248 47627/97541632 (reconstruction 74% completed)
data 0a.30.5P1 0a 30 5 SA:A 0 SSD N/A 47619/97525248 47627/97541632
The unhealthy state should change to normal when the reconstruction completes.
However, if the broken disk is put back in the system as a partitioned SSD, then the storage pool will remain unhealthy with the following state:
cluster01::storage pool*> show -storage-pool sp1 -instance
Storage Pool Name: sp1
UUID of Storage Pool: 84afe3e1-a215-11e5-ac48-00a09854bc10
Nodes Sharing the Storage Pool: cluster01-01, cluster01-02
Number of Disks in Storage Pool: 22
Allocation Unit Size: 1023GB
Allocation Unit Data Size for RAID4: 976.6GB
Allocation Unit Data Size for RAID-DP: 930.1GB
Allocation Unit Data Size for RAID-TEC: 883.6GB
Storage Type: SSD
Storage Pool Usable Size: 2.00TB
Storage Pool Total Size: 4.00TB
Is Pool Healthy?: false
State of the Storage Pool: degraded
Reason for Storage Pool Being Unhealthy: Storage pool has more number of disks than expected.
Job ID of the Currently Running Operation: -
Is Allocation Unit Broken?: false
cluster01::storage pool*>
The following message is displayed:
cluster01::storage pool*> Dec 14 04:39:44 [cluster01-01:raid.sp.unhealthy:notice]: Storage pool sp1 is unhealthy. Reason: Storage pool has more number of disks than expected.
The storage pool will remain in this state until the previously failed and replaced SSD is either physically removed from the system or unpartitioned.
Additional Information
Perform the following steps to manually unpartition the SSD drive that was replaced from the storage pool.
Note: A disk can be removed from the storage pool only if it is not in use by any aggregate.
Exercise extra caution or contact NetApp Technical Support before performing the following steps.
- Run the
storage pool show-disks -storage-pool <sp name>
command to find the existing drives in the storage pool. - Select the drive that was replaced by reconstruction and needs to be removed from the storage pool.
- Run the
storage disk show -disk <disk name>" -fields diskpathnames, owner
command to determine the owner node of the shared drive and the local name of the drive on the owner node. Thediskpathnames
field will give names in thehostname:localname
format. - Run the
storage disk partition show -container-disk <disk name> -fields owner-node-name
command to know all partition names and their owner nodes from this shared drive. - For partitions which have a different owner than the disk, change ownership by running the
storage disk partition removeowner -partition <partition name>
andstorage disk partition assign -partition <partition name> -owner <disk owner>
commands.
After this step, the owner for the disk and all its partitions should be the same. - Drop to the node shell of the owner node.
- Run the
disk unpartition <disk name>
command
Once the disk is unpartitioned, it will come back as a spare disk.
Once the disk becomes a spare disk, then the storage pool should report a healthy state.