Solaris host support considerations in a MetroCluster configuration

Last updated

Jul 23, 2021
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 984

Visibility:: Public

Votes:: 1

Category:: metrocluster

Specialty:: metrocluster

Last Updated:: 7/23/2021, 1:23:42 PM

Applies to

Solaris host support considerations in a MetroCluster configuration
MetroCluster
ONTAP 9

Answer

By default, Solaris OS can survive 'All Path Down' (APD) up to 20 seconds; this is controlled by the fcp_offline_delay parameter.
In order for the Solaris hosts to continue without any disruption during all MetroCluster workflows, like Negotiated Switchover, Switchback, Tiebreaker unplanned Switchover, and Automated Unplanned Switchover, it is recommended to set the fcp_offline_delay to 120s.

Important MetroCluster Support Considerations:

Host response to Local HA failover	When the fcp_offline_delay value is increased, application service resumption time increases during a local HA failover (such as a node panic followed by surviving node takeover of the panicking node.) For example, for fcp_offline_delay = 120s, Solaris client can take up to 120s to resume the application service.
FCP error handling	With the default value of fcp_offline_delay, when the initiator port connection fails, the fcp driver takes 110s to notify the upper layers (MPxIO). Once the fcp_offline_delay is increased to 120s, the total time taken by the driver to notify the upper layers (MPxIO) is 210s; this may cause an I/O delay. Refer Oracle Doc ID: 1018952.1. When a fibre channel port fails, an additional 110 second delay may be seen before the device is offlined.
Co-Existence with 3rd party arrays	As the fcp_offline_delay parameter is a global parameter, and may affect the interaction with all storage connected to the FCP driver.

How to modify the setting for the fcp_offline_delay.

For Solaris 10u8, 10u9, 10u10 and 10u11:
fcp_offline_delay can be set in the /kernel/drv/fcp.conf file. Adding the following line will change the timer to 120s.
fcp_offline_delay = 120;
The host should be rebooted for the setting to take effect.
Once the host is up, check if the kernel has the parameters set:
# mdb -k > fcp_offline_delay/D fcp_offline_delay: fcp_offline_delay: 120 >Ctrl_D

For Solaris 11
fcp_offline_delay can be set in the /etc/driver/drv/fcp.conf file. Adding the following line will change the timer to 120s.
fcp_offline_delay = 120;
The host should be rebooted for setting to take effect.
Once the host is up, check if the kernel has the parameters set:
# mdb -k > fcp_offline_delay/D fcp_offline_delay: fcp_offline_delay: 120 >Ctrl_D

Host Recovery example:

In the event of a disaster failover or an unplanned Switchover happening and taking abnormally long (exceeding 120s) time, which may cause the host application to fail, see the example below before remediating the host applications:

Zpool Recovery:

Ensure all the LUNs are online.

Run the following commands:

# zpool list NAME SIZE ALLOC FREE CAP HEALTH ALTROOT n_zpool_site_a 99.4G 1.31G 98.1G 1% OFFLINE - n_zpool_site_b 124G 2.28G 122G 1% OFFLINE - Check the individual pool status: # zpool status n_zpool_site_b pool: n_zpool_site_b state: SUSPENDED ==============è>>>>>>>>>>>>>> POOL SUSPENDED status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scan: none requested config: NAME STATE READ WRITE CKSUM n_zpool_site_b UNAVAIL 1 1.64K 0 experienced I/O failures c0t600A098051764656362B45346144764Bd0 UNAVAIL 1 0 0 experienced I/O failures c0t600A098051764656362B453461447649d0 UNAVAIL 1 40 0 experienced I/O failures c0t600A098051764656362B453461447648d0 UNAVAIL 0 38 0 experienced I/O failures c0t600A098051764656362B453461447647d0 UNAVAIL 0 28 0 experienced I/O failures c0t600A098051764656362B453461447646d0 UNAVAIL 0 34 0 experienced I/O failures c0t600A09805176465657244536514A7647d0 UNAVAIL 0 1.03K 0 experienced I/O failures c0t600A098051764656362B453461447645d0 UNAVAIL 0 32 0 experienced I/O failures c0t600A098051764656362B45346144764Ad0 UNAVAIL 0 34 0 experienced I/O failures c0t600A09805176465657244536514A764Ad0 UNAVAIL 0 1.03K 0 experienced I/O failures c0t600A09805176465657244536514A764Bd0 UNAVAIL 0 1.04K 0 experienced I/O failures c0t600A098051764656362B45346145464Cd0 UNAVAIL 1 2 0 experienced I/O failures The above pool has degraded.

Run the following commands to clear the pool status:

#zpool clear n_zpool_site_b

Check the pool again:

# zpool status n_zpool_site_b pool: n_zpool_site_b state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scan: none requested config: NAME STATE READ WRITE CKSUM n_zpool_site_b ONLINE 0 0 0 c0t600A098051764656362B45346144764Bd0 ONLINE 0 0 0 c0t600A098051764656362B453461447649d0 ONLINE 0 0 0 c0t600A098051764656362B453461447648d0 ONLINE 0 0 0 c0t600A098051764656362B453461447647d0 ONLINE 0 0 0 c0t600A098051764656362B453461447646d0 ONLINE 0 0 0 c0t600A09805176465657244536514A7647d0 ONLINE 0 0 0 c0t600A098051764656362B453461447645d0 ONLINE 0 0 0 c0t600A098051764656362B45346144764Ad0 ONLINE 0 0 0 c0t600A09805176465657244536514A764Ad0 ONLINE 0 0 0 c0t600A09805176465657244536514A764Bd0 ONLINE 0 0 0 c0t600A098051764656362B45346145464Cd0 ONLINE 0 0 0 errors: 1679 data errors, use '-v' for a list

Check the pool status again; here a disk in the pool is degraded.

[22] 05:44:07 (root@host1) / # zpool status n_zpool_site_b -v cannot open '-v': name must begin with a letter pool: n_zpool_site_b state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scan: scrub repaired 0 in 0h0m with 0 errors on Fri Dec 4 05:44:17 2015 config: NAME STATE READ WRITE CKSUM n_zpool_site_b DEGRADED 0 0 0 c0t600A098051764656362B45346144764Bd0 ONLINE 0 0 0 c0t600A098051764656362B453461447649d0 ONLINE 0 0 0 c0t600A098051764656362B453461447648d0 ONLINE 0 0 0 c0t600A098051764656362B453461447647d0 ONLINE 0 0 0 c0t600A098051764656362B453461447646d0 ONLINE 0 0 0 c0t600A09805176465657244536514A7647d0 DEGRADED 0 0 0 too many errors c0t600A098051764656362B453461447645d0 ONLINE 0 0 0 c0t600A098051764656362B45346144764Ad0 ONLINE 0 0 0 c0t600A09805176465657244536514A764Ad0 ONLINE 0 0 0 c0t600A09805176465657244536514A764Bd0 ONLINE 0 0 0 c0t600A098051764656362B45346145464Cd0 ONLINE 0 0 0 errors: No known data errors

Clear the disk error by running the following command:

# zpool clear n_zpool_site_b c0t600A09805176465657244536514A7647d0 [24] 05:45:17 (root@host1) / # zpool status n_zpool_site_b -v cannot open '-v': name must begin with a letter pool: n_zpool_site_b state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Fri Dec 4 05:44:17 2015 config: NAME STATE READ WRITE CKSUM n_zpool_site_b ONLINE 0 0 0 c0t600A098051764656362B45346144764Bd0 ONLINE 0 0 0 c0t600A098051764656362B453461447649d0 ONLINE 0 0 0 c0t600A098051764656362B453461447648d0 ONLINE 0 0 0 c0t600A098051764656362B453461447647d0 ONLINE 0 0 0 c0t600A098051764656362B453461447646d0 ONLINE 0 0 0 c0t600A09805176465657244536514A7647d0 ONLINE 0 0 0 c0t600A098051764656362B453461447645d0 ONLINE 0 0 0 c0t600A098051764656362B45346144764Ad0 ONLINE 0 0 0 c0t600A09805176465657244536514A764Ad0 ONLINE 0 0 0 c0t600A09805176465657244536514A764Bd0 ONLINE 0 0 0 c0t600A098051764656362B45346145464Cd0 ONLINE 0 0 0 errors: No known data errors or export and import the zpool. # zpool export n_zpool_site_b # zpool import n_zpool_site_b

The pool is now online.
If the above steps do not recover the pool, reboot the host.

Storage Virtual Machine(SVM) (metaset)
Ensure all the LUNs are online, reboot the system and then mount the Storage Virtual Machine(SVM).

Additional Information

N/A