What are the details of Network File System (NFS) Lock Recovery and Network Status Monitor?

Last updated

Feb 12, 2024
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 5,765

Visibility:: Public

Votes:: 0

Category:: data-ontap-7

Specialty:: 7dot

Last Updated:: 2/12/2024, 9:38:12 AM

Applies to

Data ONTAP 7 and earlier

Answer

Network Status Monitor (NSM) problems inhibit NFS services from starting after reboot or cluster failover.

Error message: [sm_recover]: no address for host [nfs_client1]

Error message: [sm_recover]: get RPC port for [host=unix1,prog=100024,ver=1,prot=17] failed

NFS lock recovery and Network Status Monitor

NFS versions 2 and 3 depend on the Network Lock Manager (NLM) protocol for file locking. Another RPC protocol called the Network Status Monitor (NSM), is used to notify clients of a loss of lock state because of a server reboot. When an NFS server grants a lock to a client it must maintain a record of the client that owns the lock. This information is maintained on disk. The individual lock state itself is non-persistent. If the server reboots, the lock will be lost. The client needs to be notified so that it can re-establish the lock when the NFS server is again available. The storage system NSM maintains its information as files in /etc/sm:

state state of the NSM
monitor list of hosts currently being monitored
notify list of hosts being notified after a reboot

Upon rebooting or cluster takeover, the filer reads the /etc/sm/monitor file to determine which clients held NLM file locks prior to reboot or cluster takeover. The clients to be notified are then copied into the /etc/sm/notify file and hence will be used for notifying clients. The storage system notifies the clients via NSM that it has rebooted and lost all locks. Clients running a NSM daemon (rpc.statd/statd) will issue lock reclaim requests to rebuild the lock state that was lost during the storage system reboot. When the storage system reboots, there is an NLM grace period of 45 seconds during which the storage system will not honor any new lock requests; it will only honor reclaim requests. The grace period gives all NFS clients that were holding locks the opportunity to reclaim their locks.

Client or network problems may prevent the storage system NSM from notifying all the monitored clients. Each client that cannot be contacted will delay the startup of NFS file services after the reboot. The storage system will attempt to contact all the clients in the notify list before NFS services are completely started. The maximum timeout value for each client is 10 seconds.

Following are the issues that could prevent the storage system from notifying clients:

The client is down or no longer available on the network.
The client is not running a NSM daemon; rpc.statd on linux, statd on Solaris.
There is a network connectivity or network equipment outage.
The storage system cannot resolve client hostnames because it cannot contact the DNS or NIS services.

Error Messages

Error message: [sm_recover]: no address for host [nfs_client1] Error message: [sm_recover]: get RPC port for [host=unix1,prog=100024,ver=1,prot=17] failed

Checking for unavailable hosts in the /etc/sm/monitor file

The PERL script read_monitor.pl below can be used to list the clients in a monitor file.
#!/usr/bin/perl binmode (STDIN);

$file=join "",<stdin>; while ($file =~ /(..)(..)(.*?)\000/gs) { $status=unpack("S", $1); $namelen=unpack("S", $2); print "$3\n" if $status==1; }
To use it send the monitor file from STDIN to this script:
cat etc/sm/monitor | ./read_monitor.pl

It will display a list of all clients which have in_use set to one.
Check the list of clients for the following:
- ping the host from the filer - the client portmapper is functioning(rpcinfo -p hostname) - the client rpc.statd/statd is running (rpcinfo -p hostname)

If a client fails the checks above, the problem should be corrected. If the client is permanently not available, then it can be removed from the monitor file using the sm_mon command.
The storage system advanced mode command sm_mon can be used to remove a host from the monitor list:
Enter priv set advanced Enter sm_mon -u [client_name] Enter priv set admin

Vfiler

The configuration of vfilers can impact the Network Status Monitor client notification process. Each vfiler maintains its own set of information files under the vfiler /etc/sm directory. After a reboot of the storage system, NFS services are restarted on each vfiler. Each vfiler must notify the NFS clients in its /etc/sm/notify file so that the locks can be reclaimed. A non-responding client that is shared among vfilers would incur a 10 second timeout for each vfiler. A client that held locks on multiple vfilers, must be notified via NSM by each vfiler. This could impact the overall startup time before all NFS file services are completely operating.

Storage System Cluster

In a storage system cluster configuration, a partner takeover or giveback operation is the same as a reboot of the cluster partner. The NFS clients that held locks on the affected storage system must be notified via the Network Status Monitor (NSM) as detailed in the sections above.

Because of the upper limit of 5 minutes on the startup of each service on a cluster failover/takeover, the delay in contacting the NLM clients could further compound the startup of the failed filer and could impact the overall availability of the cluster

Note: When you run the sm_mon commands, it is useful to be able to verify that locks have indeed been released. Check the lock command, which display locks held through all protocols i.e NLM, Common Internet File System protocol (CIFS), NFSv4, FIO. Depending on the Data ONTAP version there is a command called lock status. In older versions of Data ONTAP lock is is available in the form of lock_dump.

Additional Information

N/A