What are the details of Network File System (NFS) Lock Recovery and Network Status Monitor?
Applies to
Data ONTAP 7 and earlier
Answer
Network Status Monitor (NSM) problems inhibit NFS services from starting after reboot or cluster failover.
Error message: [sm_recover]: no address for host [nfs_client1]
Error message: [sm_recover]: get RPC port for [host=unix1,prog=100024,ver=1,prot=17] failed
NFS lock recovery and Network Status Monitor
NFS versions 2 and 3 depend on the Network Lock Manager (NLM) protocol for file locking. Another RPC protocol called the Network Status Monitor (NSM), is used to notify clients of a loss of lock state because of a server reboot. When an NFS server grants a lock to a client it must maintain a record of the client that owns the lock. This information is maintained on disk. The individual lock state itself is non-persistent. If the server reboots, the lock will be lost. The client needs to be notified so that it can re-establish the lock when the NFS server is again available. The storage system NSM maintains its information as files in /etc/sm
:
state
state of the NSM
monitor
list of hosts currently being monitored
notify
list of hosts being notified after a reboot
Upon rebooting or cluster takeover, the filer reads the /etc/sm/monitor
file to determine which clients held NLM file locks prior to reboot or cluster takeover. The clients to be notified are then copied into the /etc/sm/notify
file and hence will be used for notifying clients. The storage system notifies the clients via NSM that it has rebooted and lost all locks. Clients running a NSM daemon (rpc.statd/statd
) will issue lock reclaim requests to rebuild the lock state that was lost during the storage system reboot. When the storage system reboots, there is an NLM grace period of 45 seconds during which the storage system will not honor any new lock requests; it will only honor reclaim requests. The grace period gives all NFS clients that were holding locks the opportunity to reclaim their locks.
Client or network problems may prevent the storage system NSM from notifying all the monitored clients. Each client that cannot be contacted will delay the startup of NFS file services after the reboot. The storage system will attempt to contact all the clients in the notify list before NFS services are completely started. The maximum timeout value for each client is 10 seconds.
Following are the issues that could prevent the storage system from notifying clients:
- The client is down or no longer available on the network.
- The client is not running a NSM daemon; rpc.statd on linux, statd on Solaris.
- There is a network connectivity or network equipment outage.
- The storage system cannot resolve client hostnames because it cannot contact the DNS or NIS services.
Error Messages
Error message: [sm_recover]: no address for host [nfs_client1]
Error message: [sm_recover]: get RPC port for [host=unix1,prog=100024,ver=1,prot=17] failed
Checking for unavailable hosts in the /etc/sm/monitor
file
- The PERL script
read_monitor.pl
below can be used to list the clients in a monitor file.#!/usr/bin/perl
binmode (STDIN);$file=join "",<stdin>;
while ($file =~ /(..)(..)(.*?)\000/gs) {
$status=unpack("S", $1);
$namelen=unpack("S", $2);
print "$3\n" if $status==1;
} - To use it send the monitor file from STDIN to this script:
cat etc/sm/monitor | ./read_monitor.pl
It will display a list of all clients which have
in_use
set to one. - Check the list of clients for the following:
- ping the host from the filer
- the client portmapper is functioning(rpcinfo -p hostname)
- the client rpc.statd/statd is running (rpcinfo -p hostname)If a client fails the checks above, the problem should be corrected. If the client is permanently not available, then it can be removed from the monitor file using the
sm_mon
command. - The storage system advanced mode command
sm_mon
can be used to remove a host from the monitor list:
Enter priv set advanced
Enter sm_mon -u [client_name]
Enter priv set admin
Vfiler
The configuration of vfilers can impact the Network Status Monitor client notification process. Each vfiler maintains its own set of information files under the vfiler /etc/sm
directory. After a reboot of the storage system, NFS services are restarted on each vfiler. Each vfiler must notify the NFS clients in its /etc/sm/notify
file so that the locks can be reclaimed. A non-responding client that is shared among vfilers would incur a 10 second timeout for each vfiler. A client that held locks on multiple vfilers, must be notified via NSM by each vfiler. This could impact the overall startup time before all NFS file services are completely operating.
Storage System Cluster
In a storage system cluster configuration, a partner takeover or giveback operation is the same as a reboot of the cluster partner. The NFS clients that held locks on the affected storage system must be notified via the Network Status Monitor (NSM) as detailed in the sections above.
Because of the upper limit of 5 minutes on the startup of each service on a cluster failover/takeover, the delay in contacting the NLM clients could further compound the startup of the failed filer and could impact the overall availability of the cluster
Note: When you run the sm_mon
commands, it is useful to be able to verify that locks have indeed been released. Check the lock
command, which display locks held through all protocols i.e NLM, Common Internet File System protocol (CIFS), NFSv4, FIO. Depending on the Data ONTAP version there is a command called lock status. In older versions of Data ONTAP lock
is is available in the form of lock_dump
.
Additional Information
N/A