SUMMARY: Strage NFS Server Behavior

From: Rick Fincher' (rnf@spitfire.tbird.com)
Date: Wed Jul 02 1997 - 16:26:17 CDT


Hi Managers,

The original problem was repeated lines like the following on an NFS client:

  NFS Server <server> not responding
  NFS Server ok
  
I'm pretty ignorant of NFS troubleshooting techniques since we have a small net and
rarely have problems. So, all the replies were informative.

Several replies said to look for hardware problems. We eliminated that possibility
but the replies might be useful to you. They are listed first below.

Several very informative replies on NFS follow the net troubleshooting replies,
including an excerpt from a Sun FAQ on NFS that is only available on sunsolve to
folks on a maintenance contract.

I was a little surprised at the suggestion to increase the number of NFS daemons on
the server. When we transitioned from SunOS to Solaris I was told that this wasn't
necessary under Solaris. I guess I was told wrong. Solaris automatically starts the
daemons as necessary but there is a default limit that can be increased.

Our problem turned out to be the client rather than the server or the net. Using some
of the techniques below we found the server to be in good shape.

Something got hosed in the client machine's system (the net hardware was OK) and
rebooting solved the problem. Inelegant, but it worked. The client was a Sparc2 used
primarily as an X-Terminal, so rebooting was not a problem.

The giveaway was that we were only seeing the messages on one client system instead
of all of the clients of the NFS server, like you would normally see with a server
problem.

Thanks to:

bismark@alta.Jpl.Nasa.Gov (Bismark Espinoza)
"Bullock, Marty" <Marty.Bullock@sea.siemens.com>
Kevin.Sheehan@uniq.com.au (Kevin Sheehan {Consulting Poster Child})
Leonard Sitongia <sitongia@jabba.hao.ucar.edu>

Rick Fincher

The responses follow:

----------------

...run snoop and look for network problems including bad tranceivers or wires.
----------------------

...try nfsstat -c
and look at the retrans/badxid fields. If badxid is any significant portion
of retrans, it means the server is slow getting back with requests. If
not, it's dropped packets, which could be either a bad wire or an overloaded
client network interface.

One major
thing that I did to help out was increase the number of nfsd daemon
processes. If you edit the /etc/init.d/nfs.server file, you'll find
that the default is 16. Look for the following line:
        /usr/lib/nfs/nfsd -a 16
Sun recommends 2-3 processes per user. You can increase this number to
several hundred with little CPU overhead. I've been running this at 128
and just recently moved it up to 256 (just to see what impact it has).
My server (an Ultra 170e) is currently serving 20-25 SGI workstations
which mount NFSv3, and about a hundred PCs.

I've also noticed that a lot of the SGI's have a default timeout value
of 3 in the /etc/fstab files. I increased this to 11 (1.1 seconds) and
most of my problems have disappeared. I still have a couple of IRIX 6.2
workstations that just refuse to work properly with NFSv3, so I had to
force them to mount version 2 instead until I work out the bugs.

------------------------

Q: Why do I get the following error message:

  NFS Server <server> not responding
  NFS Server ok

  Note, this error will occur when using HARD mounts.
  This troubleshooting section applies to HARD or SOFT mounts.

A1: If this problem is happening intermittently, while some NFS
traffic is occurring, though slowly, you have run into the performance
limitations of either your current network setup or your current NFS
server. This issue is beyond the scope of what SunService can support.
Consult sections 7.4 & 7.5 for some excellent references that can help you
tune NFS performance. Section 9.0 can point you to where you can get
additional support on this issue from Sun.

A2: If the problem lasts for an extended period of time, during which
no NFS traffic at all is going through, it is possible that your NFS
server is no longer available.

You can verify that the server is still responding by running the commands:

  # ping server
and
  # ping -s server 8000 10
(this will send 10 8k ICMP Echo request packets to the server)

If your machine is not available by ping, you will want to check the
server machine's health, your network connections and your routing.

If the ping works, check to see that the NFS server's nfsd and
mountd are responding with the "rpcinfo" command:

   # rpcinfo -u server nfs

program 100003 version 2 ready and waiting

   # rpcinfo -u server mountd

program 100005 version 1 ready and waiting
program 100005 version 2 ready and waiting

If there is no response, go to the NFS server and find out why
the nfsd and/or /mountd are not working over the network. From
the server, run the same commands. If they work OK from the
server, the network is the culprit. If they do NOT work,
check to see if they are running. If not, restart them and
repeat this process. If either nfsd or mountd IS running but
does not respond, then kill it and restart it and retest.

A3: Some older bugs might have caused this symptom. Make sure that you
have the most up-to-date Core NFS patches on the NFS server.
These are listed in Section 5.0 below. In addition, if you are running
quad ethernet cards on Solaris, install the special quad
ethernet patches listed in Section 5.4.

A4: Try cutting down the NFS read and write size with the NFS mount
options: rsize=1024,wsize=1024. This will eliminate problems with
packet fragmentation across WANS, routers, hubs, and switches in a
multivendor environment, until the root cause can be pin-pointed.
THIS IS THE MOST COMMON RESOLUTION TO THIS PROBLEM.

A5: If the NFS server is Solaris 2.3 and 2.4, 'nfsreadmap' occasionally
caused the "NFS server not responding" message on Sun and non-Sun
NFS clients. You can resolve this by adding the following entry to
your /etc/system file on the NFS server:

set nfs:nfsreadmap=0

And rebooting the machine. The nfsreadmap function was removed in 2.5
because it really didn't work.

A6: If you are using FDDI on Solaris, you must enable fragmentation
with the command:
ndd -set /dev/ip ip_path_mtu_discovery 0

Add this to /etc/init.d/inetinit, after the other ndd command on line 18.

A7: Another possible cause is IF the NFS SERVER is Ultrix, old AIX,
Stratus, and older SGI and you ONLY get this error on Solaris 2.4 and 2.5
clients, but the 2.3 and 4.X clients are OK.

The NFS Version 2 and 3 protocol allow for the NFS READDIR request to be
1048 bytes in length. Some older implementations incorrect thought the
request had a max length of 1024. To work around this, either mount
those problem servers with rsize=1024,wsize=1024 or add the following
to the NFS client's /etc/system file and reboot:

set nfs:nfs_shrinkreaddir=1

A8: Oftentimes NFS SERVER NOT RESPONDING is an indication of another problem
on the NFS server, particularly on the disk subsystem. If you have a
SPARCStorage Array, you must verify that you have the most recent
firmware and patches due to the volatility of that product.

Another general method that can be tried to is look at the output
from iostat -xtc 5 and check the svt_t field. If this value goes
over 50.0 (50 msec) for a disk that is being used to serve NFS requests,
you might have found your bottleneck. Consult the references in
Section 7 of this PSD for other possible NFS Server tuning hints.

NOTE: NFS Server performance tuning services are only available
on a Time and Materials basis.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:58 CDT