SUMMARY: NFS timeouts over WAN

From: dansf@gte.net
Date: Mon Jan 12 1998 - 14:29:30 CST


Just wanted to thank everyone who responded to my question. My original
question (summary below):
> Hello,
> Was wondering if someone had encountered a similiar problem or knows of
> a solution for this problem.
>
> I am experiencing NFS timeouts running applications over a WAN.
>
> The NFS server is an Ultra2 running Solaris 2.5.1 w/latest Recommended
> Patches.
>
> The NFS client is Sparc20 running Solaris 2.4 w/a set of recommended
> patches (kernel patch 101945-41).
>
> The client and server have different domains, but each is using NIS+
> (although we check files for automounts).
>
> Things seem to be okay when the client gets the file from a Solaris 2.4
> box. But strangely enough, clients on the LAN (with the server) have no
> trouble accessing files from the Solaris 2.5.1 server.
>
> The two machines communicate to one another across a T1 , which I have
> been told, is functioning correctly. When trying to start an
> application (even xclock!) on the client off the server we get:
> NFS server not responding
> NFS server ok
> NFS server not responding
> NFS server ok
>
> It takes about 30 minutes to bring up xclock.
>
> What I have done:
>
> I have run "snoop client" on the server to see what is happening to the
> packets. It looks like
>
> client -> server NFS C READ2 FH=2232 at 131072 for 8192
> server -> client NFS R READ2 OK (8192 bytes)
> server -> client UDP continuation ID=12570
> server -> client UDP continuation ID=12570
> server -> client UDP continuation ID=12570
> server -> client UDP continuation ID=12570
> server -> client UDP continuation ID=12570
> client -> server NFS C READ2 FH=2232 at 131072 for 8192 (retransmit)
> server -> client NFS R READ2 OK (8192 bytes)
> server -> client UDP continuation ID=12570
> server -> client UDP continuation ID=12570
> server -> client UDP continuation ID=12570
> server -> client UDP continuation ID=12570
> server -> client UDP continuation ID=12570
>
> It seems we get quite a few "retransmits". We are also getting some
> "ICMP Time exceeded (in reassembly)" messages.
>
> It appears we are only utilizing about 5% of our T1 Bandwidth.
>
> I have bumped up the number of NFS threads from 64 to 128. Did not
> help.
>
> Telnet and ftp seem to be working ok. We get good response from the
> client and server with these programs. Perhaps the problem is UDP
> related and can be fixed with a patch. Didn't see anything good in
> SunSolve.
>
> Sorry for the long post.
>
**************************************************
I received many responses. Sorry for the late summary. I wanted to make
sure the problem was completely solved before posting a summary. Most
suggested this quick workaround solution. I decreased the packet size
on NFS mounts to "rsize=1024" and "wsize=1024". This helped but did not
solve the problem. The real solution, as suggested by some replies, was
to get the T1 people to hammer out the problem. I finally convinced
them that the problem was with the T1. They ended up taking 4 dB of
padding out of the NIU at the Telco demark (on the other end of the
T1). That fixed the problem. Aaargh! I'll list the responses received
for your viewing pleasure.

Thanks again,
Dan Freedman
*****************************************************
Subject:
        Re: NFS timeouts over WAN
  Date:
        Tue, 6 Jan 1998 16:31:11 -0800 (PST)
  From:
        David Wolfskill <david@xtend.net>
    To:
        dansf@gte.net

Check the routers between the server & client. I did some work at
a client site where they were using an ATM backbone between buildings
and encountering similar symptoms. Turns out (in that case) that the
8K NFS packets were getting fragmented, then a router would drop a
frag, causing a time-out & re-transmission (of the whole 8K packet),
which exacerbated the problem.

If you can use NFS V3, that might help a little (because of the ability
to use TCP as the transport layer, with its "reliable stream" model).

For that matter, check any and all devices between the client & server.

david
-------
Subject:
        Re: NFS timeouts over WAN
  Date:
        Tue, 6 Jan 1998 18:43:32 -0600 (CST)
  From:
        David Dhunjishaw <dave@colltech.com>
    To:
        dansf@gte.net

This sounds analogous to a problem I've had in the past with MTUs and
packet reassembly.

In those cases, servers were connected to the network using FDDI, which
had an MTU of 4000-something (don't remember exact number). When NFS
clients, whose Ethernet interfaces had an MTU of 1500, tried to connect
to
the servers, we'd get the same problems you are having. Reducing the
MTU
of the FDDI interfaces on the servers to match the Ethernet MTU of 1500
solved our problem.

Low bandwidth protocols like telnet produce small packets (smaller than
1500), so that service would work normally. And I believe ftp only
generates packets of a couple of hundred bytes at a time.

I wonder if dropping the MTU on your server would help to solve the
problem, even if its Ethernet. I can imagine a situation where large
UDP
packets would have trouble traveling over a WAN. To do this, you can
use
an ndd command, the ifconfig command, or an /etc/system parameter, but I
forget the syntax. Try searching the archives for mtu.

Hope this helps.

Dave
----------
This smells like a networking problem. If your bandwidth isn't
being utilized, then there may be a problem on a hub somewhere, or
other lan gear. check everything between the server and the client
just to make sure. Also, run netstat and get a collision rate for
both client and server, and perhaps some unrelated hosts on each
segment.

>
> It appears we are only utilizing about 5% of our T1 Bandwidth.
>
> I have bumped up the number of NFS threads from 64 to 128. Did not
> help.

If you run netstat -s and grep -i flows, you should see whether or
not you are having input overflows. If you see a large number of
these, you need more threads, otherwise you don't.

>
> Telnet and ftp seem to be working ok. We get good response from the
> client and server with these programs. Perhaps the problem is UDP
> related and can be fixed with a patch. Didn't see anything good in
> SunSolve.

UDP messages get broken up into fragments and then reassembled. If
there is a problem in reassembly, then all of the associated packets
need to be resent, thus the high number of retransmit requests.
With TCP (such as ftp or telnet), packets are sequenced, and only
the missing or bad fragment needs to be retransmitted. If you are
having a network problem (likely here), then UDP traffic will
compound this by increasing the load exponentially, causing more
retransmit requests, causing more load, more errors, more
retransmits, etc.

One way to try to address this is to force the read and write size
of the nfs mounts to be small enough to fit into one packet, thus no
fragmentation and reassembly issues will occur. The MTU for most
le devices is 1500, so leaving enough room for encapsulation
overhead, etc. would probably make 1024 a good size. The real
solution is to find the network problem though. Make the WAN/LAN
guys recheck everything from end to end.

Good luck.

Cheers,

Richard

--------

Subject:
        Re: NFS timeouts over WAN
  Date:
        Wed, 7 Jan 1998 08:37:07 +0000 (GMT)
  From:
        "Mark.Parry" <mark.parry@research.natpower.co.uk>
    To:
        dansf@gte.net

Dan,

have been briefly looking at your email, but rather than twiddle with
the
server, I had a thought - why not use cachefs on your remote clients?
Sun recommend using the cacheOS client in this situation, but if you
didn't
want to go the whole hog, perhaps you may want to be selective about the
filesystems you cache.

just a thought, alternatively a colleague suggests modifying the NFS
blocksize
or timeout. Also, you could u/g the client to 2.5 (or higher), which
will
then use NFS v.3 automatically - this is TCP rather than UDP based, and
so
may circumvent the problem entirely if it is a singular UDP problem.

Hope this helps...

Mark Parry

---------
Subject:
        Re: NFS timeouts over WAN
  Date:
        Wed, 07 Jan 1998 06:00:22 -0500
  From:
        "Brian T. Wightman" <wightman@acm.org>
    To:
        dansf@gte.net

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

NFS is _extreemly_ sensitive to small time lags. For example, on a
network I was administering in a previous life, we had a few
workstations on a different part of campus, still on the same lan, no
routers between, but the logical distance + collision rate on the
segment some of the hosts were on caused enough of a delay to make
the clients behave like you are seeing. These clients that were
seeing the problems were on the same segment as a heavily used
PC/Novell population, which when removed, solved the problem.

If possible, you may want to use TCP instead of UDP for your
transport over the wan (see the man pages for mount_nfs for details).
 You may also want to increase your timeout (again see the man
pages).

Hope this helps.

Brian
---------
You might want to play with the rsize/wsize values for the client
mounting. 1024 or 2048 might solve the problem. Chances are that there
is
a problem in re-assembling the UDP packets somewhere with the default
rsize/wsize value.

Regards
Ravi
---------
Subject:
        Re: NFS timeouts over WAN
  Date:
        Wed, 7 Jan 1998 11:30:26 EST
  From:
        Kevin.Sheehan@uniq.com.au (Kevin Sheehan {Consulting Poster
Child})
    To:
        dansf@gte.net

nfsstat -c will give you a good idea of what is going on with NFS.
I'd get NFS and NIS by Hal Stern, as it goes into great detail, but
basically you have:

1) retrans (number of rexmitted requests)
2) badxid (number that came back okay, but we had already rexmitted)

if 2 is any significant percentage of 1, then you should relax
the timeouts (timeo) on the NFS mount. It is probably retransmissions
killing you. Two common sources on WAN connections is too agressive
a timeout for rexmits, and not having the router configured to
handle the number of large packets going down the wire. One
shows up as retrans+badxid (timeout) and one shows up as
retrans alone (dropped packets).
------
Subject:
        Re: NFS timeouts over WAN
  Date:
        Wed, 7 Jan 98 08:54:34 PST
  From:
        bismark@alta.Jpl.Nasa.Gov (Bismark Espinoza)
    To:
        dansf@gte.net
    CC:
        bismark@alta.Jpl.Nasa.Gov

First, determine the nfs version, protocol, and buffer size the link is
using by running "netstat -m" on the client.

Then, start experimenting with different read buffer, write buffer,
timeout,
and retransmission numbers.
---------

NFS should work pretty well over a properly-functioning T1 without
any tweaks. Are you sure the T1 is working well? Do rcp and ftp
give you the expected transfer rates (80-90 kbyte/s), with good
ping -s times (<100ms) during the transfers?

Perhaps other unrelated traffic already has the T1 saturated.

You can probably make NFS happier by dropping the buffer sizes. Try
rsize=2048,wsize=2048 for a start and then experiment (the default is
8192).

Jay Lessert jay_lessert@latticesemi.com
Lattice Semiconductor Corp. (voice)1.503.681.0118
Hillsboro, OR, USA (fax)1.503.693.0540



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:29 CDT