Summary: nfsd daemons eating CPU time

From: Robert Davies (rob@hasler.ascom.ch)
Date: Thu Apr 08 1993 - 13:54:36 CDT


Firstly thanks for all the replies, I received the first ones 36 hours
after mailing, but 24 hours after solving the problem :)/:(.

    Response Performance Summary -- sent at 21:30 MET Wednesday

Replies received within Aid rapid sol. Metoo Accuracy

12 hours 0
24 hours 0
36 hours 4 50% 75%
48 hours 19 70% 5% 90%
60 hours ???

Most responses would have isolated the problem, some were more long term,
as they involved collecting and installing PD S/W, which takes time...
Some were red herrings.

I hope that's interesting for the curious, more passive readers. Now
YOU KNOW what to expect. I'd hoped for ideas next morning, as we have
a direct mail feed and I've often had replies from the USA in 10mins.

sun-managers is for when you're really stuck, not when you're, tired,
distracted, worried and hassled, with no time to think. Go home and sleep
instead!!

I flame myself, for not doing that! On the other hand I can still write
a useful summary, as some people came up with some v. interesting points.

    Conclusion : 2,000 heads are better than 1

        Summary nfsd daemons eating CPU time

It would appear that this problem occurs regularly, where there are v.
many clients, or perhaps often with SunOS 4.0.3, so I'll try to summarise
for those who've not yet experienced this problem. Some non-Sun
client workstations may cause this by use of find, or file monitor
programs, which some user out there starts up innocently, not understanding
the consequences.

My previous experience was that idle time <10% is indicative of infinite
looping S/W, which misled me to concentrate on the server, suspecting that
they weren't waiting for client responses, or that the server was talking
to itself, rather than looking at the system as a whole.

Those who have seen this problem before gave the best answers, that is
that the clients are causing it. There is nothing wrong with the server!
Use rpc.etherd & traffic or etherfind to find out which ones.

Symptoms :

> This morning the NFS daemons were using lot's of CPU time, I noticed
> as performance was very bad.

> root 137 36.4 0.0 28 0 ? R 19:03 19:50 (nfsd)
> root 139 31.7 0.0 28 0 ? R 19:03 19:38 (nfsd)

> I rebooted tonight and the problem is just as bad, 12 nfsd's using
> CPU like no only else needed CPU. I didn't change anything yesterday.

> I also observed v. high net activity, using netstat.

The users say, File server performance is very bad, all clients seem slow,
                    the net seems sluggish.

What the users didn't say, I can't login on my (NFS diskless client) computer.

What the Sysadmin sees :

    V. low 0-10% idle time, and found that the nfsd's were the culprits
    High net activity
    Problem seems to be getting worse with time (as the net, server is less
                                                 busy)
    It just took you 10 minutes to login to your file server

Solution :

1) Run etherfind (snoop in SunOS 5), (or tcpdump BSD & Ultrix).

or/ Run rpc.etherd on a machine putting the net interface into
     promiscuous mode (packet filter), and collecting stats.

    [ Configure at least 2 machines per segment to support etherd & etherfind
      The sun GENERIC kernel supports this, roll your own kernels require
      streams NIT, ie. pseudo devices snit, pf & nbuf. This is well
      commented in the Sun Kernel config files. ]
    
    If you haven't prepared machines for this, use NFS or X.25 servers,
    as they will already have be configured for this.

    Run traffic for a graphical display.

    [ If you don't know traffic or etherfind already, take time out to
      play with them, the pay back period is v. short ]

3) The rogue clients will be generating lots of packets

4) Login into the clients, kill any processes that could be causing
    NFS traffic.

        i) Many have suffered from find(1) started by cron on many
            clients, or by users, without restriction to the local disks.

        ii) User who write looping programs calling rwho(1), or other
            programs which reference a shared file system.
            ( I would have thought the client cache would avoid that )

        iii) Running executable with disk image re-built or deleted.

            [ I believe this to be the cause in my case, gold medal to
              Russ Poffenberger, this was common in 4.0.3 (he says)
              I economised on client disk space, using links, however
              not all clients were affected, so I'm not sure. ]

        iv) SGIs with IRIX 4.0.[1-4] running famd "File Alteration
             Monitor Daemon" [Marc Rinfret]

     In my case I had many processes labelled <defunct>, and as they
     were diskless clients, rebooting seemed the easiest solution.

I would like to give honourable mention to Tommy Reingold for his traffic
based answer, which would have allowed an inexperienced Sun manager to
solve the problem. Also Rens Troost for the most specific etherfind
solution. With the caveat about kernel support for NIS being important.

Suggestions :

1)
    Install NFS watch, recommended by 4-5 people, archive sites include :

        cs.dal.ca /pub/comp.archives/nfswatch
        phloem.uoregon.edu /pub/Sun4/bin/nfswatch
        src.doc.ic.ac.uk /usenet/comp.archives/nfswatch
    
    This program apparently tells you everything you always wanted to
    know about NFS.

    I didn't install it, I hope to one day, as I've never been very
    happy with information available from nfsstat.

2)
    Use tcpdump, xnetmon. Why bother, when rpc.etherd/traffic
    and etherfind do the job and are Sun supported?

3)
   For patch fans Christian Lawrence's 1 line answer takes some beating :

       install NFS jumbo patch 100173-10
    
   and nothing else...

   Hal Stern also suggested the NFS jumbo patch, so if you have NFS
   problems with 4.1.3, you might want to look into it.

4)
    Network or HW problems

    Some noticed that the collision rate is relatively high, 1-2%,
    but this is normal for our net (we have alot of different kit, some
    old, connected up through a cabling hub based system).

    I don't believe that Net or Ethernet Port problems could cause
    the observed symptoms.

Conclusion :

The net is fast enough to cause client request to use 100% CPU time

Learn to use etherfind or rpc.etherd/traffic, and use them when you
see high net activity (easy to forget when you're on a false trail)

Think about installing nfswatch

If you can't cure the cause, and someone absolutely must use the file
server, killing nfsd's is a tolerable stop-gap measure.

Roll of Heroes at Thu Apr 8 20:31:10 MET DST 1993

mp@allegra.att.com (Mark Plotnick) - use etherfind
deb@beaux.atwc.teradyne.com (MOTHER DAEMON) - colls. high? Net prob.
tommy@boole.att.com Tommy Reingold - use rpc.etherd/traffic
agw@math.canterbury.ac.nz - use etherfind
"Marc P. Rinfret" <Marc.Rinfret@eng.canadair.ca>- get nfswatch (IRIX info)
Tasuki Hirata <sukes@eng.umd.edu> - use etherfind
Richard Elling <Richard.Elling@eng.auburn.edu> - use etherfind/snoop (SunOS 5)
Dan Transue <odt@dcs.bellcore.com> - get nfswatch
gpr@proteon.com (Gary Richardson) - colls. high? Clients
Postmaster <Piete.Brooks@cl.cam.ac.uk> - tcpdump, rwho 100+ clients
trinkle@cs.purdue.edu (Daniel Trinkle) - clients doing find
poffen@sj.ate.slb.com (Russ Poffenberger) - use etherfind
strombrg@hydra.acs.uci.edu - nfswatch,xnetmon,tcpdump Net
sid@ingres.com - use etherfind
Christian Lawrence <cal@soac.bellcore.com> - use NFS jumbo patch 100173-10
stern@sunne.east.sun.com (Hal Stern) - check clients, NFS patch
jamest@sybase.com (James Terry) - rpc.mountd dead? etherfind
red@thumper.bellcore.com (Ram Reddy) - use etherfind
Perry_Hutchison.Portland@xerox.com - check clients



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:41 CDT