SUMMARY solutions & suggestions for processes hung in disk wait state

From: ram@jasper.cs.orst.edu
Date: Thu Feb 15 1990 - 15:53:35 CST


A few days back I asked

> We are having problems with some processes going to disk wait state
> (ps -aux shows status D) on our SUNs (4/280, 4/330 or 3/50's). The
> processes most frequently affected are the nfs daemons and other daemon
> processes. When a process is in the disk wait state, it cannot be
> killed either. The only solution seems to be to reboot the system
> to make it usable. Is there a better solution than this? It is
> difficult to convince the users on the system that it has to be rebooted
> in the middle of the day because some process is hung.

and I received a few pointers which suggested that there is a sun patch
tape and this bug has been reported earlier.

Here is a brief summary:

The bug ID's are 1017518 and 1017893 and a description of these given by
Dennis Michael <dennis@jessica.Stanford.EDU>

Occassionally on NFS server machines the nfsd daemons have been
  reported to get into a disk wait ("DW") state as noted in a
  listing of "ps aux". The result of this condition causes
  all client requests to the server to fail. Problem descriptions
  reported in Sun bugId's 1017518 and 1017893 identify at least two
  distinct different causes of this problem, described below:

  Case 1017518:
        On the server system, processes go into DW state
        and don't return. This problem is related to VM
        and may happen even in non NFS instances. The
        core dump will show _sleep, _cv_wait, _page_cv_wait,
        and _page_wait at the top of the stack trace. Basically
        the process is blocked waiting for the keep count on the
        page it wants to go to zero (meaning that it is available)
        but somehow it didn't get decremented correctly and will
        never go to zero.

 Case 1017893:
        This is a server problem similar to the client problem
        in bugId 1018954. The process is blocked waiting for an
        mbuf structure to be released back to NFS, but it is
        never being released. The core dump for this problem
        shows the hung process with a stack trace of _svc_sendreply,
        _svckudp_send(0x7hexdigits,0x7hexdigits) + 2C, _sleep.
        The routine svckudp_send is trying to send a reply to the
        client, but is blocked waiting for the mbuf structure
        pointed to by the first 0x7hexdigits argument above.
        Actually, the first 0x7hexdigits argument to svckudp_send
        is a SVCXPRT pointer, not an mbuf. However, it's possible
        to derive the mbuf's address given this argument.

  There currently are two patches available for this case:

        1) an adb patch which sets nfsreadmap to 0:

                # adb -w /vmunix -
                nfsreadmap?W 0
                $q
 
           This eliminates most of the code that increments and
           decrements the keep count.

        2) The included patched ufs_bmap.o files which fixes a
           bug in bmap() where "softlocked" were never released after
           failing to extend the original block.

        Both patches may not be necessary. It is recommended that
        the ufs_bmap.o patch be tried first before the adb patch
        is also used.

SUN has a patch tape called "nfsd_dw_hanging" and I have requested
for the same and hopefully the problems should disappear once I
install the patches.

Thanks to:
 
         Richard Elling <relling@eng.auburn.edu>
         Chris Barry <cbarry@BBN.COM>
         rackow@antares.mcs.anl.gov
         Dennis Michael <dennis@jessica.Stanford.EDU>
         Rob ten Kroode <roberto@cwi.nl>
         halstern@Sun.COM (Hal Stern - Consultant)

for some useful pointers.

ram
--------
Janakiram Cherala Internet: ram@cs.orst.edu
Sun System Administrator UUCP :
Computer Science Department UUCP : hplabs!hp-pcd!orstcs!ram
Oregon State University, Corvallis, OR 97330 (503) 737-3273



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:56 CDT