SUMMARY Sparcserver 490 hangs in fork(?) and load increases up to 40

From: Matthias Ernst (maer@nmr.lpc.ethz.ch)
Date: Fri Nov 27 1992 - 13:55:45 CST


Thanks to everybody who responded to my question. The original question, the
full answers and the names of the people can be found after my summary:

There were a lot of different suggestions what to do and why this
might have happened. Some of them did not apply to our system because of
various reasons. I have listed here the suggestions and made some
comments:

* install nfs jumbo patch to get rid of system hangs:
  I don't think that in our case nfs was involved, because the machine
  was serving all the time with a good response

* enable savecore and check after the reboot, which processes are
  active, how many memory is used ...
  This is the first thing I did, but the machine didn't hang since
  that time, so I can't give any new information. I didn't know, that
  ps and pstat can read from core files.

* install patch 100330-06
  this patch is a jumbo patch for system hang due to kernel map
  From the README:
  Topic: Jumbo Patch for the system hang problems due to kernelmap, running out
         of mbufs (100126), along with SS2 crash with screen going blank (100232),
         and other problems of ie driver (100570) such as misalgined frames,
         lost interrupt net jammmed messages on the sun4 platform and kernel
         panic iesynccmd. This patch includes patches 100126, 100570, 100232.

  I think, that this will be the next thing we will try, if we don't upgrade
  the machine to 4.1.3.

* problems with large swap space requests.
  This is very unlikely, because the machine was dead for almost 5 hours.

* SYBASE can cause the system to hang.
  Did not apply to us, because we are not running SYBASE.

* corrupted NIS-server maps can hang the system.
  NIS-maps are all good.

* problems with disk access (disk controller)
  This is unlikely, because the system did work as a nfs server on
  all disks.

We will probably wait for one more occurance of this problem and either
install the patch 100330-06, or upgrade the machine from 4.1.1 to 4.1.3
since we would like to have full support for an EXABYTE 8500 tape drive.

Thanks to

rodo@auspex.com (Rod Livingood)
jkays@msc.edu
wallen@cogsci.UCSD.EDU (Mark R. Wallen)
ups!upstage!glenn@fourx.Aus.Sun.COM (Glenn Satchell)
dennett@Kodak.COM (Charles R. Dennett)
davek@lonfs01.lsi-logic.co.uk (Dave King)
Christian Lawrence <cal@soac.bellcore.com>
carmine@usb.ve (LDC - Carmine Di Biase Cardone)
steve@seattle.Avcom.COM (Steve Lee)
tekbspa!edward@uunet.UU.NET (Edward Chien)

for their responses.

This was my original question:

-> Short description:
->
-> We have a serious problem with our Sparcserver 490. During the last
-> 48 hours it stopped twice to start up new processes but all running
-> processes were working fine. This caused the load of the system to
-> increase up to 40, but the response to running jobs was still very
-> good. It prevented of course logins and even /etc/halt did not work.
-> The only thing we could do was STOP-A and reboot the machine.
->
-> Details:
-> System: Sparcserver 490, 64 MByte RAM, 4 * 1 GByte IPI Disks, 5 GByte
-> Exabyte Tape, 2 Magnetooptical Disks, 1 ALM-2
-> SunOS 4.1.1 with only the security patches applied
-> Software: FrameMaker, Matlab, Mathematica, DevGuide, a lot of
-> applications written in GNU C++, a lot of license servers for
-> commercial applications, various NMR processing packages.
-> The machine is mainly used as a file server for ~15 (diskfull)
-> workstations(mainly Sparc-1 and Sparc-2) and to a lesser extent
-> for a large variety of computational jobs.
->
-> The symptoms are the following: The load of the system starts to climb
-> continously, while the cpu usage is below 10%. This is caused by processes
-> which are started up by users trying to log in, or by users trying to
-> start up a process from there (running) shells. These new processes do
-> not start up properly, but stay in a wait status and are probably
-> counted by the load measurement as running jobs. All processes which
-> are allready running like nfs daemons continue to run and work fine.
-> The system still answers to ping, rup but not to finger, or telnet
-> because these need the startup of a new process. Once I was logged
-> in as root at the console and it was even then impossible to stop
-> and reboot the machine with /etc/halt.
-> I have now written a small program which dumps the process structure
-> periodically to disk so that I can see the next time which of the
-> programs is the first to stop working, but besides this I have no
-> idea what else to do.
->
-> Question:
-> Has anybody seen such a problem on his machines or does anybody know
-> whether this is a known bug and how to fix it?
-> I checked the sun-managers archives, but could find only one similar
-> problem with no solution.
->
-> Any help will be appreciated. I will summary the responses.
->
-> Thanks a lot
->
-> Matthias

Here are the full answers:

#######################################################################################

Matthias,

  I have seen this exact problem under SunOS 4.1 and 4.1.1, but have yet
to find an answer. Please post a Summary or send me email directly if you
would be so kind.

Regards,

Rod Livingood
rodo@auspex.com (Rod Livingood)
-------------------------------------
Matthias,

  What does the 4/490 mount via nfs? Is there another machine that serves
nfs to the 4/490? If so can you take a working login and try to change
directories to the nfs mounted paritions. I believe that this is a nfs client
side bug.

  Please let me know what you find.

Regards,

Rod Livingood

#######################################################################################

I don't know if this is the same problem, but we had a similar problem
where processes like login and sendmail would get into a wait state and
lock up. I do not remember exactly what we did to fix it, but I think
we installed the NFS Jumbo patch (100173-07 & 100424-01). We noticed
that when huge NFS transfers were being done, it would chew up all
the system memory and bring everything to a halt. In my case, I was
unable to login, or telnet in, but I was able to do a rsh in. If
this sounds like the same problem you are having, let me know and I can
try to determine exactly what we did to fix the problem.

  jeff

-- 

Jeff Kays Minnesota Supercomputer Center E-Mail: jkays@msc.edu 1200 Washington Avenue South Phone: (612) 626-1824 Minneapolis, Minnesota 55415 Fax: (612) 624-6550

"May fortune favor the foolish"

-----------------------------------

One more thing. When I was having this similar problem, if I tried to login or telnet in, I would get the login prompt, enter my userid, and then immediately get "Login incorrect.", without having the system ask me for my password. But I could rlogin in. If you are experiencing this, it is probably caused by a bug in telnetd and how it deals with pty's. rlogind is written correctly, so we have hacked in.telnetd in a similar manner to get around this. The problem is if you have a background process holding a pty open, the controlling tty is not reset, so if a new process attempts to open the pty, it will get an error, or even get killed by the kernel.

I don't know if Sun has any patches to this; as I said, we hacked it ourselves. good luck!

jeff

--

Jeff Kays Minnesota Supercomputer Center E-Mail: jkays@msc.edu 1200 Washington Avenue South Phone: (612) 626-1824 Minneapolis, Minnesota 55415 Fax: (612) 624-6550

"May fortune favor the foolish"

#######################################################################################

My guess is that there is something wrong with your channel to disk. I have seen this before on old UNIX systems. Programs that were loaded in memory would seem to be fine. But any attempt to access the disk (start up a new program, a write from an editor, etc) would cause the apparently working process to freeze/hang. Next time it happens, you might want to peak at the disk access lights to see if the drive(s) are active *at all*. Another thing to do is turn on savecore in /etc/rc.local. At the next hang, do your STOP-A and then type g0 at the monitor prompt. That should force a memory dump. Then reboot. If you then do a ps auxk vmunix.X vmcore.X and find everything is a D status, your disks/SCSI controller are suspect

Mark Wallen cogsci, UCSD mwallen@ucsd.edu

#######################################################################################

Two things to do:

1) uncomment the savecore commands in /etc/rc.local so that you can create a core dump when the system hangs. You can then do useful things like "ps auxwwk vmunix.0 vmcore.0" which will show the processes that were running at the time. Also check "pstat -T vmunix.0 vmcore.0" to check how much swap was being used at the time of the crash.

2) My best guess is to install patch 100330, which deals with a number of system hang problems.

Has anything changed on this system lately? A new application? An extra nfs client or something?

Good luck tracking this down.

regards, -- Glenn Satchell ups!glenn@fourx.Aus.Sun.COM | Uniq Professional Services Pty Ltd ACN 056 279 335 | "The answer is no, PO Box 70, Paddington, NSW 2021, (Sydney) Australia | and I'll negotiate Phone: +61-2-360-7434 Fax: +61-2-331-2572 | from there." "Sun Accredited System Consultants" |

#######################################################################################

Matthias,

This sounds very similar to the problem we had. We never quite figured out what it was. We suspected we were filling up the swap space. We've increased the swap space by quite a bit and the problem went away. We also had a problem at the same time with a 1/2" tape controller. We removed it and the problems it was causing went away, but the problem you describe still happened. One time I did happen to catch the swap space at close to 100% full. You can check swap space in two ways: pstat -T or pstat -s.

Do you happen to run a package called Mathematica? I've seen it chew up all available swap space on a system before.

If you do find a solution, I'd be curious to know. Please email me as I'm going out of town on business and will be shutting off my sun-managers email until I return.

Charles Dennett | Rochester Distributed Computer Services Mail Stop 01925 | Customer Technical Support Services Eastman Kodak Company | --------------------------------------- Rochester, NY 14650-1925 | Internet: dennett@Kodak.COM

#######################################################################################

Our SparcServer 490 used to just hang. The reason for ours was the SYBASE database which we had installed. However, after installing patch: 100159-01 (from Sun) we don't hang any more.

We didn't have the symptoms you have.. ours just hung all of a sudden, but the patch fixes a tcp/ip problem.

Hope this helps,

Regards,

Dave.

#######################################################################################-- | Systems Programmer | Lsi-Logic Ltd | Phone: 0344 426544 | | uknet!lsieur!davek | Grenville Place, The Ring | Ext: 3363 | | davek@lsi-logic.co.uk | Bracknell, Berks., RG12 1BP. | Fax: 0344 413354 | #######################################################################################--

#######################################################################################

1 possible scenario :

there is a pretty big process in memory which is being swapped out. since it is so large, the Q for the swap disk(s) is overloaded. this transfer has a high priority and since it is edging on the upper bound of "normal" system limits the rest of the system appears hung but eventually (minutes later) it does recover.

if you could force this to happen & run vmstat ahead of time you would see lots of I/O via the disk field. about the only things you can do is add memory (might avoid swapping) and/or add swap across different disks to balance disk load.

1 possible scenario :

the system seems pretty busy, the paging activity is fairly high but disk activity is zilch (seen from vmstat sr/disk field respectively). In this case, there is a VERY LARGE file locked in memory which the system is trying to swap out because the pagedaemon can't "steal" locked pages. the file could be a UFS file or a swap file (i.e. a big process). In this situation there is basically a deadlock (tsk, tsk). starting new processes is impossible, running processes are OK because there's enough room already *OR* they're basically blocked. the system will not recover by itself but can recover but if the *pig* process exits or the file is closed.

the answer here is to split the swap into several smaller chunks (again preferably across disks). the pagedaemon will still not be able to steal but this will effectively reduce the size of the locked object to the size of the swap file so that pages can then be successfully obtained.

hope this helps.

P.S. FYI - the statement "... such a problem on his machines ..." will likely incense some very intelligent females on this list ...

Christian Lawrence <cal@soac.bellcore.com>

#######################################################################################

Matthias:

We had a similar problem several weeks ago.

The key problem -in our case- was a kind of hang up with the NIS server, associated with a corru[ted password file (which contained duplicated entries and strange control chars like ^ESC)

NIS proccessed and pushed the file withou signalling erros, but finger, telnet and login waited forever for an answer, mainly when the password entry they wanted was a duplicate (the entries with funny characters also caused trouble sometimes, but we couldn't detect a pattern)

Solution! clean up the NIS files (by hand :-( )

Yours,

Carmine Di Biase Simon Bolivar University carmine@usb.ve Computer Lab

#######################################################################################

I have seen a similar (but not identical problem) on a SPARC 2 network. The problem manifested itself in a similar fashion...the load increases, processes are stuck in a wait state, and no new logins are allowed, until only a L1-A will stop it.

In my customer's case, it was due to their Maxtor hard disk drive/controller (SCSI) getting hung. Some of the time they would see it when dumping, sometimes just in use. I finally figured out that by power cycling the drive the SCSI bus would wake up again, and then things would progress at least for a while. We ended up replacing their drives, and the problem went away.

Hope this helps.

Steven Lee Director of Technology Northwest Region AVCOM Systems, Inc. 550 Kirkland Way, Suite 100 Kirkland, WA 98033 206.828.2725 Voice 206.828.8171 FAX

steve@seattle.Avcom.COM

#######################################################################################

I had similar problem and requested patch 100330-06 (jumbo patch for system hang due to kernelmap). It's been working fine so far.

Edward Chien uunet!tekbspa!edward or edward@tss.com Teknekron Software Systems, Inc.

#######################################################################################

+-----------------------------------------------------------------------------+ | Matthias Ernst Domain: maer@nmr.lpc.ethz.ch | | Institut fuer physikalische Chemie Bitnet: maer@czheth5a.bitnet | | Eidgenoessische Technische Hochschule UUCP: ...!mcsun!chx400!ethz!maer | | ETH-Zentrum Phone: +41/1/256-4374 | | CH-8092 Zuerich Fax: +41/1/252-3402 | +-----------------------------------------------------------------------------+



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:53 CDT