SUMMARY: Tracking down the cause of system hangs

From: Gerald Combs (gerald@unicom.net)
Date: Thu May 16 1996 - 12:48:09 CDT


  I recently posted the following message:

> We have a SPARC 20 running Solaris 2.4 + the latest cluster patch that
> has hung a couple of times in the past three weeks. The system logs and
> accounting files don't show any unusual going on before the system hangs.
> After the first hang, I had perfmeter log its output to a file. This log
> shows me how busy the system was (very much so before the hang and not at
> all during), but not much else.
>
> Are there any other tools that I can use to track down why this is
> happening? I've started a 'top' session from a telnet on another machine
> in the hopes that I'll know what was running if the system hangs again.
> Is there anything else I can do?

  I received three replies, which are included below. I installed patch
102001-10 last Friday, and haven't had any problems so far. I also
pulled down Proctool, but I haven't had time to look at it closely yet.

=============================================================================

>From nino@well.ox.ac.ukFri May 10 12:30:44 1996
Date: Wed, 8 May 1996 09:18:21 -0100 (BST)
From: Nino Margetic <nino@well.ox.ac.uk>
To: Gerald Combs <gerald@unicom.net>

Gerald,

We had a LOT of grief with system (Sparc 10/512MP w/ 128M RAM) hanging over
the last few months both with 2.4 and 2.5. Since with 2.4 it wasn't *that*
often (maybe 1 crash in 3-4 weeks) and I knew we were going to upgrade to 2.5
soon, I didn't bother to try to fix the problem. However, after we moved to
2.5 it got much worse - we had 3 hangs/crashes within 18 hours and 5 crashes
within the five day period.

Most of the hangs were during the backup session, when Solstice Backup was
either writing to tape or waiting for a new tape to be inserted. There was
NEVER any indication in any of the system logs what was causing it, and all
attempts to force the system to coredump after it hung - such as unplugging
the keyboard, and when the system eventually fell into the boot monitor,
typing sync - were completely unsuccessful because the system would hang
after sync and we couldn't recover it at all....

In any case, it transpired to have NOTHING to do with backups. The prime
culprit for problems was the le (ethernet) driver!!! For 2.5 there is a patch
(103244-01) which fixed the problem for us (effectively a new
/kernel/drv/le). I am not sure what is the state of affairs for for 2.4. Hope
this helps.

Regards,

--Nino

---
Nino Margetic <nino@well.ox.ac.uk>
The Wellcome Trust Centre for Human Genetics, University of Oxford.
Tel: +44 1865 740 005		Fax: +44 1865 742 196
---

=============================================================================

>From ahill@lanser.netFri May 10 12:30:49 1996 Date: Wed, 8 May 1996 08:02:58 -0500 From: Alan Hill <ahill@lanser.net> To: Gerald Combs <gerald@unicom.net>

Try a copy of protool, it can help you trace mem-leaks also, and cpu hogs

>From davez@phil.mop.comFri May 10 12:30:55 1996 Date: Wed, 8 May 96 09:21:17 EDT From: Dave Zarnoch <davez@phil.mop.com> To: gerald@unicom.net Cc: davez@phil.mop.com Subject: Re:Tracking...

Gerald,

Try patch 102001-09

Keywords: be diskless bigmac fast-ethernet MII hang BMAC ethernet le qe Synopsis: SunOS 5.4: jumbo patch for be, qe, le drivers

davez@mop.com

=============================================================================

Date: Wed, 8 May 1996 08:02:58 -0500 From: Alan Hill <ahill@lanser.net> To: Gerald Combs <gerald@unicom.net> Subject: Re: Tracking down the cause of system hangs

Try a copy of protool, it can help you trace mem-leaks also, and cpu hogs

=============================================================================

--------------------------------------------------------------------------- ***** ***** Gerald Combs gerald@unicom.net *** *** System Administrator http://www.unicom.net * * Unicom Communications, Inc. fyi@unicom.net ***** 7223 W. 95th St., Ste 325 (913)383-1983 Ext. 101 *** Overland Park, KS 66212 (913)383-8466 Client Support * (913)383-1998 Fax "I can't hear you - I'm using the scrambler." -- Repo Man ---------------------------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:00 CDT