SUMMARY - SS5 panic

From: Christopher M. Murphy (murphyc@synapse.bms.com)
Date: Wed Dec 18 1996 - 12:04:03 CST


--- Thanks to the following for responding to my question:

Tim Carlson <tim@santafe.edu>
Stephen Harris <sweh@mpn.com>
Avi J. Levin <alevin@ltcm.com>
Kevin.Sheehan@uniq.com.au
Jens Fischer <jefi@kat.ina.de>
raju@hoho.ecologic.net

---- Response summary ----

The majority of responses indicated that this was a SIMM related hardware
problem. A few responses gave me somewhere else to look if replacing
memory does not fix the problem. We have replaced the SIMM pointed to by
the memory address and the machine has stayed up for three days now.

------- Responses received -------

From: Tim Carlson <tim@santafe.edu>
Subject: Re: SS5 panic

I think your service guy is on the right track. And I wouldn't guess that
it is a SIMM error, but a true error on the mother board.

I had similar messages occur on an Ultra a few months back and they
replaced the mother board.

-----------

From: Stephen Harris <sweh@mpn.com>
Subject: Re: SS5 panic

When my SS5 did this, it _was_ a SIMM problem. Check the manager archives
for around a year ago.

rgds
Stephen

--------

From: alevin@ltcm.com (Avi J. Levin)
Subject: Re: SS5 panic
To: murphyc@synapse.bms.com

Looks to me like your DMA chip on the main board has a problem. I'd bet you
need a new motherboard!

Avi

---------------------

From: Kevin.Sheehan@uniq.com.au (Kevin Sheehan {Consulting Poster Child})
Subject: Re: SS5 panic

almost certainly memory or a SIMM - the fact that it happens on DMA (which
is pretty stressful and a large source of reads anyway) only is kind of
strange, but timing is slightly different there.

Does the location of the fault move around much, or is that pretty much
the same?

It could be the DMA engine or something too, but almost certainly hardward
in any case.

                   l & h,
                   kev

----------------

From: Jens Fischer <jefi@kat.ina.de>
Subject: Re: SS5 panic

Hi Christopher,

the asynchronous memory fault clearly states that there is a memory
problem. The other messages are quite normal with a memory error.
dma means direct memory access, and if there is a memory error,
there may be dma errors, too.

Kind Regards - Jens Fischer

---------------

From: raju@hoho.ecologic.net
Subject: Re: SS5 panic

There is a hardware problem with SS5's that causes crashes when you have
more than one SIMM and they are not identical (read: same part number). There
is a patch for SunOS that many people claim works, but I have never been
able to find a cooresponding patch for Solaris. We had the same problem
that caused a crash a couple of times a day, I installed patch 101945-43
(for solaris 2.4) and it *seemed* to reduce the problem (although it
could have been coincedence, but I was too frustrated to pursue it). The
machine would only crash once a week or so, finally, we ended up replacing
all the SIMM's with identical ones. One other thing to try, which might
help with the problem, is to make sure all the SIMMs are in consecutive
slots, again, I'm not sure if this will help since this bug is difficult
to reproduce ...

- --raju

-------- Original posting -------------

>We have a sparcstation 5 running as a DNS server that has started crashing
>several times per week. Our hardware support tech is approaching this
>problem as a failure in the memory subsystem (i.e., simm gone bad).
>To me, it looks like the error could be related to something failing on the
>SCSI bus based on the panic message in the messages file. Can anyone
>tell me if the following error points directly to memory or something on
>the SCSI bus?
>
>First some config info:
>
> System Configuration: Sun Microsystems sun4m
> Memory size: 192 Megabytes
> SUNW,SPARCstation-5
>
> AVAILABLE DISK SELECTIONS:
> 0. c0t1d0 <SUN2.1G cyl 2733 alt 2 hd 19 sec 80>
> /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@1,0
> 1. c0t3d0 <SUN2.1G cyl 2733 alt 2 hd 19 sec 80>
> /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@3,0
>
> ---
>
> Panic message:
>
> NOTE: the numbers for the Target and LUN are different in some of the
> other panics, so it doesn't point to just one device.
>
>
> @0,10001000/espdma@5,8400000/esp@5,8800000 (esp0):
> dma error: current esp state:
> esp: State=DATA_DONE Last State=DATA
> esp: Latched stat=0x0 intr=0x0 fifo 0x80
> esp: last msg out: <unknown msg>; last msg in: IDENTIFY
> esp: DMA csr=0xa4240212<EN,INTEN,ERRPEND>
> esp: addr=fc012d65 dmacnt=1400 last=fc012000 last_cnt=1400
> esp: Cmd dump for Target 1 Lun 0:
> esp: cdblen=10, cdb=[ 0x2a 0x0 0x0 0x3f 0x51 0xd4 0x0 0x0 0xa 0x0 ]
> esp: pkt_state=0x7<CMD,SEL,ARB> pkt_flags=0x4000 pkt_statistics=0x1
> esp: cmd_flags=0xc22 cmd_timeout=60
> unix: WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000
> (esp0):
> unix: Unrecoverable DMA error on dma
> unix: panic: asynchronous memory fault: MFSR=81004040 MFAR=4430d80

-- 
Christopher M. Murphy		email: murphy@bms.com
Bristol Myers Squibb		phone: (609) 252-5741
Scientific Information Systems	fax: (609) 252-6163
Princeton  NJ



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:18 CDT