SUMMARY: SS1000 Intermittent panics on cpus

From: Mark Diaz (mark@radius.com)
Date: Tue Sep 17 1996 - 14:53:02 CDT


We finally appear to have solved the problem with intermittent panics on
different cpus on our SparcServer 1000. After replacing a couple cpus, the
power supply, backplane, and two of the four boards, Sun replaced all eight
cpus.

Thanks to:

Maryellen.Yager@Eng.Sun.COM (MaryEllen Yager)
        "replace the power supply"
blymn@awadi.com.au (Brett Lymn)
        "Sun has a kernel hack to help you get a crash dump;
        consider upgrading to Solaris 2.5."
keith@oz.health.state.mn.us (Keith Willenson)
        "Check the fans. We had a problem with a power supply not supplying
        enough power to the fans, causing the equipment to overheat.
        Taking the cover off helped a lot :)"
jay_gallivan@PaineWebber.com
        "verify your cpus are the same speed"

The original problem is below...

Thanks,
Mark Diaz
mark@radius.com
408-541-5407

>I've been having problems with our Sparc 1000 since the West coast power
>outage on August 10th and I hope you can help. Sun has come out to replace
>some hardware but I'm still see daily panics on different cpus, causing me
>to power-cycle the machine to get it back up...
>
>After the power outage, the machine booted and "ran fine" for a couple
>days. The following Monday I noticed problems on /var and I tried to reboot
>it but got a "Error for command 'write Error level Fatal". When I tried to
>reboot off CD-ROM, I got
>
>BAD TRAP: cpu_id:7 type=7 <Memory address alignment> addr=0 rw=0 rp=e70513bc
>panic[cpu7]/thread=0xf5b12800:interurupt:mutex fixup failed
>
>At this point, I wasn't able to boot the machine. Sun replaced the disk for
>/var, replaced board #1, and replaced a couple of the cpus and got the
>machine up and running.
>
>However since then, the machine has been freezing intermittently with
>increasing frequency. Yesterday it froze twice. The freezes occur at
>different times, sometimes when it is idle. This is what was on the screen
>when it froze last night:
>
>srmmu_pagesync+0x1d4 @ 0xe0019f68, fp=0xe70d5b08
> args=e3ad6028 e6ca2fdc 1 f7a4s8 e3e27ac4 e0cc822c
>hat_pagesync+0x44 @ 0xe00511dc, fp=0xe70d5b68
> args=e3ad6028 e0cc822c e6ca2fdc 1 f545a01c 1
>pvn_getdirty+0xd4 @ 0xe009bed0, fp=0x70d5bc8
> args=e0cc822c 1 b1f340ff 1 18000 e6ca2fdc
>Sysbase+0x163af0 @ 0xf5563af0, fp=0x70d5c28
> args=e0cc822c 0 0 f6450c6e e01030bc 0
>Sysbase+0x16082c @ 0xf5560982c, fp=0xe70d5c98
> args=f6450c08 0 e0cc822c 2000 0 f616e480
>Sysbase+0x16006c @ 0xf556006c, fp=0xe70d5d60
> args=f6450c00 fe1ca000 0 e70d5e34 400 1
>rw+0x288
>syscall+...
>.syscall+...
>(unknow)+...
>End traceback...
>oracle: Data fault
>kernel write fault at addr=0xf55ca000, pte=0x47d26c
>MMU sfsr=0x100b6: ft=<Access bus error> at=<supv data store> level=0
>MMU sfsr=0x100b6<CS,FAV>
>srmmu_tlbflush+0x5c, pid=945, pc=0xee001db84, sp=0xe70db1e0,
>psr=0x40000dc1, con
>text=0
>g1-g7: f558093c, 8000000, ffffffff, 18, fb6404800, 1, f63e6800
>panic[cpu6]/thread=0xf63e6800: trap: unexpected MMU trap
>Dump Aborted
>Type 'go' to resume.
>Type help for more information
>
>I typed sync at the ok prompt and got
>
>panic[cpu6]/thread=0xf63e6800: zero
>Dump Aborted
>
>At this point the machine doesn't respond to Stop-A nor disconnecting and
>reattaching the keyboard, so I power-cycle it. Sometimes it doesn't respond
>at all, and all I can do is power-cycle it. It fixes some filesystem
>problems and then boots up and runs until the next crash. The cpu panics
>occur on a different cpu each time (not just cpu6 and cpu7).
>
>Last weekend I reinstalled Solaris 2.3 and the suggested patches (including
>101318-77), and restored /etc. (Is there a list of files one should restore
>following a reinstall?) I'm concerned there are are other patches I need
>that aren't on the suggested list and plan to hunt around this afternoon...
>
>The Sparc 1000 has 8 cpus, 1782 MB RAM, and is running Solaris 2.3 and
>Oracle 7.0.16. It also has Prestoserve NVSIMMs and a Cisco fddi card.
>
>I'm waiting for the next crash so I can send another core dump to Sun
>kernel support but if anyone can give me any tips in the meantime, I
>appreciate any help.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:10 CDT