SUMMARY: E250 reboot

From: Qi Chen (qchen@insight.jpl.nasa.gov)
Date: Fri Dec 10 1999 - 13:49:43 CST


Original message: (See bottom)

Thank you for all of the people who give me the answers:

Christian Pinheiro <pinheiro@veritel.com.br>
Bruce Cheng <bcheng@corio.com>
Ulan Mamytov <tgr@ns2.kyrnet.kg>
John Chrisoulakis <john.chrisoulakis@antdiv.gov.au>
Mohammed <abusakit@pop.dnvr.uswest.net>
Salman Farooq TNG <Salman2@wipro.co.in>
Sue <Sue_Thielen@psdi.com>
Mike Watts <mikewatts@traverse.com>
H.S. Yann <yann@veritel.com.br>

Summary of the suggested solutions:
-----------------------------------

* Disconnect, clean and reconnect CPU and memories.
* CPU hardware problem, check /var/adm with grep "cpu" carefully.
  Upgrade the latest kernel patches. ("uname -a" shows the patche version)
* Check the DIMMs. It is possible that ECC error cause the problem.
* Try Solstice DiskSuite 4.1 patch 104172. Without the patch, if root
  partition is mirrored, it may cause crashing.
* Read the core dump to analyze the problem.
* Make sure power supply is ok. Also pay attention to power safe mode.

The solution for our case:
--------------------------

It is CPU hardware problem. We found the following information inside
/var/adm/messages.* file (more than one entry). However, most time,
the machine just hang without any messages.

Dec 3 17:57:31 hostnm unix: BAD TRAP: cpu=0
     type=0x10 rp=0x30437898 addr=0x6188567c mmu_fsr=0x0
Nov 22 11:39:50 hostnm unix: panic[cpu0]/thread=0x30023e80:
     CPU0 Ecache SRAM Data Parity Error: AFSR 0x00000000
     80400500 AFAR 0x00000000 000fff60
     
Sun says "CPU0 Ecache SRAM Data Parity Error" is a hardware problem,
and they shiped a new CPU to us. We replaced the CPU-0, and the
problem is fixed. So far machine is stable.

Original question:
------------------
>I have an Ultra Enterprise 250, 2 CPUs, Soalris 2.6,
>Sun Solstice Disk Suite 4.1 installed, and Sun
>2.6 Recommaned patches are installed, Y2K patches
>installed.
>
>The problem is it sometimes reboots by itself. After
>the reboot, there is no error message in /var/adm/messages
>file, no error message in /var/log/* files. We called
>sun, sun shipped us new motherboard and new new CPUs .
>After the replacement, the situtaiton got much better,
>but still sometimes reboots or hang by itself.
>
>Sometimes, just after reboot, I can see the error message
>when I type the command 'dmesg', such as:
>
>panic[cpu0]/thread=0x30023e80:
>



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:13:34 CDT