SUMMARY: V880 crashes

From: Grzegorz Bakalarski <G.Bakalarski_at_icm.edu.pl>
Date: Thu Nov 11 2004 - 06:48:39 EST
Dear ALL

Problem solved (I hope). 
I've got 15 answers (most of them during Satuday afternoon and Sunday ;-) ).
Thanks to All -  all your input was very helpfull in solving the problem.
Almost all suggested it is hardware problem. So on Monday I opened
ticket with local reseller's tech support. I provieded extended 
logs to them. Next day I got a call from SUN and SUN's engineer
came and replaced two memory DIMMs (J2900 & J8100).
In addition to what is written in the full summatry (see below)
I found out the following:
* the best way to diagnoze such error is from OBP (Solaris is
  multithreaded and memory is interleaved so from OS it is sometimes
  very hard to find out the right dimm)
* single bit error are no problem (usually) and even two bits can
  be cured (I had two bit errors on J8100 dimm). More than two
  bit errors are usually fatal (I had four bit errors on J2900 dimm) -
  especially at low addresses (where kernel is loaded)
* to make better dignostic one needs to set up OBP variables:
  setenv diag-switch? true
  setenv diag-level max
  or even set the key switch into diagnostic position.
* it may help to log console messages to file (from serial console
  and xterm using script; or with hyperterm with logging to file).
  It is good idea to leave console logging permanent in such cases -
  this may help to catch the right info.
* in urgent case one could just take the wrong board off machine
   and let it work in smaller configuration. Other temporary
   workaround may be use of .asr commands (OBP) in order to disable
  particular dimms or cpus

Again BIG thanks to all. All the best!

Grzegorz

P.S. Original query and full summary follows....


--------------------- Original Query --------------------------
Dear Guru's

Our production server: SUN Fire V880, 6x900MHz, 12GB, Solaris 9,
crashed twice during last 48 hours. First time it did panic and
successfully rebooted itself. Second time it did panic and died
(I had to power off/on machine).

Could anyone tell, what is the problem? Is it hardware or software?
May recommended patches help somhow? 

On other hand I started machine in diagnostic mode and there was
no errors. Also prtdiag does not show any failures.

The machine is 2 years old so still is under hardware warranty ...
What is strange the events occurred when load was low (less than 1;
during daytime the load can be upto 40).

Thanks for any help

Grzegorz

>info from  /var/adm/messages
====================================== 1 ===============================
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 360866 kern.warning] WARNING: [AFT1] EDU:ST Event detected by CPU0 at TL=0, errID 0x0030e2ee.7ce2a5e4
Nov  4 21:00:01 v880_sol9     AFSR 0x00000008<EDU>.00000152 AFAR 0x000000a0.3db88550
Nov  4 21:00:01 v880_sol9     Fault_PC 0x1177184 Esynd 0x0152
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 360866 kern.warning] WARNING: [AFT1] EDU:ST Event detected by CPU0 at TL=0, errID 0x0030e2ee.7ce2a5e4
Nov  4 21:00:01 v880_sol9     AFSR 0x00000008<EDU>.00000152 AFAR 0x000000a0.3db88550
Nov  4 21:00:01 v880_sol9     Fault_PC 0x1177184 Esynd 0x0152
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 606810 kern.notice] [AFT1] errID 0x0030e2ee.7ce2a5e4 More than four Bits were in error
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 465517 kern.info] [AFT2] errID 0x0030e2ee.7ce2a5e4 PA=0x000000a0.3db88540
Nov  4 21:00:01 v880_sol9     E$tag 0x00000280.f6020000 E$state_5 Modified
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x00000300.08f1f440 0x00000000.00000000 ECC 0x0a3
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x10) 0x07000000.00000000 0xf0ff0fff.ffffffff ECC 0x100 *Bad* Esynd=0x152
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x80000002.00000000 0x00000000.00000000 ECC 0x099
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x30) 0xffffffff.00000000 0x01002000.00000000 ECC 0x1d5 *Bad* Esynd=0x071
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 929717 kern.info] [AFT2] D$ data not available
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 335345 kern.info] [AFT2] I$ data not available
Nov  4 21:00:01 v880_sol9 unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x000000a0.3db88000
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 209006 kern.warning] WARNING: [AFT1] DUE Event detected by CPU0 at TL=0, errID 0x0030e2ee.7ce1b288
Nov  4 21:00:01 v880_sol9     AFSR 0x00500000<DUE,PRIV>.00000152 AFAR 0x000000a0.3db88550
Nov  4 21:00:01 v880_sol9     Fault_PC 0x1035ec4 Esynd 0x0152 Slot A: J7900 J7901 J8001 J8000
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 673850 kern.notice] [AFT1] errID 0x0030e2ee.7ce1b288 More than four Bits were in error
Nov  4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 630565 kern.warning] WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by CPU3 Privileged Data Access at TL=0, errID 0x0030e2ee.7ce38f18
Nov  4 21:00:01 v880_sol9     AFSR 0x00100004<PRIV,UE>.000000b6 AFAR 0x000000a0.2e5ea340
Nov  4 21:00:01 v880_sol9     Fault_PC 0x1090154 Esynd 0x00b6 Slot A: J7900 J7901 J8001 J8000
Nov  4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 196182 kern.notice] [AFT1] errID 0x0030e2ee.7ce38f18 Three Bits were in error
Nov  4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 828748 kern.info] [AFT2] errID 0x0030e2ee.7ce38f18 PA=0x000000a0.2e5ea340
Nov  4 21:00:02 v880_sol9     E$tag 0x00000280.b9010000 E$state_5 Exclusive
Nov  4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x00) 0x00000318.7da154b0 0x0c007000.00000000 ECC 0x07a *Bad* Esynd=0x0b6
Nov  4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x00000700.10c6f5d0 0x03007300.3c50a108 ECC 0x074
Nov  4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x00000300.09542328 0x03007002.baddcafe ECC 0x1b9
Nov  4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x30) 0x00000000.00000000 0x03006332.15542280 ECC 0x0a9 *Bad* Esynd=0x149
Nov  4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 929717 kern.info] [AFT2] D$ data not available
Nov  4 21:00:02 v880_sol9 unix: [ID 836849 kern.notice] 
Nov  4 21:00:02 v880_sol9 ^Mpanic[cpu3]/thread=30003671520: 
Nov  4 21:00:02 v880_sol9 unix: [ID 640582 kern.notice] [AFT1] errID 0x0030e2ee.7ce38f18 UE Error(s)
Nov  4 21:00:02 v880_sol9     See previous message(s) for details
Nov  4 21:00:02 v880_sol9 unix: [ID 100000 kern.notice] 
Nov  4 21:00:02 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004969c0 SUNW,UltraSPARC-III+:cpu_aflt_log+5c0 (2a100496acb, 1, 2a100496cd8, 10, 117d180, 117d1a8)
Nov  4 21:00:02 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 0000000001222d04 0000000000000010 0000000000000003 000002a100496cd8
Nov  4 21:00:02 v880_sol9   %l4-7: 000000a02e5ea340 0000000000000000 000002a100496c08 000002a100496a7e
Nov  4 21:00:02 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100496c10 SUNW,UltraSPARC-III+:cpu_deferred_error+4d4 (0, 1, 40100004032000b6, 40100004, a0, 6bc)
Nov  4 21:00:02 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 000002a100496cd8 0000000400000000 40100004032000b6 000003000367d928
Nov  4 21:00:02 v880_sol9   %l4-7: 0000000000000001 000002a100497220 0000030000010300 0000000080000000
Nov  4 21:00:02 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497170 unix:ktl0+48 (30002f1b298, 0, 20, 0, 7092c300, 0)
Nov  4 21:00:03 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 0000000000000005 0000000000001400 0000000080001604 0000000001171800
Nov  4 21:00:03 v880_sol9   %l4-7: 0000000001446800 0000000001410478 0000000000000000 000002a100497220
Nov  4 21:00:03 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004972c0 genunix:dnlc_purge_vfsp+8c (30002f1b298, 2a100497370, 144f400, 1495000, 2a100497440, 2a100497446)
Nov  4 21:00:03 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 00000300093ae008 0000000000000000 00000301f50e6540 0000000000000000
Nov  4 21:00:03 v880_sol9   %l4-7: 0000000000000000 0000030002f1b288 0000030008b665b0 0000000001443ee0
Nov  4 21:00:03 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004973b0 genunix:dounmount+c (30008b665b0, 0, 300003a5f28, 0, 30003671520, 0)
Nov  4 21:00:03 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 000003003cc565c0 000003000396d5e0 0000000000000000 0000000000000000
Nov  4 21:00:03 v880_sol9   %l4-7: 000003000b3be100 0000030009387ab0 000003000b3be182 0000030009387b08
Nov  4 21:00:03 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497460 namefs:nm_umountall+a8 (781ad4a0, 300003a5f28, 20, 2a1004975bc, 30003671520, 4)
Nov  4 21:00:03 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 00000300045a3308 0000030008b665b0 0000000000000000 00000300038aa8c0
Nov  4 21:00:03 v880_sol9   %l4-7: 0000000000000000 00000000781ad488 0000000000000088 00000000781ad540
Nov  4 21:00:04 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497510 namefs:nm_unmountall+10 (300038aa8c0, 300003a5f28, 20, 7bf, 0, 0)
Nov  4 21:00:04 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 00000300038aa8c0 00000300003a5f28 0000000000000001 0000000001499508
Nov  4 21:00:04 v880_sol9   %l4-7: 0000000000000001 0000000000000000 0000030003963e38 000002a100497ba0
Nov  4 21:00:04 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004975c0 unix:stubs_common_code+70 (300038aa8c0, 300003a5f28, 20, 7bf, 0, 0)
Nov  4 21:00:04 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 0000000000000000 0000000000000000 0000030009386910 0000000000000000
Nov  4 21:00:04 v880_sol9   %l4-7: 00000000000000b0 0000000001410a10 0000030003963ce0 0000030009387b38
Nov  4 21:00:04 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497670 fifofs:fifo_close+2d8 (30003963dd0, 300038aa8ae, 1, 0, 300003a5f28, 3000367151c)
Nov  4 21:00:04 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 00000300038aa8a0 00000300038aa8c0 0000000000000003 0000000000000000
Nov  4 21:00:04 v880_sol9   %l4-7: 00000300038aa9c0 000003000366f188 00000300038aa9c0 0000000000000000
Nov  4 21:00:04 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497720 genunix:closef+54 (3000932d378, 0, 1, 0, 100c6ac, 0)
Nov  4 21:00:04 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 0000000001340550 0000000000000001 00000300038aa9c0 000000000000000f
Nov  4 21:00:04 v880_sol9   %l4-7: 0000000001495000 0000000000000000 000000000140e000 0000000000000001
Nov  4 21:00:05 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004977d0 genunix:closeall+30 (300036d1d10, 30003671520, 20, 0, 7092c300, 0)
Nov  4 21:00:05 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 00000300092faea8 0000000000000004 0000030000010680 0000000000000000
Nov  4 21:00:05 v880_sol9   %l4-7: 0000030000010558 0000000001410478 0000030003671520 000000000000fffd
Nov  4 21:00:05 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497880 genunix:proc_exit+310 (3023596f798, 149c280, 30003671520, 300003a5f28, 0, 0)
Nov  4 21:00:05 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 000003000b3d84a0 00000300036d1440 000003000366f188 0000000000000002
Nov  4 21:00:05 v880_sol9   %l4-7: 000000000000000f 0000000000000002 000000000000000f 0000000000000000
Nov  4 21:00:05 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497930 genunix:exit+8 (2, f, 300036d1554, 0, 30003671520, 0)
Nov  4 21:00:05 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 000000000000000f 0000000000000002 0000000000004000 00000300036d1440
Nov  4 21:00:05 v880_sol9   %l4-7: 0000000000000000 000000000000000f 0000000000000070 0000000000000000
Nov  4 21:00:05 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004979e0 genunix:post_syscall+3e0 (2a100497ba0, 3, 0, 1, 30003671520, 4)
Nov  4 21:00:05 v880_sol9 genunix: [ID 179002 kern.notice]   %l0-3: 0000000000000004 00000300036d1440 000003000366f188 0000000000000000
Nov  4 21:00:05 v880_sol9   %l4-7: 0000000000000000 0000000000000000 0000000000000004 00000000ffbffdf8
Nov  4 21:00:06 v880_sol9 unix: [ID 100000 kern.notice] 
Nov  4 21:00:06 v880_sol9 genunix: [ID 672855 kern.notice] syncing file systems...
Nov  4 21:00:06 v880_sol9 unix: [ID 836849 kern.notice] 
Nov  4 21:00:06 v880_sol9 ^Mpanic[cpu3]/thread=30003671520: 
Nov  4 21:00:06 v880_sol9 unix: [ID 340138 kern.notice] BAD TRAP: type=31 rp=1437f90 addr=a0 mmu_fsr=0 occurred in module "genunix" due to a NULL pointer dereference
Nov  4 21:00:06 v880_sol9 unix: [ID 100000 kern.notice] 
Nov  4 21:00:06 v880_sol9 genunix: [ID 111219 kern.notice] dumping to /dev/dsk/c1t0d0s1, offset 644022272, content: kernel
Nov  4 21:01:30 v880_sol9 genunix: [ID 409368 kern.notice] ^M100% done: 160398 pages dumped, compression ratio 2.45, 
Nov  4 21:01:31 v880_sol9 genunix: [ID 851671 kern.notice] dump succeeded
Nov  4 21:02:16 v880_sol9 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.9 Version Generic_117171-02 64-bit

================================== 2 =======================================
Nov  6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 621593 kern.warning] WARNING: [AFT1] DUE Event detected by CPU0 at TL=0, errID 0x00008286.8959bc20
Nov  6 12:53:19 v880_sol9     AFSR 0x00500000<DUE,PRIV>.000000e2 AFAR 0x000000a0.6c7ec0c0
Nov  6 12:53:19 v880_sol9     Fault_PC 0x117bb00 Esynd 0x00e2 Slot A: J8100 J8101 J8201 J8200
Nov  6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 300719 kern.notice] [AFT1] errID 0x00008286.8959bc20 Two Bits were in error
Nov  6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 978170 kern.info] [AFT2] errID 0x00008286.8959bc20 PA=0x000000a0.6c7ec0c0
Nov  6 12:53:19 v880_sol9     E$tag 0x00000281.b1000001 E$state_3 Invalid
Nov  6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x00000000.00000000 0x00714fb0.00000000 ECC 0x123
Nov  6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Nov  6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x00714fb0.00000000 0x00000039.00000000 ECC 0x185
Nov  6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x00717188.00718018 0xff2fa7e8.00000000 ECC 0x032
Nov  6 13:42:53 v880_sol9 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.9 Version Generic_117171-02 64-bit

========================== ANSWERS =========================================

************** Answer 1
Date: Sat, 06 Nov 2004 10:59:25 -0500
From: Bill Voight <bvoight at patriot.net>

Some of the errors appear hardware related.  First thing is to patch to 
current levels.  Second, call support.  They may reseat memory and 
CPU's.   That might be tough with a production server, but it's worth a 
try.  They also have diagnostic software that might be useful.  If you 
don't run explorer, you might install it.  It can send Sun info they 
need to diagnose the problem. 

We've had some mysterious 880 problems like yours and in one case, 
patching did the trick.  The problem recently resurfaced, but we may 
have traced it to a shaky Oracle table.  Let me know how you do.

BV

**************** ANSWER 2
Sat, 6 Nov 2004 17:07:37 +0100
From: "joe_fletcher" <joe_fletcher at btconnect.com>

This sort of thing is quite common on V880s. Since it's
indicating there's a bank of DIMMs with errors you may be
looking at a replacement system board. Log it with SUN and
they will either try replacing just the DIMMS or they will
do the whole board.

******************ANSWER 3
Date: Sat, 6 Nov 2004 17:48:42 +0100
From: Stephane Tsacas  <stephane.tsacas at gmail.com>

I could be wrong, but I think that :

- you have an hardware problem, probably memory related (either memory
itself or bus).

=> remove cards, push memory with thumbs, put cards back in, power on
and see what happens.

=> put some cpus offline. It's possible that only one CPU is causing
the problem. However the machine itself should have remove it if it
detected a bad cpu, but who knows.
I'll start by disabling cpu3 and cpu0, see error message.

=> still crashing ? call Sun ASAP.

Good luck ;)
Stephane


*************** ANSWER 4
Date: Sat, 6 Nov 2004 09:44:30 -0800
From: Webpro <aielloster at gmail.com>

Looks similar to a problem I had with some bad memory.  I left a
console connected and sent the output to Sun who came out and replace
a memory module.

Joe
-- 
"Despite the hight cost of living, it still remains popular!"

********************** ANSWER 5
Date: Sat, 6 Nov 2004 11:27:16 -0800
From: "Jon Hudson" <jon.hudson at finisar.com>

I would say a cpu/cache issue. While it  complains about memory

Fault_PC 0x1035ec4 Esynd 0x0152 Slot A: J7900 J7901 J8001 J8000

it's unlikely that so many parts would actually fail.

If you want to test it without opening a ticket with sun, pull the board with cpu0 on it and see if the problem returns. If so, then it could be something deeper, if not then it's safe to say it's cpu0 and/or cpu0 components. 

I would just open up a case with sun, they can debug that error dump a lot more carefully than any of us can.

******************* ANSWER 6
Date: Sat, 6 Nov 2004 13:10:47 -0800 (PST)
From: "sunsa_tx at yahoo.com"

You have to open a case with SUN and give them the
core dump file or the log you included in this email.
I looked at your log and it looks like SUN needs to
replace the DIMMs J7900 J7901 J8001 J8000 as they had
multiple bits error. They may need to replace the
system board too.

***************ANSWER 7
Date: Sun, 07 Nov 2004 08:27:33 +1100
From: Tim Tuck <tim.tuck at penrith.net>

You have faulty memory  on the primary system board in locations

J7900 J7901 J8001 J8000


Tim


************ ANSWER 8
Date: Sun, 07 Nov 2004 08:58:34 -0200
From: "Ghassan Qanzu'a" <ghassan at sts.com.ps>

It seems that you have a bad memory at J7900 J7901 J8001 J8000  J8100 J8101 J8201 J8200
at the CPU0 borad (the first one starting from down).
    To be sure of that, you can check it by removing this board and replacing it with the third board and starting the system with just 4 proc's, 8 GB and observe the behavior and if it did not crash
during two day's then the diagnostic above is right.

Ghassan


***************** ANSWER 9
Date: Sun, 07 Nov 2004 20:44:14 +1100
From: Jeff Allison <jeff.allison at allygray.2y.net>
Don't know what the dump means but we have 2 v880's that have crashed 
due to dodgey memory (Samsung) if I remember correctly. Call them out 
and get it checked..

Jeff

**************** ANSWER 10
Date: Sun, 07 Nov 2004 21:51:36 -0500
From: Prasanth Mudundi <Prasanth_mudundi at comcast.net>

looks like there is are memory errors.... but they seem to come from 
different memory dimms.
i would start with most common one.  then move on to replace entire 
bank, if that does not work
replace system board.

try running vts for memory/CPU and it will not fail right away.... days 
with out an issue,. but when
it crashes while vts is running you have your bad boy. since you have 
warrenty let sun do the analysis
for you.
prasanth

******************** ANSWER 11
From: "Michael Horton" <Michael.Horton at acntv.com>
Date: Mon, 8 Nov 2004 07:46:36 -0500

since the v880 is still under warranty support, cal sun support for
help.
at first glance, you have cpu0 reporting memory errors in a specific
bank of memory slots.

****************** ANSWER 12
Date: Mon, 08 Nov 2004 09:46:19 -0500
From: Tim Chipman <chipman at ecopiabio.com>

If the machine is still under sun warranty/support, get them involved 
ASAP.  The type of failure you describe is consistent with "fairly 
serious hardware failure" although it isn't inconcievable it is a 
software issue.  Usually a trivial way to distinguish the two would be, 
boot from an installer CDRom and leave the machine thrashing (copying 
junk data back and forth between 2 slices in an infinite loop or 
something) for a few hours.  If it crashes thus, booted from a clean OS 
of the installer CD, it would support the "hardware failure" 
hypothesis.  However, I expect you already have enough hints in the logs 
below for sun support to have a strong candidate "smoking gun".

Tim

************ ANSWER 13
Date: Mon, 8 Nov 2004 11:35:57 -0500
From: "Eric Paul" <epaul at profitlogic.com>

This is a hardware problem.  Contact Sun immediately for service.


******************** ANSWER 14
Date: Mon, 8 Nov 2004 10:07:58 -0800 (PST)
From: David Foster <foster@ncmir.ucsd.edu>
Reply-To: David Foster <foster at ncmir.ucsd.edu>

Install most recent recommended patch cluster from SunSolve,
in particular latest kernel updates. Install latest PROM patch

112186-15 (OBP 4.13.2)

(or later)

Download SUN VTS, install it and run it to check for hardware
errors. Look at /var/adm/messages* files and check for error
messages. If you have support download and install Sun Explorer
program and run it, then open hardware case with Sun tech support
and email them the output (a .tar.gz file)...they can check it
for config and hardware problems.

Dave Foster

****************** ANSWER 15
From: "Loukinas, Jeremy" <Jeremy.Loukinas at evenflo.com>
Date: Mon, 8 Nov 2004 13:27:11 -0500 

You probably just need to upgrade your Openboot version...

prtdiag -v | grep OBP


----------------------------------- END OF SUMMARY --------------------------
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Thu Nov 11 06:49:02 2004

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:39 EST