SUMMARY: E4500 mystery crash

Date: Mon Apr 12 2004 - 09:14:28 EDT
Thanks to those who responded, especially those out of the office.  All who
sent real replies said its a CPU and pointe out  As we didn't have much
information we couldn't pin down any one part.  We ended up moving the
applications to another system (I love fibre channel).  Once the old system
was freed Sun came in a checked every component. They found things like a
missing SCSI terminator on the first I/O board, really old SAMBRA modules,
DIMM's that where suspect, and really old CPU/memory boards.  All boards had
all cards reseated or retorqued, suspect and old components have been
replaced, and the firmware is at current levels.  We ran VTS (yeah I know,
niced exerciser, so-so diagnostic tool) all week ened with no errors. At this
point we'll probably re-deploy the system for something else.


Orignal message:
Several times in the past 6 or so weeks one of our E4500's had either hung,
requiring a power off/on, or suddenly rebooted. The only indication we got is
the following console messages:

TL=0000.0000.0000.0005 TT=0000.0000.0000.0068
TL=0000.0000.0000.0004 TT=0000.0000.0000.0034
TL=0000.0000.0000.0003 TT=0000.0000.0000.0068
TL=0000.0000.0000.0002 TT=0000.0000.0000.0030
TL=0000.0000.0000.0001 TT=0000.0000.0000.0068

Software Power ON

The OS is 2.6, Sun Cluster 2.2 and Resource Manager. The other node in the
cluster is running just fine so I'm inclined to say it is a hardware issue
rather than software, but I'm not completly ruling that out either. Our
suspicion is a CPU but without any further information Sun wont venture a
real guess. As this is a mission critical (actual dollars earned) 24x7 system
we can not afford to have an extensive outage for hardware testing for
1 = weeks. We are implementing a contingency in case the beast dies again,
but Id prefer to fix the current system. As anyone seen similar symptoms?
If so was there a viable solution?
