SUMMARY E4500 Crashing.

From: Cian O'Sullivan <cian_at_parthus.com>
Date: Mon Jul 02 2001 - 12:13:34 EDT
Lads,

Many thanks to those who responded.

Seth Rothenberg
Nick Hedley
Mike Kiernan
Shannon Ward

and most important

Joseph Herpers.(joeh@stsolutions.com)

The original post is below.  It involved some very awkward crashing with
inconsistent memory errors.

It was obvious from the original posting that there was a hardware issue.
Some got confused and thought that the errors automatically where indicative
of a memory error.  This is a very dangerous approach, as an error in memory
write can be caused from processing, bus transfer, io management etc.  Once
it gets to the ram if it is a bad instruction the ram will choke because it
does not understand what it is to do.

Joe pointed out a tool that can be used to detect the error meaning. The
Software is called the ON-Line Detective for Sun, you can see info on it at
www.sundetective.com.  One of the resuls after searching for my particular
error demonstrated that a failure from a DMA write request was from a defect
in the Enterprise Server Board.  Sun engineers confirmed this (after comming
out for the second time), have replaced the board, and we are now off to the
races.  Many thanks for this list, and those who responded.

Cian O'Sullivan


-----Original Message-----
From: sunmanagers-admin@sunmanagers.org
[mailto:sunmanagers-admin@sunmanagers.org]On Behalf Of Cian O'Sullivan
Sent: Monday, July 02, 2001 9:54 AM
To: sunmanagers@sunmanagers.org
Subject: e4500 Crashing.


Lads,

I have an e4500 that is crashing without explanation.  Brief outline

The symptoms are that you boot it into extended diagnostics and it gives
wildly differing simm errors every time, sometimes it boots to the os,
sometimes (as now ) it doesn't even boot to the obp.

If you boot it off  a single cpu/mem board at a time, it comes up fine,
as soon as you start adding boards in it goes wonky again. A quick poll
of the board temps on the other adjacent e4500s show that the cpu/mem
boards are within the operating env limits (just ... ie below 40
degrees)

Sun engineers came in and have given all cpu/mem and i/o
boards a full health check, run extended diagnostics and made some OS /
operating environment recommendations which have now been implemented.
The system has been stress tested overnight and appeared stable. However
it
crashed again.

Here are some segments from the syslog.  Any comments would be most
apprecaited, as we are now at our wits end.




Piece 1.


Jun 27 02:49:26 dublin232 unix: CE Error queue wrapped
Jun 27 02:49:26 dublin232 last message repeated 1 time
Jun 27 02:49:29 dublin232 unix: Multiple Softerrors:
Jun 27 02:49:29 dublin232 unix: Seen 4 Intermittent and 2 Corrected
Softerrors
Jun 27 02:49:29 dublin232 unix: from SIMM Board 2 J3200
Jun 27 02:49:30 dublin232 unix:  Enabling verbose CE messages.
Jun 27 02:49:30 dublin232 unix: Softerror: Intermittent ECC Memory Error
SIMM Board 2 J3200
Jun 27 02:49:30 dublin232 unix:  ECC Data Bit 45 was corrected
Jun 27 02:49:30 dublin232 unix: CPU8 CE Error: AFSR 0x00000000 00100000,
AFAR 0x00000000 638ed060, SIMM Board 2 J3200

Piece 2

Jun 27 02:49:30 dublin232 unix:  Syndrome 0x2c, Size 3, Offset 0 UPA MID
8
Jun 27 02:50:01 dublin232 unix: CPU12 CE Error: AFSR 0x00000000
00100000,
AFAR 0x00000001 92ab30a0, SIMM Board 4 J3200
Jun 27 02:50:01 dublin232 unix:  Syndrome 0x2c, Size 3, Offset 0 UPA MID
12
Jun 27 02:50:01 dublin232 unix: Softerror: Intermittent ECC Memory Error
SIMM Board 4 J3200
Jun 27 02:50:02 dublin232 unix:  ECC Data Bit 45 was corrected
Received on Mon Jul 2 17:13:34 2001

This archive was generated by hypermail 2.1.8 : Wed Mar 23 2016 - 16:24:58 EDT