Re: V880 panic'ing - urgent help, pls [FollowUP or maybe SUMMARY]

From: Grzegorz Bakalarski <G.Bakalarski_at_icm.edu.pl>
Date: Mon Jun 19 2006 - 08:08:10 EDT
Hi ALL,

Last Friday I asked a query which is attached below.
I've got many out of office responses (Are you watching
MUNDIAL - Football Championship in Germany ???) and few
very helpfull advices. I followed the one by stoyan.genov
and also Hicheal Morton gave nice ruleset to follow.
THanks to all.
Here are my findings and current status of case.

> * How to set up OBP in order to display diagnostic messages
>   to rsc-console?

Generally one should issue a command in OBP:

ok> diag-console rsc

[note: it is not OBP variable but just command]
and this should send diagnostic messages to rsc ...
But if one set a system key to diagnostic position
all diagnostic output is sent to ttya (i.e. serial
console). I can only see:
OBP Alert: Diagnostic/system console is directed to ttya/screen.
This is also noted in RSC release notes ...
Possibly no way...

> * Is my set up (diag-level max AND diag-switch? true) realy maximum
>   level of diagnostics?

Possibly yes. One can also set up vebosity to max:

ok> setenv verbosity max

if you like to see much text ...

> * Is it safe to just remove system board Slot B from machine
>   (I still can alive with 4x900MHz & 8Gig RAM) for weekend ?

Yes it is safe however CPU/memory slots have to be populated
from bottom to top. So I had to move CPU/Memory board from slot
C to slot B.

============= current stutus ======================

For a first time I removed memory board from slot B
moved down this from slot C and rebooted.
Machine came up without problems. I did diagnostic test
from OBP and some stress tests. All was ok so
I left it for night. On Saturday I played with this
faulty cpu/mem brd (I had some experience - we had
similar memory error 1.5 year ago and I looked what
SUN eng. was doing and also talked with him. I also
knew when to look for more sys/tech info).
I did diagnostic tests (obp) with the faulty mem/cpu brd
in slot C and the error appeared in this area.
Also I got clear evidence:

:0>ERROR: TEST = Block Memory
4:0>H/W under test = CPU4 Bank 3 Dimm 3, J3200 side 2
4:0>Repair Instructions: Replace items in order listed by 'H/W under test' above.
4:0>MSG = DIMM failure Bank 3 DIMM 3 Pin 13
4:0>END_ERROR

Note: CPU4 is on slot C.

So I removed all 4 DIMMS from this bank (optional group A1: J3100, J3101, J3200, J3201)
and moved in this place 4 DIMMs from higher bank (optional group B1:
J8100, J8101, J8200, J8201). Then I rebooted and run dignostic
tests from OBP. No serious error appeared so I booted solaris.
I've run some stress tests and also SunVTS processor & memory tests.
All seemed OK. So I now have 11GB memory and 6CPUs running
for 48h without errors.
(I do have possibly persistent memory error on J2900 DIMM in slot A, but it single
bit error and is ECC correctable) ... Possibly to next sunsopts high
activity ...

Again great thanks for your big help!

Grzegorz

P.S. Attached full query and full responses:

************** Q U E R Y ****************************
On Fri, Jun 16, 2006 at 03:27:43PM +0200, Grzegorz Bakalarski wrote:
> Dear All,
> 
> My V880 6x900MHz 12Gig server suddenly started 
> to reboot itself after few to about 60 minites.
> Seems its a memory error; I can see such error
> message sometime before hang (sometimes
> it reboot with this error message & sometime just
> hangs) - SEE LOG AT THE END OF E-MAIL.
> I'm trying to learn more:
> I set OBP:
>  diag-level              max
>  diag-switch?            true
> 
> But it does only medium diagnostics (I've had memeory issues on this
> server more than year ago and I remember SUN engineer set more tests).
> I tried to set system KEY (at front of machine) to diag position 
> but than I can't see any messages besides:
> 
> OBP Alert: Host System is initializing in Service Mode.
> OBP Alert: Diagnostic/system console is directed to ttya/screen.
> 
> I use not rsc-console (when I had first memory issues I used
> only serial port which is not connected currently because
> "everyting can be done from rsc console)".
> 
> HERE are my QUERIES:
> 
> * How to set up OBP in order to display diagnostic messages
>   to rsc-console?
> * Is my set up (diag-level max AND diag-switch? true) realy maximum
>   level of diagnostics?
> * Is it safe to just remove system board Slot B from machine 
>   (I still can alive with 4x900MHz & 8Gig RAM) for weekend ?
> 
> IMPORTANT: Machine is NOT on maintenance!
> 
> TIA for any fast response!
> 
> GB
> 
> PS1: OBP level 4.18.2 (patched in the end of 2005)
> 
> PS. LOGS FROM CONSOLE FOLLOWS:
> ==================================
> 
> Jun 16 14:51:22 server1 SUNW,UltraSPARC-III+: WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by CPU6 in Privileged mode at TL>0, errID 0x00000067.2d774d60
> Jun 16 14:51:22 server1     AFSR 0x00200004<ME,UE>.000001b5 AFAR 0x000000b0.e5b42b80
> Jun 16 14:51:22 server1     Fault_PC 0x1180e4c Esynd 0x01b5 Slot B: J3100 J3101 J3201 J3200
> Jun 16 14:51:22 server1 SUNW,UltraSPARC-III+: [AFT1] errID 0x00000067.2d774d60 Two Bits were in error
> Jun 16 14:51:22 server1 unix: NOTICE: Scheduling clearing of error on page 0x000000b0.e5b42000
> [AFT0] errID 0x00000067.32a8ab58 Corrected Memory Error on Slot B: J3201 is Intermittent
> [AFT0] errID 0x00000067.32a8ab58 Data Bit 118 was in error and corrected
> [AFT2] errID 0x00000067.32a8ab58 PA=0x000000b0.e5b42080
>     E$tag 0x000002c3.96000124 E$state_2 Modified
> [AFT2] E$Data (0x00) 0xbcfcbcbc.bdbdbdbd 0xbcbcbcbc.bcbcbcbc ECC 0x0cd
> [AFT2] E$Data (0x10) 0xbcfcbcbc.bdbdbdbd 0xbcbcbcbc.bcbcbcbc ECC 0x0cd
> [AFT2] E$Data (0x20) 0xbcfcbcbc.bdbdbdbd 0xbcbcbcbc.bcbcbcbc ECC 0x0cd
> [AFT2] E$Data (0x30) 0xbcfcbcbc.bdbdbdbd 0xbcbcbcbc.bcbcbcbc ECC 0x0cd
> [AFT2] D$ data not available
> [AFT2] I$ data not available
> WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by CPU1 Privileged Data Access at TL=0, errID 0x00000067.32af9ddc
>     AFSR 0x00100004<PRIV,UE>.0000009d AFAR 0x000000b0.ea305f80
>     Fault_PC 0x10cd7e4 Esynd 0x009d Slot B: J3100 J3101 J3201 J3200
> [AFT1] errID 0x00000067.32af9ddc Three Bits were in error
> 
> panic[cpu1]/thread=2a1000cbd40: BAD TRAP: type=34 rp=1437f00 addr=d mmu_fsr=0
> syncing file systems... 19 9 done
> dumping to /dev/dsk/c1t0d0s1, offset 644022272, content: kernel
> 
> ERROR: CPU2 RED State Exception
> 
> 
> System State (CPU2 reporting)
> 
> [...]



*************************** R E S P O N S E S ***************************
1.
Purely a guess, but perhaps you have two CPU's failing simultaneously.
:-(
Pull the boards for CPU 1 and CPU 6 and leave just the one board for
CPU's 3 & 4.
Boot with that board only, and observer the results.
Dave H

===========
2.
what is diag-out-console set to in OBP?
Aaron Lineberger

================================
3.
The last time we crashed with a "red state exception" Sun immediately
replaced the CPU. To do this they replace the board, transfering the
memory to the new board. Do you have a service contract?
     Deborah Crocker, Ph.D. 

=================================
4.
Most likely problem is bad memory. Less likely problem is CPU
not fitting well on CPU/mem board. If you are going to stop
and open the machine anyway, best would be to remove the faulty
DIMMs. Can't tell off the top of my head whether J3100 J3101 J3201 J3200
are all in the same group and if this group is a required one, but you
can take out the four DIMMs (J3100 J3101 J3201 J3200) and if any of them
is from a required group, take out a DIMM from an optional group and put
it into the required group DIMM slot. In the end, you have to have 4
DIMMs less memory, and the empty DIMM slots should all be one of the
optional memory groups on the CPU/mem board.
If problem disappears, then it was memory; if not -- it's the CPU board.
You can't remove board B and leave board C in place -- if you remove
board B, board C must be moved to board B's place.
On the OK (OBP) prompt, setting:
setenv diag-level max
setenv diag-switch? true
and then power-cycling the machine should give you full hardware testing
(and will eat hours of time).
Hope this helps.
--sdg stoyan.genov

======================
5.
if you have a service contract, call your service provider.
if you can do some troubleshooting, try this.
1. remove the offending systemboard and reboot.
   (this confirms the error is in the removed systemboard)
2. swap the memory with the systemboard in #1.
  (this confirms the status of the memory modules)
  if errors return, the memory modules are suspect.
  if no errors, the systemboard is suspect.
3. try the offending systemboard with the known good memory
  if in the same systemboard slot returns errors try the
  systemboard in another slot.
  (if the error returns and stays with the slot, the backplane is suspect.)
  (if the error returns and stays with the systemboard,
  the systemboard is suspect.)
Hicheal Morton

======================
6.
We had exactly the same problem the last two days and what SUN tech has done was to change the I/O board !
We lost the first OS disk in that case and had many system files to restore and the second disk ....
It can be due to an memory module error (MMI miss), a CPU or motherboard trouble
Can your SUN tech soon ...
Good luckCordialement/RegardsYann
 yann geneste

=======================
7.
Okay, yes it is a memory error, the bad dimm is Slot B: J3201
Second:  Diags should be going out via the RSC serial connection. When you
log in, you just have to type console and it should display all console
output (BTW the logs the system mentions are available by doing a
showlogsvia the RSC)
Third:  No that isn't the max but anything above that will give you a bunch
of worthless testing information and it will cause your system to come up
more slow, in some cases that I've seen, two hours
Fourth:  Yeah you can remove the system board (bring the system down
first).  You'll lose processing power and memory but you said you can deal
with it so it's not an issue.
musa.williams

============================
8.
that looks like an e-cache error.  I thought they had fixed that by now.  I
had the same problem an hour ago.  they're sometimes caused by sunspots, but
we're not getting any at the moment AFAIK.
You should be OK to pull the CPU module and run like that.
some-one (007)

============================
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Mon Jun 19 09:19:10 2006

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:59 EST