SUMMARY: V880 memory error, how to react

From: Stoyan Genov <stoyan.genov_at_sun-fish.com>
Date: Sun May 30 2004 - 20:43:00 EDT
Hi again,

Thanks a lot to Joe Fletcher, Peter Ondruska, Prasanth Mudundi,
Mike Ekholm and Eugene Schmidt for their support and advice.

In short form, the machine was shut down and the faulty memory
chip was removed along with its three "siblings" from the memory
group, saving the day at the price of 1GB less for oracle and 10
minutes downtime.

In the long form, I will comment in details on my own post below:

---- Stoyan Genov on 2004-05-30 17:17:04:34 CEST (Sunday): ----
> Hi,
> 
> I have one Sun Fire V880, 4 CPU/mem boards, 8 CPUs x 900Mhz,
> 16 GB RAM (4 boards x 16 mem slots x 256MB DIMMs)
> 
> The system runs an oracle database server under a decent load
> (uses all memory).
> 
> Since a couple of days I have this in /var/adm/messages, and
> it's reported approximately twice per minute:
> 
> May 30 16:38:56 v880server SUNW,UltraSPARC-III: [ID 354446 kern.info] [AFT0] errID 0x001a85ab.36a898c4 Corrected Memory Error on Slot D: J8201 is Sticky
> May 30 16:38:56 v880server SUNW,UltraSPARC-III: [ID 220268 kern.info] [AFT0] errID 0x001a85ab.36a898c4 Data Bit 122 was in error and corrected
> May 30 16:38:56 v880server unix: [ID 752700 kern.warning] WARNING: [AFT0] Sticky Softerror encountered on Memory Module Slot D: J8201
> 
> (message is actually longer per one report, but I think the part
> above shows the problem)
> 
> As far as I understand (and as far as I found info through google),
> this is a soft (ECC-correctable) memory error in memory bank J8201 in
> slot D;
> 
> I have the following questions:
> 
> Part 1: Diagnosis.
> 1. Am I right? Is this really an ECC-correctable memory error?

All agreed that it's this type of error.

> 2. Is slot D the last (topmost) CPU/memory slot?

Yes. It is in the documentation, at
http://sunsolve.sun.com/handbook_pub/Systems/SunFire880/component.right_open.html

> 3. If I have someone (I'm thousands of miles away from the machine)
>    open the machine, take off and open the memory board,
>    will he find J8201 written somewhere (so he can spot exactly
>    the faulty memory chip)?

Yes. It is in the documentation, at
http://sunsolve.sun.com/handbook_pub/Devices/System_Board/SYSBD_SunFire_CPU.html#7028
The J8201 is the third DIMM of the fourth group on the memory board.

> 
> Part 2: How to react?
> 1. Is it correct that it's possible to remove the group of four memory DIMMs
>     in which the offending chip is, and plug back the CPU/memory board
>     so the CPUs and the rest of memory are used again?

Yes. We removed the faulty chip along with the "siblings" from this memory
group. The memory group was the fourth, optional group. If it was one
of the "always present" groups, we would have to swap DIMMs so the
"always present" groups are full.

> 2. Is it possible that I switch off somehow usage of this group of DIMMs
>     or the entire CPU/memory board from the openboot environment,
>     so that no physical intervention is required until we get the replacement
>     DIMMs?

Uncertain. Because the V880 doesn't support dynamic reconfiguration for the
CPU/memory board, it is impossible to switch the board off while the OS
is running. However, I couldn't find a way to make it offline ("offline" having
the meaning of "physically present on the main board, but not used by
the system") from the openboot environment, too.

> 3. What would you do in this situation, considering that downtime is possible,
>     but highly undesireable?

Overall opinion was "replace the faulty chip ASAP, but don't stop the machine
right now because the error is still correctlable".

We don't have an active contract for this machine, and we didn't have
any spare 256MB DIMMs. Also, no way to "offline" the CPU board without
stopping the OS. We didn't want to wait until uncorrectable errors occur,
that's why we decided to go for a removal, which means, of course,
another downtime when we get the new memory chips.

Thank you once again.

Regards,
Stoyan Genov
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Sun May 30 20:42:52 2004

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:34 EST