Summary: Hardware failure?

From: Meg Wall <meg991_at_yahoo.com>
Date: Mon Apr 07 2003 - 23:12:29 EDT
Hi managers,

Thanks for all of you replied, too many to list here! I've got all kind
of suggestions, all of them are very informative, so I decide to call
Sun. Sun's explaination is very similar to Colin Bigam's email, he's
the winner! they did mention "replace the DIMM after 3 persistent
memory error in 24 hours" rule. Thanks!!

Meg

--- Colin Bigam <colin.bigam@west.gecems.com> wrote:
> Hi Meg;
> 
> We see errors like this quite often on various systems, especially
> on the newer sunfires.
> 
> Memory errors come in three types: Intermittent, persistent, and 
> sticky. When the computer detects a single-bit memory error, it
> will go and refetch the data from memory. If it's correct the second
> time, then the error was transient, occurring randomly in the path
> from memory to CPU. If it's still in error, then the error is
> persistent--the system recalculates and rewrites the data back to
> memory. Then it reads it again--if the read-after-rewrite is still
> in error, then the error is labelled sticky.
> 
> Sticky errors indicate bad memory, and should be replaced. Sun's
> recommendation for persistent errors is to replace the DIMM if you
> get more than three persistent errors in 24 hours, or if there's
> a steady trend of increasing persistent errors. Transient errors
> are almost completely random (usually caused by cosmic rays!), and
> are not serious unless you start to get them steadily, in which case
> you probably have a bad system board.
> 
> There are also numerous patches which apply to memory errors. Make
> sure you have a fairly recent patch cluster for your OS version on
> the box, especially making sure that you have the 'memory scrubber'
> patch. (you can search for this on sunsolve.sun.com)
> 
> Hope this helps,
> Colin
> --
> | Colin Bigam, Senior Unix analyst
> 
> ----- Original Message -----
> From: Meg Wall <meg991@yahoo.com>
> Date: Monday, April 7, 2003 11:32 am
> Subject: Hardware failure?
> 
> > Hi managers,
> > 
> > I just got the following messages, do I have a
> > hardware failure here? What these mean? Thanks!!
> > I will work on my summaries today.
> > 
> > Meg
> > 
> > Apr  7 12:12:02 server32 pcipsy: [ID 854591 kern.info]
> > NOTICE: correctable error detected by pci0 (upa mid
> > 1f) during 
> > Apr  7 12:12:02 server32       DVMA read transaction 
> > Apr  7 12:12:02 server32 pcipsy: [ID 750218 kern.info]
> >        AFSR=40230000.7f800000 AFAR=00000000.1c9d0a58,
> > Apr  7 12:12:02 server32       double word offset=3,
> > Memory Module U0701 id 31.
> > Apr  7 12:12:02 server32 pcipsy: [ID 916270 kern.info]
> > syndrome bits 23
> > Apr  7 12:12:02 server32 SUNW,UltraSPARC-II: [ID
> > 354824 kern.info] [AFT0] errID 0x0003cc92.01cb251e
> > Corrected Memory Error on U0701 is Intermittent
> > Apr  7 12:12:02 server32 SUNW,UltraSPARC-II: [ID
> > 376402 kern.info] [AFT0] errID 0x0003cc92.01cb251e ECC
> > Data Bit 33 was in error and corrected
> > Yahoo! Tax Center - File online, calculators, forms, and more
> > http://tax.yahoo.com
> > _______________________________________________
> > sunmanagers mailing list
> > sunmanagers@sunmanagers.org
> > http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Yahoo! Tax Center - File online, calculators, forms, and more
http://tax.yahoo.com
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Mon Apr 7 23:15:34 2003

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:08 EST