[SUMMARY #2] UltraSparcII Ecache parity errors ["CBI event on CPU1" / "*Bad* PSYND=0x0004"]

From: David Foster <foster_at_dim.ucsd.edu>
Date: Fri Nov 22 2002 - 16:56:37 EST
I received some helpful followups from my original summary
regarding ecache parity errors on UltraII cpu's. Most notably:

1. The problem was actually caused by faulty SRAM's made by IBM.
   There were two vendors used for this L2 cache, so it just
   depends on which one you got (you can't tell by looking at them).

2. This problem actually wasn't kept that quiet, it made front-page
   news on EE Times and Electronic Business, IIRC.
   
3. Given the manufacturing capacity of the two vendors of SRAM,
   it would have been impossible for Sun to do a complete recall.
   (Consensus seems to be that their response was still inadequate.)

4. According to a Sun service engineer, best practices for cpu
   replacement is two failures in 6 months.
   
5. Someone noted that Sun recommends the following /etc/system
   settings to "reduce the ecache bug's effect". I have not verified
   this with Sun, and I have not tried these settings; they increase 
   the scrubbing rate of the ecache:
   
   *eCache Scrubbing
   set ecache_scrub_enable = 1
   set ecache_scan_rate=1000
   set ecache_calls_a_sec=100
   *End eCache Settings

6. Important to keep up-to-date on kernel patch as changes have been
   made to mitigate this problem and reduce false-positive reports in
   the logs.   
   
You can view Sun's "Best Practices" document on the ecache parity
problem at:

	ftp://ncmir.ucsd.edu/outgoing/foster/BP_Ecache_10-16-01.pdf

I've attached a reply from a Sun service engineer regarding the
"CBI event", which is way more than I wanted to know about this!

Thanks to:

Jed Dobson
Jay Lessert
Donaldson, Mark
Scott Howard

> My apologies, the Manager's List archives were down so I couldn't
> tell that there are many posts about this.
>
> This is an Ecache parity error on the CPU, a known problem with
> the UltraII cpu's. Can happen when the cpu is under heavy load,
> extremely intermittently, but if it happens multiple times then
> Sun will replace the cpu under contract support. Just heard from a
> Sun engineer that "best practices" is to wait for 3 occurances.
> It's happened once; they recommended upgrading to the latest kernel
> (108528-17 for Solaris 8) and see if it presents itself again.
> Apparently rev -16 included some fixes to prevent spurious cpu
> errors.
>
> Apparently this usually hits cpu's with 8 meg cache, but sometimes
> 4 meg as well.
>
> Rant (source anonymous)
>
>    It never ceases to amaze me how well SUN kept the UltraII design
>    problems quiet. In effect virtually a whole years
>    production of chips was broken. A shortcut in the design
>    (using parity instead of ECC on the cache) meant that
>    thousands of these things had to be replaced. Never
>    quite made the news though and how loud did they
>    shout about the first Pentium being unable to add up.
>
> Thanks to:
>
> steven.ruby
> Ryan Bishop
> Will Enestvedt
> rene_casalme
> Tim Chipman
> joe.fletcher
>
> >
> > Can anyone help with this, it doesn't look good...
> >
> > Nov 18 17:31:44 cressida SUNW,UltraSPARC-II: [ID 672871 kern.info] NOTICE:
> > [AFT2] errID 0x000644be.021b33e1 CBI event on CPU1
> > Nov 18 17:31:44 cressida SUNW,UltraSPARC-II: [ID 192776 kern.info] [AFT2]
> errID
> > 0x000644be.021b33e1 PA=0x00000000.00565000
> > Nov 18 17:31:44 cressida     E$tag 0x00000000.0e40000a E$State: Shared
> E$parity
> > 0x07
> > Nov 18 17:31:44 cressida SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
> E$Data
> > (0x00): 0x00000000.00000000
> > Nov 18 17:31:44 cressida SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2]
> E$Data
> > (0x08): 0x00000000.00080000 *Bad* PSYND=0x0004
> > Nov 18 17:31:44 cressida SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
> E$Data
> > (0x10): 0x00000000.00000000
> >
> > Dave

------------- End Forwarded Message -------------



   << All opinions expressed are mine, not the University's >>

  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
   David Foster    National Center for Microscopy and Imaging Research
    Programmer/Analyst     University of California, San Diego
    dfoster@ucsd.edu       Department of Neuroscience, Mail 0608
    (858) 534-7968         http://ncmir.ucsd.edu/
  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

   "The reasonable man adapts himself to the world; the unreasonable one
   persists in trying to adapt the world to himself.  Therefore, all progress
   depends on the unreasonable."   -- George Bernard Shaw
A CBI event is a ecache error on a cache line that can occur without the system 
panicing. CBI stands for Clean Bad Idle. Clean means that the cache line is 
clean, or has not been modified. If it was modifed, it would be a dirty page, 
which would have required flushing the changes out to memory. Idle indicates 
that this cache line was not in use by the cpu at this time. Bad means that it 
detected an error.
 
This is a corrected "scrubbed" Ecache event. This should be handled just like 
any Ecache event, that is swap on the second event only.

It appears that Ecache error reporting has changed (again). Solaris 8 kernel 
patch 108528-13 introduces the changes detailed in bug 4385694. E$ errors seem 
to be reported as "xBy events" where x is C for "clean" or D for "dirty", and y 
is I for "idle" or B for "busy" (so DBI event, CBD event and so on), reflecting 
the state of the cache line when the error was detected. So basically, a CBI 
event is telling us that the scrubbing algorythm has found a bad line of ecache 
data and scrubbed it.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Fri Nov 22 16:59:36 2002

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:58 EST