SUMMARY (part 2): ecache parity error?

From: <Dan_Kelley_at_ssmhc.com>
Date: Fri Apr 26 2002 - 15:21:05 EDT
Sorry for a second summary, but this one was good, so I thought I would 
forward it along as well.  Thank you for the great explanation, Buddy. I'm 
sure a lot of people will benefit from this knowledge.

 - Dan

----- Forwarded by Dan Kelley/IC/SSMHC on 04/26/2002 02:20 PM -----


"Lumpkin, Buddy" <Buddy.Lumpkin@nordstrom.com>
04/25/2002 03:50 PM

 
        To:     "'Dan_Kelley@ssmhc.com'" <Dan_Kelley@ssmhc.com>
        cc: 
        Subject:        RE: ecache parity error?


Hi,

I used to repair Sun Systems to component level and I would like to make a 
distinction ...

There is the famed E-Cache error that you can read about on ZDnet and 
other news sources and that everyone knows about. It affects mid-range to 
high end Sun servers (E3500 and later) running 400+mhz cpu's. The problem 
is intermittent E-Cache errors on otherwise perfectly good cpu's under 
certain
circumstances. They have addressed this with two solutions. The first 
solution after repeated problems is to replace the modules with a module 
from a different manufacturer. Sun calls these modules CTO modules.

The next step if you still experience problems is to replace these modules 
with special ones. The special modules are "hacked" in such a way that the 
E-Cache chips are actually mirrored. The effect is that if an error occurs 
from a chip that they read is retried from the mirror. Sun code named 
these
modules as "Sombra" modules. With customers they refer to these as the 
"Mirrored E-Cache" modules.

The second part of the distinction is that E-Cache errors are a very 
common way that cpu chip errors manifest. The 4-400 VME style sun systems 
actually had seperate chips for E-Cache, along with an individual MMU, 
Page Map, Seg Map, Region Map, Integer Unit (heart of what we call a CPU 
these days),
and a floating point unit. It's the last of it's kind. Any modern system 
has one big monolithic cpu with all of these other parts mentioned built 
in. Well, that's not entirely true, the E-Cache is still external, but 
sits on the module that you plug into the board.

These are still one of the more common parts on the board to fail because 
they are the most expensive parts. E-Cache is usually the fastest memory 
that you can buy on the market (6 nano-second access times or better these 
days) so they are designed to run on the bleading edge.

The E-Cache errors your experiencing on your Ultra 5 or 10 are in fact a 
symptom of a failing CPU, but are not the famed E-Cache blunder made by 
Sun that everyone talks about.

Sorry for the long winded digression.

--Buddy

-----Original Message-----
From: Dan_Kelley@ssmhc.com [mailto:Dan_Kelley@ssmhc.com]
Sent: Wednesday, April 24, 2002 10:35 AM
To: sunmanagers@sunmanagers.org
Subject: ecache parity error?


Hello, all.

We have a machine that keeps crashing, and I think it is the ecache parity 

error.  I have been waiting for it to happen again before I sent an e-mail 

to this list, though.  Could anyone look at this and tell me if they think 

it is the ecache error?  If not, any clues as to what it is?  Thanks in 
advance!  I will summarize.

 - Dan


uname -a:
SunOS netdev 5.8 Generic_108528-14 sun4u sparc SUNW,Ultra-5_10

I have tracked here is the info for the first one (note they are slightly 
different):

echo '$c' | adb -k unix.1 vmcore.1:

physmem 173a7
panicsys(104234b0,1040c198,10050068,78002000,57542400,c) + 44
vpanic(10050068,1040c198,16e76a3d8cac,10,30000689ea8,30000068438) + cc
panic(10050068,804,1,1041a798,fffd,20) + 1c
sync_handler(1041a980,10400000,0,0,0,2) + 150
prom_rtt(10000000,16,f0000000,16e7332a6da9,0,2)
client_handler(f0066d2c,2a10007d6e8,1,104283d8,1,1041a980) + 2c
prom_enter_mon(0,6,b,2a10004bd40,2a10007dd40,0) + 28
debug_enter(0,16e73315c8c5,16e73315c8c9,0,30000ddf1e8,0) + d0
kbdinput(1045a400,4d,30000689d68,300001b5000,0,1013dd4c) + 304
kbdrput(30000adabe8,30000f7e340,30000ad3a98,30000f7e340,30000689d68,30000ad3a20) 

+ 13c
putnext(30000adae48,30000ad9a90,30000adb0a8,30000f7e340,0,0) + 1cc
async_softint(30000f7e340,1,ffff,20000,0,30000adae48) + 568
asysoftintr(3000017a008,30000b7e000,1,2a10007dd40,10180,1026fba8) + 70
intr_thread(2a10001fd40,1041b180,10423890,10423890,0,0) + a4
idle(1040f864,0,0,1041b180,3000005d6c8,0) + 54
thread_start(0,0,0,0,0,0) + 4

/var/adm/messages from this one:

Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 932869 kern.warning] 
WARNING: [AFT1] EDP event on CPU0 Data access at TL=0, errID 
0x00015289.afcae2ba
Apr 12 17:59:18 netdev     AFSR 0x00000000.80400080<PRIV,EDP> AFAR 
0x00000000.3d41fa68
Apr 12 17:59:18 netdev     AFSR.PSYND 0x0080(Score 95) AFSR.ETS 0x00 
Fault_PC 0x10031cc8
Apr 12 17:59:18 netdev     UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 
UDBL.ESYND 0x00
Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 683009 kern.info] [AFT2] 
errID 0x00015289.afcae2ba PA=0x00000000.3d41fa68
Apr 12 17:59:18 netdev     E$tag 0x00000000.0003cf50 E$State: Modified 
E$parity 0x03 Badlines found=6
Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] 
E$Data (0x00): 0x00000000.10041eb0
Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] 
E$Data (0x08): 0x00000000.10041eb4
Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] 
E$Data (0x10): 0x00000000.0247e008
Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] 
E$Data (0x18): 0x00000000.10423890
Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] 
E$Data (0x20): 0x00000000.10041eb0
Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 989652 kern.info] [AFT2] 
E$Data (0x28): 0x80000000.00000000 *Bad* PSYND=0x0080
Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] 
E$Data (0x30): 0x00000000.00000000
Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] 
E$Data (0x38): 0x000002a1.000b7d20
Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 601312 kern.info] [AFT2] 
errID 0x00015289.afcae2ba AFAR was derived from E$Tag
Apr 12 17:59:18 netdev unix: [ID 836849 kern.notice] 
Apr 12 17:59:18 netdev ^Mpanic[cpu0]/thread=2a10007dd20: 
Apr 12 17:59:18 netdev unix: [ID 455523 kern.notice] [AFT1] errID 
0x00015289.afcae2ba EDP Error(s)
Apr 12 17:59:18 netdev     See previous message(s) for details
Apr 12 17:59:18 netdev unix: [ID 100000 kern.notice] 
Apr 12 17:59:18 netdev genunix: [ID 723222 kern.notice] 000002a10007d200 
SUNW,UltraSPARC-IIi:cpu_aflt_log+4e0 (2a10007d2be, 1, 101483a0, 
2a10007d448, 2a10007d30b, 101483c8)
Apr 12 17:59:19 netdev genunix: [ID 179002 kern.notice]   %l0-3: 
0000000000000000 000002a10007d510 0000000000000003 0000000000000010
Apr 12 17:59:19 netdev   %l4-7: 0000000000200000 0000000000400000 
0000000000000000 000002a10001f9c0
Apr 12 17:59:19 netdev genunix: [ID 723222 kern.notice] 000002a10007d450 
SUNW,UltraSPARC-IIi:cpu_async_error+868 (1, 2a10007d510, 80400080, 0, 
640000080400080, 2a10007d6d0)
Apr 12 17:59:19 netdev genunix: [ID 179002 kern.notice]   %l0-3: 
0000000000000001 0000000000000032 0000000000000000 0000000000000000
Apr 12 17:59:19 netdev   %l4-7: 0000000000000219 0000000000000000 
000003000005d748 0000000000000000
Apr 12 17:59:19 netdev genunix: [ID 723222 kern.notice] 000002a10007d620 
unix:prom_rtt+0 (300001b2000, 8000000000000000, a, a, 0, 0)
Apr 12 17:59:19 netdev genunix: [ID 179002 kern.notice]   %l0-3: 
0000000000000001 0000000000001400 0000000000001600 000000001013fb54
Apr 12 17:59:19 netdev   %l4-7: 0000030000697ea0 0000000000000001 
000000000000000a 000002a10007d6d0
Apr 12 17:59:19 netdev genunix: [ID 723222 kern.notice] 000002a10007d770 
genunix:callout_schedule_1+4 (300001b2000, 10443508, 300001b5000, 
10072cf4, 0, 101424b0)
Apr 12 17:59:20 netdev genunix: [ID 179002 kern.notice]   %l0-3: 
0000000000000008 0000000000000002 0000000000000001 000000001041b718
Apr 12 17:59:20 netdev   %l4-7: 000000001041b338 0000000000000016 
000000001041baf8 000002a10007d7b0
Apr 12 17:59:20 netdev genunix: [ID 723222 kern.notice] 000002a10007d820 
genunix:callout_schedule+54 (104391fc, 1, 10439178, 8, 1, 300000683c8)
Apr 12 17:59:20 netdev genunix: [ID 179002 kern.notice]   %l0-3: 
00000000100d312c 0000030000cec000 0000030000d79602 0000030000cec000
Apr 12 17:59:20 netdev   %l4-7: 000003000188f040 0000000000000000 
000003000148af00 000002a10051dba0
Apr 12 17:59:20 netdev genunix: [ID 723222 kern.notice] 000002a10007d8d0 
genunix:clock+474 (1045a800, 1041b338, 1042dc00, 94f476874837, 0, 0)
Apr 12 17:59:20 netdev genunix: [ID 179002 kern.notice]   %l0-3: 
0000000000000000 0000000000000001 000002a10007dd20 0000000000000000
Apr 12 17:59:20 netdev   %l4-7: 000000001045a000 000000003b9aca00 
000000001041baf8 00000000fed3a004
Apr 12 17:59:20 netdev genunix: [ID 723222 kern.notice] 000002a10007d9a0 
genunix:cyclic_softint+a4 (1041b338, 30000057928, 1, 3, 30000068478, 
10073f0c)
Apr 12 17:59:20 netdev genunix: [ID 179002 kern.notice]   %l0-3: 
0000030000057930 800000000237f894 0000000000000000 0000030000068478
Apr 12 17:59:20 netdev   %l4-7: 00000300000578c8 000003000068dea8 
0000000000000000 000003000068ded0
Apr 12 17:59:21 netdev genunix: [ID 723222 kern.notice] 000002a10007da60 
unix:cbe_level10+8 (0, 803, 1041b338, 2a10007dd20, 10060, 1000b34c)
Apr 12 17:59:21 netdev genunix: [ID 179002 kern.notice]   %l0-3: 
00000000102e4934 0000000000000001 0000000000000001 0000030000070ed8
Apr 12 17:59:21 netdev   %l4-7: 0000000000000000 0000000000000000 
0000000000000000 0000000000000000
Apr 12 17:59:21 netdev unix: [ID 100000 kern.notice] 
Apr 12 17:59:21 netdev genunix: [ID 672855 kern.notice] syncing file 
systems...
Apr 12 17:59:21 netdev genunix: [ID 904073 kern.notice]  done
Apr 12 17:59:22 netdev genunix: [ID 353387 kern.notice] dumping to 
/dev/dsk/c0t0d0s1, offset 322174976
Apr 12 17:59:22 netdev uata: [ID 606412 kern.warning] WARNING: timeout: 
reset bus chno = 0 targ = 0
Apr 12 17:59:38 netdev genunix: [ID 409368 kern.notice] ^M100% done: 8116 
pages dumped, compression ratio 3.96, 
Apr 12 17:59:38 netdev genunix: [ID 851671 kern.notice] dump succeeded


And now for the second crash:

echo '$c' | adb -k unix.0 vmcore.0:

physmem 173a7
panicsys(104234b0,1040c198,10050068,78002000,39ff00,c) + 44
vpanic(10050068,1040c198,faabfb648,10,30000689ea8,30000068438) + cc
panic(10050068,804,1,1041a798,fffd,20) + 1c
sync_handler(1041a980,10400000,0,0,0,2) + 150
prom_rtt(10000000,16,f0000000,f810ca9c6,0,2)
client_handler(f0066d2c,2a10007d6e8,1,104283d8,1,1041a980) + 2c
prom_enter_mon(0,6,b,2a10004bd40,2a10007dd40,0) + 28
debug_enter(0,f80db6987,f80db698a,0,30001092020,0) + d0
kbdinput(1045a400,4d,30000689d68,300001b5000,0,1013dd4c) + 304
kbdrput(30000adabe8,3000108f080,30000ad3a18,3000108f080,30000689d68,30000ad39a0) 

+ 13c
putnext(30000adae48,30000ad9a90,30000adb0a8,3000108f080,0,0) + 1cc
async_softint(3000108f080,1,ffff,20000,0,30000adae48) + 568
asysoftintr(3000017a008,30000b7e000,1,2a10007dd40,10180,1026fba8) + 70
intr_thread(2a10001fd40,1041b180,10423890,10423890,0,0) + a4
idle(1040f864,0,0,1041b180,3000005d6c8,0) + 54
thread_start(0,0,0,0,0,0) + 4


/var/adm/messages leading up to the reboot:

Apr 24 12:20:07 netdev SUNW,UltraSPARC-IIi: [ID 370172 kern.warning] 
WARNING: [AFT1] EDP event on CPU0 Instruction access at TL=0, errID 
0x0001d01e.baad443a
Apr 24 12:20:07 netdev     AFSR 0x00000000.004000f0<EDP> AFAR 
0xffffffff.ffffffff
Apr 24 12:20:07 netdev     AFSR.PSYND 0x00f0(Score 45) AFSR.ETS 0x00 
Fault_PC 0x97560
Apr 24 12:20:07 netdev     UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 
UDBL.ESYND 0x00
Apr 24 12:20:07 netdev SUNW,UltraSPARC-IIi: [ID 798591 kern.info] [AFT2] 
errID 0x0001d01e.baad443a No error found in ecache (No fault PA available)
Apr 24 12:20:07 netdev unix: [ID 836849 kern.notice] 
Apr 24 12:20:07 netdev ^Mpanic[cpu0]/thread=3000165a440: 
Apr 24 12:20:07 netdev unix: [ID 424580 kern.notice] [AFT1] errID 
0x0001d01e.baad443a EDP Error(s)
Apr 24 12:20:07 netdev     See previous message(s) for details
Apr 24 12:20:08 netdev unix: [ID 100000 kern.notice] 
Apr 24 12:20:08 netdev genunix: [ID 723222 kern.notice] 000002a1005dd6d0 
SUNW,UltraSPARC-IIi:cpu_aflt_log+4e0 (2a1005dd78e, 1, 101483a0, 
2a1005dd918, 2a1005dd7db, 101483c8)
Apr 24 12:20:08 netdev genunix: [ID 179002 kern.notice]   %l0-3: 
0000000000000000 000002a1005dd9e0 0000000000000003 0000000000000010
Apr 24 12:20:08 netdev   %l4-7: 0000000000200000 0000000000400000 
0000000000000001 0000000000000080
Apr 24 12:20:08 netdev genunix: [ID 723222 kern.notice] 000002a1005dd920 
SUNW,UltraSPARC-IIi:cpu_async_error+868 (1, 2a1005dd9e0, 4000f0, 0, 
1400000004000f0, 2a1005ddba0)
Apr 24 12:20:08 netdev genunix: [ID 179002 kern.notice]   %l0-3: 
0000000000000001 000000000000000a 0000000000000000 0000000000000000
Apr 24 12:20:08 netdev   %l4-7: 0000000000004208 0000000000000000 
00000000007fbdd0 0000000000000084
Apr 24 12:20:08 netdev unix: [ID 100000 kern.notice] 
Apr 24 12:20:08 netdev genunix: [ID 672855 kern.notice] syncing file 
systems...
Apr 24 12:20:09 netdev genunix: [ID 733762 kern.notice]  1
Apr 24 12:20:10 netdev genunix: [ID 904073 kern.notice]  done
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Fri Apr 26 18:20:33 2002

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:41 EST