SUMMARY: E420R unexplaned panic after UE error

From: Tony van Lingen <tony.vanlingen_at_epa.qld.gov.au>
Date: Wed Feb 11 2004 - 01:10:54 EST
G'day all.

Sorry for the time it took to post a summary: I wanted to wait until we 
worked it out completely. Thanks to Surender Dinkar, Gene Beaird, Bill 
Voight, Kevin Buterbaugh, Jay Lessert and Joe Fletcher for their help. 
In short: 1) it is a cpu fault, or at least of the memory mounted on the 
cpu board and 2) apparently Sun's world-wide policy is to sit it out. If 
a panic happens twice in a short period (< 1 year) on the same CPU, Sun 
will replace the module.

Several people pointed to the cache, which they say uses parity rather 
than ECC memory. A fully patched box (such as ours) should have a 
workaround installed, which reduces the incidence, but does not 
eliminate it. Gene, Bill and Kevin suggested to re-seat the memory banks 
and cpu-boards, as this helps sometimes. Unfortunately, under our 
maintenance contract we are not allowed to physically touch the box and 
Sun did not want to send an engeneer to do even that... it would 
introduce new unknown variables. We can do nothing but wait for the 
inevitable it seems...

Nobody had an answer to the differences between UE errors with different 
Syndromes. As said the references I found referred to Syndrome 0x03, 
whereas our error showed Syndrome 0x77. Surender sent a summary posted 
to this list in April 2002, with an excellent discussion of the E-Cache 
error and the Sun CTO and Sombrero modules, by Buddy Limpkin. The 
subject of that summary however was an EDP error rather than an UE 
error. I guess the question remains: does anybody know what the 
different syndromes mean, if anything.

As to the qla error messages in the log, Kevin reinforced my opinion 
that forceloading the drivers is not necessary. None of these devices 
contain boot partitions. In the mean time we have been able to trace at 
least some of those to  a faulty UPS that the storage array is plugged 
into (the panicky server is not plugged in there, though).

Cheers,

-- 
Tony van Lingen
Technical Consultant

The original question follows.
==============================

Dear All,

Last monday we've experienced an unexplaned panic that seems to be due 
to a memory fault. We've reported it to Sun Support, who basically 
advise us to sit back and hope it won't happen again   Obviously this 
is not a nice prospect, since it is on our main intranet and mail 
server. The extended error message in the messsage log was :

> Feb  2 12:11:28 Slarty SUNW,UltraSPARC-II: [ID 940907 kern.warning] 
> WARNING: [AFT1] Uncorrectable Memory Error on CPU1 Instruction access 
> at TL=0, errID 0x000a1658.78a2360f
> Feb  2 12:11:28 Slarty     AFSR 0x00000001<ME>.80300000<PRIV,UE,CE> 
> AFAR 0x00000000.7f8b4900
> Feb  2 12:11:28 Slarty     AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 
> Fault_PC 0x100b4904
> Feb  2 12:11:28 Slarty     UDBH 0x0108<CE> UDBH.ESYND 0x08 UDBL 
> 0x0377<UE,CE> UDBL.ESYND 0x77
> Feb  2 12:11:28 Slarty     UDBL Syndrome 0x77 Memory Module U1302 
> U0302 U1301 U0301
> Feb  2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 482194 kern.info] 
> [AFT2] errID 0x000a1658.78a2360f PA=0x00000000.7f8b4900
> Feb  2 12:11:29 Slarty     E$tag 0x00000000.0c400ff1 E$State: Shared 
> E$parity 0x06
> Feb  2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info] 
> [AFT2] E$Data (0x00): 0xd05fa7f7.80a22000
> Feb  2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 989652 kern.info] 
> [AFT2] E$Data (0x08): 0x036d0e15.01000000 *Bad* PSYND=0x00ff
> Feb  2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info] 
> [AFT2] E$Data (0x10): 0xd25c2010.80a26000
> Feb  2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info] 
> [AFT2] E$Data (0x18): 0x12600007.82aa0509
> Feb  2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info] 
> [AFT2] E$Data (0x20): 0x7ffffe26.92100010
> Feb  2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 989652 kern.info] 
> [AFT2] E$Data (0x28): 0xc55fadf7.19880d0d *Bad* PSYND=0x00ff
> Feb  2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info] 
> [AFT2] E$Data (0x30): 0xc4742010.0268000a
> Feb  2 12:11:29 Slarty SUNW,UltraSPARC-II: [ID 359263 kern.info] 
> [AFT2] E$Data (0x38): 0x0309070d.911e050f
> Feb  2 12:11:29 Slarty unix: [ID 836849 kern.notice]
> Feb  2 12:11:29 Slarty ^Mpanic[cpu1]/thread=3000ad426c0:
> Feb  2 12:11:29 Slarty unix: [ID 159042 kern.notice] [AFT1] errID 
> 0x000a1658.78a2360f UE Error(s)
> Feb  2 12:11:29 Slarty     See previous message(s) for details


Via Google I found some reference to the UE messages, and that UDBL 
Syndrome 0x03 was not a hardware failure. There was nothing on the 
Syndrome 0x77 reported by our box. The Sun engeneer said that (quote):

> These appear to be the usually error messages I would expect to see 
> due to a uncorrectable memory error.


Well, how about that. I wonder if any of you have a more detailed 
reaction to the above error. The message log showed apart from the panic 
a large number of errors on the qlogic card, which connects to a brand 
new Dell SAN with Clariion CX400 storage arrays:

> Feb  2 12:11:29 Slarty unix: [ID 100000 kern.notice]
> Feb  2 12:11:29 Slarty genunix: [ID 723222 kern.notice] 
> 000002a100be72e0 SUNW,UltraSPARC-II:cpu_aflt_log+568 (2a100be739e, 1, 
> 101517a8, 2a100be7528, 2a100be73eb, 101517d0)
> Feb  2 12:11:29 Slarty genunix: [ID 179002 kern.notice]   %l0-3: 
> 0000000000000000 0000000000000003 000002a100be75f0 0000000000000010
> Feb  2 12:11:29 Slarty   %l4-7: 0000000000000000 0000000000000000 
> 0000000000000000 0000000000000000
> Feb  2 12:11:29 Slarty genunix: [ID 723222 kern.notice] 
> 000002a100be7530 SUNW,UltraSPARC-II:cpu_async_error+868 (1046a270, 
> 2a100be75f0, 180300000, 0, 15bba1180300000, 2a100be77b0)
> Feb  2 12:11:29 Slarty genunix: [ID 179002 kern.notice]   %l0-3: 
> 000000001040db3c 000000000000000a 0000000000000377 0000000000000108
> Feb  2 12:11:29 Slarty   %l4-7: 000000007f8b4900 0000000000400000 
> 0000000000400000 0000000000000001
> Feb  2 12:11:29 Slarty genunix: [ID 723222 kern.notice] 
> 000002a100be7700 unix:prom_rtt+0 (0, 2f, 30006887de8, 0, 3000bea8000, 
> 3000ac5d100)
> Feb  2 12:11:30 Slarty genunix: [ID 179002 kern.notice]   %l0-3: 
> 0000000000000005 0000000000001400 0000004480001604 0000000010148e04
> Feb  2 12:11:30 Slarty   %l4-7: 00000000fd3619d8 000002a100cf7af0 
> 0000000000000000 000002a100be77b0
> Feb  2 12:11:30 Slarty genunix: [ID 723222 kern.notice] 
> 000002a100be7850 genunix:pcache_insert+e8 (2a100be7a0c, 1, 
> 3000c7374e8, 0, 3000c6205f8, 2f)
> Feb  2 12:11:30 Slarty genunix: [ID 179002 kern.notice]   %l0-3: 
> 000003000b462210 0000030001c41000 000003000af70700 0000000000000001
> Feb  2 12:11:30 Slarty   %l4-7: 0000000000000000 0000000000000000 
> 0000000000000000 0000000000000000
> Feb  2 12:11:30 Slarty genunix: [ID 723222 kern.notice] 
> 000002a100be7910 genunix:pcacheset_resolve+25c (1, 3000c7374e0, 1, 3, 
> 30001c41000, 0)
> Feb  2 12:11:30 Slarty genunix: [ID 179002 kern.notice]   %l0-3: 
> 000003000a6f7b68 000003000ac5d100 0000000000000002 0000000000000008
> Feb  2 12:11:30 Slarty   %l4-7: 0000000000000003 000003000a6f7b60 
> 000003000c7374e8 0000000000000001
> Feb  2 12:11:30 Slarty genunix: [ID 723222 kern.notice] 
> 000002a100be7a10 genunix:poll+32c (30001c40100, 20, 3000c7374e0, 1, 
> 314, 5fa074)
> Feb  2 12:11:30 Slarty genunix: [ID 179002 kern.notice]   %l0-3: 
> 0000000000000004 0000030001c41000 000000000000000a 000002a100be7ac8
> Feb  2 12:11:30 Slarty   %l4-7: 0000030001c41010 0000000000000001 
> 000003000c6205f8 0000000000000000
> Feb  2 12:11:30 Slarty unix: [ID 100000 kern.notice]
> Feb  2 12:11:30 Slarty genunix: [ID 672855 kern.notice] syncing file 
> systems...
> Feb  2 12:11:31 Slarty qla2300: [ID 467028 kern.info] qla2300(1): 
> isp_firmware, firmware load needed
> Feb  2 12:11:31 Slarty qla2300: [ID 693156 kern.info] qla2300(1): 
> fw_ready, waiting firmware state=1h, wait_timer=24, min_wait=10
> Feb  2 12:11:32 Slarty qla2300: [ID 693156 kern.info] qla2300(1): 
> fw_ready, waiting firmware state=1h, wait_timer=23, min_wait=10
> Feb  2 12:11:33 Slarty qla2300: [ID 693156 kern.info] qla2300(1): 
> fw_ready, waiting firmware state=1h, wait_timer=22, min_wait=10
> Feb  2 12:11:34 Slarty qla2300: [ID 693156 kern.info] qla2300(1): 
> fw_ready, waiting firmware state=1h, wait_timer=21, min_wait=10
> Feb  2 12:11:35 Slarty qla2300: [ID 302519 kern.info] qla2300(1): 
> async_event, 8030h Point to Point Mode received
> Feb  2 12:11:35 Slarty qla2300: [ID 996118 kern.info] qla2300(1): 
> Fibre Channel Loop is Down (8030)
> Feb  2 12:11:35 Slarty qla2300: [ID 225349 kern.info] qla2300(1): 
> async_event, 8011h Loop Up received
> Feb  2 12:11:35 Slarty qla2300: [ID 935615 kern.info] qla2300(1): 
> async_event, 8014h Port Database Update received
> Feb  2 12:11:35 Slarty qla2300: [ID 567540 kern.info] qla2300(1): 
> Fibre Channel Loop is Up (8014)
> Feb  2 12:11:35 Slarty qla2300: [ID 873664 kern.info] qla2300(1): 
> configure_fabric, Re-login of device, tgt=2, wwpn=500601681020d58ah
> Feb  2 12:11:35 Slarty qla2300: [ID 818760 kern.info] qla2300(1): 
> fabric_login, loop_id=0h, mb[1]=0h, wwpn=500601681020d58ah
> Feb  2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1): 
> cfg_lun, configured loop_id=0h, lun=0, type=0h
> Feb  2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1): 
> cfg_lun, configured loop_id=0h, lun=1, type=0h
> Feb  2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1): 
> cfg_lun, configured loop_id=0h, lun=2, type=0h
> Feb  2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1): 
> cfg_lun, configured loop_id=0h, lun=3, type=0h
> Feb  2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1): 
> cfg_lun, configured loop_id=0h, lun=4, type=0h
> Feb  2 12:11:35 Slarty qla2300: [ID 873664 kern.info] qla2300(1): 
> configure_fabric, Re-login of device, tgt=1, wwpn=500601601020d58ah
> Feb  2 12:11:35 Slarty qla2300: [ID 818760 kern.info] qla2300(1): 
> fabric_login, loop_id=1h, mb[1]=0h, wwpn=500601601020d58ah
> Feb  2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1): 
> cfg_lun, configured loop_id=1h, lun=0, type=0h
> Feb  2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1): 
> cfg_lun, configured loop_id=1h, lun=1, type=0h
> Feb  2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1): 
> cfg_lun, configured loop_id=1h, lun=2, type=0h
> Feb  2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1): 
> cfg_lun, configured loop_id=1h, lun=3, type=0h
> Feb  2 12:11:35 Slarty qla2300: [ID 518765 kern.info] qla2300(1): 
> cfg_lun, configured loop_id=1h, lun=4, type=0h
> Feb  2 12:11:35 Slarty qla2300: [ID 148734 kern.info] qla2300(1): 
> fcport_bind, exiting tgt=2, loop_id=0h
> Feb  2 12:11:35 Slarty qla2300: [ID 148734 kern.info] qla2300(1): 
> fcport_bind, exiting tgt=1, loop_id=1h
> Feb  2 12:11:35 Slarty qla2300: [ID 175527 kern.info] qla2300(1): 
> configure_loop, 2 gigabit data rate connection
> Feb  2 12:11:35 Slarty qla2300: [ID 467028 kern.info] qla2300(1): 
> configure_loop, F-PORT connection
> Feb  2 12:11:35 Slarty qla2300: [ID 465925 kern.info] qla2300(1): 
> status_entry, check condition sense data t1d0
> Feb  2 12:11:35 Slarty 70h  0h  6h  0h  0h  0h  0h  6h  0h  0h  0h  0h 
> 29h  0h  0h  0h  0h 20h
> Feb  2 12:11:35 Slarty scsi: [ID 107833 kern.warning] WARNING: 
> /pci@1f,4000/fibre-channel@2/sd@1,0 (sd557):
> Feb  2 12:11:35 Slarty  Error for Command: write                   
> Error Level: Retryable
> Feb  2 12:11:35 Slarty scsi: [ID 107833 kern.notice]    Requested 
> Block: 1664                      Error Block: 1664
> Feb  2 12:11:35 Slarty scsi: [ID 107833 kern.notice]    Vendor: 
> DGC                                Serial Number: 0000006CCCCL
> Feb  2 12:11:35 Slarty scsi: [ID 107833 kern.notice]    Sense Key: 
> Unit Attention
> Feb  2 12:11:35 Slarty scsi: [ID 107833 kern.notice]    ASC: 0x29 
> (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
> Feb  2 12:12:00 Slarty unix: [ID 836849 kern.notice]
> Feb  2 12:12:00 Slarty ^Mpanic[cpu1]/thread=3000ad426c0:
> Feb  2 12:12:00 Slarty unix: [ID 715357 kern.notice] panic sync timeout
> Feb  2 12:12:00 Slarty unix: [ID 100000 kern.notice]
> Feb  2 12:12:00 Slarty genunix: [ID 353387 kern.notice] dumping to 
> /dev/md/dsk/d1, offset 837681152


After which the system rebooted:

> Feb  2 12:31:38 Slarty genunix: [ID 540533 kern.notice] ^MSunOS 
> Release 5.8 Version Generic_108528-27 64-bit
> Feb  2 12:31:38 Slarty genunix: [ID 913632 kern.notice] Copyright 
> 1983-2003 Sun Microsystems, Inc.  All rights reserved.
> Feb  2 12:31:38 Slarty genunix: [ID 723599 kern.warning] WARNING: 
> Driver alias "pci1077,2200" conflicts with an existing driver name or 
> alias.
> Feb  2 12:31:38 Slarty unix: [ID 389951 kern.info] mem = 2097152K 
> (0x80000000)
> Feb  2 12:31:38 Slarty unix: [ID 930857 kern.info] avail mem = 2051686400
> Feb  2 12:31:38 Slarty rootnex: [ID 466748 kern.info] root nexus = Sun 
> Enterprise 420R (2 X UltraSPARC-II 450MHz)
> Feb  2 12:31:38 Slarty rootnex: [ID 349649 kern.info] pcipsy0 at root: 
> UPA 0x1f 0x4000
> Feb  2 12:31:38 Slarty genunix: [ID 936769 kern.info] pcipsy0 is 
> /pci@1f,4000
> Feb  2 12:31:38 Slarty rootnex: [ID 349649 kern.info] pcipsy1 at root: 
> UPA 0x1f 0x2000
> Feb  2 12:31:38 Slarty genunix: [ID 936769 kern.info] pcipsy1 is 
> /pci@1f,2000


The qla and scsi errors still occur, especially when a lot of disk 
activity  takes place (e.g. the daily backup). There is also a message 
about a conflicting driver alias when the system rebooted. Could these 
errors have anything to do with the panic? What could be causing them? 
And would force-loading the device drivers (advised by the Sun engeneer) 
solve these transport problems?




___________________________
Disclaimer

This e-mail, including attachments if any, has originated from a Queensland government agency and may contain information that is confidential, or covered by legal professional privilege, and is intended for the named recipient(s) only.  If you have received this message in error, you are asked to inform the sender as quickly as possible and delete this message and any copies of this message from your computer system network.

Any form of disclosure, modification, distribution and/or publication of this e-mail, including attachments is prohibited.  Unless otherwise stated, this e-mail, including attachments represents the views of the sender and not the views of the Environmental Protection Agency.

Although this e-mail has been checked for the presence of computer viruses, the Environmental Protection Agency provides no warranty that all possible viruses have been detected and cleaned.  Any use of this e-mail could harm your computer system.
___________________________
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Wed Feb 11 01:10:49 2004

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:28 EST