SUMMARY: Ultra SPARC: panic on CPU, memory faults, SCSI problems

From: Stefan (s.voss@terradata.de)
Date: Fri Apr 25 1997 - 04:33:22 CDT


Hello

Thanks to all who replied.

I got some answers, which fell into 2 different groups:

        1. SCSI problems: check cables, termination and so on...
        2. CPU problems: replace the CPU

We discussed the problems with our hardware vendor, and he agreed, that we have
a defect CPU. We will change it ASAP. Hopefully, this will also clear the SCSI
trouble (but I don't think so...).

Unfortunately, i have deleted the answers from
   "Jeff J. Dingbaum" <dingbaum@hep.net>
   Weldon S Godfrey 3 <weldon@excelsus.com>
The other answers are listed below.

Jeff Dingbaum pointed me to the right direction: Some older 200 MHz Ultras have
a problem with their cache. Modules with revisions -04 must be replaced by
modules with revisions -05.

Thanks a lot,

                        Stefan

                                      ,,,
                                     (o o)
 --------------------------------o0Oo-(_)-oO0o---------------------------------

Stefan Voss Phone: # 49 (0) 5139-9908-51
Software & System Support Fax: # 49 (0) 5139-9908-10
TerraData Geophysical Services GmbH e-mail: s.voss@terradata.de
Ehlbeck 15 a
D - 30938 Burgwedel, Germany

*******************************************************************************

ANSWERS:

From: robin.landis@imail.exim.gov

Hi. I thought you had to install at least Solaris 2.5.1 for ultrasparc to
boot at 64 bit. Is it possible that you really have just 2.5 installed and
that's causing this problem? ....Robin

-------------------------------------------------------------------------------

From: jam@cdi.cdicad.com (James Musso)

Stefan,
I believe you have scsi devices of different asynchronous data rates, to fix this the file /etc/system should be changed like:

set scsi_options = 0x58 (not 378)
then reboot-this will slow the transfer rate down

hope this helps

-------------------------------------------------------------------------------

From: Jay Lessert <jayl@latticesemi.com>

You may or may not have a SCSI setup problem.

You *do* have a CPU module problem. Hopefully you're either under
warranty or maintenance contract. Fix that, and then worry about
the SCSI.

-------------------------------------------------------------------------------

From: "Rick von Richter" <rickv@mwh.com>

Have you tried getting a crash dump and analizing it yourself or just send it to
Sun and they will tell you what's up. To enable crashe dumps, do the following.
 Find a partition on your system that has enough space to hold ~50% of total
memory. Let's use /opt as an example. The edit /etc/init.d/sysetup and
uncomment the bottom lines shown below.

##
## Default is to not do a savecore
##
#if [ ! -d /var/crash/`uname -n` ]
#then mkdir -m 0700 -p /var/crash/`uname -n`
#fi
# echo 'checking for crash dump...\c '
#savecore /var/crash/`uname -n`
# echo ''

replace /var with /opt (or whatever partition you are going to use) and that's
it.

When the system panics again, get to the OK prompt (L1-A) and type 'sync'. When
it reboots, it will dump memory core into the above directory into a couple of
files which you can tar and send to sun.

Hope this helps,

-------------------------------------------------------------------------------

From: bismark@alta.Jpl.Nasa.Gov (Bismark Espinoza)

I think you have a bad cpu module:

Apr 23 11:49:15 kybele unix: panic[cpu0]/thread=0x511507e0: CPU0 Ecache SRAM Data Parity Error: AFSR 0x00000000 00408000 AFAR 0x000001c2 00000000

-------------------------------------------------------------------------------

From: David Schiffrin <daves@adnc.com>

Hi-

It looks to me like you've got more than one problem, and they aggravate
each other.

It appears that target 0 is not syncing to the scsi bus as it should.
        is target 0 the internal drive from sun?
   if so, have it replaced. It shouldn't need the scsi-options=0x78.
This turns off fastwide and tagged command queuing, both of which are supported
on the Sun drive.
        if target 0 is NOT internal or a Sun drive, the above may not apply, and
you may need to check cables, termination, adapters.....

The other error (Ecache SRAM) indicates a problem with some early UltraSparc processor modules. Contact SUN and ask for warranty replacement service.

good luck, and feel free to reply if this leaves you with more questions.

-dave

-------------------------------------------------------------------------------

From: Angel Lopez Luengo <alopez@mayor.dia.fi.upm.es>

Hi Stefan,

we had a similar problem with a SCSI Quantum Atlas XP 34300W, in a SparcServer
1000 running Solaris 2.5. Our solution was to change the SCSI to the slower
asynchronous data rate, adding:

   set scsi_options = 0x58

to the /etc/system file, and then reboot the system.

If the above doesn't work, you can try other combinations:

    0x8 --> global disconnect/reconnect
    0x10 --> global linked commands
    0x20 --> global synchronous xfer capability
    0x40 --> global parity support
    0x80 --> global tagged command support
    0x100 --> global FAST scsi support
    0x200 --> global WIDE scsi support

you have to "add" the one's you've selected and put it in the /et/system.

If this doesn't solve your problem, you can think about the possiblity of
changing the problematic SCSI, if you've got more than one, to one that has
similar technical characteristics like the other ones. It's drastic, but
that was one of my first possible solutions.

Hope this can help you in some manner.

*******************************************************************************

EXCERPT FROM ORIGINAL POSTING:

>Apr 23 11:49:11 kybele unix: cpu0: SUNW,UltraSPARC (upaid 0 impl 0x10 ver
0x40
>clock 200 MHz)
>Apr 23 11:49:11 kybele unix: SunOS Release 5.5 Version Generic [UNIX(R)
System
>V Release 4.0]
>.
>.
>.
>Apr 23 11:49:12 kybele unix: WARNING:
/sbus@1f,0/espdma@e,8400000/esp@e,8800000
>(esp0):
>Apr 23 11:49:12 kybele unix: Connected command timeout for Target 0.0
>Apr 23 11:49:13 kybele unix: WARNING:
/sbus@1f,0/espdma@e,8400000/esp@e,8800000
>(esp0):
>Apr 23 11:49:13 kybele unix: Target 0.0 reducing sync. transfer rate
>Apr 23 11:49:13 kybele unix: WARNING:
>/sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0 (sd0):
>Apr 23 11:49:13 kybele unix: SCSI transport failed: reason 'timeout':
retrying
>command
>Apr 23 11:49:13 kybele unix: WARNING:
>/sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0 (sd0):
>Apr 23 11:49:13 kybele unix: SCSI transport failed: reason 'reset': retrying
>command
>Apr 23 11:49:13 kybele unix: panic[cpu0]/thread=0x50ef6120: CPU0 Ecache SRAM
>Data Parity Error: AFSR 0x00000000 00408000 AFAR 0x00000000 37effff0
>Apr 23 11:49:13 kybele unix: syncing file systems... 15 15 15 15 15 15 15
15 15
>15 15 15 15 15 15 15 15 15 15 15 done
>Apr 23 11:49:13 kybele unix: 5271 static and sysmap kernel pages
>Apr 23 11:49:13 kybele unix: 47 dynamic kernel data pages
>Apr 23 11:49:13 kybele unix: 65 kernel-pageable pages
>Apr 23 11:49:13 kybele unix: 0 segkmap kernel pages
>Apr 23 11:49:13 kybele unix: 0 segvn kernel pages
>Apr 23 11:49:13 kybele unix: 667 current user process pages
>Apr 23 11:49:13 kybele unix: 6050 total pages (6050 chunks)
>Apr 23 11:49:13 kybele unix: dumping to vp 5022eee4, offset 2002116
>
>
>Apr 23 11:49:13 kybele unix: cpu0: SUNW,UltraSPARC (upaid 0 impl 0x10 ver
0x40
>clock 200 MHz)
>.
>.
>.
>Apr 23 11:49:15 kybele unix: WARNING:
/sbus@1f,0/espdma@e,8400000/esp@e,8800000
>(esp0):
>Apr 23 11:49:15 kybele unix: Connected command timeout for Target 0.0
>Apr 23 11:49:15 kybele unix: WARNING:
/sbus@1f,0/espdma@e,8400000/esp@e,8800000
>(esp0):
>Apr 23 11:49:15 kybele unix: Target 0.0 reducing sync. transfer rate
>Apr 23 11:49:15 kybele unix: WARNING:
>/sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0 (sd0):
>Apr 23 11:49:15 kybele unix: SCSI transport failed: reason 'timeout':
retrying
>command
>.
>.
>.
>Apr 23 11:49:15 kybele unix: panic[cpu0]/thread=0x511507e0: CPU0 Ecache SRAM
>Data Parity Error: AFSR 0x00000000 00408000 AFAR 0x000001c2 00000000
>Apr 23 11:49:15 kybele unix: syncing file systems... 10 10 10 10 10 10 10
10 10
>10 10 10 10 10 10 10 10 10 10 10 done
>Apr 23 11:49:15 kybele unix: 5046 static and sysmap kernel pages
>Apr 23 11:49:15 kybele unix: 40 dynamic kernel data pages
>Apr 23 11:49:15 kybele unix: 81 kernel-pageable pages
>Apr 23 11:49:15 kybele unix: 0 segkmap kernel pages
>Apr 23 11:49:15 kybele unix: 0 segvn kernel pages
>Apr 23 11:49:15 kybele unix: 4753 current user process pages
>Apr 23 11:49:15 kybele unix: 9920 total pages (9920 chunks)
>Apr 23 11:49:15 kybele unix: dumping to vp 5022eee4, offset 1940196
>
>Apr 23 11:49:18 kybele unix: WARNING:
/sbus@1f,0/espdma@e,8400000/esp@e,8800000
>(esp0):
>Apr 23 11:49:18 kybele unix: Connected command timeout for Target 0.0
>Apr 23 11:49:18 kybele unix: WARNING:
/sbus@1f,0/espdma@e,8400000/esp@e,8800000
>(esp0):
>Apr 23 11:49:18 kybele unix: Target 0.0 reducing sync. transfer rate
>Apr 23 11:49:18 kybele unix: WARNING:
>/sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0 (sd0):
>Apr 23 11:49:18 kybele unix: SCSI transport failed: reason 'timeout':
retrying
>command
>Apr 23 11:49:18 kybele unix: WARNING:
>/sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0 (sd0):
>Apr 23 11:49:18 kybele unix: SCSI transport failed: reason 'reset': retrying
>command
>
>



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:51 CDT