Summary: nfs server not responding - SCSI transport failed

From: Dan Penrod (penrod@kazoo.er.usgs.gov)
Date: Mon Feb 12 1996 - 14:51:16 CST


Here is my original query, the solution is below...

>Sun Managers:
>
>I'm having a problem with our nfs servers named 'elroy' and Sun Support is,
>as usual, totally worthless. Machines remotely using elroy's disk get the
>following message.
>
> NFS server elroy not responding still trying
> NFS server elroy ok
> NFS server elroy not responding still trying
> NFS server elroy ok
>
>...resulting in abysmally slowwww performance. I've tried rebooting both
>elroy and its client machines with no improvement. The SunSolve database
>shows no such known problem under Solaris 2.4, which elroy currently runs.
>
>Looking at elroy:/var/adm/messages I notice the following error messages...
>
> Feb 5 14:01:03 elroy unix: eout for Target 0.0
> Feb 5 14:01:03 elroy unix: WARNING: /iommu@f,e0000000/sbus@f,e0001000/
> espdma@f,400000/esp@f,800000/sd@0,0 (sd0):
> Feb 5 14:01:03 elroy unix: SCSI transport failed: reason 'timeout':
> retrying command
> Feb 5 14:01:03 elroy unix: WARNING: /iommu@f,e0000000/sbus@f,e0001000/
> espdma@f,400000/esp@f,800000 (esp0):
> Feb 5 14:01:03 elroy unix: Disconnected tagged cmds (3) timeout for Target
> 0.0Feb 5 14:01:03 elroy unix: WARNING: /iommu@f,e0000000/sbus@f,e0001000/
> espdma@f,400000/esp@f,800000/sd@0,0 (sd0):
> Feb 5 14:01:03 elroy unix: SCSI transport failed: reason 'timeout':
> retrying command
>
>This might explain the inability to nfs serve the disk at target 0.
>The SunSolve database shows a reported bug #1194263 which appears to be
>identical. I've attached that html document. It offers no fix but does
>suggests one possible workaround...
>
> "set sd:sd_max_throttle=10"
>
>Any idea where can I make this configuration change?
>Anyone know if this problem is hardware or software?
>
>Thanks,
>-dan

The answer is that the changes can be made to the /etc/system file. After
you edit the file you must reboot. I received a lot of different suggestions
as to what to put in that configuration file which I will describe below.

The other answer is that it's hardware and software. Configurations to
software can change the way hardware is accessed. No, there doesn't seem
to be a patch... there is a hardware solution.

dotty@tgivan.wimsey.bc.ca (Dotty Pon) writes... "shorten your scsi cables."
I tried that... no go.

bismark@alta.jpl.nasa.gov (Bismark Espinoza) writes... "check network load
and NFS parameters, also tagged command queueing." Well, it's not the
network, it's definately a scsi problem. Later I discuss how to handle
command queueing.

mrs@cadem.mc.xerox.com (Mike Salehi) writes... "You can put those changes
in /etc/system and reboot." Right. Thanks.

James.E.Coby.Jr@cdc.com (James Coby) writes... "take a look at /etc/system
file." Right again. Thanks.

Casper Dik <casper@holland.Sun.COM writes...
>The solaris FAQ says:
>
>3.29) I have all kinds of problems with SCSI disks under Solaris 2.x
> They worked fine under SunOS 4.x.
>
> Append this line to /etc/system and reboot:
>
> set scsi_options & ~0x80
>
> This turns off Command Queuing, which upsets rather a lot
> of SCSI drives.
>
> In Solaris 2.4 and later you can set those options per SCSI
> bus. See isp(7) and esp(7).
>
> For some disks, all you need to do is decrease the maximum number of
> queued commands:
>
> forceload: drv/esp
> set sd:sd_max_throttle=10
He also say to check the scsi-cables and terminators. This was a really
good answer so I copied the whole thing here.

baldwinj@mailbox.ne.tpd.eds.com (John Baldwin) writes...
>This message indicates the the system sent data over the SCSI bus, but the
>data never reached its destination because of a bus reset..
>I have seen this problem before when you mix different devices on the same
>scsi chain ex. fast 10 Mb/s and 5 Mb/s scsi disks or devices...check and make
>sure the devices on the chain are consistent, with controller...check lenght
>of cable, check termination...make sure your power supply is consistent,
>check target addresses for conflicts...I am also sending you additional info
>in a seperate e-mail...The configuration that you mentioned should be added
>to the /etc/system file and then reboot....
Yea, good point. I'm sure this is common these days.

Kent R Arnott <karnott@falcon.tamucc.edu> writes... "im getting the same
>problems if you find a solution let me know i have tried several things
>and can not get them to work..."
Here you go Kent hope one of these things helps you.

Glenn.Satchell@Uniq.com.au (Glenn Satchell)
>If possible the fast scsi-2 devices (ie most disks less than two years
>old) should be the first things on the bus, and slower scsi-1 devices
>at the far end (relative to the cpu).
>
>The possible workaround goes in the file /etc/system, and the system
>then needs to be rebooted for it to take effect. But I'd investigate
>the cables and hardware first.
I tried changing the order of devices. No good. I've also swapped out all
the cables and terminators. Nope.

kwthomas@wizard.nssl.uoknor.edu (Kevin W. Thomas) also writes... "check
cables and termination." Nope, that's not it.

"Daniel M. Quinlan" <danq@jspc.colorado.edu> writes...
>Well, I'd take a look at the length of the scsi chain that disk is
>on and also what other scsi devices are on that chain. I believe
>you're not supposed to mix certain kinds of scsi devices, and that might
>be some of the problem. There's a very interesting program "scsiinfo"
>which you can get from ftp.cdf.toronto.edu which might tell you something
>useful about the other things on the chain. Another possibility of
>course is that you're just looking at a hardware failure.
I did download the scsiinfo utility. It's a small text based unix command
which I found to be very interesting. I recommend everyone keep a copy
of scsiinfo on every machine. Thanks Daniel.

Henry Katz <hkatz@panix.com> writes...
>the set command goes in /etc/system and configure a kernel parameter
>in the sd driver, you may also want to turn off SCSI tagged command queueing:
>set scsi_options = 0x378
This is interesting. I wonder how this is different from...
 set scsi_options & ~0x80 ??? They both claim to do the same thing. Hmmm.

Jens Fischer <jefi@kat.ina.de> writes... "turn off tagged command queuing in
/etc/system. Can't remember the entry..." Thanks Jens... Apparently it's
either...
set scsi_options = 0x378 or
et scsi_options & ~0x80

vahsenr@ce.philips.nl (Vahsen Rob) writes... "we installed patches but they
didn't help..." Well Vahsen, maybe something here will help. Good luck.

Anderson McCammont <and@morgan.com> writes... "configure in /etc/system...
check cables and terminators..." Thanks.

Roger Salisbury <rogers@ttmc.com> writes...
>It may be you scsi_options have the drive set to tagged fast scsi-2
>which is the default in 2.4, does the drive support tagged queuing??
>Does it tell the O.S. it does when in fact it doesn't do the sun tagged
>queuing?? You can disable tagQ with set scsi_options = 0X178 in /etc/system
>The FAQ has a more in depth explaination.
Huh? What? ...set scsi_options = 0X178. I thought it was 0x378 or ~0x80...
Ok. I'd still like to know the difference.

seanm@sybase.com (Sean McInerney) writes..."It's a cable or terminator"
Good guess but wrong. Thanks anyway.

shish Parikh <ashish@Savantage.Com> writes... "I ran into a similar problem a
last Friday. The problem was that my SCSI cables were not properly seated."
I guess this is very common. It wasn't my problem though.

vitec!jsutton@uu.psi.com (John R. Sutton (214-997-4123)) writes...
"I have seen the same problem corrected by addng the following line to
/etc/services set scsi_options & ~0x80 which turns off command queing"
Another vote for ~0x80. Okay.

vitec!jsutton@uu.psi.com (John R. Sutton (214-997-4123)) writes...
"The 'set sd.......' line needs to be added to the /etc/system file"
Apparently I was the only one who didn't know about /etc/system. Not anymore!

Akile Sahin <akile@bornova.ege.edu.tr> writes...
>You asked a question about above subject at 7th February. You haven`t
>sent SUMMARY yet. I am forward to wait the solution of this problem. Because
>I come accross this problem on our system.
Sorry for the delay. Hope something here helps you out

nobroin@esoc.esa.de (Niall O Broin - Gray Wizard) writes...
>You set the variable in /etc/system, and to answer your question, the
>problem can be both hard and software i.e. your OS (soft) and the disk(hard)
>are not happy together. The variable change suggested may help to bring them
>to harmony - I've seen similar suggestions for various SCSI problems with SCSI
>under Solaris before now.
Yea, since Solaris has come out the old problems are gone and new problems
have replaced them! Figures.

I found that I had a sony scsi cdrom burner on the scsi chain and that when
I removed it from the chain all my problems went away. I haven't bothered
to try putting it back yet. I'm happy to leave it off. I may try turning
off the command queuing later if/when I need to put it back on. On another
machine with similar problems my solution was to add a second SBus SCSI
card to distribute the load, shorten the scsi cable, and isolate problems.
This has helped. Both nfs servers that have given me similar trouble have
2 9GB single-partition disk drives on them. This must be a contributor,
which is sort of a side-effect of Solaris since SunOS doesn't support the
beasts.

Thanks Sun Managers!

-dan

 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| _/ _/ _/_/_/_/ _/_/_/ _/_/_/_/ | Dan Penrod - Unix Administrator |
| _/ _/ _/ _/ _/ | USGS Center for Coastal Geology |
| _/ _/ _/_/_/_/ _/ _/_/ _/_/_/_/ | St. Petersburg, FL 33701 |
| _/ _/ _/ _/ _/ _/ | (813)893-3100 ext.3043 |
|_/_/_/_/ _/_/_/_/ _/_/_/_/ _/_/_/_/ | penrod@whiplash.er.usgs.gov |
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:53 CDT