SUMMARY: Help with NFS problem

From: Greg Roberts (gregr@ee.uwa.edu.au)
Date: Thu Apr 30 1998 - 09:01:59 CDT


Hi. I've managed to solve my problem, so here's the summary for those that
are interested. I've included the original question to help the summary
make more sense.

> Hi all. I'm currently experiencing a problem with which I have no idea how
> to remedy. The situation is this. I have a lab with a main server machine
> (Sun UltraSPARC 1), a SPARCStation-20, four X-Terminals and two DEC
> AlphaStation's. The Sun's are running Solaris 2.5.1 and the DEC's are
> running DU4.0 flat. The problem started when I installed three new Quantum
> Fireball ST 6.4GB drives into the server machine. Every now and then, you
> will be working on something on any of these Unix machines, and the system
> will hang for around a minute, and then return with the following error
> message:
>
> NFS3 RFS3_GETATTR failed for server <servername> : RPC : timeout
>
> This can appear at any time without warning, and it can't be generated
> deliberately. I have been working with one of the new disks for a while
> now, and I can do whatever I want with the disk no problems. The main
> reason why this problem needs to be resolved is that the lab is used to run
> simulations that can take days to complete, and the above error comes along
> and knocks off this process, rendering the system quite useless to their
> needs. The funny thing is that as soon as the error appears, the disk you
> were trying to access does become available straight away. So if it didn't
> kill the processes running, I guess this could be lived with. This isn't a
> Sun-DEC incompatibility because it can happen on the server machine as
> well. I've just moved the disks into a new 4bay SCSI disk box, which is
> connected to the Ultra SPARC by a DB50-Honda68 cable. The Ultra SPARC has
> two internal disks, the three new disks plus a CDROM. I have checked the
> messages log and I'm getting SCSI timeout and retry errors. Here's an
> output snippet of what's going on.
>
> Apr 9 10:52:46 <server> unix: fas: 3.0: cdb=[ 0x2a 0x0 0x0 0x4d 0x88
> 0xf0 0
> x0 0x0 0x8 0x0 ]
> Apr 9 10:52:46 <server> unix: fas: 3.0: cdb=[ 0x2a 0x0 0x0 0x2b 0x21
> 0x50 0
> x0 0x0 0x8 0x0 ]
> Apr 9 10:52:46 <server> unix: fas: 3.0: cdb=[ 0x2a 0x0 0x0 0x23 0x48
> 0x90 0
> x0 0x0 0x8 0x0 ]
> Apr 9 10:52:46 <server> unix: fas: 3.0: cdb=[ 0x2a 0x0 0x0 0x38 0x30
> 0x90 0
> x0 0x0 0x8 0x0 ]
> Apr 9 10:52:46 <server> unix: fas: 3.0: cdb=[ 0x2a 0x0 0x0 0x35 0x95
> 0x50 0
> x0 0x0 0x8 0x0 ]
> Apr 9 10:52:46 <server> unix: fas: 3.0: cdb=[ 0x2a 0x0 0x0 0x2d 0xbc
> 0x90 0
> x0 0x0 0x8 0x0 ]
> Apr 9 10:52:46 <server> unix: fas: 3.0: cdb=[ 0x2a 0x0 0x0 0x2a 0xb0
> 0xf0 0
> x0 0x0 0x8 0x0 ]
> Apr 9 10:52:46 <server> unix: WARNING: /sbus@1f,0/SUNW,fas@e,8800000
(fas0):
> Apr 9 10:52:46 <server> unix: Disconnected tagged cmd(s) (8) timeout for
> Targe
> t 3.0
> Apr 9 10:52:46 <server> unix: WARNING:
> /sbus@1f,0/SUNW,fas@e,8800000/sd@3,0 (sd3
> ):
> Apr 9 10:52:46 <server> unix: SCSI transport failed: reason 'timeout':
> retryin
> g command
> Apr 9 10:52:46 <server> unix: WARNING:
> /sbus@1f,0/SUNW,fas@e,8800000/sd@3,0 (sd3
> ):
> Apr 9 10:52:46 <server> unix: SCSI transport failed: reason 'reset':
> retrying
> command
>
> So what I'd like to know is, is it the disks that are faulty or is it
> something to do with the system configuration or hardware setup that is
> causing these problems? If it's the disks, I need to get them back to the
> supplier ASAP for a trade in. If it's the system, what changes do I need to
> make to fix this problem? Any help with this problem will be greatly
> appreciated.

Believe it or not, the solution to all the above was a single line entry
into the /etc/system file on the server machine. The entry is:

                        set sd:sd_max_throttle=8

As soon as this was entered and the system rebooted, I haven't seen one
SCSI error entry like the above, or had any complaints from users about NFS
timeouts. The disks I bought didn't support tagged queueing, so by turning
this feature off, has solved my problem.

I'd like to thank the following for their help/input:

Kevin Sheehan <Kevin.Sheehan@uniq.com.au>
Stacy Lindberg <grebdnil@cheetah.spots.ab.ca>
Jim Robertori <jimr@lucent.com>
Seela Balkissoon <seela@cs.yorku.ca>
Eddy Fafard <eddy@slimepuppy.apple.com>
Bismark Espinoza <bismark@alta.Jpl.Nasa.Gov>
------------
Greg Roberts
Computer Systems Officer
Dept. of Electrical & Electronic Engineering
The University of Western Australia
NEDLANDS WA 6907 Australia

Ph : +61-08-9380-7366
Fax : +61-08-9380-1065
Email : gregr@ee.uwa.edu.au



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:39 CDT