SUMMARY: SCSI problems

From: Susan Thielen (thielen@irus.rri.uwo.ca)
Date: Tue Mar 31 1992 - 18:30:13 CST


Well my SCSI problems still aren't over.. although they
have lessened a great deal. I took the /tmp area off the
last disk in the chain and put it back on the internal
disk of the machine and I reseated all the cables. I
am still getting the following errors.. albeit much less
frequently

sd2: SCSI transport failed: reason 'reset': retrying command
esp0: No command for reconnect of Target 2 Lun 0
esp0: Proxy abort succeeded for Target 2 Lun 0
esp0: Disconnected command timeout for Target 2 Lun 0
sd2: SCSI transport failed: reason 'timeout': retrying command

Thanks to all who responded:

prime_s!blaes@ernohb.uucp Rainer Blaes
kalli!kevin@fourx.Aus.Sun.COM Kevin Sheehan
jumper@spf.trw.com Greg Jumper
hargen@sybus.com Bill Hargen
ames!amdcad!sjsca4!poffen@gatech.uucp Russ Poffenberger
Perry_Hutchison.Portland@xerox.com Perry Hutchison
fetrow@biostat.washington.edu David Fetrow
cdr@kpc.com Carl Rigney
Zbyslaw@europarc.xerox.com> Alex Zbyslaw
phil@cns-howden.co.uk Phil Male

What follows are all of the main points of the responses. Initially
I have my original posting.

------------------------------------------------------------------------------
I am having some difficulties with a couple SCSI disks that are attached
to a Sparc II running 4.1.1.. The disk crashed inexpicably a few weeks
ago, but after booting it up, it has been sort of fine... except that
it has been spouting out error messages like

Mar 14 08:08:45 antares vmunix: esp0: unexpected SCSI bus reset
Mar 16 07:46:00 antares vmunix: smt0: Media change
Mar 16 08:05:30 antares vmunix: sd1: SCSI transport failed: reason 'incomplete': retrying command
Mar 16 08:05:38 antares last message repeated 29 times
Mar 16 08:05:38 antares vmunix: sd1: disk not responding to selection
Mar 16 09:18:03 antares vmunix: sd2: SCSI transport failed: reason 'timeout': retrying command
Mar 16 09:21:50 antares vmunix: smt0: SCSI transport failed: reason 'unexpected_bus_free'

Now I've read some of TFM but I can't seem to find anything that
gives me a clue as to what this all means!! I done the format
analyze on the disks, but I don't find any extra bad blocks... What
should I do next?? The disks are both Delta Microsystems SS-1002D's.
------------------------------------------------------------------------------
From: prime_s!blaes@ernohb.uucp (Rainer Blaes)
did you install the following patch:

Patch-ID# 100343-04
Keywords: 1GB greater, disk, gigabyte
Synopsis: SunOS 4.1.1: sd.o patch to access scsi drive capacity beyond 1GB.
Date: 11/05/91
 
SunOS release: 4.1.1
 
Topic: 1.3GB Disk Drive Enhancement
 
BugId's fixed with this patch: 1058682,1045586,1045071
                               1049417,1046580,1048141,1046305(see Addendum)

Architectures for which this patch is available: sun4c

Patches which may conflict with this patch: 100243

Note: This patch conflicts with Online Disk Suite and Backup: Copilot.

Obsoleted by: 4.1.2

Problem Description:

format functions are limited to 1GB (2^21 bit address). Reassign of a
block beyond 1GB wraps around to the lower 21 bit address.As a result,using
a 1.3 GB drive ,for example, the top .3GB will be inaccessible
without this patch.

Using the format "repair" function I reassigned block # 2676800 (28D840x). The Flexstar tester reports block # 576648 (08D840 hex) to be reassigned.
Using the Ancot bus analyzer I verified that the driver is issuing a
reassign block command for block 08D840 hex, indicating that it is
truncating the most significant bits of the address and supports only a 21
bit address in this section of the code.

User data will remain intact, but since the block in error is not being
reassigned it will continue to fail, which could eventually lead to loss
of data. Good blocks will be reassigned unecessarily, which can adversely
affect performance.

INSTALL:

A) FOR SUN4C :

1)As root:

mv /sys/sun4c/OBJ/sd.o /sys/sun4c/OBJ/sd.o.fcs
cp sun4c/sd.o /sys/sun4c/OBJ/sd.o
mv /sys/sun4c/OBJ/esp.o /sys/sun4c/OBJ/esp.o.fcs
cp sun4c/esp.o /sys/sun4c/OBJ/esp.o

2) If customer' system has 4.1.1 installed and NOT 4.1.1RevB, copy
the format.dat file.

   cp sun4c/format.dat /etc/format.dat

You will then have to re-run config and make on your kernel.
Please refer to the System and Network administration manual
for information on building and installing a custom kernel.

B) FOR SUN4 :

 If customer' system has 4.1.1 installed and NOT 4.1.1RevB, copy
the format.dat file . As root

cp sun4/format.dat /etc/format.dat

There is no need to remake the kernel in a sun4 architecture.

ADDENDUM :

Other Bugs fixed are

Bugid 1058682: Reassign block (format "repair") malfunctions
beyond 1GB (6-byte address)

Bugs 1045586 and 1045071 : format parameters problem
modify sd_maptouscsi to handle i/o past 1 GB, bugid 1045071

Bugid 1049417: Hang in selection phase

Bug id 1046580: rework several issues dealing with the running of
proxy commands;

Bugid 1048141: if a data overrun is detected, set e_weak for that target.

Bugid 1046305: in esp_commoncap the sync and disconnect stuff was reversed
for the target-only case.

This patch is being resubmitted to correct the source files and the README.
This patch is being resubmitted to include the format.dat file to make
the whole patch self contained .

I think using it means also rerun 'format' (partition set-up).
GOOD LUCK

Rainer Blaes, MBB/ERNO 2800 Bremen 1 GERMANY
-------------------------------------------------------------------------------
From: kalli!kevin@fourx.Aus.Sun.COM (Kevin Sheehan {Consulting Poster Child})

A quick intro - in order to do a SCSI xfer, you have to go thru
various bus phases:

bus free - nobody doing anything
arbitration - getting the bus
selection - selecting another device (and reselection, coming back after
                a disconnection)
command - sending out the actual command bytes
message - sending control information for the most part
data - actual useful bytes winging their way down the wire.

>
> Mar 14 08:08:45 antares vmunix: esp0: unexpected SCSI bus reset

Somebody felt it necessary to reset the bus, or there was noise on the
reset line.

> Mar 16 07:46:00 antares vmunix: smt0: Media change

That one is from a driver with which I am not familiar - it indicates
that the media changed in some device (like changing floppies or tapes)
and the driver is warning you via the sens information it receives.

> Mar 16 08:05:30 antares vmunix: sd1: SCSI transport failed: reason 'incomplete': retrying command

This means that the host adaptor was thwarted somewhere in the process
of issuing a command. It had the bus, but couldn't get the command
or messages out.

> Mar 16 08:05:38 antares last message repeated 29 times
> Mar 16 08:05:38 antares vmunix: sd1: disk not responding to selection

Really bad - means the disk didn't see a wire wiggle.

> Mar 16 09:18:03 antares vmunix: sd2: SCSI transport failed: reason 'timeout': retrying command

A command was issued, the device disconnected, but it never came back. MOst
likely a missing reselection. Again, a wire wiggling and someone missed it.

> Mar 16 09:21:50 antares vmunix: smt0: SCSI transport failed: reason 'unexpected_bus_free'

The driver saw a wierd phase where it didn't expect it. Most likely another
missed wire wiggle.

>
>
> Now I've read some of TFM but I can't seem to find anything that
> gives me a clue as to what this all means!! I done the format
> analyze on the disks, but I don't find any extra bad blocks... What
> should I do next?? The disks are both Delta Microsystems SS-1002D's.

You should probably get the shortest cables you can find and restring
the bus. Remember to keep the last (and only the last) device on the
bus terminated. All of the problems above can be attributed to long
cables, and the resulting signal degradation.
--------------------------------------------------------------------------
From: jumper@spf.trw.com (Greg Jumper)
We have seen very similar error messages from one of our Maxtor 669MB hard
disks attached to a SPARCstation 2. I posted a query to
"comp.sys.sun.hardware" a few weeks ago and only got a few responses. One
person had seen a similar problem on a Sun 3 SCSI shoebox drive; it turned out
one of the wires to the SCSI connector was loose in his case. Another person
said they had had problems with Maxtor disks before; after Maxtor upgraded the
firmware, everything was fine.

I'm somewhat curious: I have seen our problem just with the one disk, but I
have had it attached to two different SPARC 2's, running both 4.1.1 and 4.1.2,
and using different types of internal drives. I also switched SCSI cables and
terminators. Thus, I concluded that the problem was definitely with the disk.
However, since you are seeing the same sort of problem with a different brand
of disk, maybe there is more going on.

If you get any useful responses, please send me a copy, or summarize back to
the list.
--------------------------------------------------------------------------
From: hargen@sybus.com (Bill Hargen)

As I'm sure other people will tell you, this sounds like a problem with the
SCSI bus. Either it is too long, not properly terminated (i.e. only at the
end with power to the terminators...), or there is a connector problem. One
thing you can try if you have synchronous disks is to disable the sync SCSI
support in Sun's driver. Problems in synchronous mode seem to be the first
indication of bus problems. To disable sync SCSI, do the following and
reboot:
> # adb -k /vmunix /dev/mem
> scsi_options/X
> _scsi_options: 78
> $W
> scsi_options?W 58
> scsi_options/W 58
> $q

(A more permanent change can be made in /usr/sys/scsi/scsi_confdata.c.)
----------------------------------------------------------------------------
From: ames!amdcad!sjsca4!poffen@gatech.uucp (Russ Poffenberger)

What kind of disks are they internally? The Delta is not the disk manuf. They
just package them.

There is a known problem with Fujitsu 2266SA's on SS-2's. The solution is
to get the lates firmware prom. Contact Delta if you have these Fuji drives.

It could also be SCSI cable problems. Make the cables as short as possible,
and make sure only the last drive is terminated.

---------------------------------------------------------------------------
From: Perry_Hutchison.Portland@xerox.com

I'm guessing that these are external disks -- are there any other
external SCSI devices, and are they working properly? This sounds like
loose connections, or flakey power, or possibly a termination problem.
Maybe some cables got shoved around while dealing with the crash.

----------------------------------------------------------------------------
From: David Fetrow <fetrow@biostat.washington.edu>

 I once had similar problems due to a bad cable. What made it interesting
was it was a bad cable INSIDE the case; it had been lightly rubbing the
fan hub and one-by-one the ribbon lines were being worn off.

 It's unlikely but it's such a cheap fix I'd thought to mention it.
----------------------------------------------------------------------------
From: cdr@kpc.com (Carl Rigney)

We've seen errors exactly like that on Hitachi DK-516C disks with old
ROMs, especially NO24, on SS2's. After we upgraded to NO28 the problem
went away. We also saw similar problems with Maxtor 8760's. A
probe-scsi from the monitor prompt should tell you what kind of disks &
ROMS you have.

>What should I do next?? The disks are both Delta Microsystems SS-1002D's.

Have you called Delta & asked them about it? What did they say?
---------------------------------------------------------------------------
From: Alex Zbyslaw <Zbyslaw@europarc.xerox.com>

We had similar messages from a Sun3 and it's SCSI controller board needed
replacing.

I'm afraid I don't know where the SCSI controller is on a Sparc II, bust
suspect it's on the motherboard, so the whole thing *might* need replacing.
---------------------------------------------------------------------------
From: Phil Male <phil@cns-howden.co.uk>

hmmmm - glad someone else is seeing this :-)

We are running 4.1.2 on a few SS2's - disks are Maxtor P1-17S cyl 1766 alt 2
hd 19 sec 85 - and we are getting the same error when the disks are heavily
loaded, everything pauses for a while, then the SCSI command works ok and
everything carries on as normal. Interested in any comments you receive :-)
------------------------------------------------------------------------------

 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Susan KJ Thielen Application Programmer, System Manager
Advanced Imaging Lab
Robarts Research Institute Phone: (519) 663-3833
PO Box 5015, 100 Perth Drive Fax: (519) 663-3789
London, ON N6A 5K8 E-mail: thielen@irus.rri.uwo.ca
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:40 CDT