SUMMARY: Conner Archive 4586NP/SCI reset problems

From: Nancy Voorhis (nancyv@curtech.com)
Date: Tue Oct 13 1998 - 18:38:57 CDT


Hi,

The original question was about a Conner Archive 4586NP tape drive and
the inability to read tapes with SCSI reset errors showing up in the logs.
The full text of the original problem is below.

Thanks!!!! to everyone who responded:
 Eric D. Pancer, Roger Fujii, Rick Kelly, Dwight Peters, Heidi Burgiel,
 Bruce R. Zimmers, Phil Kao, Chris O'Neal, Bismark Espinoza
and anyone else I missed.

There was a range of answers from yes, your tapes are all bad, to
ck scsi cabling and termination, to isolate tape drive on a bus
w/out bad disks and anything else, and the lack of the proper definitions
in st_conf.c and stdef.h for the tape drive. The answers also included
moral support for experiencing the "system administrator's nightmare"
which I was also glad to get even if it hasn't helped restore the data.

The summary is kind of long, so read if you are interested. If not,
my reminder is that if the #1 job of sys adm is doing backups, then #2
job probably is checking you can do a restore. We all say it, but how
many out there have actually done it recently? If not, than here's
my encouragment to do it. Sooner rather than later! My moral is don't
trust the dump logs.

And now for the summary...I have some answers, and still some
unknowns:

I believe I eliminated any scsi cabling/termination problems via
numerous changes in enclosures, termination, cabling and the
tape drive was isolated from the bad drives from the beginning.

Most people, said yes, try moving the drive to a Solaris 2.x box.
Several people also pointed out that the improper identification of the
drive with the 'mt' command (as Exabyte 8mm) was due to the lack of the proper
kernel configuration and modifications to st_conf.c and stdef.h

These people were right about not having the kernel configured properly.
In addition, without the proper kernel configuration, SunOS treats
the drive as a 1/4" tape drive and write with 512 byte blocks.
A tape written with this block size can not be read by a Solaris system
and the ufsrestore command. (Error is:
        I/O error: tape block size 512 is not a multiple of 1024).

Typically, one *can* read a tape written with /etc/dump on SunOS 4.x with
ufsrestore on Solaris 2.x and there are no special parameters to use
except a tape device (/dev/rmt/0m, etc) that is correct.

Moving the tape drive to o a Solaris 2.5 box, as the single device on the
SCSI chain, aside from the system disk, and it was recognized "properly" as
  st4: <Archive Python 4mm Helical Scan>
  st4 at esp0: target 4 lun 0
  st4 is /iommu@0,10000000/sbus@0,10001000/espdma@4,8400000/esp@4,8800000/st@4,0

Under this configuration, the tapes still can not be read. With the
tape drive under Solaris 2.x and having learned the block size, I used
'dd if=/dev/rst4 ibs=512 of=dumpfile' to try to read the tape back
to disk and then issue the Sun 4.x /etc/restore command on the file.

In all cases, I can get a partial read of the tape before a "read I/O error"
 and a SCSI message on the console with the block number that it failed on.

The conclusion is that the tape drive did not write the tapes properly
despite the misconfiguration and despite the dump logs showing no errors.
Presumably the heads on the tape drive became misaligned, or were not aligned
properly in the first place.

My second conclusion is that it might be worth running dump with the 'v'
(verify) switch. I wasn't using it and I suspect it would add to the already
long backup times, but it might be worth it. In my case, the dump logs
were entirely misleading.

The story hasn't been concluded yet. I am still working on restores,
reading whatever partial restores I can. The most critical disk has
been sent to the "Data Recovery Clinic" (www.datarecoveryclinic.com)
who specialize in retrieving data from damaged disks.
 No results on that yet but if anyone wants any
feedback on their service and what the results were, feel free to email
me directly.

I have not yet tried the tape drive under a properly configured Sun 4.x
kernel but I have no reason to believe it will behave differently than
on Solaris 2.x with its proper detection of the device. I also have not
tried reading the tapes on a different tape drive on SunOs although
they have been tried on a different tape drive under a different o/s (SCO).
Again, I have no reason to believe this will result in a full read of
the tapes. Both of these I will probably try anyway in desperation :)

Again, thanks to everyone who responded. Your answers were all valuable
and the collective experience out there is one of the best resources
around.

If anyone wants to correct any of my conclusions they feel to be incorrect,
please feel free to email. I am not out of the woods yet!

Nancy Voorhis
nancyv@curtech.com
System Administrator
Current Technologies, Inc.
Durham, NH 03824

(aka nancyv@voortech.com of VoorTech Consulting)

====================================
Original posting:
====================================

>From sun-managers-relay@sunmanagers.ececs.uc.edu Tue Oct 6 02:22:17 1998
...
From: nancyv@curtech.com (Nancy Voorhis)
Message-Id: <199810060103.VAA00443@sunapee.curtech.com>
Subject: Conner Archive 4586NP/SCSI reset problems
To: sun-managers@sunmanagers.ececs.uc.edu
Date: Mon, 5 Oct 1998 21:03:13 -0400 (EDT)

Hi,

I have a one of those sys adm nightmare scenarios on my hands...

I have a Conner Archive 4586NP Auto Loader (takes 4/12 tape cartridges
for 4mm tapes) on a SPARCstation 1+ (4/65) running SunOS 4.1.3.
There are 3 other devices on the SCSI bus (all disks - 1 internal, 2 external).

It does our full backups. Now, with 2 disk drives down due to a failed
power supply on an enclosure, it is failing when trying to do restores with the
following (SCSI) errors:

Oct 5 20:11:10 jp vmunix: esp0: Disconnected command timeout for Target 4 Lun 0
Oct 5 20:11:10 jp vmunix: st0: transport completed with timeout
Oct 5 20:11:10 jp vmunix: st0: attempting a device reset
Oct 5 20:11:10 jp vmunix: st0: attempting a bus reset
Oct 5 20:11:10 jp vmunix: st0: SCSI transport failed: reason 'timeout': giving up
Oct 5 20:24:07 jp vmunix: esp0: Disconnected command timeout for Target 4 Lun 0
Oct 5 20:24:07 jp vmunix: st0: transport completed with timeout
Oct 5 20:24:07 jp vmunix: st0: attempting a device reset
Oct 5 20:24:07 jp vmunix: esp0: polled command timeout
Oct 5 20:24:07 jp vmunix: State=MSG_OUT_DONE Last State=MSG_OUT
...

It has always been a bit temperamental but with judicious cleaning it
has been reporting dump logs without errors. I am getting the same
errors for just about any tape I put it (I've tried about 10; 10 weeks worth).
I have gotten the vendor to send me a 2nd drive (thinking it can't possibly
be *all* the tapes) and I am getting the same behaviour with the replacement
drive (supposedly refurbished) :( I have finally gotten the 2nd drive to
read a tape written about 4 months ago. This is good, but still nearly
disastrous unless we can get our data back from more recent tapes.

The SunOS box always reports the drive as a Exabyte 8mm for the
the mt status cmd and reading the man pages I don't see anything about
4mm drive support (sorry, I don't have the actual mt status output on hand).

An answer to any of the following of them may help me out
of the disastrous predicament of losing 4 months (OUCH) of data:

- is there support for this type of drive under SunOS 4.1.3?
  and should I be doing any special configuration/kernel settings?
- is it worth trying the drive on a Solaris 2.x box , possibly getting
  drivers that will support the tape drive better?
- can I read my dump tapes under Solaris 2.x and what options are needed
  (I briefly tried this but was not able to read them)?
- is it likely/possible that all my tapes are bad even though dump showed
  logs with no errors upon writing them?
- if anyone has suggestions on things to try, configure, etc. I am happy to
  listen and try...

Thanks for any help in advance,
Nancy Voorhis
Systems Administrator
Current Technologies, Inc.
Durham, NH 03824
A
Durham, NH



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:50 CDT