SUMMARY: Backup problems Exabyte/SLC

From: Paul Hostrup-Jessen (phj06@bk.dk)
Date: Sat Mar 19 1994 - 01:01:55 CST


Sun managers,

This summary has now been sent out for the fourth time. None of the other
trials were successful. (Have there been some problems with the list server
recently ?):

----------------------------------------------------------------------------

It took me longer time than expected before I could get round to trying out
all suggestions concerning my backup problems encountered with Exabyte 8200s
on SLC workstations.

And I haven't really solved it yet, unfortunately - so it is not quite a
summary, but some of the tips listed might help at other sites.

There are indications that the SLCs combined with older Exabytes are known to
produce these problems. SCSI-cabling, termination and the physical order of the
peripheral devices play a major role, but look for yourself in the following.

What is really weird is that my original setup used to work at one stage. Then
from one day to another it ceased to work without anyone touching the equipment.
Network traffic has gone up recently owing to a lot of new installations
carried out at our site, but that is the only thing I can think of right now.

On our Sun 3/60s, SPARCstation 1s, 2s and 10s with the same cables, terminators Exabytes, we have no problem at all!

My own comments follow after each "COMMENTS" and I will continue to investigate
the problem in due course.

This was my original question posted:
------------------------------------------------------------------------------
I have a serious problem getting any of our four Exabyte 8200 8mm backup units
to make a valid backup when it is attached to any of our many SPARCstation SLCs.

The SPARCstation SLC is running SunOS 4.1.3 (Solaris 1.1) and only one local
disk and the backup unit are attached to the SCSI-bus. Only the last peripheral
is terminated - in this case the Exabyte.

The dump parameters are like the following example where "arabia" is the
remote host being backed up to "haldir" which has the Exabyte:

rsh arabia /etc/dump 0dsbfu 54000 6000 126 haldir:/dev/nrst1 /dev/rsd1g

If the partition is small the backup will succeed. However, if it is a large
partition if will give up slightly more than half way through. The write
errors occur on any tape - new or used:

Small dump OK:

  DUMP: Dumping /dev/rsd0g (/usr) to /dev/nrst1 on host haldir
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 174066 blocks (84.99MB) on 0.05 tape(s).
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]
  DUMP: 71.52% done, finished in 0:01
  DUMP: level 0 dump on Sun Feb 20 01:03:49 1994
  DUMP: Tape rewinding
  DUMP: 174044 blocks (84.98MB) on 1 volume
  DUMP: DUMP IS DONE

Big dump not OK:

  DUMP: Date of this level 0 dump: Sun Feb 20 01:11:30 1994
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping /dev/rsd1g (/pcapp) to /dev/nrst1 on host haldir
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 873410 blocks (426.47MB) on 0.23 tape(s).
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]
  DUMP: 16.07% done, finished in 0:26
  DUMP: 32.55% done, finished in 0:20
  DUMP: 49.18% done, finished in 0:15
  DUMP: 65.58% done, finished in 0:10
  DUMP: write: I/O error

  DUMP: write: I/O error

  DUMP: Tape write error 984 feet into tape 1
  DUMP: fopen on /dev/tty fails
  DUMP: The ENTIRE dump is aborted.

These are the error messages on the console indicating that the SCSI-bus has
problems owing to an overload:

Feb 20 01:36:01 haldir vmunix: esp0: data transfer overrun
Feb 20 01:36:01 haldir vmunix: State=DATA Last State=DATA_DONE
Feb 20 01:36:01 haldir vmunix: Latched stat=0x10<XZERO> intr=0x10<BUS> fifo 0x0
Feb 20 01:36:01 haldir vmunix: last msg out: <unknown msg 0xff>; last msg in: IDENTIFY
Feb 20 01:36:01 haldir vmunix: DMA csr=0x80000000
Feb 20 01:36:01 haldir vmunix: addr=fff129a1 last=fff11ed1 last_count=acf
Feb 20 01:36:01 haldir vmunix: Cmd dump for Target 5 Lun 0:
Feb 20 01:36:01 haldir vmunix: cdb=[ 0xa 0x0 0x0 0xfc 0x0 0x0 ]
Feb 20 01:36:01 haldir vmunix: pkt_state 0xf<XFER,CMD,SEL,ARB> pkt_flags 0x0 pkt_statistics 0x1
Feb 20 01:36:01 haldir vmunix: cmd_flags=0x23 cmd_timeout 119
Feb 20 01:36:01 haldir vmunix: Mapped Dma Space:
Feb 20 01:36:01 haldir vmunix: Base = 0x2da0 Count = 0xfc00
Feb 20 01:36:01 haldir vmunix: Transfer History:
Feb 20 01:36:01 haldir vmunix: Base = 0x2da0 Count = 0xfc00
Feb 20 01:36:01 haldir vmunix: current phase 0x25=DATAOUT stat=0x0 0xacf
Feb 20 01:36:01 haldir vmunix: current phase 0x1b=RESEL stat=0x7 0x5 0x0
Feb 20 01:36:01 haldir vmunix: current phase 0x5=MSG_IN stat=0x7 0x4
Feb 20 01:36:01 haldir vmunix: current phase 0x28=DISCONNECT stat=0x7 0xfc00
Feb 20 01:36:01 haldir vmunix: current phase 0x2c=SAVEDP stat=0x7 0xfc00
Feb 20 01:36:01 haldir vmunix: current phase 0x25=DATAOUT stat=0x10 0xfc00
Feb 20 01:36:01 haldir vmunix: current phase 0x20=SELECT stat=0x10 0x5 0x0
Feb 20 01:36:01 haldir vmunix: current phase 0x1=CMD_START stat=0x10 0xa 0x20
Feb 20 01:36:01 haldir vmunix: current phase 0xb=CMD_CMPLT stat=0x17 0xfc00
Feb 20 01:36:01 haldir vmunix: current phase 0x27=STATUS stat=0x17 0x0
Feb 20 01:36:01 haldir vmunix: current phase 0xb=CMD_CMPLT stat=0x13
Feb 20 01:36:01 haldir vmunix: current phase 0x25=DATAOUT stat=0x10 0xfc00
Feb 20 01:36:01 haldir vmunix: current phase 0x20=SELECT stat=0x10 0x5 0x0
Feb 20 01:36:01 haldir vmunix: current phase 0x1=CMD_START stat=0x10 0xa 0x20
Feb 20 01:36:01 haldir vmunix: current phase 0xb=CMD_CMPLT stat=0x17 0xfc00
Feb 20 01:36:01 haldir vmunix: current phase 0x27=STATUS stat=0x17 0x0
Feb 20 01:36:01 haldir vmunix: st1: transport completed with data_ovr
Feb 20 01:36:01 haldir vmunix: st1: attempting a device reset
Feb 20 01:36:01 haldir vmunix: st1: attempting a bus reset
Feb 20 01:36:01 haldir vmunix: st1: SCSI transport failed: reason 'data_ovr': giving up
Feb 20 01:36:04 haldir vmunix: st1: Error for command 'write file mark', Error Level: 'Fatal'
Feb 20 01:36:04 haldir vmunix: Block: 5132
Feb 20 01:36:04 haldir vmunix: Sense Key: Unit Attention

If I move the tape unit to a SPARCstation 10 also running SunOS 4.1.3 - a
much more powerful workstation than the SLC - the backup works perfectly.

It must have something to do with the traffic on the SCSI-bus as far as I can
gather. Would it be an idea to slow down the backup process by writing less
blocks at a time?
------------------------------------------------------------------------------
------------------------------------------------------------------------------

Phil Hubbard <phubbard@baosc.com> wrote this one:

We're fighting similar problems here with SCSI bus errors, and we haven't been
able to establish exactly what is causing them. Please keep me posted, and
I'll pass any information we come up with along to you.

COMMENTS: Phil, we're not the only ones who encounter this problem - see some
          of the following answers.
------------------------------------------------------------------------------

Steve Swaney <swanes@etswwmd.eq.gs.com> wrote this one:

Looks like your running out of tape. The parameters I use (based on info from
Delta Microsystems) is:

for a Exabyte 8500 (5GB) if the device is st1

        /usr/etc/dump 0usbf 226552 126 /dev/nrst9 /<filesystem>

for a Exabyte 8500 (5GB) if the device is st1

        /usr/etc/dump 0usbf 113276 126 /dev/nrst1 /<filesystem

Note the use of /dev/rst9 for 5GB and rst1 for 2.3GB as well as the different
"s" parameters.

COMMENTS: Thank you for your advice, Steve. My machine is an 8200 and it
          doesn't have the option of making 5GByte dumps.
------------------------------------------------------------------------------

Bob Izenberg <bobi@vswr.sps.mot.com> wrote this one:

>rsh arabia /etc/dump 0dsbfu 54000 6000 126 haldir:/dev/nrst1 /dev/rsd1g

This command could have the arguments in an incorrect order. What happens
if you try this:

rsh arabia "/etc/rdump 0ubdsf 126 6250 54000 haldir:/dev/nrst1 /dev/rsd1g"

COMMENTS: I have followed the original documentation concerning the order.
          However, I havent' tried the "rdump" command. Is there any
          significant difference between "dump" and "rdump" ?
------------------------------------------------------------------------------

Glenn Satchell <glenn@uniq.com.au> wrote this one:

I think this is a problem with some of the earlier SLC's which was
fixed in a later rev of the CPU board and scsi controller hardware. I
saved this message (don't really know why) from an old sun-managers
posting. The SS10 will certainly have a better SCSI interface since it
can support "fast scsi", ie 10MB/sec.

Otherwise it is a case of checking your terminator and cables on the
SLC, making sure that they are the round shielded type and as short as
possible. Use an active terminator - these have a small green LED in
the back - they are much better for providing noise imunity on the
bus.

COMMENTS: I get the impression, too, that the problem is really related to
          the SLC itself. I do use shielded and short SCSI-cables, but use
          the built-in internal termination of the device. I have experienced
          that active terminators and short SCSI-cables are absolutely neces-
          sary on SCSI-busses which support SCSI-2 because of the much faster
          transfer rate. Unfortunately, the active terminators we have in use
          already are not interchangeable because of different types of sockets.

----------------------------------------------------------------------------
           
Yuval Tamir <tamir@cs.ucla.edu> wrote this one:

There have been several messages on the net regarding problems with the SCSI
port on the SLC. These problems seem to be with unexpected resets on the SCSI
bus.

Some people report that this is harmless. Another report said that the resets
were causing an Exabyte drive to rewind and eject the tape unexpectedly in the
middle of dumps. Another report is that the problem occurs with Fujitsu disks
when their read-ahead cache is enabled.

One theory is that the SCSI host adaptor on the SLC is extremely sensitive to
the impedance of your daisy chain. Some people have been able to eliminate the
problem by using a very short (1 meter) cable between the SLC and the disk.
Others have bypassed the problem by using an external terminator instead of
the internal ones. (There is a Sun reference number 560762 for this problem).
 
Our Sun salesperson says that this has been a problem with only 12 systems in
the entire country. He further says that the problem happens only with "certain
third-party SCSI devices" or when using 3 or more SCSI devices on the same SLC.

Does anybody have more info on this ? It is not very appealing to purchase the
workstations and later find out that there are problems with the particular
third party disk you have and/or you cannot use the tape drive you want.

Is there anybody from Sun who would care to comment ?

Any estimate on when (if?) a fix can be expected ?

COMMENTS: Thanks for the detailed information, Yuval. I have forwarded the
          mail to our local Sun representative for their comments. I would
          like to ask the Exabyte Corportation, too, what their opinion is.
          Does anyone have en email inquiry address for Exabyte?

------------------------------------------------------------------------------

Birger A. Wathne <birger@vest.sdata.no> wrote this one:

Especially older 2GByte Exabytes were picky about the SCSI chain. Cables and
terminators must be ok. And these older Exabytes must be located closer to the
host than faster external units. Don't ask me why, but I have confirmed it in
a lot of cases. So you should connect the exabyte as the first external unit,
then the disk. If it still won't work, try adding Sun's CD player (possibly
before the Exabyte). This CD player seems to calm down most SCSI chains.

I know that this shouldn't make any sense, as the SCSI chain is a bus, and
physical location shouldn't matter, but it does. Having internal disks before
the Exabyte seems to be ok. But not external disks before older Exabytes. So
it has to be something with cabling, reflections, etc....

COMMENTS: Thank you, Birger, for the detailed information. I will experiment
          with cables and reversing the order of the devices on the SCSI-chain.
          I also have a CD-player, too, which I can install for test purposes.
-------------------------------------------------------------------------------

Mike Frizzell <friz@ms3.dseg.ti.com> wrote this one:

If this is a 5 GByte drive you need to change the tape length from 6000 to
13000.

COMMENTS: Thanks Mike, but it is a 2.3 GByte drive - the model 8200.
-------------------------------------------------------------------------------

Andy Feldt <feldt@phyast.nhn.uoknor.edu> wrote this one:

My first guess would be a cabling or termination problem. Double check even the
obvious. Use a totally different cable, for example. Check total cable length.
Try it with only the tape on the SCSI bus (or whatever minimum of other devices
on that you can test with.) Make sure the SCSI chain is terminated at the end
of the external chain and never is terminated internally by any of the external
devices on the bus.

COMMENTS: Thanks, Andy. Like the other suggestions, it seems to be very much
          related to termination and cables.
------------------------------------------------------------------------------

Dave Weitzel <weitzel@burke.com> wrote this one:

If you are on the machine haldir and issuing the statement :
rsh arabia /etc/dump 0dsbfu 54000 6000 126 haldir:/dev/nrst1 /dev/rsd1g

To run on arabia, then you probably need to run this statement :
rsh arabia /bin/rdump 0dsbfu 54000 6000 126 haldir:/dev/nrst1 /dev/rsd1g

Please let me know if this clear up your problem. You also might need to
check /dev on arabia and see if a LARGE file named nrst1 was created. If it
did not exist on Arabia when you initiated the dump command originally sent
with this message, this might have happened.

COMMENTS: Thanks, Dave. I checked /dev/nrst1 on the remote host, and it appears
          to be the right size.

-------------------------------------------------------------------------------
Ron Zinnato <zinnato@NADC.NADC.NAVY.MIL> wrote this one:
Content-Length: 395

I'm not sure if this will help, but we had similar problems a few years ago
and it turned out that we had the default swap partition in the /etc/fstab
file. It seems that as of 4.1.3., partition b of the boot device is the default
swap partition and SHOULD NOT be in the fstab. We didn't know that, and had the
same problem of small dumps working, but big ones dying. Hope this helps.

COMMENTS: Generally we always use the b-partition for swapping, but thanks for
          your suggestion, Ron.
------------------------------------------------------------------------------

Robert J Wolf <Robert.Wolf@dciem.dnd.ca> wrote this one:

We have replaced our Exabyte 8500 tape unit three times in the last 2 years.
They are the most flakey hardware I have ever encountered. Has the tape unit
ever worked on the slower machine? If yes then I would suspect the tape unit
and if not then I would suspect the scsi controller in the older machine.

COMMENTS: Thanks, Robert. The Exabyte is a 8200 model. So far, I'm rather
          tempted to think that it is the SLC rather than the Exabyte.

------------------------------------------------------------------------------

Ted Rodriguez-Bell <ted@ssl.Berkeley.EDU> wrote this one:

I was told when I was having trouble with ours that it should be as close to
the beginning of the bus as possible, and that an active terminator would be
a good idea on our system (which was an LX). Try running these ideas past the
people who sold you the things.

COMMENTS: Will do, Ted. The LX workstation supports the fast SCSI-2 standard
          and in these cases short SCSI-cables and active terminators are a
          must.
------------------------------------------------------------------------------

Sean Ward <seanw@amgen.com> wrote this one:

What type of disk drives are being backed up? I remember a problem about a year
ago with Maxtor 1.2GB and 1.7GB (unformatted) disk drives when backing up
onto Exabytes. The problem ended up being the read-ahead cache on the Maxtor
drives. We received a program from Maxtor to turn the cache off, and the
problem went away.

COMMENTS: Thanks Sean. We use Micropolis disk drives for data in general, but
          do use smaller Maxtor drives as local system drives.

------------------------------------------------------------------------------
Jerry Symanski <symanski@gold.nosc.mil> wrote this one:

I see similar messages from dump on my 4/370 but I don't get the console
messages. I will get a simple "... st1: write error" message.

I am beginning to wonder if my 4/370 is too slow. I will be interested in
seeing your summary. These backups should JUST WORK. But that is not the
case...!

COMMENTS: I agree, Jerry. Backups should just work! You will always need one,
          when it fails ..... All the suggestions are here for you to read.
------------------------------------------------------------------------------

Curt Vincent <vincec@uslss2.eq.gs.com> wrote this one:

Your are overrunning the SCSI bus. I've been to this movie sort to speak. I
have found that more than two tape drives on an IPX or SPARC II scsi bus
results in "broken pipes" and tape errors. Same thing for more than one drive
attached to IPC or SLC class machine.

The SPARC 10 does not have this problem as it uses a fast SCSI II.

COMMENTS: Thanks, Curt. I only have one Exabyte and one local disk attached
          to the SLC. The SLC has no internal drive.
----------------------------------------------------------------------------
John DeRosa <derosa@marble.rtsg.mot.com> wrote this one:

I have not been having any rdump problems to a SS10 running 4.1.3 with a Sun
8mm drive. My envocation is;

/usr/etc/rdump 0dsbfu 110000 11000 126 backups@dumphost:/dev/nrst9

I would suspect a bad scsi port on the Sun or the tape drive.

COMMENTS: Thanks, John. The SPARCstations 10 work well - also with "dump". It
          is the SPARCstation SLC which causes problems. I'm not sure about
          the parameters. I got my dump parameters from the original documen-
          tation following SunOS 4.1.3 for the standard 2.3 GByte Helical Scan
          tape drives.
------------------------------------------------------------------------------
Steve Young <syoung@cedar.buffalo.edu> wrote this one:

I have seen this before on SLC's - it is an OS bug, you need to update the
4.1.3_U1 running on the SLC to the newest version. There might also be a patch,
available. To carry your tests further, I found I was able to use tar but not
dump, and I was able to dump disk to disk but not disk to tape. The 8500 was
not originally supported in the 4.1.3_U1.

I don't have any machines in house currently to tell you what version does work
but I know the recent announcements have been saying that Solaris 1.1.1 now
supports the 8500.

COMMENTS: Thanks, Steve. I'm not using the SunOS 4.1.3_U1, but I could do.
          SunOS 4.1.3. supports the Exabyte 8200, however - and we only have
          these tape drives.

-----------------------------------------------------------------------------

That's all for now, folks!

Paul Hostrup-Jessen
System Administrator

Bruel & Kjaer A/S
DK-2850 Naerum
DENMARK

email: phj06@bk.dk

tel. : +45 42 80 78 55 + 2433 (direct dial from touch-tone telephones)
fax : +45 42 80 14 05



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:08:57 CDT