SUMMARY: Replace drive or not?

From: Brian Whiting (
Date: Wed May 12 1993 - 02:38:32 CDT

I wrote:

: Hi all,
: About 3 weeks ago, I experienced a number of write errors on my Seagate
: St42100N partition b, which has a swap file and some spool dirs. On the
: advice of the vendor, I did a backed up all files on the disk, then re-
: formatted. I ran format's analyze utility and it did not find any defects
: other than those listed in the mfr's table. So I went over my partition
: table with a fine-tooth comb, then restored the files. I saw a few more
: write and some read errors in the next few days, including some on another
: partition. This led me to ask for a replacement disk, which I now have.
: Now here's the rub: Since that initial flurry or errors, the drive has
: been stable-- no errors to console or any other of the usual places. So
: I am unsure whether it is worthwhile to go ahead and swap the drive or
: not, since I don't know whether the errors were real or not. Please advise!
: Facts:
: Sparc IPX, 64 Mb RAM, SunOS 4.1.3, Sun 427 internal (less than 5 months old)
: sd1: <SEAGATE cyl 2575 alt 2 hd 15 sec 96>
: partitioned as follows:
: partition a - starting cyl 0, # blocks 741600 (515/0/0)
: partition b - starting cyl 515, # blocks 741600 (515/0/0)
: partition c - starting cyl 0, # blocks 3708000 (2575/0/0)
: partition d - starting cyl 0, # blocks 0 (0/0/0)
: partition e - starting cyl 0, # blocks 0 (0/0/0)
: partition f - starting cyl 0, # blocks 0 (0/0/0)
: partition g - starting cyl 1545, # blocks 1483200 (1030/0/0)
: partition h - starting cyl 1030, # blocks 741600 (515/0/0)
: NB: drive is less than 4 months old.
: also on the SCSI bus are Sun CD-ROM, 8 mm Exabyte
: Finally, here's a sample of the errors I saw:
: Apr 13 11:55:31 lithos vmunix: sd1b: Error for command 'write'
: Apr 13 11:55:31 lithos vmunix: sd1b: Error Level: Fatal
: Apr 13 11:55:31 lithos vmunix: sd1b: Block 289664, Absolute Block: 1031317
: Apr 13 11:55:31 lithos vmunix: sd1b: Sense Key: Hardware Error
: Apr 13 11:55:31 lithos vmunix: sd1b: Vendor 'SEAGATE' error code: 0x3

I havent really gotten a handle on what (if anything) is wrong with
the drive, but I am sending it back as an ounce of prevention. Here
are excerpts from the responses I received (thanks all around for the
help): NB bonus was new drive is quieter that one replaced!

>From: (Bruce Cogan)

On general principles, I'd replace the drive. Perhaps the
errors were some transient thing. But suppose you
don't replace the drive, and they recur in a few months,
then you're back to square one.

If you do replace the drive, and the same errors recur,
then you may suspect a SCSI bus problem.

>From: (Peter Watkins)

This sounds very much like an error I had on my Seagate ST42100N a few months ago.
The symptoms were intermittent and unpredictable read/write errors that gave much
same error messages as you seem to have. In my case I have the drive in its own
desktop unit and I assume that you have the same arrangement.

Like you I asked for a replacement and exchanged the drives in the desktop - the
problem persisted!! After some heated telephone calls and a lot of reformatting, etc
I eventually discovered the problem. It had nothing at at to do with the disc itself
but was a fault in the desktop unit. The SCSI selector switch on the desktop was
connected to the disc by a really poor quality pin connector and it was poor contacts
here that were causing the problem.

One solution clearly is to get the connector replaced with a better quality unit and
this has been done on my box. Even easier is to disconnect the connector from the
selector switch, in which case the drive defaults to SCSI device number 0. The latter
works well for me but is not really a solution, it may confirm where your problem
lies however.

If this is indeed your problem then I'm surprised because I made quite a fuss about
the defect. Our distributor here in Holland assured me that he had passed on the
comments to SEAGATE so they should know about it.

>From: (Bruce Cogan)

The analyze in Suns format will not catch all the errors. For a scsi disk what you have to do is to repair the bad blocks that show up and then reformat. Many times there are some locations on the disk that are "good" enough to get by analyze, but after use they show up "bad". If you only get a few errors at a time show up then you should be ok with just repair then format.

Bill Barnes

PS. Anything but scsi all you have to do is repair, but scsi requires that you follow a repair with a reformat.

>From: cyrix!led!markm@texsun.Central.Sun.COM (Mark McDermott)

You need to check the power to the SCSI termination resistors. We had this
problem until the jumpers were set to supply termination power from the disk

>From: (Christopher Kelly)

We had a few errors on a new 2.3 Gbyte disk a month ago, very similar
configuration to yours. Ours was a _brand new_ disk. The only time we
saw the errors was during a dump to the 8mm backup tape, and only on a
particular large partition. At the advice of the vendor we replaced
the disk and have had no more such errors. They looked similar to
yours, except there more more messages like SCSI this... and SCSI that...
and something that seemed to resemble an "overflow" message. It was a
hardware error and not a defect on the disk, so it could not have been
fixed by any amount of reformatting.

My advice: if the vendor is cooperative, why not swap it?

>From: admiral!xhaque@uunet.UU.NET (Amanul Haque)

I have seen similar "phantom" problems on my disks. Only a newfs or a
replacement of the drive have fixed the problem. Would you send me a
copy/summary of the insights you get. I did not want to flood the net with
the same requests.


Virtual errors are just as bad as the real ones! ;^) A drive this new should
not yield errors of ANY type. If the replacement offer is free, accept it
gratefully, and avoid hassles down the road. The "bad" drive will probably
get a factory checkout and be recycled into the replacement pool, so no one
is really out of pocket. Besides, your vendor wants your system running at
100% -- and your future business, right? ;^) Hope this helps!

>From: (Michael Sullivan)

I advise you to replace the disk. There must be some real problem with the
disk or you wouldn't have gotten the errors in the first place. Just because
the errors have stopped doesn't mean they won't come back. For instance,
the disk drive may be overly senstitve to vibration or temperature due
to a cold solder joint or a marginally functional component.


>From: (System Manager)

Problem with your patitions : (correct parts see bellow) |
| 1. a | 0 | 741600 | 515 | 370.800 | / |
| 2. b | 515 | 741600 | 515 | 370.800 | |
| 3. c | 0 | 3708000 | 2575 | 1854.000 | |
| 4. d | 0 | 0 | 0 | 0.000 | /export |
| 5. e | 0 | 0 | 0 | 0.000 | /export/swap |
| 6. f | 0 | 0 | 0 | 0.000 | |
| 7. g | 1030 | 1483200 | 1030 | 741.600 | /usr |
| 8. h | 2060 | 741600 | 515 | 370.800 | /home |
------------------------------- Partitions ------------------------------------

>From: polaris1!support!lv@uunet.UU.NET (Luis Vallejo)

        I seems that your partition table is not correct, you should start the g partition at 1030 and end at 1030 so you can keep the same size of that partition, then you should add a+b+g and start the h partition at 2060 and end at 515.

the partition should be:
        partition a - starting cyl 0, # blocks 741600 (515/0/0)
        partition b - starting cyl 515, # blocks 741600 (515/0/0)
        partition c - starting cyl 0, # blocks 3708000 (2575/0/0)
        partition d - starting cyl 0, # blocks 0 (0/0/0)
        partition e - starting cyl 0, # blocks 0 (0/0/0)
        partition f - starting cyl 0, # blocks 0 (0/0/0)
        partition g - starting cyl 1030, # blocks 1483200 (1030/0/0)
        partition h - starting cyl 2060, # blocks 741600 (515/0/0)

If you are using this disk as the primary drive, you should create a /var partiton and reduce the / partiton, this way the system will have some inprovment on performance.

>From: (Dan Christensen)

        Go on to the format command and look at repair. This will than ask you for a block to repair. You would than type in the block that you are having troulbe with. Then quit out of the format command and reboot your system.


|  Brian Whiting           |
|  Post-Doc                        V: 606-257-7214         |
|  Geology Dept., UKy              F: 606-258-1938         |

This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:50 CDT