SUMMARY: Strange SCSI disk errors on Wren IV: Sparc SunOS 4.1.1

From: Ted Nolan SRI Ft Bragg (ted@usasoc.soc.mil)
Date: Thu Jan 23 1992 - 15:14:08 CST


Hello again folks,

If you'll recall, I asked a question a week or so ago about strange behavior
on Wren IV SCSI drives.

A brief restatement of the problem is that I have had problems with several
Wren IV drives on Sparcstations. The drives start getting bad block messages
under Unix, especially for block 6736, and format is unable to deal with
the problem. Sometimes such a drive will pass a format and test absolutely
cleanly, other times format will find errors which it cannot fix, sometimes
it will go off into never-never land, complaining about sectors way off the
end of the drive somewhere. I currently have such a drive just back from
repair, which invariably gets:

                sd3a: Error for command 'write'
                sd3a: Error Level: Fatal
                sd3a: Block 6736, Absolute Block: 6736
                sd3a: Sense Key: Hardware Error
                sd3a: Vendor 'CDC' error code: 0x9

when I try to newfs the 'a' partition, even after a perfect format/verify.
After getting this error, format will refuse to deal with the drive anymore
until some (variable) combination of poweroffs and manually entering the
format.dat data, after which it will again pronounce the drive perfect.

Anyway, on with the responses. (I've reformatted or edited these in places):

=============================================================================
Jan Berger Henriksen from Norway (jan@eik.ii.uib.no) observes:

>
>This is Wren-IV problem on SPARC's under SunOS 4.1.X. I've got a bunch
>sitting in our repair-room, our vendor has a few, and none of us has
>a fix yet. And we first observed this last summer....
>
>There's been a few articles on this subject in sun-managers during fall-91.
>No solutions.

It's nice to know I'm not alone, at least.

=============================================================================
Marc Hansen (yangtze!nile!mhansen@snowbird.Central.Sun.COM) says

>
>I have had this EXACT problem on seven different Wren IVs (six purchaed
>third party, one from Sun).
>
>The first two were under warrenty and I sent them in for repair and
>they have worked fine ever since (26 months).
>
>The others were not under warrenty when they failed. Because of the
>frequency with which I saw the problem and the fact that disk anylsis
>never seemed to find the "bad" blocks, I call the factory technical
>support people (CDC sold Impremis, they are now a divsion of Seagate).
>After 26 phone calls and eight weeks, I couldn't find anybody at the
>factory who could tell me that they even suspected that a total drive
>rebuild would help the problem. At this point I decided that the
>problem is design flaw or incompatibilty bug. (The Wren IV was designed
>well before the first sparc station.)
>
>The decison we finaly made was to replace the drives. Repair of the
>drive was $500 and we can buy a Wren 6 for $1500. Considering the
>amount of time we were putting into this problem it didn't seem
>ecconomical to mess with anymore.
>
>We have installed 14 Wren 6 and Wren 7 drives and have never seen this
>problem on any of them.
>

I agree that I wouldn't reccommend this configuration to anyone else..

=============================================================================
Michael Sullivan (trdlnk!mike) has some comments on format itself:

>
>I don't know what the actual problem is with your disks, but I can tell
>you that format's tests are brain-damaged: they silently ignore any
>kind of error except media errors, which explains why you could run
>multiple tests in format successfully, but get a hardware error as soon
>as you really try to use the disk. I once had a disk that was unable
>to write at all pass multiple passes of format's compare test without
>any errors reported!

That's nice to know, but not very reassuring..

=============================================================================
John Baird (ACSnet: john@toshiba.tic.OZ .AU), thinks I may have hurt myself
by using a dd on the 'c' partition to copy the disk from another Wren IV
at one point:

>
>Just a note. Be very carefull when using dd to copy whole disks as you
>can very easily copy the sector defect map from one drive to another.
>When format runs it maps out all bad sectors and everything is alright. When
>you "dd" however, you over-write this map with the one from the source drive,
>possibly causing previously good blocks to become bad (ok), and previously
>bad blocks to become good (not ok).
>
>Why the numbers are being screwed around I don't know, and this is only a
>guess, but dd is not a really safe way of doing things. Backup | restore
>is much slower, but much safer.
>
>+ This [the dd] completed with no errors, and I have done this before
>+ with no problems (although maybe not under 4.1.1).
>
>I used to use this too until I ran into big problems, most of the time you
>can get away with it, but enventually it will catch up with you.
>

What's the scoop here? I'm almost positive you can't overwrite the
original factory list (retrieved by the "original" command in
format/defect). What about the list that format writes back out? I
had assumed that that was written in the spare cylinders beyond the end
of the 'c' partition? Is this not the case?

=============================================================================
Marcel Bernards (bernards@ECN.NL) has had similar problems with a Micropolis:

>
>I had a Sun Shoebox sd0: <Micropolis 1558 cyl 1218 alt 2 hd 15 sec 35>
>which worked OK until I upgraded from 4.0.3 to 4.1.1 on a 3/50
>which showed the same symptoms.
>
>Since I had no time for restoring i left the home partition preserved
>since that time the problems began.
>
>+Block 1835008 (4432/3/22), Fatal non-media error (hardware error)
>Hey ! that 4432/3/22 is the same the old 327 Shoebox disk barfed to me.
>
>Thats even weirder....
>
>Sun replaced the disk and we reformatted the disk , no problems, but i ran
>the format>analyze>read test and it began to produce
>Warning: unable to pinpoint messages on some random blocknumbers
>ranging from 1100/0/0 1217/14/34
>repairing did not help, and other blocknumbers appeared too..
>
>format did not complain after the repair.
>The strange thing was, when I used another (Sun) cable, the number of
>messages seem to decrease, but they did not disappear.
>The cable was that long old Sun cable, so I'm gonna try a much shorter cable.
>Has this something to do with the Sync SCSI stuff ? dunno.
>
>I suspect there is something wrong with the SCSI/ESDI controller,
>or some obscure CPU board revision level in combination with SUnOS 4.1.1
>but I'm not sure (Yet).
>

I have seen the "unable to pinpoint", and "repair failed" messages during
some manifestations of this problem. At this point though, we have tried
several different cables, and have had the drives on several different
machines (so if it's a controller problem, it's broken as designed..). We
have had these problems under 4.0.3 and 4.1.1, and as I recall 4.0.3 did
not have the sync SCSI stuff.

=============================================================================
Kevin Sheehan (kalli!fourx!toad!kevin@fourx.Aus.Sun.COM) suggests a trip
to TFM to resolve the funny format block numbers:

>
>Part of the problem is that it tries to reconstruct the block from
>the sense data, and failing that, it tries to reconstruct it from the
>data it has around. In the case of a format command, I think what it
>is trying to interpret is *not* the block address. You should check
>the manual and find out what the 'hardware error' is...
>

We are getting a "Vendor 'CDC' error code: 0x9",unfortunately, we don't
have a manual for the drive itself...

=============================================================================
I-Teh Hsieh (hsieh@crayfish.UCSD.EDU) asks:

>does the drive sound like a spring doing a boi-oi-ing when this happens?

I replied that it did not, but I may have been wrong since I was
listening when format failed, not when I got the block 6736 error.

Apparently he is getting exactly the same error on block 6704, with a
boi-oi-ing sound, and comments:

>I tried the drive in 2 different SUN EXP shoeboxes and I always get
>a reading bellow 11.2 VDC for the 12V line, which I think is kinda low...
>the drive is draining quite a bit of juice.
>I'm going to try to find a power supply closer to 12V and try again.

=============================================================================
Larry Steury (lesteury@hou.amoco.com) writes that he is having the
same problem on a 4.1.1 IPC

>And yes, it's always block 6736 with the same message that you get.
>When this error is about to happen, the disk makes the old well-known
>disk "clunking" noise exactly 5 times.
>
>This problem does not seem to be fatal, just annoying. It's been going
>on for at least 6 months on my disk, but my disk otherwise is working
>OK. I have tried re-formatting, with similar results to yours - format
>doesn't find any problems.

While we have run systems with this problem for a while, my experience is
that it eventually turns fatal, and since the users of these machines are
not at all technical, and work with Interleaf TPS4, which does not let
you redirect messages away from the console, it's more than annoying for them..
=============================================================================

Thanks to everyone for their responses; it's nice to know that I'm not the
only one having this problem, and that there isn't some obvious soloution
that I'm a bonehead for not having seen right off.

So what am I going to do?

Well, on a long shot, I'm going to ask our techs to open the shoebox and
check to make sure the jumpers are as described by jay@Princeton.EDU
in his response to Jon Stone's question of a few weeks ago about moving
a Wren IV from a 386i to a Sparc (although the drive has always been on
Sparcs). If they look ok, or it doesn't help, I'm going suggest we send it
back to repair, and say it was still broken when we got it back, as we have
had some come back and work. In the long run, I will certainly suggest any
new disks we get not be Wren IVs..

                                
                                Ted Nolan
                                ted@usasoc.soc.mil



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:34 CDT