Subject: SUMMARY: What are "header not found" disk errors?
I thank you all for your thoughts on this problem.  There were a variety of
explanations, from disks, controllers, cables, and a controller ECO.  A
number of people have seen these on Fujitsu and CDC disk drives.
I noticed later that all the errors were on head #11, implying a real disk
problem.  Another drive is reporting a few errors each day, mostly on head
#6.  A third drive has errors of 5 different heads at different parts of
the disk, so I'm going to replace that one.
I had reported:
me> I've seen problems with disks which produced the following kinds of error
me> messages:
me> 
me> Jan  6 03:15:42 hao vmunix: xd0c: write retry (header not found) -- blk #718071, abs blk #718071
me> Jan  6 03:16:03 hao vmunix: xd0c: read retry (header not found) -- blk #500910, abs blk #500910
Here are excerpts from the replies:
--------------------------------------------------
Date: Tue, 08 Jan 91 11:48:49 PST
From: "Dean S. Messing" <deanm%medulla.labs.tek.com@RELAY.CS.NET>
dean> Leonard,
dean>     We have been experiencing the problem you describe
dean> along with a host of other (possibly related) problems
dean> for months.   Sometimes reformatting fixes the problem,
dean> often for days or weeks.  Then, head header error messages,
dean> bad block messages,  or some other message (e.g. inodes full)
dean> will begin again.  Sometimes the system ends up crashing.
dean> We are running a CDC SMD 9720-850 - a disk almost identical
dean> to yours except for size.  We have replaced cables, controller,
dean> drive boards, and even the drive itself.  CDC was good enough to
dean> loan us a spare for a month.  The problem did not go away,
dean> although on the loaner disk we ran flawlessly for almost
dean> 4 weeks!  After the loaner was returned, CDC (Seagate) did
dean> extensive checks on their disk and found no problems.
dean> 
dean>     The thing we did learn from all our pain is that the
dean> disk was often (but not always) going off-line when these
dean> problems occurred.  One day I just happened to be sitting
dean> near the disk when an error occurred and when I looked at
dean> the disk's front panel, the on-line light was irregularly
dean> blinking on and off.  After this, we noticed the same
dean> behaviour very often when disk problems were happening.
dean> The light never blinked when all was well.
From: curt@ecn.purdue.edu (Curt Freeland)
Subject: Re:  What are "header not found" disk errors?
curt> I was seeing the same thing on some of our XD disk controllers.  We have
curt> been seeing this for 2 years now!  I recently got my hands on a Xylogics
curt> Field Change Notice that says (in part):
curt> 
curt> 	Date: 10/10/89  ECO No: 1757  FCN 753-011
curt> 	Title:  Busy hang - disk bus loading problem
curt> 
curt> 	A bad head address is put out by the 753 during head tags.  Symptoms
curt> 	reported due to this occuring include:  "disk sequencer errors",
curt> 	"drive off cylinder", "header not found", and in many cases the 
curt> 	controller will hang busy.  One of the most common failures is 
curt> 	during the verify pass of Sun's format - verify will stop running
curt> 	and the controllers busy LED will be on solid.  This condition is
curt> 	caused by a D.C. loading problem on the 753's internal disk bus.
curt> 
curt> The fix is to pull out a SIPP resistor pack, and replace a PAL chip.
curt> You can check chip location D5 and see if the chip label has the number
curt> "1085" or "180-001-085" on it, and the SIPP resistor RP11 should be 
curt> missing from the board.  You should also make sure you have the large 
curt> metallic heat-sink with the diodes in it if you have a 753 controller.
curt> Without the heatsink, you could burn up your controller among other things.
Date: Tue, 08 Jan 91 17:24:59 EST
From: trinkle@cs.purdue.edu
trinkle>      What you have is a media failure on the disk.  Most likely it is
trinkle> a head crash.  This means there is physical damage to part of the
trinkle> recording surface of the disk and/or one of the read/write heads.
trinkle> Once one area of the surface is damaged, there is usually some
trinkle> particles (dust) floating around in the sealed drive as a result of
trinkle> the abrasion of the head against the surface.  This dust will then
trinkle> cause more abrasion between the head and other areas of the surface.
trinkle> If the head is badly damaged, then even without dust, the damage to
trinkle> the head may cause physical damage to the surface in other areas.
From: era@niwot.scd.ucar.EDU (Ed Arnold)
Date: Tue, 8 Jan 91 15:30:59 MST
ed> ...get the HDA replaced.
Date: Tue, 08 Jan 91 16:11:28 -0600
From: Gordon C. Galligher <oddjob!oconnor!trevise!gorpong@ncar.UCAR.EDU>
gordon> You have lost, or are losing, your controller, NOT your drive.  We see these
gordon> errors all the time with the Xylogics 450/451 controller cards.  Replace the
gordon> controller, and things should be fine.  Beware that when replacing a
gordon> controller card, it is a "good idea" to reformat the drive.  If the drive
gordon> contains data which you cannot do without, then I suggest bringing the system
gordon> up single user mode and dump'ing what you need, and then reformatting.  It is
gordon> an extra step, but you are then guaranteed of a clean system.
gordon> 
 
Date: Tue, 8 Jan 91 19:08:38 PST
From: aldrich@sunrise.Stanford.EDU (Jeff Aldrich)
jeff> Similar problems I've had in the past have been due to flaky disk
jeff> controllers or, more rarely, bad cabling or bad connector.  Lots of
jeff> luck!
Date: Wed, 09 Jan 91 09:57:23 +0000
From: James Pearson <jcpearso@ps.ucl.ac.uk>
james> I had a very similar problem about a year ago with one of our Eagles
james> (bad blocks appearing all over the disk, reformatting occasionally
james> working and finding new bad blocks, disk being OK for a couple of days
james> then failing with the same problem etc).
james> 
james> It turned out to be a cable problem. I replaced all the cables and the
james> problem went away. 
From: mailrus!umich!samsung!uunet!anagld.analytics.com!rcsmith@ncar.UCAR.EDU (Ray Smith)
Date: Wed, 9 Jan 91 7:12:45 EST
ray> Leonard,
ray> 	I can't answer your question directly from first hand experience
ray> but I did run your error message through my full-text archives
ray> of sun-spots, sun-managers, sun-nets and sun-flash. I came up with the
ray> following messages which appeared in August 1990. 
ray> 
ray> I hope they help,
ray> Ray
ray> 
me> I haven't included this text, because you have probably already
me> seen it and it is available in the archives
Date: Wed, 9 Jan 91 08:56:21 -0500
From: eap@bu-pub.bu.edu (Eric A Pearce)
eric> It wasn't clear to me from your letter, but it sounded like you have
eric> replaced a disk drive and the new one failed in the same manner as
eric> the last (?).   If this is the case, I would look elsewhere for the
eric> problem - such as replacing the disk controller and/or cables.
eric> Large fluctuations in room temperature can also cause errors.
eric> We have many of the drives you mention and they run for years without
eric> errors.
Date: Wed, 9 Jan 91 14:29:39 GMT
From: dit@kc.aberdeen.ac.uk
david> We have had two CDC drived go the same way, and a third disk away for checking at
david> the moment. My understanding is that each disk surface is divided into tracks, and
david> each track divided into sectors. Each sector is written with certain information, 
david> such as sector number, checksum etc, and a space left for data. This corresponds to
david> the header followed by the sector size you expect. A 512 byte sector may actually
david> be between 600 and 700 bytes once you allow for the rest of the junk required. 
david> 
david> I am told the surface of our disks started to 'flake' or 'peel', and in any case
david> get thinner. This leads to problems reading the information, but not always, hence
david> the intermittent nature of the problem. This generates the 'header not found' errors.
david> Reformatting writes new (seemingly stronger) information to the disk which can be
david> read OK. The final symptoms are the 'flakes' of disk surface contaminating other
david> areas of the disk in a catastrophic manner.
david> 
david> As an aside, I have also had disks damaged by being moved suddenly. This can either
david> destroy the disk or the head, obliterate part of the surface, or just make bits of
david> the disk unreliable, but formatting usually fixes all but the destroyed disk head.
david> 
david> David Tock     dit@uk.ac.aberdeen.kc    /  \              /
Date: Thu, 10 Jan 91 11:46:41 -0500
From: mike@park.bu.edu (Michael Cohen)
mike> I usually see this stuff with a winchester with degenerating media.
mike> I would run format on the disks in question after backing them up.
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:09 CDT