Ladies and Gentlemen:
 
Many thanks for the replies.  The answer is to use icheck and ncheck
to get from block number to inode and thus to file name.  Both are in
the man pages but I had missed finding them with both AnswerBook searches
and man -k probes; I just never had a suitable guess for my searches.
 
A couple of responses said that such messages are normal on a busy
IPI disk but to become concerned if the number of messages starts to
get large.
 
Several of the respondees attached a previous summary (11/30/91) from
this list so it is included here also after the list of respondees
and the original query.
 
Thanks to:
 
blymn@awadi.com.AU (Brett Lymn)
"Andrew Luebker" <aahvdl@eye.psych.umn.edu>
geertj@ica.philips.nl (Geert Jan de Groot)
Steve Elliott <se@computing.lancaster.ac.uk>
Eckhard.Rueggeberg@ts.go.dlr.de
bill@aloft.att.com
etnibsd!dwy@uunet.UU.NET (David Young)
jdschn@nicsn1.monsanto.com (John D Schneider)
Torsten Metzner <tom@uni-paderborn.de>
Dave Wilmot <dawi@is-rocker.gwl.com>
era@niwot.scd.ucar.EDU (Ed Arnold)
 
 ----------------------------------------------------------------
|  Hap Hinrichs                            G W Hinrichs, III     |
|                                          Research Director     |
|  gwh@ecliptic.stat.nielsen.com           A. C. Nielsen Co.     |
|  voice: (708) 498-6320 x2430             Nielsen Plaza         |
|  fax:   (708) 205-4014                   Northbrook, IL 60062  |
 ----------------------------------------------------------------
<><><><>original message<><><><>
<> 
<>Ladies and Gentlemen:
<>
<>We have started seeing messages like:
<>
<>Jul 24 14:04:14 ecliptic vmunix: id000g: block 80 (471824 abs): read: Conditional Success. Data Retry Performed.
<>
<>on one of our IPI disk partitions.  I am planning to do some work on this
<>with format but being a belt AND suspenders type I was wondering if there is
<>some way to find out in advance which files or inodes are involved.  Of
<>course this is not a little used partition (/usr) so I will also be doing a dump
<>of it before I touch it, but I'm trying to find out as much as possible
<>before I go at it.
<>
<>If useful, the system is a 4/490 running 4.1.1.
<>
<>Thanks in advance, summary later
<>
*********************************************************************
*                Previous summary from this list follows            *
*********************************************************************
>From sun-managers-relay@delta.eecs.nwu.edu Sat Nov 30 02:04:40 1991
Received: from delta.eecs.nwu.edu by eye.psych.umn.edu; Sat, 30 Nov 91 02:04:34 CST
Received: by delta.eecs.nwu.edu id AA06824
  (5.65c/IDA-1.4.4 for sun-managers-outbound); Fri, 29 Nov 1991 23:58:25 -0600
Sender: sun-managers-relay@eecs.nwu.edu
Received: from sun2.nsfnet-relay.ac.uk by delta.eecs.nwu.edu with SMTP id AA05769
  (5.65c/IDA-1.4.4 for <sun-managers@eecs.nwu.edu>); Fri, 29 Nov 1991 23:58:11 -0600
Received: from dcs.sheffield.ac.uk by sun2.nsfnet-relay.ac.uk via JANET 
          with NIFTP id <18155-0@sun2.nsfnet-relay.ac.uk>;
          Fri, 29 Nov 1991 18:51:29 +0000
Received: from server3.sheffield by dcs.sheffield.ac.uk (4.1/DAVE-1.0) 
          id AA18794; Fri, 29 Nov 91 16:47:06 GMT
Date: Fri, 29 Nov 91 16:47:06 GMT
From: Dave Mitchell <D.Mitchell@dcs.sheffield.ac.uk>
Message-Id: <9111291647.AA18794@dcs.sheffield.ac.uk>
To: sun-managers@eecs.nwu.edu
Subject: SUMMARY: finding file associated with disk block
Status: R
My original query:
> I recently had to repair a bad block on a disk. Unfortunately,
> I now have a file with a block of zeros embedded somewhere in it.
> I know the block number, I need to find which file is using that block.
> I asked sun, they said that there's no command that gives that info.
> Has anyone got a program that can scan the i-nodes to find the block?
> The machine is running 4.0.3, but I have another disk playing up on a 3.4
> machine as well (I know - not even 3.5 !!!)
> 
The answer, as many of you pointed out, is
icheck -b <bad_block#> /dev/r...
this lists (amongst other things), the i-node that refers to that block
you can then do 
find /mount/point -xdev -inum nnnn -ls
or
ncheck -i nnnn /dev/r..
to find the file name(s) associated with that inode.
BTW, by a bizarre coincidence, the file I eventually asociated with
an (intermittent) bad block, was /bin/find itself!
Finally, Keith Farrar <keith%markets@net.uu.uunet> sent me a document
which I have included at the end, on the grounds that it might be
of interst to many people, even though its quite long.
Thanks to the countless people who replied (too many to list!)
Dave.
----- Begin Included Message -----
                     What File Has The Disc Error?
                             by John Walker
                   Revision 0 -- December 21st, 1989
                                ABSTRACT
                                ========
        When a single block or contiguous area on  a  Sun  (or
        other  Unix) system's hard disc fails, one of the most
        obvious  and  immediately  important  questions   that
        arises is "What file contains the error?".  Amazingly,
        there is no simple, standard utility that answers this
        question, leaving the user knowing that some data have
        been destroyed, but not what.  If backups are current,
        the  user  doesn't know what files to reload after the
        failed area is reassigned to  an  alternate  track  or
        made  unavailable for allocation.  This paper presents
        a cookbook procedure, based on information provided by
        Bob  Elman,  for determining which file contains a bad
        disc block.
                              INTRODUCTION
                              ============
When my hard disc presented me with its  latest  holiday  surprise,  I
ended  up  with  100% repeatable errors on a specific track, head, and
sector.  Immediately after the error occurred, I  ran  an  incremental
backup which, naturally, encountered read errors.  At that point I had
a current set of backups from which I was perfectly willing to  reload
or  rebuild  any  files  that  occupied  the area of the disc that had
failed, but I didn't know which files were involved.  DUMP didn't tell
me, when it so kindly reported an error during the backup; even though
it clearly knows the INODE  number  it  was  dumping  when  the  error
occurred, it didn't deign to print it.
Bob Elman explained the procedure one uses to find what file  contains
a given disc block, and it worked just fine, telling me that the error
was in an executable file I could simply re-link after I'd  fixed  the
disc  by  reformatting  the track that failed.  Since the procedure is
less than obvious and nowhere explained in the Unix manuals I've seen,
I  decided  to write it down so I'd have it at hand the next time this
happened, and to help the next poor sucker victimised by a  hard  disc
failure.  You might want to print this message on a piece of paper and
file it in your system administration manual--when you  need  it,  you
may not be able to get it from a file on your disc.
                            FINDING THE FILE
                            ================
We start out knowing that a hard disc contains one or more bad blocks.
The  first  symptom  that  something  is wrong is usually Unix console
messages reporting I/O errors on the drive.  Most of these  I/O  error
messages  give  the  block number that failed but since Unix reads and
writes large buffers, these numbers should  be  considered  as  giving
only  the  general area of the actual error.  The first step, then, is
to identify the actual blocks that contain the errors.
What Blocks Are Bad?
--------------------
(Sun specific.) Initially, note the drive number from the  disc  error
message.  In a typical message like:
xd1c: write failed (header not found) -- blk #1317140, abs blk #1317140
the  drive  name  is  "xd1c".   To  find  out  what  file  system this
corresponds to, type "df", which will print something like:
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/xd0a              15502    1946   12005    14%    /
/dev/xd0h             514106  430020   32675    93%    /usr
/dev/xd1c             659242  569911   23406    96%    /usr2
/dev/xd0g              42406    8554   29611    22%    /var
In this case, you can  see  that  "xd1c"  is  mounted  as  your  /usr2
filesystem.   (The  default  mounting  of file systems is given by the
file /etc/mtab, which you can type.)
Shut down your system and bring it up single user  with  "b  -s".   In
single  user mode, run "format".  When you fire up format, it asks you
to choose the disc you want to work on; pick the one  from  the  error
message.  For example:
throop# format 
Searching for disks...done
 
AVAILABLE DISK SELECTIONS:
        0. xd0 at xdc0 slave 0
           xd0: <CDC 9720-850 cyl 1358 alt 2 hd 15 sec 66>
        1. xd1 at xdc0 slave 1
           xd1: <Fujitsu-M2372K cyl 743 alt 2 hd 27 sec 67>
Specify disk (enter its number): 1
selecting xd1: <Fujitsu-M2372K>
[disk formatted, defect list found]
Here,  I've entered "1" to choose "xd1".  (The "c" in the error number
is a partition name, but at this level format is working on the  whole
disc.)
Next,  we  want  to  get the physical disc address of the block number
reported in the error message.  Enter the "show" command, and type  in
the error block number:
format> show
Enter a disk block: 1317140
Disk block = 1317140 = 0x141914 = (728/2/54)
This  tells  us that the block where Unix encountered the error was on
track 728, head 2, sector 54.  Since we don't know precisely where the
error  was,  we'll sniff around the two surrounding tracks for errors.
Enter the surface analysis command:
format> analyze
and then enter "setup" to specify the parameters for the analysis:
analyze> setup
Analyze entire disk [yes]? no
Enter starting block number [0, 0/0/0]: 727/0/0
Enter ending block number [1347704, 744/26/66]: 729/$/$
Loop continuously [no]? 
Enter number of passes [2]: 1
Repair defective blocks [yes]? no    <========= INCREDIBLY IMPORTANT!!!! <===
Stop after first error [no]? 
Use random bit patterns [no]? 
Enter number of blocks per transfer [126, 0/1/59]: 1
Verify media after formatting [yes]? 
Enable extended messages [no]? 
Restore defect list [yes]? 
Restore disk label [yes]? 
Here we've set up to scan from the start of track 727 through the  end
of track 729 (the "$" means "the highest number valid in this field"),
reading single sectors.  If we  were  to  use  a  larger  blocks,  the
precise  location  of  the  errors  would  be  indeterminate.   IT  IS
ABSOLUTELY ESSENTIAL, SURPASSINGLY SO, THAT YOU  ANSWER  *NO*  TO  THE
"REPAIR  DEFECTIVE  BLOCKS" PROMPT.  If fail to do this, the so-called
"read-only" test will go ahead  and  "repair"  blocks  on  your  disc,
possibly  causing  loss  of  data  in  files.   So much for reasonable
defaults!
Now select the read-only surface analysis:
analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? yes
This will scan the tracks you've specified.  Since we're only  looking
at  a few tracks, the comment about taking a long time is another lie.
This command should report the individual sectors with errors.  If  it
doesn't,  welcome  to the world of transient disc errors.  If it does,
note the track, head, and sector numbers of  all  failing  sectors  on
paper, then leave the analyse command:
analyze> q
You  can  then  convert those addresses back to block numbers with the
"show" command:
format> show
Enter a disk block: 728/2/22
Disk block = 1317108 = 0x1418f4 = (728/2/22)
Once you have the failing block numbers  in  hand,  you're  done  with
format.  This example has been for a disc with a single partition that
fills it entirely.  If your disc has multiple partitions, you'll  have
to  convert  these absolute block numbers to relative numbers based on
your partitioning of the disc.  The partition/print command will  show
the  current  partitioning, which can use to bias the cylinder numbers
into their partition-relative addresses.
What I-Node Owns That Block?
-----------------------------
On  Unix, there is no one-to-one mapping of file names to areas on the
disc, since "hard links" can result in a given disc area belonging  to
any  number  of  named  files.   The  Unix  object  that  most closely
corresponds to the notion of a  file  in  most  operating  systems  is
called an "I-Node", and it's  expressed  as  a  number.   The  utility
"icheck",  which  was  part  of the semi-automatic assault guru-driven
file recovery facilities of Unix later largely supplanted  by  "fsck",
has  the ability to determine what I-Node points to a given block.  If
you know, for example, that blocks 1317108 and 1317110 on disc  "xd1c"
contain errors, use the command:
/usr/etc/icheck -b 1317108 1317110 /dev/rxd1c
Bizarre, isn't it?  It just scans numbers until it hits the "/" at the
start  of the disc name.  We specified "rxd1c" because naming the "raw
device" makes icheck run faster.
Icheck will crunch for some time, and if the specified blocks are part
of a file, it will print a line that gives, among  other  things,  the
I-node of the file(s) that contain the given blocks.  Note the I-nodes
on your paper, next to the block numbers.  If no I-nodes were reported
by  this  procedure,  the  error  block  is  not part of any currently
existing file.
What File Name(s) Correspond To That I-Node?
-------------------------------------------- 
With the I-Node number in hand, we can finally find out what file  was
hit.  If "icheck" has told us the error is in I-Node 87055, we use the
command:
/usr/etc/ncheck -i 87055 -a /dev/rxd1c
to find the file name.  After a while, this will print something like:
/dev/rxd1c:
87055   /usr2/kelvin/acadexe/acad
and at last, the inscrutable is  unscrewed!   The  error  was  in  the
AutoCAD  executable  file,  which  I  can simply re-link.  If the file
hadn't been one so easily  recreated,  it  would  have  to  have  been
reloaded from the most recent valid backup.  Note that if a backup was
made after the error occurred,  and  that  file  was  present  on  the
backup,  an  earlier  backup  should  be  used  since  the copy on the
post-error backup is almost certainly bad.
You can use "ncheck" to search for multiple I-nodes on one pass.   For
example:
/usr/etc/ncheck -i 4142 4131 4102 -a /dev/rxd0g
/dev/rxd0g:
4102    /tmp/vm_fonts-n0
4131    /tmp/tty.txt.a00444
4142    /tmp/rmail
Repairing And Reloading
-----------------------
After the location and scope of the damage are established, you should
repair the disc errors and restore the damaged  files.   Since  repair
procedures  are  highly  system-dependent  and,  even  on Sun systems,
differ depending on the type of disc controller and  drive  installed,
you  must  refer to the hardware documentation for your system for the
appropriate procedures.
Note that the Sun documentation talks about "repairing"  sectors  with
errors.   Nobody  I  know  can say for sure precisely what this means:
whether it's a process of assigning that sector's address  to  another
sector on an alternate track, clearing its  availability  bit  in  the
current  bad  spot  list,  marking  it in the original defect list, or
what.  In addition, the problems I encounter most frequently  on  hard
discs  are  destroyed  headers due to failed writes (for example, when
the power fails during a write), which are best fixed by  reformatting
the  area  containing  the errors rather than discarding sectors which
have no physical defects.
In any case, after you've repaired the problem with the disc, you need
to delete all the files containing destroyed data and reload them from
their most recent backups.  As noted above, don't use any  backups  of
error-containing files made after the error occurred, as they probably
contain the same errors as the disc controller was complaining  about.
----------------------- End Included Text -------------------------------------
______________________________________________________________________
| Keith Farrar                                                       |
| AMIX Corporation                                                   |
| Palo Alto, CA                 "Apple is like the Chinese Cultural  |
| (415) 856-1234 x217            Revolution conducted by people in   |
|                                      three-piece suits."           |
| DOMAIN:  keith@markets.amix.com               -John Perry Barlow   |
| UUCP: {uunet|sun|xanadu!}markets!keith                             |
----------------------------------------------------------------------
----- End Included Message -----
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:46 CDT