Summary: critical: Loosing Filesystem

From: V.Sander (zdv123@zam092.zam.kfa-juelich.de)
Date: Thu Aug 24 1995 - 03:10:12 CDT


Hi Sun managers,
my original quesion was:

----- Begin Included Message -----

Hi SUN-Gurus,

we have a critical problem with a Sparc 20/512 running
Solaris 2.4. The filesystem we use for non SUN Software
(named /usr/local) gets destroyed without any disk error
logged to /var/adm/messages. I should tell you, that we
export this file system to many clients via NFS/cachfs and
is placed on the second SCSI-interface.

What I see is, that the
root directory of the filesystem gets dirty.
    cd /usr/local
    ls -l

 says: cannot read .

Rebooting and running fsck results in asking for every inode
number (it starts with inode 2). Running fsck -y results
mostly in a clean (means empty!) filesystem.
Today the filesystem was not really empty,
but each directory under /usr/local
(and some more) was moved to lost+found.

Now some more special information:

    I installed the following patches with
    a public domain perl-script: fastpatch (!!!!!!):

101753-01 101933-01 102038-01
101829-01 101959-03 102044-01
101878-01 101979-03 102057-13
101879-01 102001-03 102057-14
101880-03 102002-01 102062-03
101902-01 102003-01 102066-04
101905-01 102007-01 102070-01
101907-02 102011-02 102079-01
101920-01 102020-02 102112-01
101921-04 102030-04 102137-01
101922-04 102035-01 102216-01
101923-03 102036-01 102292-01
101925-01 102037-01 102922-01

and last but not least 101945-27!

I believe that the problem occurs due to the
not correctly installed patches (fastpatch-script).
showrev -p or installpatch -p shows everything fine, but
reinstalling a client with printer problems has shown
that they have gone !!!
I tried to reinstall the patches with installpatch -u -d
but this does not work.

Any ideas or suggestions on the problem or the reinstalling
of the patches???
(Today I replaced the disk)

----- End Included Message -----

First let me thank all who have responded until now.
Most replies supposed that overlayed disk partitions caused
the problem.
Format says that they do not overlapp.

At the moment I think about three possible reasons:

1.) The filesystem was export with write access and root-access to some
     clients, administrated by me. If there would be a root-process running
     an unlink() on the root-directory of that filesystem, the effect
     would be the same.
     There are two possible candidates for this unlink:

           nfsfind is executind a find without -xdev option on /usr but
                   the mentioned FS is mounted under /usr/local

           cachefs is used on the clients, because there is only less update
                   on /usr/local. It could be a cache coherency problem

      I stopped root-access!

2.) Using the disk for the first time on SUN OS, the label was destroyed
     due to power of during formatting. I reconstructed the label
     with the informations I received from "format-current" in another
     disk of the same type. Maybe something has gone wrong during the
     new disk formatting. (I swapped the disk)

3.) I figured out a method for reinstalling the patches.
     Deleting SUNW_PATCHID=.... from /var/sadm/pkg/..../pkginfo and
     running installpatch works!

     I have several misterious effects with Solaris 2.4 systems installed
     by using fastpatch.
     I do not use fastpatch anymore.

Thanks,
Volker



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:32 CDT