SUMMARY: How to force fsck?

From: Mike van der Velden (mvanderv@yahoo.com)
Date: Thu Dec 23 1999 - 16:38:01 CST


Hello,

Last week I asked how to force the system to require an
interactive fsck at the next reboot, as part of a break/fix
hardware test for prospective sysadmins. (Full text of
original message is at the end).

In the end, mgmt decided that the test would not be
necessary (whew), so I am now able to offer you this
summary. It should be noted that I did not try any
of these suggestions, as the test was cancelled.

My thanks to the following people for their speedy replies.

   Kris (Unixboy)
   Darren Dunham
   Kevin Sheehan {Consulting Poster Child}
   Michael Stapleton
   Seferino Gardner
   Rich Jankowski
   Thad MacMillan
   Dan Lowe
   Dan Lorenzini
   Annette Lee
   Jerry Lu
   Gary Jenson
   David B. Harrington
   Dan Brown
   Toby A. Rider
   Brett Lymn
   Ian MacPhedran

   
Here are some of the utilities mentioned that can be used to
corrupt a filesystem
   fsdb(1M)
   unlink(1M)
   clri(1M)
   dd(1M)

   
Here are their suggestions and comments:

 1. You can destroy the primary superblock on a partion then run
    "fsck" to restore it from a backup superblock (block 32 is
    the traditional first choice):

      i. newfs /dev/rdsk/c?t?d?s? (create a new file system on a
         slice if you don't wanna use an existing partition)
     ii. fsck /dev/rdsk/c?t?d?s?
    iii. dd if=/kernel/genunix of=/dev/rdsk/c?t?d?s? count=32 bs=512

         (destroy the primary superblock on this partition)
     iv. fsck /dev/rdsk/c?t?d?s? (you'll see error now)
      v. fsck -o b=32 /dev/rdsk/c?t?d?s?
        (repair the superblock from block 32)

    If you reboot the machine right after step iv, you can force
    the fsck to run.

 2. Just change the reference to your raw device in the
    /etc/vfstab. This will cause the fsck to fail, as the device
    won't exist and this will drop you to the shell level. This
    is a much safer method than intentionally corrupting a
    filesystem. Who ever you interview will have to figure out
    the correct device and eventually fsck the correct one. Good
    way to test a junior person on fsck, format and basic file
    system principles.
   
 3. To do it in a repeatable way, look at fsdb(1M) and try your
    hand at manually manipulating the filesystem. Great
    opportunity, when you actually *want* to mess up the FS!

 4. If you're looking for a real munge, grab 'fastfs', put your
    /usr filesystem into the fast state and make changes.
    create/delete many files in a directory and pull the plug.
    It's not as repeatable, but you'll sure have the filesystem
    in a very unclean state when it comes back.

 5. I would create a toy file system on another partition first
    (a bit safer) and then hit either a directory or inode part of
    a cylinder group with (eg.)
   
    dd if=/dev/zero of=/dev/rdsk/c0t0d0s4
   
    This messes up the FS, but not very graceful stuff like
    duplicate inodes, etc.

 6. Use the unlink command to cause lost files. It removes the
    directory entry but does not delete the file.
   
 7. Make a directory somewhere in /usr then make another directory
    in the one you just created. Get the inode for the first
    directory (ls -il) then run clri(1M) on that inode. Make sure
    this is on a test system, as there's always the possibility
    that it'll really clobber /usr. You probably will still have
    to just turn the power off to make sure the FS is dirty so
    the fsck will happen on boot.
   
 8. Tell me if I'm off base here but can't you just make an rc file
    to do this before /usr is mounted ? The following files may be
    places to put something like that: /etc/rcS.d/S30rootusr.sh,
    /etc/rcS.d/S40standardmounts.sh [ It's not really what I was
    after. I wanted there to be a problem to fix, which requires
    the admin to run fsck, not to have fsck run automatically at
    reboot time. -- Mike]
   
 9. If it's ugly you want, jerking on the power cord and then
    jerking on it again in the middle of reboot ought to do the
    trick. Of course, controlled mayhem is probably available as
    a package from from Sun. (hehehehe)
   
10. You might go to your hardware guys and see if they have a bad
    hard drive somewhere that you can put in the test machine, and
    let them play with it that way. Of course, they should never
    be able to successfully fsck that drive, but that's real world.

11. Repartion the drive, overlap something on the /usr partition.
    Newfs the new partition or use it for swap. Guaranteed to hose
    whatever data is there. Don't ask me how I know :-)

12. Perhaps bonus points for someone who can repair a system with
    a broken dynamic library system without resorting to booting
    from CD - another machine available to copy stuff from helps.
    
       
   
Plus some comments on the validity of the test in general:

- This is kind of a difficult test, how are you going to base
  pass/fail? What if you trash one machine to the point of it
  being unrecoverable, and another where fsck recovers without
  incident? Then you'll be hiring people based on which machine
  they got and not their skills.

- If you have to have a hands on test for prospective employees,
  why not give them a machine and have them bring up networking?
  Or change the subnet/router/nameserver/nfs and have them
  reconfigure by hand. Do something like move the libraries, so
  they have to use the static binaries to recover the filesystem.
  This would probably give you a better idea of their problem
  solving skills.
  
- Wow -- this sounds like THE acid test for wanna-bes. If
  somebody had shown me just a glimpse 20 years ago of all the
  nasty things that might (and did) happen, I would have probably
  chosen another vocation.
  
- I have always considered fsck to be a simple task interactively,
  provided you remember to umount the drive. And I had a 36GB hard
  drive I needed to fsck, where I found about about the '-y'
  parameter (after 5 minutes of hitting y, Enter).
  
  

Several people asked for copies of the complete test once I was
done with it. Here it is. You'll notice where I have incorporated
some of the above suggestions into it. I didn't come up with a more
thorough test, as it was cancelled on me.

------------- begin -------------

Screen is blank

  - output device is ttya instead of screen. admin needs to
    fix eeprom setting.

System doesn't boot up, tries to boot from net.

  - usually this means that the eeprom setting "boot-device"
    is set to "net" rather than "disk". Instead, what happens
    sometimes after an error is that the setting "diag-switch?"
    is set to true rather than the default "false", which
    means it looks in the setting "diag-device" to decide where
    to boot from. "diag-device" is usually set to "net".
    To fix, need to change the eeprom setting "diag-switch?"
    from true to false.

System can't find boot block

  - often happens after restoring the root filesystem, and
    forgetting to install the bootblock on the disk. Admin
    needs to boot from cdrom, and run the "installboot"
    command on the appropriate disk. I can simulate this
    condition by doing a dump and restore.

/usr needs an fsck

  - the system can't boot up if it can't mount /usr, which it
    can't do if there are errors in the filesystem. I can
    manually corrupt the filesystem so it requires an fsck.
    I want to ensure that /usr is separate from / so that it
    doesn't conflict with the previous test.

  - we can be really sneaky and also have the system try to mount
    the wrong partition. This happens with a typo in the
    vfstab file. Will need to use the "format" command to
    find the correct filesystem to mount.

system hangs after a reboot

  - bad entry in the /etc/system file. set maxusers=0.
    Admin will have to boot from cdrom to fix.

can't login as root -- no shell

  - this is a common problem when people improperly change
    root's shell from /sbin/sh. Need to boot from cdrom
    to fix.

user cannot login as root from any terminal

  - caused by a space CONSOLE= setting in the
    /etc/default/login file, boot -s to fix.

convert a system from complete standalone to fully networked.

  - given the list of required info, such as NIS domainname,
    NIS server name and IP, interface names and IPs, subnet
    mask(s), default router, DNS domain name and name servers,
    have the admin bring up networking from the ground up.

------------- end -------------

Finally, here is my original query:

--- On Dec. 15, 1999, Mike van der Velden <mvanderv@yahoo.com> wrote:

> Hello,
>
> I have to design a small "hands-on" test for some prospective
> system administrators. This test will include troubleshooting
> boot-up problems among other things.
>
> One thing I want to do is corrupt the /usr filesystem enough that
> an interactive fsck is required. I know that just powering off a
> system without a proper shutdown will require an fsck, but this
> usually happens automatically at the next bootup and fsck is able
> to fix it without much fuss.
>
> I want some fuss. Simply removing or renaming certain files will
> undoubtedly impair the bootup process, but it doesn't corrupt the
> filesystem in any way. I want the prospects to have to run fsck.
>
> Does anyone have a command that I can use to accomplish this? How
> can I create duplicate inodes or lost files? Or maybe you know of
> a bug in the OS that will trigger something like this.
>
> Or, perhaps I'm way off base, and I'm better off testing something
> else. Let me know.
>
> Thanks in advance for all your comments. I'll summarize after the
> tests are complete, just in case any of the prospects read this
> list as well. :)
>
> Thanks in advance.
> Mike van der Velden
 

Mike van der Velden
Insurance Corporation of British Columbia

=====
9 days to Y2k. 345 days until the new millenium.
_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:13:35 CDT