SUMMARY: boot hangs after patch cluster, but works if start things manually from single-user mode

From: Jeffrey P. Elliott <jpelliot_at_umd.umich.edu> Date: Mon Jun 23 2003 - 12:50:50 EDT · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:15 EST

Apologies for the late summary - I had been hoping I would have a 
maintenance window to be 100% sure of the solution, but it has not 
worked out that way. However, I am assured by Sun that things should be 
back to normal  ;)

In the haze of 1:30 a.m. and not thinking clearly, I negelected to check 
the state of the metadevices. It turns out that all slices were in need 
of repair. According to metastat -t, this system had been in need of 
repair for some time. Unfortunately, no one had set up any of the 
suggested cron jobs to monitor for this issue. So, the system was 
stopping after the check of the metadevices and doing a metasync -r, as 
called for in lvm.sync script. (this check is ignored when booting into 
single-user mode).

After bringing up the system from single-user, we ran metasync -r to 
sync up the filesystems and all appears to be well. The  d30 volume is 
very large (over 30GB), and did take a LOT of time to sync -- which is 
apparently what was happening during a normal boot.

Thanks to everyone who sent in a suggestion.  (To the 13 of you who let 
me know that you were out of the office that day,  I could have done 
without that information.)

Darren Dunham and Jay Lessert sent along ideas of how to make the init 
process a bit more verbose, so that these types of issues can be easier 
to find. I can imagine the reaons Sun wants a less chatty boot process, 
but considering all of the things that can go wrong during boot, I 
wouldn't mind a bit more feedback while my systems were booting.

 From Darren:
What I will do sometimes is to modify /sbin/rc2 temporarily.  Keep a
backup and add the two "echo" lines below in the appropriate place.

if [ $_INIT_PREV_LEVEL != 2 -a $_INIT_PREV_LEVEL != 3 -a -d /etc/rc2.d 
]; then

        for f in /etc/rc2.d/S*; do
                if [ -s $f ]; then
                        echo "Starting to run $f"
                        case $f in
                                *.sh)   .        $f ;;
                                *)      /sbin/sh $f start ;;
                        esac
                        echo "Completed running $f"
                fi
        done
fi

Thanks again!

jef

Original Message:

> Hello All,
>
> This past weekend, we applied the latest 8_Recommended cluster to an 
> E220R (which appeared to be an original Sol 8 install, and had never 
> been patched before - lucky me). After the installation and reboot, 
> the system hangs after checking the filesystems, i.e.
>
> ...
> /dev/dsk/md/d20 is clean
> /dev/dsk/md/d30 is stable
>
> and just stops here. The longest I let it go was probably 20 mintues, 
> just to see if it would eventually do anything. If we boot into 
> single-user mode, and start up all of the things we need by hand, 
> however, the system works just fine, as do all services. (it's a 
> Real/Helix streaming server).
>
> I'm guessing that there is probably an issue with an rc script, since 
> I can mount the file systems and start services by hand, including an 
> NFS mount.  I'm not familiar enough with the boot sequence to know 
> exactly the route to take from rcS to rc2 (or even rc3) to have walked 
> through the required scripts.
>
> I don't know if this will help, but here is the vfstab, just in case 
> (and yes, I am also not a fan of these mount points, but I inherited 
> the box :
>
> fd      -       /dev/fd fd      -       no      -
> /proc   -       /proc   proc    -       no      -
> /dev/dsk/c0t0d0s1       -       -       swap    -       no      -
> /dev/md/dsk/d0  /dev/md/rdsk/d0 /       ufs     1       no      -
> /dev/md/dsk/d10 /dev/md/rdsk/d10        /var    ufs     2       yes     -
> /dev/md/dsk/d20 /dev/md/rdsk/d20        /usr/local/     ufs     
> 3       yes     -
> /dev/md/dsk/d30 /dev/md/rdsk/d30        
> /usr/local/RealServer/Content/  ufs     4       yes     -
> swap    -       /tmp    tmpfs   -       yes     -
> nfs.host:/home         -       /home           nfs     -       yes     
> soft,quota,bg
>
> I'm wondering if anyone might have an idea, based on where the boot is 
> hanging, which scripts I can check for problems. I realize that there 
> could be mounting issues with the /usr/local items if done out of 
> order - however, the boot sequence shows them being checked in order, 
> so I am assuming (incorrectly, maybe?) that they would be mounted in 
> that order. (and again, they mount fine by hand).
>
> Oh, I should also mention that it appears that some services are 
> starting, as the box will respond to a ping from a different subnet, 
> so it must be getting route/network. dmesg confirms this. So does this 
> indicate that parts of the system are hitting rc2.d/S69inet and 
> S72inetsvc? it never makes it to any of the other network related 
> services, tho, such as ssh or the helix server.
>
> It also shows a dump to swap that I am unsure about.
>
> Jun  7 23:04:36 nova genunix: [ID 936769 kern.info] hme0 is 
> /pci@1f,4000/network@1,1
> Jun  7 23:04:40 nova hme: [ID 517527 kern.info] SUNW,hme0 : Internal 
> Transceiver Selected.
> Jun  7 23:04:40 nova hme: [ID 517527 kern.info] SUNW,hme0 :   100 Mbps 
> Full-Duplex Link Up
> Jun  7 23:04:42 nova genunix: [ID 454863 kern.info] dump on 
> /dev/dsk/c0t0d0s1 size 1000 MB
>
> Any helpful pointers/suggestions/ideas appreciated.
>
> Thanks
> jef 
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers