SUMMARY: need backup script to detect time-out

From: Dan Penrod (
Date: Fri Aug 04 1995 - 09:59:08 CDT

Many thanks to the sun managers...

To briefly recap the problem -
I have a backup script that I run which runs on my backup server. The
script runs a series of 'rsh <remotehost> dump' commands, piping the dumps
all back to my local 8mm tape robot. If one of the remote hosts happens
to crash while I'm backing it up the backup script hangs and the rest of
the machines never get backed up. I need to find some way to monitor the
remote machine and detect the system as down and/or be able to time-out
after an unreasonable period of time. Due to the sequential nature of
most languages, I couldn't figure out how to check the remote machine at
the same time as I was running a dump command. My script is written in

I received a bunch of replies... lets see here... I guess about 25.  There
were a number of totally different approaches but the most common approach
was to spawn (fork) a child process to run the dump command and then have the
parent loop while it monitored the child, which would 1) finish normally,
2) crash, or 3) time-out (sort of a default failure).

Here's my Perl test code. Hopefully tomorrow I'll be able to drop it into my actual backup script. This chunk of code fits my scenario real well and is pretty illustrative, and... it works! ;-) Let me say right now that my code here is strongly modeled after the excellant solution provided to me by Arthur Blais. If you have perl you should be able to cut, paste, and run the code below after tweeking nothing the $remote_host and maybe $time_limit variables... --- #!/usr/local/bin/perl # Sample/Test program to show how to run dumps and monitor for hangs

$remote_host = "aesop"; $time_limit = 30; #seconds for test purpose

#spawn the child and wait for dump to complete unless ($pid = fork) { exec("rsh $remote_host ufsdump 0f /dev/null / > /dev/null 2>&1 > /dev/null"); }

#babysit the child until it stops or must be killed - unix is so violent! $done = &babysit($pid,$remote_host,$time_limit);

#deal with the various exit possibilities CASE: { if($done==0) { print("Terminated Normally\n"); last CASE; } if($done==1) { &kill_kill($pid); print("Terminated Due To Loss Of Contact\n"); last CASE; } if($done==2) { &kill_kill($pid); print("Terminated Due To Timeout\n"); last CASE; } } exit;

# function babysit(): This function babysits the the child process sub babysit { local($id,$rhost,$timeout)=@_; local($stop_time) = $timeout + time; local($now_time); local($are_you_there);

while(1) { # Loop Forever - Three possible ways out... sleep 10; print("Babysitting pid $id...\n"); $running = `ps ax |grep $id |grep -v grep`; print $running; if($running =~ /<defunct>/) { return 0; # *** Child has finished <kludge> *** } unless($running =~ /$id/) { return 0; # *** Child has finished normally *** } $are_you_there = `ping $rhost`; unless($are_you_there =~ /alive/) { return 1; # *** Remote host is dead *** } $now_time = time; printf("Now time:%d Stop time:%d\n",$now_time,$stop_time); unless($now_time < $stop_time) { return 2; # *** Can't wait any longer *** } } }

# function kill_kill(): This function kills the child process and it's children sub kill_kill { local($id) = @_;

local($user) = `whoami`; local(@ps) = `ps auwx |grep $user`; foreach $process (@ps) { if (($process =~ /rsh/) && ($process =~ /dump/)) { local($a, $pid, $b) = split(" ",$process); kill(9,$pid); } } printf("Aborting dump\n"); kill(9,$id); sleep 10; return; }

--- The code is really pretty simple. The things I did different than Arther were mostly a matter of personal taste. Just a couple of notes... The dump command I'm using in this sample is designed to just burn up some cpu for testing purposes. You would, of course, plug in your own dump command. The case statement near the end of the main() function is basically some printf() statements. You would, of course, have to plug in your own actions in response to successful or unsuccessful dumps. In the babysit() function I have one case that I label as a kludge where I'm testing for the child process to be <defunct>. I usually found that the children wouldn't die until the parent died. I don't think this is actually a problem, I just tested for that case and moved on. In the kill_kill() function, I'm specifically looking for the pids of the children 'rsh dump' processes and killing them. Pretty simple really.

That about covers it for the summary. There were a bunch of alternative solutions. If you want to see the credits and/or alternatives then read on below. But I'm warning you now, it's pretty darn long (and possibly boring). Otherwise, consider this the end of my summary and get back to work.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | _/ _/ _/_/_/_/ _/_/_/ _/_/_/_/ | Dan Penrod - Unix Administrator | | _/ _/ _/ _/ _/ | USGS Center for Coastal Geology | | _/ _/ _/_/_/_/ _/ _/_/ _/_/_/_/ | St. Petersburg, FL 33701 | | _/ _/ _/ _/ _/ _/ | (813)893-3100 ext.3043 | |_/_/_/_/ _/_/_/_/ _/_/_/_/ _/_/_/_/ | | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Many thanks to the sun manager list in general and especially to those listed below...

From: (Perry Hutchison) >I have no knowledge of perl, but if I needed to solve this sort of >problem in sh I would probably run the "rsh dump" in the background >capturing its process id, then enter a loop consisting of "sleep 10" >followed by a test to see if the process is still around, repeated >some appropriate number of times. Exit from the loop if the process >is gone; if the loop count expires the timeout has been exceeded and >you can kill it.

Yea that's the general idea. Unix shell scripts, C, and Perl all provide functionality for this kind of multiprocessing.

--- From: (Jim Barbas ) >I am faced with the same situation. I believe intr requires that the command >be run from a terminal. Is that true? Let me know if you find a solution.

Nobody ever responded to my question in the original query regarding the intr utility. It's not a popular utility. I don't think Sun really wants to advertise it since they apparently haven't carried it over to Solaris. The man page specifically says that it's sole purpose is for use in /etc/rc.local scripts to avoid boot hangs. So I don't know, I decided to avoid it.

--- From: Gene Rackow <> >take a look at the "amanda" backup package from It >does what you want plus more. It works great.

I've looked at amanda. I decided there were things about it I didn't care for and decided to write my own. I know there are a lot of people out there that think amanda is better than mom's chicken soup, but it's not for me.

--- From: (Sangamesh Biradar HCMC Vietnam) >If you receive any info, could you pls. passon to me. > >My setup : >OS-SunOS 4.1.3_U1 >System : Sparc10,20

You got Sangamesh. Hope this is helpful.

--- From: (Rick Pluta) >One way to solve the problem would be to set an alarm before executing >the rsh command as follows: > >$timeout = 60; # set timeout as desired (in seconds) >$SIG{ALRM} = "alarm_handler"; # set up handler for ALRM signal >$alarm_flag = 0; # set flag to "no timeout occured" >alarm($timeout); # set the alarm clock >rsh host ufsdump ...; # do the rsh >alarm(0); # turn off alarm clock >if ($alarm_flag) { # if a timeout occured, do whats req'd > # do what you need to for timeout >} > >. >. >. > ># ># Handler for alarm signal ># >sub alarm_handler { > $alarm_flag = 1; # indicate that an alarm has occured > return; >} > > >Notes: >1. The above is just an example which you can expand as needed. >2. If your running Solaris 2.3, make sure to load patch 101318 which > fixes a bug which was keeping signals from interrupting the socket > system. >3. Be sure to run a test. I use this method from "C", however, I see > no reason why it would not work from perl.

This is a great example of using Unix's system function alarm(). It's a great idea. I chose the forking solution over this because I felt like I was getting a little better control. Personal decision.

--- From: Admin|Paul Crutchley X1452 <> >We do a similar thing in bourne shell, but only wait around a bit, then >kill the remote shell using bourne shell function (enclosed). If you >want to keep pinging, edit "hangon()" to suit. > >########################################## >## The following function is to get round those proc's that hang >## - give them 120(default) seconds grace then nuke > >hangon() >{ > elapsed=0 > > pid=${1} > secs=${2-120} >> > while [ `ps ${pid} | wc -l` -gt 1 ] > do > sleep 2 > > elapsed=`expr ${elapsed} + 2` > > if [ ${elapsed} -gt ${secs} ] > then > kill ${pid} 2>/dev/null > > sleep 1 > > kill -9 ${pid} 2>/dev/null > fi > done >} > > >... etc .... > >########################################## >## Get OS cause they bloody moved it for Solaris ... > >rsh ${workstation} uname -r > /tmp/newdumpOS${$} & rshpid=$! > >hangon ${rshpid}

Good example of doing it at the bourne shell level.

--- From: Christopher Weaver <> > Have you looked at expect?

Yea, expect is great for a lot of things, but when you use it, as the name implies, you really need to know what to expect in terms of dialog from the host. I could probably do it with enough experimentation but the solution I picked is probably better suited to my problem.

--- From: John A. Murphy <> >I'd recommend moving to amanda. You can get it from >The README follows. > >WHAT IS AMANDA? > >This is a beta-test release of Amanda, the Advanced Maryland Automated >Network Disk Archiver. Amanda is a backup system designed to archive...

It's a cult, isn't it.

--- From: (Peter Allan) >If the script detects a timeout & kills the rsh (and dump processes) >won't you be left with an unfinished dump on tape ? >Are you happy you will succeed in reading past it ? >Otherwise there's no point in proceeding to the next machine >in your list. OK- I have just experimented and found it easy. > >How about farming the process out to another script >(or recursive call of same one) using at(1) to kill its pid >after an appropriate time ? > >Don't do it from the main script of course because all the work will stop. >After killing the script kill -KILL any remaining dumps after a suitable time, >and again a minute or two later. >(because of the way dump runs may processes) > >You could be really smart and list the machines you failed on for retrying. > >I'm currently trying to back up disks onto tapes that don't always hold them. >I estimate the size of the file to save (using bru rather than dump in Irix) >and select a best choice of files that do fit on, with not much wasted tape. >It also warns me so that I can adjust things for next time that backup is >attempted.

Sounds like generally the same idea although using at(1) sounds a little tricky. By the way... here's MY Perl function to estimate the size of a backup. Notice Solaris built it into their dump command an 'S' switch to report dump size.

>#backup_size(): h=host, l=level, f=filesystem, v=os version d=dump command ># - Returns, in Kilo Bytes, size required for backup >sub backup_size { > local($h,$l,$f,$v,$d)=@_; > # --- Solaris (SunOS 5.x) provides a shortcut --- > if($v >= 5) { > $cmdline=sprintf("rsh %s %s %dS %s",$h,$d,$l,$f); > $size = `$cmdline`; > $size = $size / 1024; > } else { # --- Use the SunOS Kludge --- > local($fld1,$fld2,$fld3,$fld4); > $cmdline=sprintf("rsh %s %s %df %s %s 2>&1|grep -i estim|", > $h,$d,$l,"/dev/null",$f); > open(filevar,$cmdline); > while ($line = <filevar> ) { > ($fld1,$fld2,$fld3,$fld4) = split(/\s+/,$line); > } > close(filevar); > $size = $fld4 * 512 / 1024; > } > $retval = $size; >}

--- From: (Jerry Weber CIC-2) >Here is our script and it seems to do the right thing when a machine goes >down or a tape error is encountered.....

I looked over your csh script. I can see that it detects for dump failure by checking c-shells $status variable. The problem is that the $status variable only gets set if the dump returns. In my scenario the dump was never returning.

--- From: "SYSTEM SUPPORT" <> >I know of a way to do this in the Bourne Shell. There is a "trap" command >which you can use to tell the script to stop running this command when a >certain signal is received (i.e. - HUP, kill, TERM, etc.). There are several >different signals that the script can handle, and I'm sure that the system >shutdown is included. The only problem is that I don't have a Bourne shell >programming book here to tell you what signal to catch, and how to use it >(syntax). I thought I'd let you know that there is a way to do it, and that is >through the "trap" command. Sorry I couldn't be more help.

That's a novel idea. I wonder if it really works. What would send the shutdown signal? Would it be the shutdown command maybe? In other words can you detect a signal on a machine that's just crashed? Can you detect a signal on a remote machine? Food for thought.

--- From: (Robert H. Moyer 733-0208) >If you want to try an alternative to hacking your own programs take a look >at amanda the free backup utility developed and maintained by the University >of Maryland. The real advantages of this program are: > >1) Easy to setup an maintain. >2) Supported by a large community of users. >3) Runs on most Unix platforms. >4) Fast - Spawns multiple processes to communicate with multiple clients > simultaneously. >5) Handles problems with timeouts gracefully.

Groupies, all of you. Where's your hacking pride!? ;-)

--- From: (Al Venz) >Similar problems have been approached by having the rsh forked rather than just >executed. Then your parent process just waits around for a period of time, an >either starts the next dump or kills the current one.

>You could have it ping every 5 minutes or so, and kill based on that, and have >it move on to the next whenever a child returns or is killed...

That's the spirit! I'm pinging and timing.

--- From: (Kohler R. P. (Robert)) >I have the same problems, we have SunOS_4.1.3 using sh shell scripts, using >rsh and dump. It hangs when the remote system locks up or reboots. > >Let me know if you hear anything good... > >One suggestion I have thought about, but haven't implemented, would be to >spawn a child just before the dump command that would monitor it. Since >the dump process doesn't show anything relevant on status, I would look >at the "rmt" processes generated on the "tape host" machine. If you >ps -aux them, probably ps -elf in Solaris, you can see the /etc/rmt process >spawns 2 children "rmt" processes which both increase time at least every >minute. If they lock up for say 5 minutes without time increases (CPU), then >I'd say they are stuck, thus you could either 1) send a signal to the dump >process to abort it, 2) kill the dump process and continue. > >I'm not sure what I would do to my tape then, since I put multiple dumps >on one tape, which I already pre-sorted. I will probably take the rest of >the dumps scheduled and put them on the next tape in my list. > >I've written C code using SIGNAL to alarm you after a time-out, thus I >could probably set the alarm for X minutes, then system the dump command, >thus putting a timeout on the dump command also. Then in the shell scripts, >I could just call the C executable passing the command/timeout I want to >run. > >It sounds like we have alot of stories in common, I'd be interested in >hearing about your solution...

Here's your opportunity to take care of that little bug. You had the right idea, now here's the code to do it. Get at it! By the way, I find that if I kill the local 'rsh dump' command everything else dies nicely, both on the remote and local host. The SIGNAL's a great idea but I think the forking and looping may be more reliable and offer more specific control; but again, it's a personal choice.

--- From: fmicos!pongo! (Arthur Blais) >Since you are using perl you can fork a process and have the parent >watch the child process. I allow about 2 hours per gigabyte for the >child to complete. If the child doesn't complete in the expected >time the program kills it and goes on to the next filesystem. > >art blais > >------------------------------------------------------------------- > > # fork the process and wait for the dump to finish > unless ($child_pid = fork) { > system("($dump | $bdd) 2>> $logfile"); > exit 0; # terminate the child process. > } > > $done = &watch_dump($child_pid, $fs_used); > > if ($done) { > # Dump finished ok > &remove_from_list($target, $backup_list); > } > else { > # dump timed out or didn't execute. > # stop the dump by killing all the processes > &stop_dump($child_pid); > } > > ># sub watch_dump ># ># subroutine watches the child process for normal termination. ># if normal termination, returns 1. Returns 0, if the dump doesn't ># complete in expected amount of time. ># >sub watch_dump >{ > > local($c_pid, $fs_size) = @_; > > local($curr_time) = time; > local($time_len) = &get_time_len($fs_size); > local($stop_time) = $curr_time + $time_len; > > while($curr_time < $stop_time) { > $running = `ps auwx | grep $c_pid | grep -v grep`; > if ($running =~ /$0/) { > sleep 60; > $curr_time = time; > } > else { > return 1; > } > > } > > return 0; > >} > ># sub stop_dump() ># ># subrouting kills all dump process that did not complete in a normal ># amount of time. ># >sub stop_dump >{ > local($c_pid) = @_; > > local($user) = &get_user; > local(@ps) = `ps auwx | grep $user`; > foreach $process (@ps) { > if (($process =~ /\/etc\/dump/) || ($process =~ /\/bin\/bdd/)) { > local($a, $pid, $b) = split(" ", $process); > kill(9,$pid); > } > } > > print LOGFILE "\nDUMP aborted - exceeded time limit.\n"; > kill(9,$c_pid); > sleep 30; > return; >}

This was, in my opinion, the best answer. Very complete and clear code. This is what I based my solution on. Thanks alot!

--- From: (Steve Ehrhardt) >I haven't done this, but there's a solution that immediately comes >to mind. > >(I'm assuming Bourne shell from here on.) > >Run the rsh in background. Start another script in background, passing >it the PID of the rsh, which will wait for the specified timeout, then >check to see if the rsh is still running. The second script will >kill the PID of the rsh if it's still running after the timeout. > >The original script issues a wait on the PID of the rsh, then can >determine whether it worked based on the exit status. > >There are a few hazards to this approach, but I can't see any reason >why it wouldn't work in most cases. Something similar could almost >certainly be done in PERL as well.

--- From: (Rich Holland) Here's a snippet of perl that will kill a process after the specified timeout period. I haven't implemented it in my backup scheme yet due to lack of time. My scheme is similar to yours where I queue a list of machines to dump and save the data to tape via an rsh pipeline. This code is from a perl script that lets you execute a specified command on a list of hosts with a timeout which kills the rsh process after the specified timeout period.

A couple notes on variables: MAXSH is the max number of shells spawned simultaneously. @hosts is a list of hosts to execute the command on @argv holds the command to be executed (options have been shifted off)

Take a look at the main while loop and the start() and kill() procedures especially. Good luck! :) ============================================================================= [...] while (1) { if (($i <= $#hosts) && ($shells < $MAXSH)) { &start; $shells++; } while (($reaped = waitpid(-1, &WNOHANG)) > 0) { &completed($reaped); $shells--; } while (($j <= $i) && (!$start[$j])) { exit 0 if ++$j > $#hosts; } if (($start[$j] + $MAXWAIT) <= time) { &kill($pid[$j]); $shells--; $start[$j] = 0; } } exit 1;

sub start { unless($file) { $_ = $hosts[$i]; split; $hosts[$i] = $_[1]; } $name = $hosts[$i];

$fh="FH$i"; if ($command) { $argv = "@ARGV"; $argv =~ s/{}/$name/g; $pid[$i] = open($fh, "exec $argv 2>&1 |"); } else { $pid[$i] = open($fh, "exec $RSH $name -n \"@ARGV\" 2>&1 |"); } $start[$i] = time; $pidslot{$pid[$i]} = $i; $i++; 1; }

sub completed { local($pid) = shift @_; local($fh, $slot);

$ch = STDOUT; $ch = STDERR if $?;

$slot = $pidslot{$pid}; $fh="FH$slot"; @output = <$fh>; $output = join('',@output); close $fh; chop $output; &output($slot); $start[$slot] = 0; undef $pidslot{$reaped}; 1; }

sub kill { local($pid) = shift @_;

$ch = STDERR; kill 15, $pid; kill 9, $pid; $fh="FH$pidslot{$pid}"; close $fh; waitpid($pid,0); $output = "rsh to host timed out."; &output($j); undef $pidslot{$pid}; 1; }

sub output { local($slot) = shift @_;

if($list) { printf $ch "%s\n", $hosts[$slot]; } else { printf $ch "%14s %2d : %s%s\n", $hosts[$slot], time - $start[$slot], $nl, $output; } 1; }

Perl's an amazing language isn't it?!

--- rom: (Frank Greco) Is a C pgm wrapper helpful to you?

Here's a little ansi-c thing I just whipped up... (I needed to break the boredom of fixing someone's makefiles).

You might have to hack the execle() line to fit your needs though (i.e., other cmd-line args to rsh). And you might have to exit(NN) in the timeout() routine to give your higher-level shell script a return code that indicated the rsh timed out.

Frank G.

============================================================== /* * Usage: pgm nodename pgm * * ex: pgm godzilla ls * * If you set the NAPTIME env var to a timeout value (in seconds) * and rerun this pgm, this pgm will die unless it completes within * the NAPTIME period. */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <math.h> #include <sys/types.h> #include <sys/param.h> #include <sys/stat.h> #include <sys/wait.h>

extern void timeout();

int pid;

int main( int argc, char **argv, char **arge) { int naptime; int err; int statusp; char *cp;

naptime = atoi((cp = getenv("NAPTIME"))? cp: "0");

switch(pid = fork()) { case -1: perror("Could not fork()"); exit(1); case 0: /* The Child */ execle("/usr/ucb/rsh", "rsh", "-n", argv[1], argv[2], (char *)NULL, arge); perror("Could not exec()."); exit(0); default: alarm(naptime); signal(SIGALRM, timeout); while ((err = wait(&statusp)) != pid) ;

} }

/****************************************************************/ void timeout() { fprintf(stderr, "timeout: Killing process...");

if ( kill(pid, SIGKILL) == -1 ) /* kill the rsh */ perror("Kill failure");

fprintf(stderr, "Done.\n"); }

This is a useful snippet.

--- From: Nick Barrowman <> >I saw your message on Sun Managers. Sorry, I don't have any suggestions. >However, I am interested in your perl dump script. Could you send me a copy?

Sure. I'll send it out after I finish with this summary.

--- From: (Jeremy Hunt - Optimation) >why not take a different approach. > >Check your dumps over some period and guesstimate the time taken to backup each machine. Rewrite your scripts to fire off a seperate process for each dump. >The main script is just a time manager. If, when the time comes for the next >process to start, the previous dump process is still running, then maybe give it >a grace period of say 15 mins then kill it. There are shell code examples of >finding a named process and killing it in your startup files in init.d. You >may want to write the expected finish time of the current process to a file >to keep track of dumps starting later than expected. > >Obviously this is not as elegant as what you suggest, but on the whole more >dumps will be done of more machines with this approach. > >>Good Luck, ... if there is an interrupt style mechanism I would be glad to hearof it, ... Jeremy Hunt

That's not a backup script, that's an operating system! Just kidding. ;-)

--- From: Jamie Bubenicek <> > I don't know perl, but couldn't you just write out a temp file with the >ip address of the site you are currently dumping, have another script >that comes by every minute or so and gets the ip address of the machine >that the other perl script is backing up and ping it. If it's dead, ps, >grep, cut the process id and kill it (not the perl script but the dump >command.

Yea, but it sounds a little untrustworthy, don't you think?

--- From: (Kevin Sheehan {Consulting Poster Child}) >You can ping - run the dump in background and keep pinging there. If you >see the system go down, you can "kill $!"

Ping and Kill; Ping and Kill! Yea. Yea! YEA! Actually, what make this idea interesting is that it's reversed from the others. Here we're forking the babysitter instead of the dump job. I wonder if there are any advantages to that approach. I'll have to give that additional thought.

--- From: Mark <> >The easiest thing to do is fork off the child process with fork() and loop >until the child is finished or the time is up. Otherwise use alarm() in >the child to signal when the time is up.

Yea, you could use the alarm mechanism in combination with the fork solution but that's a bit too exotic for my tastes. If you're already doing the fork you don't need to use the alarm. It's too easy to just time in a loop.

--- From: (Mattias Zhabinskiy) >I think You can try to use perl function fork >to fork your remote dump and then ping from >time to time remote node from your parent >process and, if it hangs, kill the child process, >at the same time You've to wait for a end signal >from the child for a reasonable amount of time >and if by some reason dump process hang, but node >is responding, kill the child anyway.

Is there really some sort of 'END' signal I can check for? As I understand it I have to choose one of two ways to detect the child is done. 1) by using the wait() command. But then I can't be looping because I'm busy waiting, so I'm not really running two processes at the same time am I? 2) Check the child process in my loop by executing 'ps' commands or 'pings'. That's the one I chose. And quite frankly, it seems a little clumsy to me. Is it really possible to check for a SIGNAL????

--- From: Daniel_V._D' >They set a timed watchdog reset out there on the base machine to blow out the >system tape drive in the event of a system hang. In your case, what you may >want to do is to fire off a parallel process that keeps pinging the node in >question until you get some failed replies, then turn around on your machine >and kill the process for that machine back on board your box.

That seems to be the general gist.

--- Again many thanks. All the help is greatly appreciated. I hope this SUMMARY is of use to other.


This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:30 CDT