[SUMMARY] Re: Help: why are remote dumps taking so long from crontab?!

From: Michael Fuller (msf@rdt.monash.edu.au)
Date: Mon May 13 1996 - 20:48:46 CDT

        After changing changing dump hosts from a DECstation 5000
to a SPARCstation 20 running Solaris 2.4, dumps run via rsh/dump/dd
pipeline to a Solbourne Sparc 2 clone running 4.1B (== SunOS 4.1.2)
became very slow (from app. 2 1/2 hours to 6 hours).
Further, if the dumps were kicked off from crontab on the Solaris box,
dumps made so little progress that they would have taken days to complete.
I also asked about preferred tape options to use with a 90m/5G DAT unit.

        There is a known performance bug between Solaris 2.x and SunOS
4.1.x that causes very low throughput for rsh connections between
the two. Sun apparently have no plans to rectify the problem.

The solution that I chose was to modify dump scripts to run 'rdump'
on the remote machine, specifying the Solaris box as the tape host.
Simple example:
        rsh remotehost /etc/rdump 0f tapehost:/dev/rmt/0n /dev/sd0a
This requires that each machine be in the other's .rhosts file.

I received no applicable response about the difference between
crontab and non-crontab performance.

With respect to tape options, the sun-managers archive provided
a Summary item that answered the question. It suggested:
tape size = 5000 (fictitious tape size to fool dump)
blocking = 126 (63k: maximum for SCSI bus)
density = 61000 (real tape density)
I have chosen to use a tape size of 15360 (also suggested in the
summary) to try to convince dump that it can fit more than 2G on a tape.

See http://www-dccs.stanford.edu:80/lists/sun-managers/hyper94/0692.html
for more details.

Thank you very much to all that responded; your advice has been very helpful:
        Andrew J Cosgriff Andrew R. Tefft Brian Desmond
        David Abbott Glenn Satchell Kenneth Simpson
        Paul Eggert Peter Bestel Thomas Buehlman
                                 Will Nelson

Michael Fuller
Systems Administrator, msf@dgs.monash.edu.au
Department of Digital Systems, Monash University, Ph: +61 3 9905 3218
Clayton 3168, Victoria, Australia. Fax: +61 3 9905 3574

----------------- Original article and edited responses -----------------------
My orginal posting:
>Date: Wed, 24 Apr 1996 12:06:29 +1000
>From: Michael Fuller <msf>
>Subject: Help: why are remote dumps taking so long from crontab?!
>Summary: dumps run from crontab of remote Solbourne to DAT are *very* slow
>Keywords: dump,Solbourne,Solaris 2.4,backup,DAT
>Newsgroups: aus.computers.sun,alt.sys.sun,comp.unix.solaris,comp.sys.sun.admin,comp.unix.admin
>Mailed-To: sun-managers@ra.mcs.anl.gov
>Followup-To: poster
>Reply-To: msf@dgs.monash.edu.au

A number of months ago we acquired a new Sparc 20 running Solaris 2.4,
to take over as our main file-server. At the same time, we purchased
a 5G 4mm DAT unit to handle over local backups.

Unfortunately, whilst dumps of filesystems local to the Sparc20 run fine,
as do remote dumps of Ultrix and OSF/1 boxes, remotes dumps of our Solbourne
S4000 machine (Sparc 2 clone, running Solbourne OS 4.1B == SunOS 4.1.2) take
forever. And remote dumps of the Solbourne that are started from crontab
take ten times as long!

Local ufsdumps are fine: 600-800K/sec
Dumps of our DECstation are fine: 400M in 20-odd minutes
But dumps of the Solbourne: 760M in *over* *four* *hours* !!

And that's *only* if I run the dump from an interactive shell.
If I fire of the dump from cron, such dumps would take (I project)
at least *two* *days* !!

Previously, before moving backup responsibilities to the new Sparc,
dumps of the Solbourne would take an hour or two overnight. They now
take 6 1/2 hours, and then only if I run the dump script manually from
an interactive shell before leaving in the evening!

A typical command for our remote dumps (fired off by a complex shell script):
rsh remotemachine.dgs.monash.edu.au /etc/dump 0uf - /dev/rz0d | dd ibs=8k obs=8k of=/dev/rmt/0n

Can anyone provide any suggestions as to what the problem is and how to
combat it?

[Side question: can anyone suggest a more appropriate block size to use for
the DAT unit? 8K is way too small, I'm sure ...]

Michael Fuller
Systems Administrator, msf@dgs.monash.edu.au
Department of Digital Systems, Monash University, Ph: +61 3 9905 3218
Clayton 3168, Victoria, Australia. Fax: +61 3 9905 3574

[msf: Here are the responses I received. I have edited them for brevity;
      I trust that no one objects.]

>>From: brian@chimera.psych.unimelb.edu.au (Brian Desmond Rm 927 Ext 4208)

I hope you get a good answer. In the meantime, here is something that
has helped me achieve better transfer rates with remote dumps:

/usr/sbin/ufsdump 9uf - /filesystem | rsh -l <user> <hostname> \
                        '/usr/local/bin/buffer -s <blocksize> > /dev/rmt/0cn'

I can dump /usr (about 175 Mb) from a Sparc 10 to another Sparc 10, both
running Solaris 2.4, in 5 minutes using this.

[msf: man page for "buffer" deleted. I've run across buffer before: it's
      a specialised program for doing fast re-blocking. It should be locatable
      via archie, but I didn't bother.]

>>From: David Abbott <dwa@sybase.com>

Isn't this the Solaris <-> SunOS rsh performance bug recently mentioned here -

--------------20431CA73F7845541E4D4CFD Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="Comp.unix.admin"

>From - Thu Apr 25 13:11:36 1996 Path: sybase.com!halon!uunet!in2.uu.net!newsfeed.internetmci.com!info.ucla.edu!news.cs.ucla.edu!twinsun!not-for-mail From: eggert@twinsun.com (Paul Eggert) Newsgroups: comp.unix.solaris,comp.sys.sun.admin,comp.unix.admin Subject: Re: Help: why are remote dumps taking so long from crontab?! Date: 24 Apr 1996 03:28:16 -0700 Organization: Twin Sun Inc, El Segundo, CA, USA Lines: 25 Distribution: inet Message-ID: <4lkvo0$s59@bird.twinsun.com> References: <4lk2b1$6p4@harbinger.cc.monash.edu.au> <4lksp6$3dj@gcsin3.geccs.gecm.com> NNTP-Posting-Host: Xref: sybase.com comp.unix.solaris:17344 comp.sys.sun.admin:11606 comp.unix.admin:6834

Dave Miller <dave.m.miller@gecm.com> writes:

> I would suggest that your dump command is wrong. > First use rdump as opposed to piping dump into dd

That's good advice, but I'm afraid it glosss over an important point. There's a horrible performance bug when using rsh to transfer data from a SunOS 4.1.x host to a Solaris 2.x host.

Here is a simple way to reproduce the problem. Try this command on a Solaris 2.5 host:

time rsh -n H exec dd if=/dev/zero bs=64k count=64 > /dev/null

where H is a SunOS 4.1.4 host. This command should take about 8 seconds real time on an otherwise idle 10 Mb/s Ethernet. (And it will take 8 seconds if you run it on a SunOS 4.1.4 host.) But it takes 120 seconds due to the horrible performance bug.

The only workaround is to not use rsh; in the case we're talking about, this can be done by using rdump instead of dump | rsh, but in other cases the workarounds are considerably more painful.

Sun knows about the problem but isn't planning to fix it; they haven't even issued a bugid for it, as far as I know.

--------------20431CA73F7845541E4D4CFD-- [msf: And indeed that seems to have been the problem for the slow down in dump speed after the move to the Solaris tapehost. Still doesn't explain the incredible drop in throughput between dumps from cron and dumps not run from cron.]

>>From: Peter.Bestel@uniq.com.au (Peter Bestel) Ensure that you use blocking with the dump command and don't try to re-block the input using dd.

Try the following:

rsh remotemachine.dgs.monash.edu.au /etc/dump 0ubf 126 - /dev/rz0d | dd obs=8k of=/dev/rmt/0n -- [msf: I do now specify the blocksize for dump to use.]

>>From: Will.Nelson@Eng.Sun.COM (Will Nelson) This combination of using rsh to do a dump to stdout and then piping through dd has way too much overhead in it.

I would think that kicking the backup off from the remote machine and using rdump would be better:

ufsdump 0uf localmachine:/dev/rmt/0n /dev/rz0d

Am I missing something here? -- [msf: This is what I now do for our non-Ultrix boxes, avoiding the 'rsh' bug.]

[msf: Glenn originally responded:] >>From: Glenn.Satchell@uniq.com.au (Glenn Satchell - Uniq Professional Services)

Is the Solbourne separated by a bridge frpom the Sparc 20? This may be impacting the backup time? Have you tried running snoop to see why the packets aren't getting through? You may need to get a network sniffer onto your ethernet to see what is going on, bu tit would seem that there is some sort of network problem. What is the througput like for ftp and NFS between the Solbourne and the Sparc20?

As for your block sizes, what about something like this:

rsh somehost.what.ever /usr/etc/dump 0bf 96 - /mount/point | dd of=/dev/rmt/0n bs=48k

You need to specify the block size in the dump command as well as the dd. 48k seems to be the block size most used for 4mm tapes. For 8mm tapes use 63k (126 blocks). -- [msf: I then asked him:] > This is a fair point, as the Solbourne is on thin ethernet on the > other side of a DMPR with the Sparc on utp on this side. But the > question I have is perhaps not relevant? Remember, the same script > happily dumped over the same network configuration when the tape > unit (then an Exabyte) was on a DECstation, and also that dumps > of another Sparc in an equivalent network setup (different but "parallel" > thin ethernet loop) run fine. [msf: and] > Ta. bs=96 for the dump but 48 for the dd? I also have a feeling that > it's necessary to explicitly set both ibs & obs for dd under Solaris > for some reason. [msf: to which he responded:] >>From: Glenn.Satchell@uniq.com.au (Glenn Satchell - Uniq Professional Services)

I'd still be suspicious. What are th ecolision rates like on the thinnet? As you have two similar nets that's handy because you can run different tests on both systems and compare results to see if there's much difference. I think you might find ftp and NFS throughout comparisons useful.

dump expects blocks to be in 512 byte blocks, ie 96 * 512 = 48k, for dd I specified bs=48k so that's in kilobytes already. Same goes for ibs and obs if you need to use them. -- [msf: When I have a chance, I will look into network performance as I suspect that it is poorer for the thinnet part of our network cf. the UTP portion. He also provided some feedback on block sizes when I queried his initial response:]

>>From: Paul Eggert <eggert@twinsun.com>

rsh from SunOS 4 to Solaris 2 has terrible performance. Sun knows about the problem, but they haven't issued a bugid for it (and I suspect they never will).

For dumps use rmt instead, e.g.

rsh remotemachine /etc/dump 0uf localmachine:/dev/rmt/0n /dev/rz0d

You'll have to give up on your fancy blocking; too bad. -- [msf: Again, spot on about the rsh bug. Still doesn't explain cron/non-cron performance discrepancy.]

>>From: buehlman <buehlman@iwf.bepr.ethz.ch>

I am not sure whether I can help you but definitely can give some hints how you could go on wit further searches...

a) if you really have SunOS 4.1.2 over there, how about using rsh followed by ufsdump or rdump (whichever existed in SunOS 4.1.2, I believe it is the latter.) The point is that they are able to dump directly to tapes across the net with the same syntax (for options) as dump/ufsdump. b) A short check of the manpage reveals that the unit size for dd options is bytes. (I know the example shows that 8k example) but are you sure this is really understood as 8192 bytes? c) lastly you should check whether /dev/rmt/0nb would be helpful.

-- [msf: a) a good idea, b) yes, 8k does mean 8192 bytes to dd, c) didn't try it]

>>From: Andrew J Cosgriff <ajc@bing.apana.org.au>

Any reason you don't use /etc/rdump 0uf backuphost:/dev/rmt/0n ? That's what I do here for our backups - all our machines are running Solaris 2.3 or 2.5, but were originally running SunOS 4.1.2 / 4.1.3 doing the same thing with none of these speed problems. -- [msf: Another vote for rdump]

>>From: ken_simpson@tmai.com (Kenneth Simpson)

Hi - loose the pipe to dd and use rdump instead. -- >>From: ken_simpson@tmai.com (Kenneth Simpson)

Hi - we were running our dumps on SunOS 4.x using a pipe to dd. When we switch to Solaris 2.x, the time for the dumps suddenly increased by a factor of roughly 4. Same tape drive, same tape, same CPU's - different OS. Haven't thought much about why - simply no time to explore. -- [msf: Apparently because of the 'rsh' bug]

[msf: the only suggestion I received that related to the difference between performance when run/not run from crontab:] >>From: teffta@erie.ge.com (Andrew R. Tefft)

A friend here found the answer. By default, for non-root users, cron jobs run at reduced priority. I never noticed because my dumps run out of root's crontab.

See 'man queuedefs'. Whoom! :-) -- [msf: Unfortunately, my dumps ran as root, so this doesn't seem to be the answer]

This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:00 CDT