SUMMARY: large data repository sync over WAN

From: Rob Windsor <>
Date: Thu Feb 22 2007 - 12:47:24 EST
I received many responses, some pointed at tools (which is what I was 
looking for, honestly), but most had a common theme to them.  :)

Original Post:
> We need to sync 10TB of data in small files from one North American 
> coast to the other.
> Our tenative plans are to sneakernet the data and then use some form of 
> sync to catch up the delta.
> Aside from bandwidth constraints, we found that rsync quickly craps out 
> with large numbers of files.
> What tools have you used to do this?

Most popular question:
> Does all 10TB of it change daily?

No.  The data comes in two flavors:
* Oracle DBF files (yes, changes daily), less than a TB here
* Small static files, the files themselves don't change, their count
   simply increases
   - side note: These files are about 16 subdirs deep and heavily
     scattered (er.. I mean.. "distributed")

Other common questions/comments:
> You didn't specify how rsync craps out, but i'm guessing

I forget the specifics, but it was basically "out of memory" due to the 
number of files and subdirs it has to dig in.

> what version of rsync you're using

2.6.8 (looking at 2.6.9 now to see if it addresses any of the problems 
we've had)

> but you can often throw ram at the issue.

Not in this case, unfortunately.

> In addition, you can fire off rsync on a subtree so it has less work to
> do.

That's certainly a consideration.  It won't be easy (c.f. "about 16 
subdirs deep" above).

Then there were these:
> (Deborah Santomauro) Have you tried "rdist"?
> (Anthony D'Atri) rdist 6 from with SSH as the transport works great for managing files.

Holy cow, now that's oldschool love!  I'll look into that.

Brad Morrison mentioned:
> I think cpio has a flag to skip files with equal or newer mod dates,

Yeah, we've also considered something like:
    "rsync -av `find . -newer <somefile> -print` dest:/path"
just to limit the volume of files that rsync has to consider.

There was mention of NetApp, zfs, VxFS/VxVM, which aren't options in 
this situation.  As much as I tried to get to zfs, it wasn't available 
at the time we upgraded the DB/file servers to Sol10.

Hutin Bertrand mentioned an app called "aide", which is an Intrusion 
Detection tool (think tripwire) that you can use to spot files/subdirs 
that have changed.  interesting find. 

Gedaliah Wolosh pointed me at
Karl Rossing mentioned

AFS is quite an endeavor, we're not quite prepared to go that route.

AVS looks interesting, we might be able to do something with that, if 
heavy rsync-frobulation doesn't work out.

Thanks all!

Internet:                             __o
Life: Rob@Carrollton.Texas.USA.Earth                    _`\<,_
                                                        (_)/ (_)
"They couldn't hit an elephant at this distance."
   -- Major General John Sedgwick
sunmanagers mailing list
Received on Thu Feb 22 12:48:44 2007

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:04 EST