Running s3sync in parallel
s3sync is a great tool to synchronize local data with Amazon S3 for backups, or whatever other reasons you might want to put your data on S3. It is very simple to install (gem install s3sync) and use (s3sync -v -s -r –progress <source_dir> s3_bucket:<dir>); it runs very well and it can be easily scripted to do regular backups or even synchronize live data with S3. The only problem I found while using s3sync was that it can be very slow when uploading a lot of data (millions of files) to S3; this because the process is slow but also because it runs a single file at a time, and it doesn’t do several uploads in parallel. I would have loved for s3sync to do this out of the box, but unfortunately it doesn’t, but for my particular need I was able to do this by running more s3sync commands a the same time. It will not apply to your data (unless it is structured the same way as here; very unlikely), but it might give an idea on how you could do this your own data if it is structured in a feasible way.
Ok, for this particular upload I am sync’ing a few millions files in folders structured like this:
000/000/files..
000/001/files..
…
999/999/file…
the process was taking days with a single s3sync running, so I just put up a small script to run several toplevel folder s3sync’s at the same time. This reduced the time a lot and was a good walkaround for our problem. Here is the script used, in case it might help others:
#!/bin/bash
cd /source_top_folder
id=0
while [ $id -lt 999 ]; do
sleep 10
echo "."
running=$(ps -ef | grep s3sync | grep ruby |wc -l)
if [ $running -lt 20 ]; then
lid=`printf "%03d" $id`
echo "starting a new s3sync - $lid"
/usr/bin/s3sync -p --no-md5 -v -s -r --progress --delete ./$lid/ my_bucket:$lid/ &
let id=id+1
fi
doneThis will basically run 20 s3sync instances and start a new one everytime it is needed (if total running go bellow 20). I realize this is not perfect, but it has done its job for us for this particular project. Ideally s3sync would be able to run several parallel upload threads to be much faster, but until then you might use a similar solution if you have such a problem
.
>

12th August 2009, 19:27
Try something like this:
seq -w 0 999 | xargs -i -P 20 /usr/bin/s3sync -p –no-md5 -v -s -r –progress –delete ./{}/
13th August 2009, 15:42
I found the s3sync tool to be unusably slow and unable to handle very large files. If memory serves correctly it is because it is using the horrible net-http library from ruby 1.8. When trying to move a lot of large files, it stored the files in memory wasn’t freeing it after the file was transferred, ultimately crashing.
s3-bash was a bit clunkier to use but was WAY faster and reliable and didn’t eat up the system memory.
It’s been about 2 years since I’ve look at this so things might have changed some.
13th August 2009, 19:04
@Mark: besides the issue I described here (not running in parallel) I’ve found s3sync to run quite well (haven’t seen any memory problems). I used it for small files (many of them as in this case) but also with very big files, without any issues. I am not aware of a better alternative that handles well the synchronization between a local directory and one s3 bucket, but if such better tool would exist I would love to try it out
19th May 2010, 09:14
I found the s3sync tool to be unusably slow and unable to handle very large files. If memory serves correctly it is because it is using the horrible net-http library from ruby 1.8. When trying to move a lot of large files, it stored the files in memory wasn’t freeing it after the file was transferred, ultimately crashing.s3-bash was a bit clunkier to use but was WAY faster and reliable and didn’t eat up the system memory.It’s been about 2 years since I’ve look at this so things might have changed some.