Running s3sync in parallel

s3sync is a great tool to synchronize local data with Amazon S3 for backups, or whatever other reasons you might want to put your data on S3. It is very simple to install (gem install s3sync) and use (s3sync -v -s -r –-progress source_dir s3_bucket:dir); it runs very well and it can be easily scripted to do regular backups or even synchronize live data with S3. The only problem I found while using s3sync was that it can be very slow when uploading a lot of data (millions of files) to S3; this because the process is slow but also because it runs a single file at a time, and it doesn’t do several uploads in parallel. I would have loved for s3sync to do this out of the box, but unfortunately it doesn’t, but for my particular need I was able to do this by running more s3sync commands a the same time. It will not apply to your data (unless it is structured the same way as here; very unlikely), but it might give an idea on how you could do this your own data if it is structured in a feasible way.

Ok, for this particular upload I am sync’ing a few millions files in folders structured like this:

000/000/files..
000/001/files..

999/999/file…

the process was taking days with a single s3sync running, so I just put up a small script to run several toplevel folder s3sync’s at the same time. This reduced the time a lot and was a good walkaround for our problem. Here is the script used, in case it might help others:

#!/bin/bash

cd /source_top_folder

id=0
while [  $id -lt 999 ]; do
        sleep 10
        echo "."
        running=$(ps -ef | grep s3sync | grep ruby |wc -l)
        if [ $running -lt 20 ]; then
                lid=`printf "%03d" $id`
                echo "starting a new s3sync - $lid"
                /usr/bin/s3sync -p --no-md5 -v -s -r --progress --delete ./$lid/ my_bucket:$lid/ &
                let id=id+1
        fi
done

This will basically run 20 s3sync instances and start a new one every time it is needed (if total running go bellow 20). I realize this is not perfect, but it has done its job for us for this particular project. Ideally s3sync would be able to run several parallel upload threads to be much faster, but until then you might use a similar solution if you have such a problem ;).

comments powered by Disqus