Parallel rsync’ing a huge directory tree

Some days ago I was on the chance to transfer a huge directory.

Huge means ~50TB with +10million files and a deep of only 6 folders under the parent one.
As I must do that kind of transfer more than 10 times with the same amount of folders… I decided to implement some kind of parallel function which launch parallel rsync’s at a given deep of my choose.

The ressult was that “pure bash” little script (the only dependency is “screen”)… You’ll notice that the main function “sync_this()” will run alone in your script only changing 2 or 3 variables ;-)

#!/bin/bash
[ ! $1 ] && echo "Usage: $0 /path/to/run" && exit 1

TARGET="$1"

[[ ! "${TARGET}" ]] && echo -e "$TARGET\n not a directory" && exit 1
[ ! -d ${TARGET} ] && echo -e "$TARGET\n not a directory" && exit 1

LOGDIR=$(dirname $0)/$(basename ${TARGET})
[ -d ${LOGDIR} ] && echo "Cleanup" && rm -fr ${LOGDIR}
mkdir -p ${LOGDIR}/transferlogs

check_max_processes()
{
	local let MAXPARALEL=$1
	while [ $(ps waux | egrep ":[0-9]{2} rsync" | wc -l) -gt ${MAXPARALEL} ] ; do
	printf "%s" .
	sleep 1
	done
}

sync_this()
{
	local let MAXDEPTH=3
	local let MAXPARALEL=20

	LAUCHRSYNC="/root/autosync/launch_rsync.sh"
	local let y=0
	for FOLDER in $(find ${TARGET} -mindepth ${MAXDEPTH} -maxdepth ${MAXDEPTH} -type d) ; do
		DIRLIST[$y]="${FOLDER}"
		let y++
	done

	echo "Copying files and directories NOT recursively"
	for ((i=0;i<${MAXDEPTH}; i++));do
		let x=0
		for ITEM in $(find ${TARGET} -mindepth $i -maxdepth $i -type d) ; do
			check_max_processes ${MAXPARALEL}
			screen -S ${x} -d -m ${LAUCHRSYNC} -nr ${ITEM} nr_${x} ${LOGDIR}
			let x++
			[[ $x =~ [0-9]{1,2}00$ ]] && printf "\n%s\n" "$x Directories Copied Not recursively"
		done
		echo "Deep $i DONE, going upper"
	done
	echo "Launching recursive rsyncs in deep ${MAXDEPTH}"
	let x=0
	for ((i=0;i<${#DIRLIST[@]}; i++ )); do
		printf "\n%s" "Launching rsync $i of ${#DIRLIST[@]}"
		check_max_processes ${MAXPARALEL}
		screen -S ${i} -d -m ${LAUCHRSYNC} -r ${DIRLIST[$i]} r_${i} ${LOGDIR}
	done
}

sync_this ${TARGET}

I’m using an additional script to launch the rsync (variable ${LAUCHRSYNC}) why? Simply to keep track of what the rsync’s are doing and the result of it, here the code of that script:

  • RSYNCRECURSIVE="/root/autosync/launch_rsync.sh"
#!/bin/bash
# launch_rsync.sh
RECURSIVE=$(echo $1 | tr '[[:upper:]]' '[[:lower:]]')
TARGET=$2
SCREENNAME=$3
LOGDIR=$4DSTSERVER="1.1.1.1"

if [[ "${RECURSIVE}" =~ ^\-{1,2}(nr|non-recursive)$ ]] ; then
rsync -cdlptgoDv --partial ${TARGET}/* ${DSTSERVER}:${TARGET}/ 2>&1 > ${LOGDIR}/transferlogs/${SCREENNAME}_NOTRECURSIVE.log
RES=$?
elif [[ "${RECURSIVE}" =~ ^\-{1,2}(r|recursive)$ ]] ; then
rsync -cazv --partial ${TARGET}/* ${DSTSERVER}:${TARGET}/ 2>&1 > ${LOGDIR}/transferlogs/${SCREENNAME}.log
RES=$?
else
echo "$0 -nr|-r|--non-recursive|--recursive"
exit 1
fi

if [ $RES -eq 0 ] ; then
echo "$RES : ${TARGET}" >> ${LOGDIR}/${RECURSIVE//-/}_TRANSFERS.OK
else
echo "$RES : ${TARGET}" >> ${LOGDIR}/${RECURSIVE//-/}_TRANSFERS.FAIL
fi

If you don’t care about the ressult of the rsync’s, you can simply move the rsync line’s from the launch_rsync.sh to the main code of the script and launch them to the background.
The main script will create a new folder with name: $(dirname $0)/$(basename ${TARGET}) in which you’ll find some important files:

Nombre Contenido
(nr|r)_TRANSFERS.FAIL
Folders which rsync HASN’T finished OK (nr=not-recursive,r=recursive)
(nr|r)_TRANSFERS.OK
Folders which rsync HAS finished OK (nr=not-recursive,r=recursive)
transferlogs
Folder which will have 1 logfile for each rsync launched ;-)

EDIT:
You’ll find more info here:
https://wiki.ciberterminal.net/doku.php?id=linux:parallel_rsync

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.