Running Spark clusters as jobs on LSF

While Spark is often run on Mesos or YARN separate from a traditional HPC scheduler (and sometimes alongside them via a super-scheduler), our requirement was to run Spark clusters on top of an HPC cluster. Previous posts have discussed running Spark on top of a Univa Grid Engine cluster.

This article deals with running Spark clusters on top of IBM Spectrum LSF. Spark is configured in standalone mode, and individual masters and their workers are submitted as separate jobs to the LSF scheduler. From there, jobs can be submitted to LSF that act as Spark drivers connected to those Spark clusters.

LSF Settings

  1. Define the hostgroup in LSB_CONFDIR/cluster_name/configdir/lsb.hosts:
    Begin HostGroup GROUP_NAME  GROUP_MEMBER                        CONDENSE    #GROUP_ADMIN    # Key words spark       (c0[1- 8]u0[1-9] c0[1- 8]u[10-32])    Y End HostGroup

    This is optional, but it makes things easier in later config files. In this case, the spark hostgroup is defined as covering nodes number c01u01 through c08u32.

  2. Define the queue in LSB_CONFDIR/cluster_name/configdir/lsb.queues:
    Begin Queue QUEUE_NAME = spark priority = 40 HOSTS = spark DESCRIPTION = spark nodes EXCLUSIVE = Y JOB_CONTROLS = TERMINATE[SIGTERM] POST_EXEC = /PATH/TO/POSTSCRIPTS/postscript_spark_lsf.sh RUNLIMIT = 8:00 730:00 USERS        = allusers End Queue

    Note the JOB_CONTROLS section. By default, LSF will send SIGUSER2 to a job that is being killed. The spark daemons generate core dumps if sent this, so they need to be sent a SIGTERM. I have also set a default runlimit of 8 hours, with a maximum of 730 hours (user defined)

  3. Add the following to $LSF_TOP/conf/lsf.conf:
    LSB_SUB_COMMANDNAME=Y

    This allows for substitution of the command name in a bsub via an esub.

esub configuration

Define an esub for spark in LSF_SERVERDIR/esub.spark:

#!/bin/bash
#$RCSfile$Revision$Date$

# Redirect stderr to stdout so echo can be used for error
exec 1>&2

if [ -z "$LSB_SUB_PARM_FILE" ]; then
    # if not set do nothing
    exit 0
fi

. $LSB_SUB_PARM_FILE

if [ "$LSB_SUB_ADDITIONAL" = "" ]; then
    exit 0
fi

case $1 in
    "master")
        TYPE="master"
        echo 'LSB_SUB_JOB_NAME = "master"' >> $LSB_SUB_MODIFY_FILE
        ;;
    "worker")
        TYPE="worker"
        if [[ $LSB_SUB_JOB_NAME != W* ]] ; then
            echo "Wrong job name for a Spark worker. Please submit with the job name W"
            exit $LSB_SUB_ABORT_VALUE
        fi
        ;;
    *)
        echo '"Invalid job type. Please use bsub -a "spark(master)" or -a "spark(worker)"'
        exit $LSB_SUB_ABORT_VALUE
        ;;
esac

case $2 in
    "current"|"")
        VERSION="current"
        ;;
    "rc")
        VERSION="rc"
        ;;
    "test")
        VERSION="test"
        ;;
    "2")
        VERSION="2"
        ;;
    *)
        exit $LSB_SUB_ABORT_VALUE
        ;;
esac

if [[ -z $SPARK_LOG_DIR ]]; then
    export SPARK_LOG_DIR=$HOME/.spark/logs/$(date +%H-%F)
fi
echo "SPARK_LOG_DIR = $SPARK_LOG_DIR" >> $LSB_SUB_MODIFY_ENVFILE
mkdir -p $SPARK_LOG_DIR/lsf

COMMANDLINE="/misc/local/spark-versions/bin/start_${TYPE}_lsf.sh $VERSION"
echo "LSB_SUB_COMMAND_LINE = \"$COMMANDLINE\"" >> $LSB_SUB_MODIFY_FILE
echo 'LSB_SUB_MAX_NUM_PROCESSORS = 16' >> $LSB_SUB_MODIFY_FILE
echo "LSB_SUB_OUT_FILE = \"$SPARK_LOG_DIR/lsf/%J.out\"" >> $LSB_SUB_MODIFY_FILE
echo 'LSB_SUB_QUEUE = "spark"' >> $LSB_SUB_MODIFY_FILE

exit 0

The two case statements interpret the options submitted in the -a flag; first chooses between master or worker, second is version type.

The worker option also tests to make sure that the worker is named appropriately for the launch scripts to work. The format is W

COMMANDLINE defines the actual command to be run; in the bsub one must include a command, but it can be an arbitrary string; this replaces it.

Site definitions and assumptions for this file:

  • Multiple versions are available and their launch scripts are named in the same way such that the version can be substituted in (e.g. start_master_lsf.sh 1_6)
  • Each node has 16 cores (LSB_SUB_MAX_NUM_PROCESSORS line)
  • STDOUT/STDERR is logged to a file named .out under the user’s home directory

Launch Scripts

The launch scripts are referred to in the esub above. I have revised these significantly since my Grid Engine/sparkflex post, and they’re a lot smarter now.

start_master_lsf.sh:

#!/bin/bash                                                                                              
VERSION=$1
if [[ -n ${SPARK_LOG_DIR} ]]; then
  sparklogdir="${SPARK_LOG_DIR}"
else
  sparklogdir=~/.spark/logs/`date +%H-%F`
fi
mkdir -p $sparklogdir
export SPARK_HOME=/PATH/TO/spark-$VERSION    
. $SPARK_HOME/sbin/spark-config.sh    
. $SPARK_HOME/bin/load-spark-env.sh    
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.master.Master --host `hostname`

The VERSION is set by the esub (above).  sparklogdir defines the log location for all logs for this job, which is set similarly in the esub and in the spark conf directories. The SPSPARK_HOME is the path to the Spark install. Note that for the script to work properly, the versions must be named similarly. The SPARK_HOME/sbin and bin commands specify scripts provided by the Spark install.

start_worker_lsf.sh

#!/bin/bash                                                                                                                               
MasterID=`echo $LSB_JOBNAME | cut -c2-`
echo $MasterID

VERSION=$1

n=0
while [ $n -lt 15 ]; do
    n=$(($n + 1))
    MasterFQDN=`bjobs -r -noheader -X | awk "/$MasterID/ && /master/ "'{sub(/16\*/,""); print $6".fully.qualified.domain"}'`
    echo $MasterFQDN
    if [ ! -z $MasterFQDN ]; then
        break
    fi
    echo "No running master. Retrying..."
    sleep 120
done

if [ -z $MasterFQDN ]; then
    echo "No Master running. Please resubmit worker."
    exit 1
fi

MASTER="spark://${MasterFQDN}:7077"
export SPARK_HOME=/PATH/TO/spark-$VERSION
. $SPARK_HOME/sbin/spark-config.sh
. $SPARK_HOME/bin/load-spark-env.sh
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 $MASTER 2>&1

This script obtains the jobID of the master from the name of the worker job. At that point, it searches the bjobs output for a job with that ID, finds the hostname of that node, and appends the domain. If the Master has not yet started, it will wait for 2 minutes 15x, and error out after that time period. If the master is found, it will set the MASTER variable, and then launch the worker with the Spark scripts, similar to the start_master_lsf.sh script above.

Launching clusters

Launching a master is trivial:

bsub -a"spark(master,)" arbitrarystring

Note that arbitrarystring will be replaced by the esub above with the appropriate launch script depending on the inputs in the -a flag.

To launch a worker, it is necessary to have the master’s jobid in order to name the job correctly so that the script can associate the worker(s) with the master:

bsub -a "spark(worker,)"-J W  arbitrarystring

Repeat as needed for more workers.

To submit Spark jobs against this cluster, set the $MASTER variable to spark://:7077, the $SPARK_HOME variable appropriately, and bsub the command either with spark-submit or pyspark (etc).

Later posts will cover a wrapper script that simplifies the process of setting up, running jobs on, and finally tearing down these Spark clusters for end users, as well as a tightly coupled version of Spark analogous to my first attempt to run Spark on Grid Engine.  I also plan to write a new post on the spark configuration options and tuning we currently use.

SparkFlex, aka running Spark jobs on a Grid Engine cluster, take II

I’ve left everyone who looks at my Spark posts hanging for months with a teaser about a new method I’ve been working on. This is the first of my posts about this.

The high level basics on this is that it’s better to run Spark as independent jobs rather than a monolithic multi-node job. That way if the HPC cluster is heavily used, at least PART of the Spark cluster can get scheduled and work can begin (assuming that not all nodes are strictly necessary for the work to be successful). Additionally, nodes can be added or removed from the cluster as needed.

This is fairly easily achieved by running the Spark master as one job, then getting the Master’s url and submitting each worker as an independent job with the MASTER variable set (and things configured like my previous example), ie

export MASTER=spark://<mastername>:7077
export SPARK_HOME=/PATH/TO/SPARK
qsub -pe spark 1 -jc spark -N sparkworker -q spark -b y "$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://$MASTER:7077"

But wait! There’s still a better way to do this, using job classes to a much greater extent. Note that you’ll need to change the paths to the scripts, name of queue, and exclusive complex name to fit your environment

qconf -sjc sparkflex
jcname          sparkflex
variant_list    default worker master-rc worker-rc master-test worker-test
owner           NONE
user_lists      NONE
xuser_lists     NONE
A               {+}UNSPECIFIED
a               {+}UNSPECIFIED
ar              {+}UNSPECIFIED
b               TRUE
binding         {+}UNSPECIFIED
c_interval      {+}UNSPECIFIED
c_occasion      {+}UNSPECIFIED
CMDNAME         /PATH/TO/SCRIPTS/start-flexmaster.sh, \
                [worker=/PATH/TO/SCRIPTS/start-flexworker.sh], \
                [master-rc=/PATH/TO/SCRIPTS/start-flexmaster-rc.sh], \
                [worker-rc=/PATH/TO/SCRIPTS/start-flexworker-rc.sh], \
                [master-test=/PATH/TO/SCRIPTS/start-flexmaster-test.sh], \
                [worker-test=/PATH/TO/SCRIPTS/start-flexworker-test.sh]
CMDARG          {+}UNSPECIFIED
ckpt            {+}UNSPECIFIED
ac              {+}UNSPECIFIED
cwd             .
dl              {+}UNSPECIFIED
e               {+}UNSPECIFIED
h               {+}UNSPECIFIED
hold_jid        {+}UNSPECIFIED
hold_jid_ad     {+}UNSPECIFIED
i               {+}UNSPECIFIED
j               TRUE
js              {+}UNSPECIFIED
l_hard          {-~}spark_exclusive=1,{-~}h_rt=43200
l_soft          {+}UNSPECIFIED
masterl         {+}UNSPECIFIED
m               {+}UNSPECIFIED
mbind           {+}UNSPECIFIED
M               {+}UNSPECIFIED
masterq         {+}UNSPECIFIED
N               master,[{~}worker={~}W],[{~}worker-rc={~}W], \
                [{~}worker-test={~}W]
notify          {+}UNSPECIFIED
now             {+}UNSPECIFIED
o               $HOME/.spark/ugelogs
P               {+}UNSPECIFIED
p               {+}UNSPECIFIED
pe_name         sparkflex
pe_range        1
q_hard          spark.q
q_soft          {+}UNSPECIFIED
R               {+}UNSPECIFIED
r               {+}UNSPECIFIED
rou             {+}UNSPECIFIED
S               {+}UNSPECIFIED
shell           {+}UNSPECIFIED
t               {+}UNSPECIFIED
tc              {+}UNSPECIFIED
V               TRUE
v               {+}UNSPECIFIED

As you can see, the JC specifies quite a few things about the job that had previously been in the qsub, including a hard runtime, the pe, the queue, and even the scripts. With the variant list, a single JC can be used for multiple kinds of spark jobs, including the master (default), the worker, and a couple of other variants of spark installed separately.

The master start scripts referenced in the CMDNAME line are pretty simple:


#!/bin/bash
# start-flexmaster.sh 
####CHANGE PATH TO SPARK INSTALL                                                                                            
export SPARK_HOME=/PATH/TO/SPARK   
. $SPARK_HOME/sbin/spark-config.sh    
. $SPARK_HOME/bin/load-spark-env.sh    
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.master.Master --host `hostname`

 

The worker is a bit more complicated, since it has to figure out where the master is from the job number provided:


#!/bin/bash  
# start-flexworker.sh                                                                                           
MasterID=`echo $JOB_NAME | cut -c2-`
echo $MasterID

n=0
while [ $n -lt 15 ]; do 
    n=$(($n + 1))
####USE YOUR NODES' DOMAIN FOR  AND THE CORRECT QUEUE NAME FOR YOUR ENVIRONMENT
    MasterFQDN=`qstat -s r | awk "/$MasterID/ && /master/ "'{sub(/spark.q@/,""); gsub(/\.i.*/,"."); print $8}'`
    echo $MasterFQDN
    if [ ! -z $MasterFQDN ]; then
        echo "No running master. Retrying..."
        break
    fi  
    sleep 120 
done

if [ -z $MasterFQDN ]; then
    echo "No Master running. Please resubmit worker."
    exit 1
fi

MASTER="spark://${MasterFQDN}:7077"
####USE YOUR PATH TO SPARK
export SPARK_HOME=/PATH/TO/SPARK
. $SPARK_HOME/sbin/spark-config.sh    
. $SPARK_HOME/bin/load-spark-env.sh    
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 $MASTER

 

So now the submission would be:


#MASTER
qsub -jc sparkflex.default
#WORKER
qsub -jc sparkflex.worker -N W<jobid of master>

 

Breaking this down, the start-flexworker.sh script first checks if there is a master running, and will keep checking every 2 minutes for 30 minutes (that’s that while loop at the top). It does this in the huge awk statement, which checks for the master job ID in the parsed qsub output and returns the master’s FQDN. After that, it then launches the worker as above, passing it the $MASTER variable.

I have written a python script that makes this even easier and includes some error checking, etc, but that’s a post for another day.

Isilon: Remove or Add a Node from/to Multiple Network Pools

It’s a pain to change networks in bulk in an Isilon cluster, particularly if it’s a complex environment. Adding a new node that will be serving multiple network pools in the same subnet is particularly time consuming. Similarly, tracking down all the interfaces and pools a node is in in order to remove it for maintenance or other purposes can be messy.

This script takes the node’s number as the first input and add or remove as the second. It checks if the interface is active before performing any operations on it.

 
#!/bin/bash
node=$1
operation=$2

#check if node number is valid
if [ "$(isi_nodes -n$node %{name})" != "clustername-$node" ]; then
 echo "Not a valid node"
 exit
fi

#check if operation is either add or remove
if [ "$operation" != add -a "$operation" != remove ]; then
 echo "Not a valid operation: $operation"
 exit
fi


#function to check if the interface's connection is active. 
check_ifaces_active(){
 isi_for_array -n$node "ifconfig $iface" | awk '/active/ {print 1}'
}

#function to perform the operation on the interface for a set of pools
operate_interfaces() {
 echo $isi_iface
 isi networks modify pool --$operation-ifaces=$node:$isi_iface subnet2:pool4-synciq
 sleep 5
 isi networks modify pool --$operation-ifaces=$node:$isi_iface subnet2:pool0
 sleep 5
 isi networks modify pool --$operation-ifaces=$node:$isi_iface subnet2:pool2
 sleep 5
}

check both 10GbE interfaces
for iface in bxe0 bxe1; do
 if [ "$(check_ifaces_active)" = "1" ]; then
  if [ $iface = bxe0 ]; then
   isi_iface=10gige-1 #Isilon uses different interface naming schemes for different things...
  elif [ $iface = bxe1 ]; then
   isi_iface=10gige-2
  else
   print "Something went wrong"
   exit
  fi
  operate_interfaces
  isi_for_array -n$node "ifconfig $iface"
 fi
done

Running Spark as a job on a Grid Engine HPC Cluster (part 3)

UPDATED GUIDE: SparkFlex, aka running Spark jobs on a Grid Engine cluster, take II

And now for the secret sauce–how to actually get Spark workers to run and talk to a master.

For a long time, I had my start script launching the master. Unfortunately, this resulted in very loose control by the scheduler/execd over the processes, and I had to do a ton of clean up at hte end to make sure there were no orphan processes running on nodes after the job closed or failed.

After some research, I realized that it made more sense to launch the master as part of the job submission (I’ll get to this later in the post), and that I could tightly couple the workers by launching them in the start script via the qrsh command.

qrsh places the processes that it runs under the control of the scheduler, unlike using ssh to do the same thing.

This also means that you MUST have passwordless ssh set up for any user who wants to use Spark in this fashion.

So here’s the start script referenced by the PE (see Part 2)


#!/bin/bash
echo "Starting SPARK PE from $HOSTNAME"
SPARK_HOME=/usr/local/spark-current

# Create the master file.
cat $PE_HOSTFILE | cut -f1 -d" " | head -1 > /scratch/spark/tmp/master

# Create all file.
cat $PE_HOSTFILE | cut -f1 -d" " | uniq > /scratch/spark/tmp/all

# Create the workers file.
grep -v `cat /scratch/spark/tmp/master` /scratch/spark/tmp/all > /scratch/spark/tmp/workers

# Logging/debugging info
mkdir -p ~/.spark/logs/$JOB_ID
cp /scratch/spark/tmp/workers ~/.spark/logs/$JOB_ID/workers.txt
cp -LR $SPARK_HOME/conf ~/.spark/logs/$JOB_ID/

# Start the workers
echo "Starting Spark workers" 
MASTER=$HOSTNAME
while read worker; do
    /sge/current/bin/lx-amd64/qrsh -inherit $worker "$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://$MASTER:7077" &
done < /scratch/spark/tmp/workers

echo "Spark Master WebUI for this job: http://$MASTER:8080"

$PE_HOSTFILE is a list of all nodes involved in the job, which is provided by Grid Engine. In this case, we only want the slave nodes, which will run the workers, so we need to do a little parsing to get those out. I’m sure that section could be a little cleaner, but there we are…

After I get my list of worker nodes, I do a little logging, which can be extremely helpful for troubleshooting launch failures and the like.

Then we get to the worker launches. I use the standalone cluster mode of launching the workers via scripts provided in Spark. Note the ampersand at the end of the command; this makes for a much faster startup of the workers. By backgrounding the qrsh process for each worker, we are not held up by serially launching the worker across each node. That allows us to spin up a large cluster (>20 nodes) in a fraction of the time that would be required to do it serially.

The assumption in this script is that the node that runs it is the Spark Master, which is why the master launch must be done as part of the job submission.

Here is a very simplistic submission script. It does nothing other than launch the master, so any processing on the Spark cluster would need to be included in the script or run via an interactive session (more about this later)


#sparkmaster.sh
#!/bin/bash
export PATH=/usr/local/spark-current/bin:/usr/local/spark-current/sbin:/usr/local/python-2.7.6/bin:$PATH
export SPARK_HOME=/usr/local/spark-current
/usr/local/spark-current/sbin/start-master.sh
export MASTER=spark://$HOSTNAME:7077
sleep 3600
exit 0

The only thing of note here is the sleep 3600 line–that keeps the master up for an hour. You’ll probably want to modify this to fit your environment. If you change it to sleep infinity, then you will need to manually qdel any cluster started with this script (lacking further instructions for the spark cluster in the script before the exit 0).

The qsub for this looks something like this, assuming you’ve set up your environment exactly like mine (ha)

qsub -jc spark -pe spark 5 -q hadoop2 -j y -o ~/sparklogs -V -b y sparkmaster.sh

This will launch a Spark cluster with 1 master and 4 workers (1+4=5, dontcherknow), log to sparklogs in your home directory.

Our users typically use Spark in an interactive fashion, so in addition to launching their clusters, they also do a qlogin and then define the $MASTER, $SPARK_HOME, and $PATH variables as above within that session. From that point, the qlogin session acts as the Spark driver. It is also possible to run the driver on the same node as the master, but you run the risk of running the master out of RAM.

The main Spark using group here has written a much more intelligent version of this (see this github repo), but here is a very basic script that can be run after the qlogin to set up the right variables (as long as the user isn’t running more than one Spark cluster…)


#!/bin/bash
export MASTER="spark://"`qstat | grep spark | awk '{print $8}' | cut -d@ -f2`":7077"
export PATH=$PATH:/usr/local/spark-current/bin:/usr/local/spark-current/sbin:/usr/local/python-2.7.6/bin
export SPARK_HOME=/usr/local/spark-current

The spark-pestop.sh script is very simple at this point (before using qrsh, I was having to do a TON of clean up in it); now all it has to do is remove the files created by the spark-pestart.sh script, otherwise future jobs by different users will fail:


#!/bin/bash
rm -rf /scratch/spark/tmp/workers
rm -rf /scratch/spark/tmp/all
rm -rf /scratch/spark/tmp/master
exit 0

Another little note here–for multiple users to run Spark in this fashion, that /scratch/spark directory MUST be read/write by all of them. In our environment, the easiest way to do this was to set the permissions on it to 777 and call it a day.

Please let me know in the comments if you have any questions or run into trouble.

Breaking Down the Monster III

So, finishing this off.

It-sa bunch-a case lines!

Write first:

 

echo $1 $2 "filesize: "$3 "totalsize: "$4"G" "filesperdir: "$5
case $1 in
	write)
        if [ $2 = scality ]; then
            filecount=$totfilecount
            time scalitywrite
            exit 0
        fi
        

So if it’s a Scality (or other pure object storage), it’s simple. Just run the write and time it, which will output the info you need. OTHERWISE…

#Chunk file groups into folders if count is too high
	if [ $totfilecount -ge 10000 ]; then
	    for dir in `seq 1 $foldercount`; do
	        createdir $fspath/$dir
	    done
	    time for dir in `seq 1 $foldercount`; do
	        path=$fspath/$dir
		filecount=$(( $totfilecount / $foldercount ))
	        writefiles
	    done
	else
	    path=$fspath
            createdir $path
            filecount=$totfilecount
            time writefiles
	fi
	;;

 

Do what the comment says. Chunk the files into folders, since if you write to a filesystem, count of files in directories makes a big difference. . Make sure you create the directories before you try to write to them… and then time how long it takes to write all of them. If it’s less than the critical file count number, then just write them and time it.

Neeeext….

 

read) #in order read
	sync; echo 1 > /proc/sys/vm/drop_caches
        if [ $2 = scality ]; then
            filecount=$totfilecount
            time scalityread
            exit 0
        fi
	if [ $totfilecount -ge 10000 ]; then
		time for dir in `seq 1 $foldercount`; do
			path=$fspath/$dir
			filecount=$(( $totfilecount / $foldercount ))
			readfiles
		done
	else
		path=$fspath
		filecount=$totfilecount
		time readfiles
	fi
	;;

That sync line is how you clear the filesystem cache (as root) on a Linux system. This is important for benchmarking, because let me tell you, 6.4GB/sec is not a speed that most network storage systems can reach. Again, we split it and time all of the reads, or we just straight up time the reads if the file count is low enough. This routine reads files in the order they were written.

 

	rm) #serial remove files
        if [ $2 = scality ]; then
            time for i in `seq 1 $totfilecount`; do
                curl -s -X DELETE http://localhost:81/proxy/bparc/$fspath/$i-$suffix > /dev/null
            done
            exit 0
        fi
		if [ $totfilecount -ge 10000 ]; then
			time for i in `seq 1 $foldercount`; do
				rm -f $fspath/$i/*-$suffix
				rmdir $fspath/$i
			done
		elif [ -d $fspath/$3 ]; then 
			time rm -f $fspath/*-$suffix
		fi
	;;

Similar to the other two routines, if it’s an object based, do something completely different, otherwise remove based on file path and count of files.

 

	parrm) #parallel remove files
		time ls $fspath | parallel -N 64 rm -rf $fspath/{}
	;;

This one is remarkably simple. Just run parallel against an ls of the top level directory, and pipe it into rm -rf. The {} is stdin for parallel. The -N 64 is number of threads to run.

 

This one’s kind of neat:

	shufread) #shuffled read
		sync; echo 1 > /proc/sys/vm/drop_caches
		if [ $totfilecount -ge 10000 ]; then
			folderarray=(`shuf -i 1-$foldercount`)
			time for dir in ${folderarray[*]}; do
				path=$fspath/$dir
				filecount=$(( $totfilecount / $foldercount ))
				shufreadfiles
			done
		else
			path=$fspath
			filecount=$totfilecount
			time shufreadfiles
		fi
	;;
	

I needed a way to do random reads over the files I’d written, in order to simulate that on filesystems with little caching (ie, make the drives do a lot of random seeks.)

	shufread) #shuffled read
		sync; echo 1 > /proc/sys/vm/drop_caches
		if [ $totfilecount -ge 10000 ]; then
			folderarray=(`shuf -i 1-$foldercount`)
			time for dir in ${folderarray[*]}; do
				path=$fspath/$dir
				filecount=$(( $totfilecount / $foldercount ))
				shufreadfiles
			done
		else
			path=$fspath
			filecount=$totfilecount
			time shufreadfiles
		fi
	;;
	

At first, I tried writing the file paths to a file, then reading that, but that has waaaay too much latency when you’re doing performance testing. So, after some digging, I found the shuf command, which shuffles a list. You can pass an arbitrary list with the -i flag. I tossed this all into an array, and then it proceeds like the read section.

 

	*) usage && exit 1;;
esac
echo '------------------------'

Fairly self explanatory. I tossed an echo with some characters in to keep the output clean if you’re running the command inside a for loop.

And that’s it!

Breaking down that monster

Or should I use Beast? No, this isn’t an XtremIO. (sorry, I just got back from EMCWorld 2015. The marketing gobbledygook is still strong in me.)

So, first part of the script, like many others, is a function (cleverly called usage), followed by the snippet that calls the function:


usage () {
	echo "Command syntax is $(basename $0) [write|read|shufread|rm|parrm] [test|tier1|tier2|gpfs|localscratch|localssd|object]"
        echo "[filesizeG|M|K] [totalsize in GB] (optional) [file count per directory] (optional)"
}

if [ "$#" -lt 3 ]; then
	usage
	exit 1
fi

Not much to see here if you already know what functions are and how they’re formatted in bash. Basically, if it starts with () { and then is closed with }, it’s a function, and you can call it like a script inside the main script. The code is not executed until it is called by name. You can even pass it input variables–more on that later.

Next, we come to a case block:


case $2 in
	test) fspath=/mnt/dmtest/scicomp/scicompsys/ddcompare/$3 ;;
	tier1) fspath=/mnt/node-64-dm11/ddcompare/$3 ;;
	tier2) fspath=/mnt/node-64-tier2/ddcompare/$3 ;;
	gpfs) fspath=/gpfs1/nlsata/ddcompare/$3 ;;
        localscratch) fspath=/scratch/carlilek/ddcompare/$3 ;;
        localssd) fspath=/ssd/ddcompare/$3 ;;
        object) fspath=/srttest/ddcompare/$3 ;;
	*) usage && exit 1;;
esac

This checks the second variable and sets the base path to be used in the testing. Note that object will be used differently than the rest, because all of the rest are file storage paths. Object ain’t.

Then, we set the size of the files (or objects) to be written, read, or deleted:


case $3 in
	*G) filesize=$(( 1024 * 1024 * `echo $3 | tr -d G`));;
	*M) filesize=$(( 1024 * `echo $3 | tr -d M` ));;
	*K) filesize=`echo $3 | tr -d K`;;
	*) usage && exit 1;;
esac

Note that I should probably be using the newer call out to command style of $( ) here, rather than backticks. I’ll get around to it at some point.

The bizarre $(( blah op blah )) setup is how you do math in bash. Really.

The next few bits are all prepping how many files to write to a given subdirectory, how big the files are, etc.


#set the suffix for file names
suffix=$3

#set the total size of the test set
if [ ! -z $4 ]; then
	totalsize=$(( 1024 * 1024 * $4 ))
else
	totalsize=52428800 #The size of the test set in kb
fi
	
#set the number of files in subdirectories
if [ ! -z $5 ]; then
	filesperdir=$5
else
	filesperdir=5120 #Number of subdirs to use for large file counts
fi

#set up variables for dd commands
if [ $filesize -ge 1024 ]; then
	blocksize=1048576
else
	blocksize=$(( $filesize * 1024 ))
fi

#set up variables for subdirectories
totfilecount=$(( $totalsize / $filesize ))
blockcount=$(( $filesize * 1024 / $blocksize ))
if [ $filesperdir -le $totfilecount ]; then
	foldercount=$(( $totfilecount / $filesperdir ))
fi

OK, I’ll get into the meat of the code in my next post. But I’m done now.

The first of several benchmarking scripts

I’m currently a file storage administrator, specializing in EMC Isilon. We have a rather large install (~60 heterogeneous nodes, ~4PB) as well as some smaller systems, an HPC dedicated GPFS filer from DDN, and an object based storage system from Scality. Obviously, all of these things have different performance characteristics, including the differing tiers of Isilon.

I’ve been benchmarking the various systems using the script below. I’ll walk through the various parts of the script. To date, this is probably one of my more ambitious attempts with Bash, and it would probably work better in Python, but I haven’t learned that yet. 😉


#!/bin/bash
usage () {
	echo "Command syntax is $(basename $0) [write|read|shufread|rm|parrm] [test|tier1|tier2|gpfs|localscratch|localssd|object]"
        echo "[filesizeG|M|K] [totalsize in GB] (optional) [file count per directory] (optional)"
}

if [ "$#" -lt 3 ]; then
	usage
	exit 1
fi

#CHANGE THESE PATHS TO FIT YOUR ENVIRONMENT
#set paths
case $2 in
	test) fspath=/mnt/dmtest/scicomp/scicompsys/ddcompare/$3 ;;
	tier1) fspath=/mnt/node-64-dm11/ddcompare/$3 ;;
	tier2) fspath=/mnt/node-64-tier2/ddcompare/$3 ;;
	gpfs) fspath=/gpfs1/nlsata/ddcompare/$3 ;;
        localscratch) fspath=/scratch/carlilek/ddcompare/$3 ;;
        localssd) fspath=/ssd/ddcompare/$3 ;;
        object) fspath=/srttest/ddcompare/$3 ;;
	*) usage && exit 1;;
esac

#some math to get the filesize in kilobytes
case $3 in
	*G) filesize=$(( 1024 * 1024 * `echo $3 | tr -d G`));;
	*M) filesize=$(( 1024 * `echo $3 | tr -d M` ));;
	*K) filesize=`echo $3 | tr -d K`;;
	*) usage && exit 1;;
esac	

#set the suffix for file names
suffix=$3

#set the total size of the test set
if [ ! -z $4 ]; then
	totalsize=$(( 1024 * 1024 * $4 ))
else
	totalsize=52428800 #The size of the test set in kb
fi
	
#set the number of files in subdirectories
if [ ! -z $5 ]; then
	filesperdir=$5
else
	filesperdir=5120 #Number of subdirs to use for large file counts
fi

#set up variables for dd commands
if [ $filesize -ge 1024 ]; then
	blocksize=1048576
else
	blocksize=$(( $filesize * 1024 ))
fi

#set up variables for subdirectories
totfilecount=$(( $totalsize / $filesize ))
blockcount=$(( $filesize * 1024 / $blocksize ))
if [ $filesperdir -le $totfilecount ]; then
	foldercount=$(( $totfilecount / $filesperdir ))
fi

#debug output
#echo $fspath
#echo filecount $totfilecount
#echo totalsize $totalsize KB
#echo filesize $filesize KB
#echo blockcount $blockcount
#echo blocksize $blocksize bytes

#defines output of time in realtime seconds to one decimal place
TIMEFORMAT=%1R

#creates directory to write to
createdir () {
	if [ ! -d $1 ]; then
		mkdir -p $1
	fi
}

#write test
writefiles () {
	#echo WRITE
	for i in `seq 1 $filecount`; do 
		#echo -n .
		dd if=/dev/zero of=$path/$i-$suffix bs=$blocksize count=$blockcount 2> /dev/null
	done
}

#read test
readfiles () {
	#echo READ
	for i in `seq 1 $filecount`; do 
		#echo -n .
		dd if=$path/$i-$suffix of=/dev/null bs=$blocksize 2> /dev/null
		#dd if=$path/$i-$suffix of=/dev/null bs=$blocksize
	done
}

#shuffled read test
shufreadfiles () {
	#echo SHUFFLE READ
	filearray=(`shuf -i 1-$filecount`)
	for i in ${filearray[*]}; do 
		#echo -n .
		#echo $path/$i-$suffix
		dd if=$path/$i-$suffix of=/dev/null bs=$blocksize 2> /dev/null
		#dd if=$path/$i-$suffix of=/dev/null bs=$blocksize
	done
}

#ObjectWrite
scalitywrite () {
    for i in `seq 1 $filecount`; do
        dd if=/dev/zero bs=$blocksize count=$blockcount 2> /dev/null | curl -s -X PUT http://localhost:81/proxy/bparc$fspath/$i-$suffix -T- > /dev/null
    done
}

#ObjectRead
scalityread () {
    for i in `seq 1 $filecount`; do
        curl -s -X GET http://localhost:81/proxy/bparc/$fspath/$i-$suffix > /dev/null
    done
}

#Do the work based on the work type

echo $1 $2 "filesize: "$3 "totalsize: "$4"G" "filesperdir: "$5
case $1 in
	write) 
        if [ $2 = scality ]; then
            filecount=$totfilecount
            time scalitywrite
            exit 0
        fi
        #Chunk file groups into folders if count is too high
	    if [ $totfilecount -ge 10000 ]; then
			for dir in `seq 1 $foldercount`; do
				createdir $fspath/$dir
			done
			time for dir in `seq 1 $foldercount`; do
				path=$fspath/$dir
				filecount=$(( $totfilecount / $foldercount ))
				writefiles
			done
		else
			path=$fspath
            createdir $path
			filecount=$totfilecount
			time writefiles
		fi
	;;
	read) #in order read
		sync; echo 1 > /proc/sys/vm/drop_caches
        if [ $2 = scality ]; then
            filecount=$totfilecount
            time scalityread
            exit 0
        fi
		if [ $totfilecount -ge 10000 ]; then
			time for dir in `seq 1 $foldercount`; do
				path=$fspath/$dir
				filecount=$(( $totfilecount / $foldercount ))
				readfiles
			done
		else
			path=$fspath
			filecount=$totfilecount
			time readfiles
		fi
	;;
	rm) #serial remove files
        if [ $2 = scality ]; then
            time for i in `seq 1 $totfilecount`; do
                curl -s -X DELETE http://localhost:81/proxy/bparc/$fspath/$i-$suffix > /dev/null
            done
            exit 0
        fi
		if [ $totfilecount -ge 10000 ]; then
			time for i in `seq 1 $foldercount`; do
				rm -f $fspath/$i/*-$suffix
				rmdir $fspath/$i
			done
		elif [ -d $fspath/$3 ]; then 
			time rm -f $fspath/*-$suffix
		fi
	;;
	parrm) #parallel remove files
		time ls $fspath | parallel -N 64 rm -rf $fspath/{}
	;;
	shufread) #shuffled read
		sync; echo 1 > /proc/sys/vm/drop_caches
		if [ $totfilecount -ge 10000 ]; then
			folderarray=(`shuf -i 1-$foldercount`)
			time for dir in ${folderarray[*]}; do
				path=$fspath/$dir
				filecount=$(( $totfilecount / $foldercount ))
				shufreadfiles
			done
		else
			path=$fspath
			filecount=$totfilecount
			time shufreadfiles
		fi
	;;
		
	*) usage && exit 1;;
esac
echo '------------------------'

I’ll break this all down in my next post.