Moving data between quota protected directories on Isilon, take II

My previous post about moving data between quota’d directories on Isilon is probably one of the most popular I have made, garnering up to 5 hits per week.

But that script has a big issue: it can’t handle special characters or–worse yet–spaces in file and directory names. So here’s an improved version:


#Tests whether there is a valid path
testexist () {
        if [ ! -r "$1" ] || [[ "$1" != *ifs* ]]; then
                echo "$1" " is an invalid path. Please check the path and/or use a full path including /ifs."
        exit 1

#Iterates through path backwards to find most closely related quota
findquota () {
        while [ $RIGHTPATH -eq 0 ]; do
                QUOTADIR=$(echo "$QUOTADIR" | sed 's|/[^/]*$||')
                isi quota list | grep -q "$QUOTADIR" && RIGHTPATH=1
        echo $QUOTADIR

testquota () {
        if [ "$1" = "-" ] || [ -z $1 ]; then
                echo "No hard directory quota on this directory. Just use mv."

if [[ $# -ne 2 ]]; then
        #Gets paths from user
        echo "Enter source: "
        read SOURCE
        echo "Enter target: "
        read TARGET

testexist "$SOURCE"
testexist "$TARGET"

#Verifies paths with user
echo -en "Moving $SOURCE to $TARGET\nIs this correct? (y/n): "
if [ $ANSWER != 'y' ] ; then

#Defines quotas
SOURCEQUOTA=$(findquota "$SOURCE")
TARGETQUOTA=$(findquota "$TARGET")

#Gets size of hard threshold from quota

SOURCETHRESH=$(isi quota view --path=$SOURCEQUOTA --type=directory | awk -F": " '$1~/Hard Threshold/ {print $2}' | sed s/.00//)
TARGETTHRESH=$(isi quota view --path=$TARGETQUOTA --type=directory | awk -F": " '$1~/Hard Threshold/ {print $2}' | sed s/.00//)

echo -e "\nOriginal values:"

isi quota quotas delete --type=directory --path=$SOURCEQUOTA -f
isi quota quotas delete --type=directory --path=$TARGETQUOTA -f


isi quota quotas create $SOURCEQUOTA directory --hard-threshold=$SOURCETHRESH --container=yes
isi quota quotas create $TARGETQUOTA directory --hard-threshold=$TARGETTHRESH --container=yes

echo -e "\nNew values:"
isi quota list | grep $SOURCEQUOTA
isi quota list | grep $TARGETQUOTA

echo -e "\nls -l $TARGET"
ls -l "$TARGET"

This version uses a ton of quotes to handle all the crazy stuff SMB users put on Isilons.

I’ve also cleaned up the findquota () function. It now doesn’t bother with incrementing stuff, it just backs through the path using a sed command. Once it sets a path, it checks that against the quota list on the Isilon. If that command returns true, it will then change the $RIGHTPATH variable to 1 and exit the loop.

The outputs at the end are also a little cleaner so you can tell what the heck it’s doing when it spews out its final output.

Running Spark as a job on a Grid Engine HPC Cluster (part 3)

UPDATED GUIDE: SparkFlex, aka running Spark jobs on a Grid Engine cluster, take II

And now for the secret sauce–how to actually get Spark workers to run and talk to a master.

For a long time, I had my start script launching the master. Unfortunately, this resulted in very loose control by the scheduler/execd over the processes, and I had to do a ton of clean up at hte end to make sure there were no orphan processes running on nodes after the job closed or failed.

After some research, I realized that it made more sense to launch the master as part of the job submission (I’ll get to this later in the post), and that I could tightly couple the workers by launching them in the start script via the qrsh command.

qrsh places the processes that it runs under the control of the scheduler, unlike using ssh to do the same thing.

This also means that you MUST have passwordless ssh set up for any user who wants to use Spark in this fashion.

So here’s the start script referenced by the PE (see Part 2)

echo "Starting SPARK PE from $HOSTNAME"

# Create the master file.
cat $PE_HOSTFILE | cut -f1 -d" " | head -1 > /scratch/spark/tmp/master

# Create all file.
cat $PE_HOSTFILE | cut -f1 -d" " | uniq > /scratch/spark/tmp/all

# Create the workers file.
grep -v `cat /scratch/spark/tmp/master` /scratch/spark/tmp/all > /scratch/spark/tmp/workers

# Logging/debugging info
mkdir -p ~/.spark/logs/$JOB_ID
cp /scratch/spark/tmp/workers ~/.spark/logs/$JOB_ID/workers.txt
cp -LR $SPARK_HOME/conf ~/.spark/logs/$JOB_ID/

# Start the workers
echo "Starting Spark workers" 
while read worker; do
    /sge/current/bin/lx-amd64/qrsh -inherit $worker "$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://$MASTER:7077" &
done < /scratch/spark/tmp/workers

echo "Spark Master WebUI for this job: http://$MASTER:8080"

$PE_HOSTFILE is a list of all nodes involved in the job, which is provided by Grid Engine. In this case, we only want the slave nodes, which will run the workers, so we need to do a little parsing to get those out. I’m sure that section could be a little cleaner, but there we are…

After I get my list of worker nodes, I do a little logging, which can be extremely helpful for troubleshooting launch failures and the like.

Then we get to the worker launches. I use the standalone cluster mode of launching the workers via scripts provided in Spark. Note the ampersand at the end of the command; this makes for a much faster startup of the workers. By backgrounding the qrsh process for each worker, we are not held up by serially launching the worker across each node. That allows us to spin up a large cluster (>20 nodes) in a fraction of the time that would be required to do it serially.

The assumption in this script is that the node that runs it is the Spark Master, which is why the master launch must be done as part of the job submission.

Here is a very simplistic submission script. It does nothing other than launch the master, so any processing on the Spark cluster would need to be included in the script or run via an interactive session (more about this later)
export PATH=/usr/local/spark-current/bin:/usr/local/spark-current/sbin:/usr/local/python-2.7.6/bin:$PATH
export SPARK_HOME=/usr/local/spark-current
export MASTER=spark://$HOSTNAME:7077
sleep 3600
exit 0

The only thing of note here is the sleep 3600 line–that keeps the master up for an hour. You’ll probably want to modify this to fit your environment. If you change it to sleep infinity, then you will need to manually qdel any cluster started with this script (lacking further instructions for the spark cluster in the script before the exit 0).

The qsub for this looks something like this, assuming you’ve set up your environment exactly like mine (ha)

qsub -jc spark -pe spark 5 -q hadoop2 -j y -o ~/sparklogs -V -b y

This will launch a Spark cluster with 1 master and 4 workers (1+4=5, dontcherknow), log to sparklogs in your home directory.

Our users typically use Spark in an interactive fashion, so in addition to launching their clusters, they also do a qlogin and then define the $MASTER, $SPARK_HOME, and $PATH variables as above within that session. From that point, the qlogin session acts as the Spark driver. It is also possible to run the driver on the same node as the master, but you run the risk of running the master out of RAM.

The main Spark using group here has written a much more intelligent version of this (see this github repo), but here is a very basic script that can be run after the qlogin to set up the right variables (as long as the user isn’t running more than one Spark cluster…)

export MASTER="spark://"`qstat | grep spark | awk '{print $8}' | cut -d@ -f2`":7077"
export PATH=$PATH:/usr/local/spark-current/bin:/usr/local/spark-current/sbin:/usr/local/python-2.7.6/bin
export SPARK_HOME=/usr/local/spark-current

The script is very simple at this point (before using qrsh, I was having to do a TON of clean up in it); now all it has to do is remove the files created by the script, otherwise future jobs by different users will fail:

rm -rf /scratch/spark/tmp/workers
rm -rf /scratch/spark/tmp/all
rm -rf /scratch/spark/tmp/master
exit 0

Another little note here–for multiple users to run Spark in this fashion, that /scratch/spark directory MUST be read/write by all of them. In our environment, the easiest way to do this was to set the permissions on it to 777 and call it a day.

Please let me know in the comments if you have any questions or run into trouble.

Running Spark as a job on a Grid Engine HPC Cluster (part 2)

UPDATED GUIDE: SparkFlex, aka running Spark jobs on a Grid Engine cluster, take II

So the next thing you need to do to get Spark running in Grid Engine is to set up your queue, job class, and your parallel environment.

Our environment is normally run with users requesting slots, which roughly correspond to one cpu and ~8GB of RAM per host. So our normal Sandy Bridge nodes have 16 slots, and our Haswell nodes have 32.

Spark is much easier to run if you give it the whole machine, of course (although I suspect you can limit the workers in various ways), but in our environment, part of the reason Spark is used is to load huge datasets (>1TB) into RAM. So it doesn’t make sense to allocate Spark workers (or the master) based on slot count.

Therefore, our queue for spark (called hadoop2 for historical reasons) is set up to have only 1 slot per node. We also have a complex called hadoop_exclusive which forces jobs (or tasks) to be scheduled to whole nodes.

>qconf -sq hadoop2
qname                 hadoop2
hostlist              @h02 @h03 @h04 @h05 @h06 @h07 @h08
seq_no                0
load_thresholds       np_load_avg=3
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
qtype                 INTERACTIVE
ckpt_list             NONE
pe_list               hadoop,spark,spark-rc,spark-test
jc_list               hadoop.default,spark.default, \
rerun                 FALSE
slots                 1
tmpdir                /tmp
shell                 /bin/sh
prolog                NONE
epilog                NONE
shell_start_mode      unix_behavior
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
d_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

There doesn’t appear to be much here that’s non-default, other than the JC and PE lists. Note that you need to set up a separate pe and jc for every version of spark you’re using.

Here’s the PE setup:

>qconf -sp spark
pe_name                spark
slots                  99999
user_lists             NONE
xuser_lists            NONE
start_proc_args        /usr/local/uge-hadoop/
stop_proc_args         /usr/local/uge-hadoop/
allocation_rule        1
control_slaves         TRUE
job_is_first_task      FALSE
urgency_slots          min
accounting_summary     TRUE
daemon_forks_slaves    FALSE
master_forks_slaves    FALSE

The start_proc_arts and stop_proc_args are really the secret sauce. I’ll talk about those in part 3. The other options are fairly obvious; you want the master (ie, the job) to control the slaves, which are running the Spark workers. The accounting_summary allows you to track the load/memory usage on the slave nodes. job_is_first_task is set to FALSE, because the head node, while it contains the Spark master process (and sometimes the driver, depending on the job creation), it does not run any of the Spark workers.

On to the JC:

>qconf -sjc spark
jcname          spark
variant_list    default
owner           NONE
user_lists      NONE
xuser_lists     NONE
A               {+}UNSPECIFIED
a               {+}UNSPECIFIED
ar              {+}UNSPECIFIED
b               {+}UNSPECIFIED
binding         {+}UNSPECIFIED
c_interval      {+}UNSPECIFIED
c_occasion      {+}UNSPECIFIED
ckpt            {+}UNSPECIFIED
ac              {+}UNSPECIFIED
cwd             {+}UNSPECIFIED
dl              {+}UNSPECIFIED
e               {+}UNSPECIFIED
h               {+}UNSPECIFIED
hold_jid        {+}UNSPECIFIED
hold_jid_ad     {+}UNSPECIFIED
i               {+}UNSPECIFIED
j               {+}UNSPECIFIED
js              {+}UNSPECIFIED
l_hard          hadoop_exclusive=1
l_soft          {+}UNSPECIFIED
masterl         {+}UNSPECIFIED
m               {+}UNSPECIFIED
mbind           {+}UNSPECIFIED
M               {+}UNSPECIFIED
masterq         {+}UNSPECIFIED
N               {+}UNSPECIFIED
notify          {+}UNSPECIFIED
now             {+}UNSPECIFIED
o               {+}UNSPECIFIED
P               {+}UNSPECIFIED
p               {+}UNSPECIFIED
pe_name         spark
pe_range        {+}UNSPECIFIED
q_hard          {+}UNSPECIFIED
q_soft          {+}UNSPECIFIED
R               TRUE
r               {+}UNSPECIFIED
rou             {+}UNSPECIFIED
S               {+}UNSPECIFIED
shell           {+}UNSPECIFIED
t               {+}UNSPECIFIED
tc              {+}UNSPECIFIED
V               {+}UNSPECIFIED
v               {+}UNSPECIFIED

The only modifications here are setting the complex hadoop_exclusive, the pe to be spark (see above), and setting the Reserved (or R) to TRUE. Reserved holds available nodes aside when the cluster load is high to enable the Spark job to start. For example, if the job is requested with 10 slots and there is only one free node in the cluster, that node is reserved for the Spark job and nothing is scheduled to it until all 10 nodes are available for the Spark job. Otherwise, Spark jobs are likely to never be run on a busy cluster.

Running Spark as a job on a Grid Engine HPC Cluster (part 1)

SEE THE UPDATED GUIDE TO RUNNING ON GRID ENGINE HERE: SparkFlex, aka running Spark jobs on a Grid Engine cluster, take II


Apache Spark has become a pretty big thing where I work. We were originally approached about running it on our HPC cluster about 3 years ago, and, knowing little to nothing about Big Data clusters, I agreed to set it up and get it rolling. Over the last 3 years, we have been gradually improving the process and getting it more and more under the control of execd, which makes it a bit more resilient to crashes and memory overruns.

The first thing that needs to be done is to make sure that all the prerequisites are present for Spark. We use Java 1.7.x, Scala 2.10.x, and Python 2.7.6. We do not use HDFS for the storage; we just use our usual Isilon NFS shares (mounted with nolock, of course) and occasionally GPFS. We use the precompiled version of Spark for Hadoop 1.x, since we happen to have CDH 3.x MapReduce installed on the cluster, although we never use it.

There are two main configuration files for Spark: and spark-defaults.conf. Both need to be modified for the defaults to fit your environment. As an example, here are ours:

ulimit -n 65535 export SCALA_HOME=/usr/local/scala-2.10.3 
export SPARK_WORKER_DIR=/scratch/spark/work 
export JAVA_HOME=/usr/local/jdk1.7.0_67 
export SPARK_LOG_DIR=~/.spark/logs/$JOB_ID/ 
export SPARK_LOCAL_DIRS=/scratch/spark/tmp 
export PYSPARK_PYTHON=/usr/local/python-2.7.6/bin/python 
export SPARK_SLAVES=/scratch/spark/tmp/slaves 
export SPARK_SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=30"

Our nodes are 128GB/16 cores, so your mileage and settings may vary. /scratch is a local directory on the nodes. Note that you will want to have some kind of clean up script for your work and tmp directories. Spark will normally clean up behind itself, but if it exits abnormally, all bets are off.




These are, for the most parts, settings I made many, many versions of Spark ago. Other than the local.dir, I’m afraid I don’t have much info about them. But this one is important. NEVER place the spark.local.dir on locking storage, because the workers will collide with each other.

In the next article, I’ll talk about the setup of Grid Engine to accommodate running the Spark cluster.

Addendum to mongodb on rocksdb-Haswell

So I finally got around to doing the install on the physical box today, and on the initial make static_lib, I got this:

>/opt/src/rocksdb>make static_lib
 GEN util/
 CC db/c.o
 /tmp/ccActpEz.s: Assembler messages:
 /tmp/ccActpEz.s:14025: Error: no such instruction: `shlx %rsi,%r9,%rax'
 /tmp/ccActpEz.s:40750: Error: no such instruction: `shlx %rcx,%r11,%rsi'
 /tmp/ccActpEz.s:40764: Error: no such instruction: `shlx %rax,%r11,%rdx'
 /tmp/ccActpEz.s:41219: Error: no such instruction: `shlx %r8,%rsi,%r9'
 /tmp/ccActpEz.s:41231: Error: no such instruction: `shlx %rdx,%rsi,%rax'
 /tmp/ccActpEz.s:41497: Error: no such instruction: `shlx %rbx,%rsi,%rdi'
 /tmp/ccActpEz.s:41511: Error: no such instruction: `shlx %r14,%rsi,%rdx'

The machine I was installing on has Haswell CPUs, which was what was causing this error. So we had to download/compile/install binutils-2.2.5, and then add that to the path. The link is here.

Installing MongoDB with RocksDB as a storage engine

I honestly don’t really understand most of the words I just wrote in that headline. Nevertheless, when our dba or our software team asks us to install something, install it we do.

This one was fairly tricky, and I’m hoping this post will help some people.

First, some words about our environment. We run Scientific Linux 6.x in our datacenter primarily, and we maintain a shared /usr/local directory (mounted via nfs) across most of the servers. This makes installing software somewhat tricky in two areas: Most new/Big Data/open source/whatever software these days is expecting the latest and greatest code, which tends to show up in Ubuntu. Unfortunately, Ubuntu is not all that well suited to some other stuff we do, and tends to not be as easy to manage in the datacenter. The other trickiness is that we want to avoid the default install location of /usr/local for a lot of software, unless it’s being installed for our HPC cluster.

Aaaaanyway. So that basically meant I had to install a number of packages for this to work, as well as compiling with a non-standard location for gcc, g++, etc. And I also had to install the software itself in a non-standard location. I also had to call a newer version of Python than is standard on SL (and you DON’T want to install over the standard version!)

I started out from this blog post. I’m sure those instructions work for a standard install on Ubuntu 14.xx or later, but they need a little expanding for RHEL/CentOS/Scientific Linux.

Set up the path and install some prereqs:
export PATH=/usr/local/gcc/bin:/usr/local/python-2.7.8/bin:$PATH
yum install libmpc --enablerepo=epel
yum install snappy-devel zlib-devel bzip2-devel

Install the rocksdb libraries:

git clone
cd rocksdb
git checkout mongorocks

Change the INSTALL_PATH line in Makefile change to preferred install location (/opt/rocksdb):

INSTALL_PATH ?= /opt/rocksdb

And make/make install

make static_lib
sudo make install

Download scons, which you will need to build mongo. Note that the version of scons available in the standard repos is not new enough, so you get to compile this from source, too:

curl -O

Install and add it to your path:

python install --prefix=/opt/scons
export PATH=/opt/scons/bin:$PATH

Download the latest version of Mongo that works with rocksdb:

git clone
cd mongo
git checkout v3.0-mongorocks

Here’s the tricky bit. Scons seems to be rather poorly documented, or at least it’s so flexible that documenting it isn’t helpful. These are the commands I figured out, bearing in mind that our version of GCC (4.9.1) is at /usr/local/gcc and I installed the rocks libraries at /opt/rocksdb:

scons all --cxx=/usr/local/gcc/bin/g++ --cc=/usr/local/gcc/bin/gcc --cpppath=/opt/rocksdb/include --libpath=/opt/rocksdb/lib --rocksdb=1 --prefix=/opt/mongo

scons --cxx=/usr/local/gcc/bin/g++ --cc=/usr/local/gcc/bin/gcc --cpppath=/opt/rocksdb/include --libpath=/opt/rocksdb/lib --rocksdb=1 --prefix=/opt/mongo install

So I’m not certain I need all that jazz for the install line, but it seemed to work, so better safe than sorry. This took about 4 hours to compile in the little VM I was running it in, so I didn’t feel like rebuilding over and over again once I had it actually functional.

Now, in my case, I found that because I couldn’t figure out how to make scons use static libraries and actually work, my dba will need to use this command before running mongo:

export LD_LIBRARY_PATH=/usr/local/gcc/lib64

If anyone can figure out how to make the static libraries with a non-standard location, please include it in the comments. I tried –static, but that caused the compile to throw errors about not being able to find lpthread (??), which it finds just fine without the static libraries.

Breaking Down the Monster III

So, finishing this off.

It-sa bunch-a case lines!

Write first:


echo $1 $2 "filesize: "$3 "totalsize: "$4"G" "filesperdir: "$5
case $1 in
        if [ $2 = scality ]; then
            time scalitywrite
            exit 0

So if it’s a Scality (or other pure object storage), it’s simple. Just run the write and time it, which will output the info you need. OTHERWISE…

#Chunk file groups into folders if count is too high
	if [ $totfilecount -ge 10000 ]; then
	    for dir in `seq 1 $foldercount`; do
	        createdir $fspath/$dir
	    time for dir in `seq 1 $foldercount`; do
		filecount=$(( $totfilecount / $foldercount ))
            createdir $path
            time writefiles


Do what the comment says. Chunk the files into folders, since if you write to a filesystem, count of files in directories makes a big difference. . Make sure you create the directories before you try to write to them… and then time how long it takes to write all of them. If it’s less than the critical file count number, then just write them and time it.



read) #in order read
	sync; echo 1 > /proc/sys/vm/drop_caches
        if [ $2 = scality ]; then
            time scalityread
            exit 0
	if [ $totfilecount -ge 10000 ]; then
		time for dir in `seq 1 $foldercount`; do
			filecount=$(( $totfilecount / $foldercount ))
		time readfiles

That sync line is how you clear the filesystem cache (as root) on a Linux system. This is important for benchmarking, because let me tell you, 6.4GB/sec is not a speed that most network storage systems can reach. Again, we split it and time all of the reads, or we just straight up time the reads if the file count is low enough. This routine reads files in the order they were written.


	rm) #serial remove files
        if [ $2 = scality ]; then
            time for i in `seq 1 $totfilecount`; do
                curl -s -X DELETE http://localhost:81/proxy/bparc/$fspath/$i-$suffix > /dev/null
            exit 0
		if [ $totfilecount -ge 10000 ]; then
			time for i in `seq 1 $foldercount`; do
				rm -f $fspath/$i/*-$suffix
				rmdir $fspath/$i
		elif [ -d $fspath/$3 ]; then 
			time rm -f $fspath/*-$suffix

Similar to the other two routines, if it’s an object based, do something completely different, otherwise remove based on file path and count of files.


	parrm) #parallel remove files
		time ls $fspath | parallel -N 64 rm -rf $fspath/{}

This one is remarkably simple. Just run parallel against an ls of the top level directory, and pipe it into rm -rf. The {} is stdin for parallel. The -N 64 is number of threads to run.


This one’s kind of neat:

	shufread) #shuffled read
		sync; echo 1 > /proc/sys/vm/drop_caches
		if [ $totfilecount -ge 10000 ]; then
			folderarray=(`shuf -i 1-$foldercount`)
			time for dir in ${folderarray[*]}; do
				filecount=$(( $totfilecount / $foldercount ))
			time shufreadfiles

I needed a way to do random reads over the files I’d written, in order to simulate that on filesystems with little caching (ie, make the drives do a lot of random seeks.)

	shufread) #shuffled read
		sync; echo 1 > /proc/sys/vm/drop_caches
		if [ $totfilecount -ge 10000 ]; then
			folderarray=(`shuf -i 1-$foldercount`)
			time for dir in ${folderarray[*]}; do
				filecount=$(( $totfilecount / $foldercount ))
			time shufreadfiles

At first, I tried writing the file paths to a file, then reading that, but that has waaaay too much latency when you’re doing performance testing. So, after some digging, I found the shuf command, which shuffles a list. You can pass an arbitrary list with the -i flag. I tossed this all into an array, and then it proceeds like the read section.


	*) usage && exit 1;;
echo '------------------------'

Fairly self explanatory. I tossed an echo with some characters in to keep the output clean if you’re running the command inside a for loop.

And that’s it!