New Spark configuration suggestions

I know that a fair amount of people still look at the configuration suggestions I made way back when here. Since then, we’ve made a number of changes to improve performance and reliability. We’ve moved a lot of the memory and core configurations out of the spark-env.sh script into the spark-defaults.conf, which allows them to be overridden by advanced users and is generally a better place to put those things. We’re also working on limiting the number of cores used by spark and the amount of memory in order to leave some space for OS and background processes, (hopefully preventing runaway load problems we’ve been fighting for a long time).

Upcoming changes will, I believe, provide even better performance and stability. We will be moving the spark.local.dirs from a single, slow HDD to 3x SSDs (consumer grade, but generally good performance/$). This should prevent the memory filling up from the reduce phase outrunning the disk. I strongly recommend making this change in your environment, if at all possible.

The new spark-defaults.conf:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
#########################################################################
spark.akka.timeout=300s 
spark.rpc.askTimeout=300s 
spark.storage.blockManagerHeartBeatMs=30000
spark.rpc.retry.wait=30s 
spark.kryoserializer.buffer.max=1024m
spark.core.connection.ack.wait.timeout=600s
spark.driver.maxResultSize=0
spark.python.worker.memory=1536m
spark.driver.memory=70g
spark.executor.memory=25g
spark.executor.cores=5
spark.akka.frameSize=2047

The identical two timeouts are because under Spark 1.6 and newer, akka is deprecated (removed entirely in 2.0.0, I believe), and without it, our workers would repeatedly drop out of the cluster.

The spark.python.worker.memory is crucial to set to a relatively small value, because pyspark operates outside the jvm(s) that is/are configured with the spark.executor.memory value. So in our environment, with 128GB nodes, we need to keep the total memory usage below 120GB. The 1536MB isn’t a hard limit, but it is the point where the python processes are instructed to start dumping their memory to disk. Depending on the application, it might be more advantageous to use less executor.memory and more python.worker.memory.

spark.executor.cores determines the amount of cores allocated to that jvm and, thus, how many spark executors will be running on the worker node. In this case, it will be 3 executors with one core left free for system processes.

The spark-env.sh file is now very trimmed down from the original (created back in the 0.7 days…):


#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

ulimit -n 65535
export SCALA_HOME=/usr/local/scala-2.10.3

export SPARK_WORKER_DIR=/scratch/spark/work
export JAVA_HOME=/usr/local/jdk1.8.0_45
export SPARK_LOG_DIR=~/.spark/logs/$(date +%H-%F)/
export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true

###################################
#set disk for shuffle and spilling
#Local hdd 
export SPARK_LOCAL_DIRS=/scratch/spark/tmp

#3 local ssds
#export SPARK_LOCAL_DIRS=/scratch-ssd1/sparklocaldir,/scratch-ssd2/sparklocaldir,/scratch-ssd3/sparklocaldir

#################################

export PYSPARK_PYTHON=/usr/local/python-2.7.6/bin/python
export SPARK_SLAVES=/scratch/spark/tmp/slaves
export SPARK_SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=30"
export SPARK_PUBLIC_DNS=$HOSTNAME

## pull in users environment variables on the workers to that PYTHONPATH will transfer
source $HOME/.bash_profile

Basically all that we’re doing is selecting the versions of java, scala, and python to be used by default, allocating the location for the list of slaves (may not even be necessary with sparkflex), and setting the various local directories. Note the two options between local HDD and the 3 SSDs. These paths will of course be different in your environment. You’ll also need to make sure that the spark log directory already exists.

Moving data between quota protected directories on Isilon, take II

My previous post about moving data between quota’d directories on Isilon is probably one of the most popular I have made, garnering up to 5 hits per week.

But that script has a big issue: it can’t handle special characters or–worse yet–spaces in file and directory names. So here’s an improved version:


!/bin/bash

#Tests whether there is a valid path
testexist () {
        if [ ! -r "$1" ] || [[ "$1" != *ifs* ]]; then
                echo "$1" " is an invalid path. Please check the path and/or use a full path including /ifs."
        exit 1
        fi
}

#Iterates through path backwards to find most closely related quota
findquota () {
        RIGHTPATH=0
        QUOTADIR="$1"
        while [ $RIGHTPATH -eq 0 ]; do
                QUOTADIR=$(echo "$QUOTADIR" | sed 's|/[^/]*$||')
                isi quota list | grep -q "$QUOTADIR" && RIGHTPATH=1
        done
        echo $QUOTADIR
}

testquota () {
        if [ "$1" = "-" ] || [ -z $1 ]; then
                echo "No hard directory quota on this directory. Just use mv."
                exit
        fi
}

if [[ $# -ne 2 ]]; then
        #Gets paths from user
        echo "Enter source: "
        read SOURCE
        echo "Enter target: "
        read TARGET
else
        SOURCE="$1"
        TARGET="$2"
fi

testexist "$SOURCE"
testexist "$TARGET"

#Verifies paths with user
echo -en "Moving $SOURCE to $TARGET\nIs this correct? (y/n): "
read ANSWER
if [ $ANSWER != 'y' ] ; then
        exit
fi

#Defines quotas
SOURCEQUOTA=$(findquota "$SOURCE")
TARGETQUOTA=$(findquota "$TARGET")

#Gets size of hard threshold from quota

SOURCETHRESH=$(isi quota view --path=$SOURCEQUOTA --type=directory | awk -F": " '$1~/Hard Threshold/ {print $2}' | sed s/.00//)
TARGETTHRESH=$(isi quota view --path=$TARGETQUOTA --type=directory | awk -F": " '$1~/Hard Threshold/ {print $2}' | sed s/.00//)
testquota $SOURCETHRESH
testquota $TARGETTHRESH

echo -e "\nOriginal values:"
echo $SOURCEQUOTA $SOURCETHRESH
echo $TARGETQUOTA $TARGETTHRESH

isi quota quotas delete --type=directory --path=$SOURCEQUOTA -f
isi quota quotas delete --type=directory --path=$TARGETQUOTA -f

mv "$SOURCE" "$TARGET"

isi quota quotas create $SOURCEQUOTA directory --hard-threshold=$SOURCETHRESH --container=yes
isi quota quotas create $TARGETQUOTA directory --hard-threshold=$TARGETTHRESH --container=yes

echo -e "\nNew values:"
isi quota list | grep $SOURCEQUOTA
isi quota list | grep $TARGETQUOTA

echo -e "\nls -l $TARGET"
ls -l "$TARGET"

This version uses a ton of quotes to handle all the crazy stuff SMB users put on Isilons.

I’ve also cleaned up the findquota () function. It now doesn’t bother with incrementing stuff, it just backs through the path using a sed command. Once it sets a path, it checks that against the quota list on the Isilon. If that command returns true, it will then change the $RIGHTPATH variable to 1 and exit the loop.

The outputs at the end are also a little cleaner so you can tell what the heck it’s doing when it spews out its final output.

Breaking down the monster, part the second

Let’s do this thing.

This little snippet tells the bash builtin time to output in realtime seconds to one decimal place (like the comment says). Very handy modifier, and of course there are others.

#defines output of time in realtime seconds to one decimal place
TIMEFORMAT=%1R

Here's a function to create directories. Seems kind of silly, but if you don't check if the directory is created first, mkdir throws an error, and who wants that?
#creates directory to write to
createdir () {
	if [ ! -d $1 ]; then
		mkdir -p $1
	fi
}

Now we get to the meat of the thing. There’s 5 functions to do the writes, reads, and shuffled reads. But wait, that’s only 3 operations, you say! Well, object is different. So there are two separate functions for writing and reading from the object store. Obviously these would need to be heavily modified for any object based storage that’s not Scality, since we’re using their sproxyd method at this time. I haven’t looked into what Swift or S3 looks like, but it’s probably fairly similar, syntax wise.

So, writes:

#write test
writefiles () {
	#echo WRITE
	for i in `seq 1 $filecount`; do 
		#echo -n .
		dd if=/dev/zero of=$path/$i-$suffix bs=$blocksize count=$blockcount 2> /dev/null
	done
}

Pretty straightforward, here. All the variables, aside from i, are globally defined outside the function. The commented out echo -n . just creates some pretty …….. stuff, if you’re into that sort of thing.

Same with reads:

#read test
readfiles () {
	#echo READ
	for i in `seq 1 $filecount`; do 
		#echo -n .
		dd if=$path/$i-$suffix of=/dev/null bs=$blocksize 2> /dev/null
		#dd if=$path/$i-$suffix of=/dev/null bs=$blocksize
	done
}

Now the shuffled writes. I played with a few ways of doing this (like writing out the list to a file, then reading that file), but writing to a file and reading from it is expensive, which gives bum results on the timing. So, an array it is, created by shuffling (shuf -i 1-$filecount) a list of numbers. There’s still some debugging code commented out in here.

#shuffled read test
shufreadfiles () {
	#echo SHUFFLE READ
	filearray=(`shuf -i 1-$filecount`)
	for i in ${filearray[*]}; do 
		#echo -n .
		#echo $path/$i-$suffix
		dd if=$path/$i-$suffix of=/dev/null bs=$blocksize 2> /dev/null
		#dd if=$path/$i-$suffix of=/dev/null bs=$blocksize
	done
}

Now we get to the object stuff. I wanted to eliminate the need for reading from a file to do the writes, so I jiggered up curl to take stdin from dd. This is done with the -T- flag. It doesn’t work in some circumstances (which I will detail in a later post when I talk about Isilon RAN object access), but it does here with plain ‘ol unencrypted http calls.

#ObjectWrite
scalitywrite () {
    for i in `seq 1 $filecount`; do
        dd if=/dev/zero bs=$blocksize count=$blockcount 2> /dev/null | curl -s -X PUT ;
         http://localhost:81/proxy/bparc$fspath/$i-$suffix -T- > /dev/null
    done
}

So there’s a lot of mess in here where I was trying to get rid of output from curl and dd. This is fairly difficult. dd outputs to stderr as a normal program does, but curl need to be silenced (-s) as well as sending its stdout to /dev/null. The other note is that Scality sproxyd can be written to in two ways; with a hash that contains some metadata about how to protect the files and where to put them, or as a path. The path is hashed by the system, and the object is written that way. Note that you CAN’T do a file list, and it’s not stored in the system by the path. The full path can be retrieved, but not searched for.

The read is much simpler

#ObjectRead
scalityread () {
    for i in `seq 1 $filecount`; do
        curl -s -X GET http://localhost:81/proxy/bparc/$fspath/$i-$suffix > /dev/null
    done
}

OK, in the next post, I’ll get to the heart of the script, where it calls all of these here functions.

Hideous powershell and bash scripts for comparing groups

I apparently started writing this post many moons ago. I have no idea what I was doing, but maybe someone will find it useful.

$Groups = Get-ADGroup -Properties * -Filter * -SearchBase 
"OU=Groups,OU=SciComp,DC=hhmi,DC=org" 
Foreach($G In $Groups)
{
 New-Item U:\Documents\adgroups\$G.Name -type file
 Add-Content U:\Documents\adgroups\$G.Name $G.Members
}
for i in `ls`; do mv $i `echo $i | awk -F, '{print $1}' | awk -F= '{print $2}'`; done

for i in `ls`; do cat $i | awk -F, ‘{print $1}’ | awk -F= ‘{print $2}’ > mod/$i; done

mv mov/* .

for i in `ls adgroups`; do /root/grouptest.sh $i 2>/dev/null | sort > ldapgroups/$i; done