Running Spark as a job on a Grid Engine HPC Cluster (part 2)

UPDATED GUIDE: SparkFlex, aka running Spark jobs on a Grid Engine cluster, take II

So the next thing you need to do to get Spark running in Grid Engine is to set up your queue, job class, and your parallel environment.

Our environment is normally run with users requesting slots, which roughly correspond to one cpu and ~8GB of RAM per host. So our normal Sandy Bridge nodes have 16 slots, and our Haswell nodes have 32.

Spark is much easier to run if you give it the whole machine, of course (although I suspect you can limit the workers in various ways), but in our environment, part of the reason Spark is used is to load huge datasets (>1TB) into RAM. So it doesn’t make sense to allocate Spark workers (or the master) based on slot count.

Therefore, our queue for spark (called hadoop2 for historical reasons) is set up to have only 1 slot per node. We also have a complex called hadoop_exclusive which forces jobs (or tasks) to be scheduled to whole nodes.

>qconf -sq hadoop2
qname                 hadoop2
hostlist              @h02 @h03 @h04 @h05 @h06 @h07 @h08
seq_no                0
load_thresholds       np_load_avg=3
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
qtype                 INTERACTIVE
ckpt_list             NONE
pe_list               hadoop,spark,spark-rc,spark-test
jc_list               hadoop.default,spark.default, \
rerun                 FALSE
slots                 1
tmpdir                /tmp
shell                 /bin/sh
prolog                NONE
epilog                NONE
shell_start_mode      unix_behavior
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
d_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

There doesn’t appear to be much here that’s non-default, other than the JC and PE lists. Note that you need to set up a separate pe and jc for every version of spark you’re using.

Here’s the PE setup:

>qconf -sp spark
pe_name                spark
slots                  99999
user_lists             NONE
xuser_lists            NONE
start_proc_args        /usr/local/uge-hadoop/
stop_proc_args         /usr/local/uge-hadoop/
allocation_rule        1
control_slaves         TRUE
job_is_first_task      FALSE
urgency_slots          min
accounting_summary     TRUE
daemon_forks_slaves    FALSE
master_forks_slaves    FALSE

The start_proc_arts and stop_proc_args are really the secret sauce. I’ll talk about those in part 3. The other options are fairly obvious; you want the master (ie, the job) to control the slaves, which are running the Spark workers. The accounting_summary allows you to track the load/memory usage on the slave nodes. job_is_first_task is set to FALSE, because the head node, while it contains the Spark master process (and sometimes the driver, depending on the job creation), it does not run any of the Spark workers.

On to the JC:

>qconf -sjc spark
jcname          spark
variant_list    default
owner           NONE
user_lists      NONE
xuser_lists     NONE
A               {+}UNSPECIFIED
a               {+}UNSPECIFIED
ar              {+}UNSPECIFIED
b               {+}UNSPECIFIED
binding         {+}UNSPECIFIED
c_interval      {+}UNSPECIFIED
c_occasion      {+}UNSPECIFIED
ckpt            {+}UNSPECIFIED
ac              {+}UNSPECIFIED
cwd             {+}UNSPECIFIED
dl              {+}UNSPECIFIED
e               {+}UNSPECIFIED
h               {+}UNSPECIFIED
hold_jid        {+}UNSPECIFIED
hold_jid_ad     {+}UNSPECIFIED
i               {+}UNSPECIFIED
j               {+}UNSPECIFIED
js              {+}UNSPECIFIED
l_hard          hadoop_exclusive=1
l_soft          {+}UNSPECIFIED
masterl         {+}UNSPECIFIED
m               {+}UNSPECIFIED
mbind           {+}UNSPECIFIED
M               {+}UNSPECIFIED
masterq         {+}UNSPECIFIED
N               {+}UNSPECIFIED
notify          {+}UNSPECIFIED
now             {+}UNSPECIFIED
o               {+}UNSPECIFIED
P               {+}UNSPECIFIED
p               {+}UNSPECIFIED
pe_name         spark
pe_range        {+}UNSPECIFIED
q_hard          {+}UNSPECIFIED
q_soft          {+}UNSPECIFIED
R               TRUE
r               {+}UNSPECIFIED
rou             {+}UNSPECIFIED
S               {+}UNSPECIFIED
shell           {+}UNSPECIFIED
t               {+}UNSPECIFIED
tc              {+}UNSPECIFIED
V               {+}UNSPECIFIED
v               {+}UNSPECIFIED

The only modifications here are setting the complex hadoop_exclusive, the pe to be spark (see above), and setting the Reserved (or R) to TRUE. Reserved holds available nodes aside when the cluster load is high to enable the Spark job to start. For example, if the job is requested with 10 slots and there is only one free node in the cluster, that node is reserved for the Spark job and nothing is scheduled to it until all 10 nodes are available for the Spark job. Otherwise, Spark jobs are likely to never be run on a busy cluster.


About kcarlile
Twitter: @overclockdlemon

2 Responses to Running Spark as a job on a Grid Engine HPC Cluster (part 2)

  1. Pingback: Running Spark as a job on a Grid Engine HPC Cluster (part 3) | Unscrupulous Modifier

  2. Pingback: SparkFlex launch script | Unscrupulous Modifier

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: