Hybrid MPI/Thread runs of Qbox on our Blue Gene Q.

Qbox is a scalable parallel implementation of first-principles molecular dynamics (FPMD) based on the plane-wave, pseudopotential formalism. Qbox is designed for operation on large parallel computers. The page http://eslab.ucdavis.edu/software/qbox/index.htm describes the application. The paper Architecture of Qbox: A scalable first-principles molecular dynamics code describes the design of an early version of the program.

Hybrid parallel computing refers to a method of parallelization in which multiple paradigms are used within the same code. Combining MPI and threads based programming within the same application is not unusual. MPI or message passing can be used to communicate between tasks on the same node or between nodes. Threading can be used to coordinate the use of multiple cores on a node by a single task . OpenMP is a common API for doing threads based parallel computing.

The IBM BlueGene supercomputer (BGQ) has 16 cores per node. Thus some people think that the best way to program for the BGQ is to assign 16 MPI tasks per node. However there are important consideration when programming the BGQ including:

We ran one of the Qbox test cases on our BGQ - Mc2 on a fixed (4) number of nodes and varied the number of MPI tasks per node and the number of threads per MPI task. We ran on of the test cases distributed with the source: test/h2ocg

The normal command sequence to run this application using 8 MPI tasks per node and 4 threads per task while using the input file test.i would be:

export OMP_NUM_THREADS=4
srun  --overcommit --ntasks-per-node=8 /opt/qbox/1.60.9/bin/qb test.i

We ran all cases in the set:

for MPI_TASKS_PER_NODE in [1,2,4,8,16,32,64] :
	for THREADS_PER_TASK in [1,2,4,8,16,32,64] :
		things=MPI_TASKS_PER_NODE*THREADS_PER_TASK
			if things <= 64 :
				run qbox

This allowed for all of the combinations of number of threads and MPI tasks to be run such that the product of the two did not exceed 64. Note that allowed for up a 4X oversubscription.

Our runscript for doing this is shown below.

We have two charts below that show the runtimes and speedups for the various numbers of tasks and threads. Speedup is measured relative to running 1 thread and 1 MPI task per node. The "green" chart is organized by task and thread counts. The "blue" chart is organized with the best speedup on top. The plot show the same data graphically.

We make the following observations:

Our best runtime was obtained with 8 tasks per node and 4 threads for a "things" count of 32 or a 2x oversubscription. We had slightly behind that, 8 tasks per node and 8 threads for a 4x oversubscription.

The plot is interesting. We see that the "sweet spot" for this test is in the box around 8-16 threads/task and 2 to 8 MPI tasks.

Qbox example runtimes
Various numbers of MPI tasks / node and Threads / task

Tasks /
Node
Threads /
Task
TimeTasks *
Threads
Speedup
1149.8911.00
1239.1221.28
1433.1341.51
1830.2181.65
11629.22161.71
13232.48321.54
16453.84640.93
2126.8521.86
2221.4942.32
2418.4482.71
2817.00162.93
21616.85322.96
23219.75642.53
4114.9943.33
4212.3884.03
4410.90164.58
4810.40324.80
41610.84644.60
819.2285.41
827.98166.25
847.45326.70
887.57646.59
16110.07164.95
1629.27325.38
1648.89645.61
32115.21323.28
32214.55643.43
64126.06641.91

Qbox example runtimes
Various numbers of MPI tasks / node and Threads / task

Tasks /
Node
Threads /
Task
TimeTasks *
Threads
Speedup
847.45326.70
887.57646.59
827.98166.25
1648.89645.61
819.2285.41
1629.27325.38
16110.07164.95
4810.40324.80
41610.84644.60
4410.90164.58
4212.3884.03
32214.55643.43
4114.9943.33
32115.21323.28
21616.85322.96
2817.00162.93
2418.4482.71
23219.75642.53
2221.4942.32
64126.06641.91
2126.8521.86
11629.22161.71
1830.2181.65
13232.48321.54
1433.1341.51
1239.1221.28
1149.8911.00
16453.84640.93

figure_1c

Advice

The optimum number of MPI tasks per node and threads per task is going to be highly application and data set dependent. Users should try different combinations to see what works best for them.

It is expected that future HPC platforms will rely more heavily on hybrid programming to achieve peak performance. Application authors should be, today, writing their programs using a hybrid programming style.

Run Script

The runscript shown below was used to generate the data for this note. The application /opt/utility/phostname gives detailed mappings of tasks and threads to cores and we include it in our script to ensure we are getting the mappings we expect.

#!/bin/bash
#SBATCH --job-name="atest"
#SBATCH --nodes=4
##SBATCH --ntasks-per-node=16
#SBATCH --time=00:30:00
#SBATCH -o stdout.%j
#SBATCH -e stderr.%j
#SBATCH --export=ALL
##SBATCH --mail-type=ALL
##SBATCH --mail-user=joeuser@mines.edu

#----------------------
cd $SLURM_SUBMIT_DIR

#----------------------
cd $SLURM_SUBMIT_DIR

for TPN in 1 2 4 8 16 32 64 ; do
 for OMP_NUM_THREADS in 1 2 4 8 16 32 64 ; do
  things=`expr $TPN \* $OMP_NUM_THREADS`
  if [ $things -le 64 ] ; then
   export OUTPUT=set04_`echo $TPN`_`echo $OMP_NUM_THREADS`
   echo "OMP_NUM_THREADS=" $OMP_NUM_THREADS >  $OUTPUT
   srun -N 4  --overcommit --ntasks-per-node=$TPN /opt/utility/phostname -F | sort  >> $OUTPUT
   srun -N 4  --overcommit --ntasks-per-node=$TPN /opt/qbox/1.60.9/bin/qb test.i >> $OUTPUT
  else
   echo skipped $OMP_NUM_THREADS $TPN $things
  fi
 done
done