Threading for IBM Power8 nodes

The HPC platform, Mio, has two IBM Power8 nodes, ppc001 and ppc002. Each node has 256 Gbytes of memory and 2 K80 GPU boards. (Each K80 board actually contains 2 K40 GPUs.) The nodes have 20 cores each and each core can support 8 threads. This note discusses how to map MPI processes and threads to the cores. In particular, we will look at mappings for hybrid MPI/OpenMP applications. The methods described here are also applicable to pure threaded applications. These results are specific to using IBM's version of MPI along with the XL compilers.

Why might you be interested in specifying a specific mapping of threads to cores? Performance! By default, the thread scheduler may "stuff" threads on cores, filling a core with 8 threads and leaving other cores empty. Also, if you are using the GPUs you might want to have specific cores partnering with individual GPUs.

In particular, we will show that we can control mapping of threads to cores using the environmental variable MP_TASK_AFFINITY and with proper mapping we can greatly increase performance.

We will first discuss how cores are numbered on the Power8 nodes. We can see the configuration of the nodes using the command:

[tkaiser@mio001 ibm]$ /opt/utility/slurmnodes ppc002
ppc002
   NodeHostName ppc002
   CoresPerSocket 10
   CPULoad 0.70
   State IDLE
   Version 15.08
   TmpDisk 949020
   SlurmdStartTime 2016-10-21T17:32:53
   BootTime 2016-10-21T16:06:57
   RealMemory 257476
   Boards 1
   Gres gpu:kepler:4
   CPUAlloc 0
   NodeAddr ppc002
   FreeMem 249830
   AllocMem 0
   Features (null)
   Weight 1
   CPUTot 160
   ThreadsPerCore 8
   CapWatts n/a
   Owner N/A
   CPUErr 0
   Sockets 2
[tkaiser@mio001 ibm]$ 

Some things to note: There are 10 CoresPerSocket and 2 Sockets for a total of 20 physical cores. There are 8 ThreadsPerCore allowed. The odd feature is that these are combined to give and effective or logical CPUTot of 160. That is there are 160 "slots" for active processes or threads. We are interested in mapping to these 160 logical cores.

The physical cores on the nodes are numbered 0-19 and the logical cores are numbered from 0-159. Logical cores 0-7 map to physical core 0, 8-15 map to physical core 1 and so on. The physical core for a particular logical core is int(logical_core/8).

Testing

The first part of this document discusses running 20 or fewer MPI tasks/node. The second covers running 40, 80, and 160 MPI tasks/node.

We run simple predictable hybrid MPI/OpenMP program, with various numbers of MPI tasks and setting environmental variables that effect the number of threads and the distribution of threads to logical and physical cores. We report on the distribution of MPI tasks and threads and relative performance.

The Basic Program

Our test program simply does a number of matrix inversions. MPI task 0 reads three integers. The first is the size of the matrix. The second is the number of matrices that will be generated and inverted by each MPI task. The third number is an iteration count, that is how many times will the sets of matices be inverted. Our inputs are:

[tkaiser@ppc001 ibm]$ cat input
1024 160 10
.true.

The value .true. specifies that there is a synchronization between MPI tasks after each set of inversions.

There is a single 3d array that holds all of the 2d matrices. We have a pointer, twod, that is used to reference sections of the 3d array as a 2d matrix. The data type for or matrices is "b8" which maps via parameter statement to an 8 byte real.

The code for generating the data is:

 allocate(tarf(msize,msize,nrays))
 ...
      tarf=1.0_b8
!$OMP PARALLEL DO PRIVATE(twod,i3,j)
      do i=1,nrays
        twod=>tarf(:,:,i)
        j=omp_get_thread_num()+1
           do i3=1,msize
             twod(i2,i3)=j+10.0_b8
           enddo
      enddo

Where msize is the matrix size and nrays is the number of them. As you can see the matrix generation is threaded.

We use the LAPACK routine DGESV to do the matrix solves. The solves are also threaded:

!$OMP PARALLEL DO PRIVATE(twod)
      do i=1,nrays
        twod=>tarf(:,:,i)
        call my_clock(cnt1(i))
        CALL DGESV( N, NRHS, twod, LDA, IPIVs(:,i), Bs(:,i), LDB, INFOs(i) )
        call my_clock(cnt2(i))
        write(17,'(i5,i5,3(f12.3))')i,infos(i),cnt2(i),cnt1(i),real(cnt2(i)-cnt1(i),b8)
      enddo

There is an important, not obvious, point about the routine DGESV. It is not inherently threaded. That is each call to DGESV will only use a single thread. However, calling DGESV inside a threaded region will cause each particular invocation to be given a thread. Thus we can have several instances of DGESV running independently in parallel.

A simple makefile for building the application is:

LPATH=/software/lib/lapack/3.5.0/xl
LIBS=-llapack -lblas

default: pointerMPI

pointerMPI: pointerMPInrc.f90
	mpfort  -compiler xlf90_r -L$(LPATH) -O3 -qsmp=omp  pointerMPInrc.f90 $(LIBS) -o pointerMPI
	rm -rf numz.mod

Where:

mpfort -compiler xlf90_r
Specifies the IBM version of MPI with the threaded fortran compiler xlf90_r backend
-L$(LPATH) = -L/software/lib/lapack/3.5.0/xl
Path to LAPACK and BLAS directory (built from source)
-O3 -qsmp=omp
Optimization and enabling OpenMP
pointerMPInrc.f90
Source file
$(LIBS) = -llapack -lblas
Libraries for DGESV
-o pointerMPI
Specifies the name of our executable program

Getting environmental variables from within the program

There are a number of environmental variables that are important to our study.

OMP_NUM_THREADS
Sets the number of threads to use during execution
XLSMPOPTS
Runtime options affecting parallel processing.
MP_TASK_AFFINITY
Controls the placement of tasks in a parallel job

We can use the Fortran subroutine call GET_ENVIRONMENT_VARIABLE to get the value of these variables from within the program.

For more information about these variables see the first three references given below.

We also use the environmental variable SLURM_JOB_ID which gets a new value for each job to create a new set of output files for each test. Our output file names are a combination of SLURM_JOB_ID, the MPI task ID and the setting of the variable MP_TASK_AFFINITY.

      character(len=4096)::ont,MP_TASK_AFFINITY,XLSMPOPTS,SLURM_JOB_ID
...
...
      call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
...
...
      call GET_ENVIRONMENT_VARIABLE("OMP_NUM_THREADS",  &
                                     value=ont,status=is0)
      call GET_ENVIRONMENT_VARIABLE("MP_TASK_AFFINITY", &
                                     value=MP_TASK_AFFINITY,status=is1)
      if(is1 .ne. 0)write(MP_TASK_AFFINITY,"('none')")
      call GET_ENVIRONMENT_VARIABLE("XLSMPOPTS", &
                                     value=XLSMPOPTS,status=is2)
      if(is2 .ne. 0)write(XLSMPOPTS,"('none')")
      call GET_ENVIRONMENT_VARIABLE("SLURM_JOB_ID", &
                                     value=SLURM_JOB_ID,status=is3)
      if(is3 .ne. 0)write(SLURM_JOB_ID,"('xxxxx')")
      write(pname,"(i3.3,'_',i3.3,'_',a,'_',a)")myid,&
                                  omp_get_max_threads(),&
                                  trim(SLURM_JOB_ID), &
                                  trim(MP_TASK_AFFINITY)
      open(unit=17,file=trim(pname),status="unknown",POSITION='APPEND')
      write(17,*)"For ",myid," OMP_NUM_THREADS=",trim(ont)
      write(17,*)"For ",myid,"       XLSMPOPTS=",trim(XLSMPOPTS)

The file "007_016_1874756_core:2" was created by MPI task 7 running 16 threads as part of job 1874756 with MP_TASK_AFFINITY set to core:2.

Our example, pointerMPInrc.f90, program was run using the slurm script shown here.

#!/bin/bash
#SBATCH --time=10:00:00
#SBATCH --partition=ppc
#SBATCH --overcommit
#SBATCH --exclusive
#SBATCH --nodes=1
##SBATCH --ntasks=5
#SBATCH --export=ALL
#SBATCH --out=%J.out

# Go to the directoy from which our job was launched
cd $SLURM_SUBMIT_DIR

source /etc/profile
module purge

unset OMP_PROC_BIND


export MYID=$SLURM_JOBID
cat $0 > $MYID.script

export MP_RESD=poe
export MP_HOSTFILE=$MYID.list
printenv > $MYID.env
export MP_LABELIO=yes

./expands > $MP_HOSTFILE

cat $INPUT >> $MYID.out 
export INPUT=input
export XLFRTEOPTS="buffering=disable_all"
export EXE=./pointerMPI
ulimit -c 0
for T in  1 2 4 8 16 20 40 80 160 ; do
	export OMP_NUM_THREADS=$T
	for S in core:1 core:2 core:4 core:8 core:10 core:20 ; do
		export MP_TASK_AFFINITY=$S
		echo $MP_TASK_AFFINITY >> $MYID.out
        poe $EXE  -procs $SLURM_NTASKS < $INPUT  >> $MYID.out
	done
done
unset MP_RESD
unset MP_HOSTFILE

The program is run a number of times in a nested "for" loop with various values for OMP_NUM_THREADS and MP_TASK_AFFINITY. Note that OMP_PROC_BIND is unset. This needs to be done for MP_TASK_AFFINITY to have the effects discussed below. Also, the script was run with various MPI tasks using the command.

for ntasks in 1 2 4 5 8 10 20 ; do  sbatch -n $ntasks doit; done

After the jobs completed the output files created by running with a particular number of MPI tasks were combined in a single directory. The post processing of the output files will be described in detail below. In short, we are interested in the number of matrix inversions per second as a function of the settings of the environmental variables.

Macro Effect of variables

The environmental variable MP_TASK_AFFINITY is set to core:N, where N has the value of 1 to 20. This variable determines the number of physical cores that each individual MPI task is allocated. If each MPI task launches less than N threads then physical cores will be left open. If the number of threads is greater than N then physical cores will run more that one thread. This can be seen in cases 62-69 where we have 2 MPI tasks and MP_TASK_AFFINITY=core:10. With a single thread, the first MPI task gets physical core 0 and the second task gets physical core 10. As we add threads successive cores become occupied. With 16 threads per task (case 66) all cores are busy and 12 cores are over subscribed, 6 for each task. With 20, 40, and 80 threads we have all cores in use with an equal load.

You may note that our script does not set the variable XLSMPOPTS even though it is "collected" and printed in our program. XLSMPOPTS is actually an important variable because it determines the actual mapping of MPI tasks and OpenMP threads to cores. It could be set in the script but that would not be particularly useful because each MPI task would have the same value and thus have the same mapping.

Note that in our script the command poe is used to launch our program, taking the place of the more common srun or mpirun commands. The IBM poe command generates a different setting for XLSMPOPTS for each MPI task. The value of XLSMPOPTS is a function of the MPI task id, OMP_NUM_THREADS, and XLSMPOPTS. Going back to example that created the file 007_016_1874756_core:2 was created by MPI task 7 running 16 threads as part of job 1874756 with MP_TASK_AFFINITY set to core:2 and looking at the output files from the other MPI tasks of this run we get:

[tkaiser@mio001 08]$ grep XLSMPOPTS *_016_*
000_016_1874756_core:2: For  0        XLSMPOPTS=parthds=16:procs=0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15
001_016_1874756_core:2: For  1        XLSMPOPTS=parthds=16:procs=16,24,17,25,18,26,19,27,20,28,21,29,22,30,23,31
002_016_1874756_core:2: For  2        XLSMPOPTS=parthds=16:procs=32,40,33,41,34,42,35,43,36,44,37,45,38,46,39,47
003_016_1874756_core:2: For  3        XLSMPOPTS=parthds=16:procs=48,56,49,57,50,58,51,59,52,60,53,61,54,62,55,63
004_016_1874756_core:2: For  4        XLSMPOPTS=parthds=16:procs=64,72,65,73,66,74,67,75,68,76,69,77,70,78,71,79
005_016_1874756_core:2: For  5        XLSMPOPTS=parthds=16:procs=80,88,81,89,82,90,83,91,84,92,85,93,86,94,87,95
006_016_1874756_core:2: For  6        XLSMPOPTS=parthds=16:procs=96,104,97,105,98,106,99,107,100,108,101,109,102,110,103,111
007_016_1874756_core:2: For  7        XLSMPOPTS=parthds=16:procs=112,120,113,121,114,122,115,123,116,124,117,125,118,126,119,127
[tkaiser@mio001 08]$ 

This shows we have 8 MPI tasks, each with 16 threads. For task 0 the threads get mapped to logical processors 0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15 or physical processors 0 and 1. Task 1 has threads on physical cores 2 and 3 ... task 8 has threads on physical cores 14 and 15.

MP_TASK_AFFINITY mapping of tasks and threads and performance

As stated at the beginning of this note we are interested in performance. Here we use a relative measure of performance because absolute performance would only be meaningful if we used a highly tuned matrix inversion algorithm. We only want to see the changes in performance from different task and thread mappings. We use as our measure of performance the number of matrix inversion performed in one second.

Our example program produces an output file for each MPI task as shown above. Our post processing program postit.py takes the set of files produced by each run and pulls out the "XLSMPOPTS" line and the lines giving the run times for the collections of matrix inversions. It uses the XLSMPOPTS lines from all of the files to determine the number of threads running on each core. Statistics are gathered for the run times for the inversions. We calculate the minimum, maximum, average, and variance for the times over all interations and MPI tasks for the number of matrix inversion per second. This information is printed along with the number of MPI tasks, the number of threads, and the setting for MP_TASK_AFFINITY.

We post process our data files by first creating directories 01, 02, 04, 05, 08, 10, and 20. The numbers correspond to the number of MPI tasks used for the runs. We then move the data files from each run to their respected directory. Then we run the script postit.py using the command:

../postit.py *core*

The results are shown in Figure 1.

results
Figure 1. MPI tasks and OpenMP threads are mapped to
physical cores.

We first note that not all test cases implied by the nested for loops in our runs script are represented in our output. The power nodes are currently configured to only allow up to 20 MPI tasks per nodes and at most 8 threads per core. If the combination of environmental variables would suggest a case outside of these limits the run fails with an error message stating that the limits are violated.

There is a direct mapping between OMP_NUM_THREADS and the "parthds" portion of the XLSMPOPTS variable. The core:N setting then maps the threads from a task to N physical cores. For example, consider the cases where we had 4 MPI tasks, 8 threads, and MP_TASK_AFFINITY in the set (core:1, core:2, core:4). A sorted grep "parthds=8" on the 4 MPI task runs produces the following.

04/000_008_1874754_core:1: For  0  XLSMPOPTS=parthds=8:procs=0,1,2,3,4,5,6,7                 
04/001_008_1874754_core:1: For  1  XLSMPOPTS=parthds=8:procs=8,9,10,11,12,13,14,15           
04/002_008_1874754_core:1: For  2  XLSMPOPTS=parthds=8:procs=16,17,18,19,20,21,22,23         
04/003_008_1874754_core:1: For  3  XLSMPOPTS=parthds=8:procs=24,25,26,27,28,29,30,31         

04/000_008_1874754_core:2: For  0  XLSMPOPTS=parthds=8:procs=0,8,1,9,2,10,3,11               
04/001_008_1874754_core:2: For  1  XLSMPOPTS=parthds=8:procs=16,24,17,25,18,26,19,27         
04/002_008_1874754_core:2: For  2  XLSMPOPTS=parthds=8:procs=32,40,33,41,34,42,35,43         
04/003_008_1874754_core:2: For  3  XLSMPOPTS=parthds=8:procs=48,56,49,57,50,58,51,59         

04/000_008_1874754_core:4: For  0  XLSMPOPTS=parthds=8:procs=0,8,16,24,1,9,17,25             
04/001_008_1874754_core:4: For  1  XLSMPOPTS=parthds=8:procs=32,40,48,56,33,41,49,57
04/002_008_1874754_core:4: For  2  XLSMPOPTS=parthds=8:procs=64,72,80,88,65,73,81,89
04/003_008_1874754_core:4: For  3  XLSMPOPTS=parthds=8:procs=96,104,112,120,97,105,113,121

We see that the procs string (logical cores) has 8 entries. We get core:N setting mapping to N physical cores and each physical core has 8/N threads. That is, the logical cores to physical cores mapping is:

04/000_008_1874754_core:1: For  0  XLSMPOPTS=parthds=8:procs=0,1,2,3,4,5,6,7
                                                       Physical cores: 0*8
04/001_008_1874754_core:1: For  1  XLSMPOPTS=parthds=8:procs=8,9,10,11,12,13,14,15           
                                                       Physical cores: 1*8
04/002_008_1874754_core:1: For  2  XLSMPOPTS=parthds=8:procs=16,17,18,19,20,21,22,23         
                                                       Physical cores: 2*8
04/003_008_1874754_core:1: For  3  XLSMPOPTS=parthds=8:procs=24,25,26,27,28,29,30,31         
                                                       Physical cores: 3*8

04/000_008_1874754_core:2: For  0  XLSMPOPTS=parthds=8:procs=0,8,1,9,2,10,3,11               
                                                       Physical cores: 0*4,1*4
04/001_008_1874754_core:2: For  1  XLSMPOPTS=parthds=8:procs=16,24,17,25,18,26,19,27         
                                                       Physical cores: 2*4,3*4
04/002_008_1874754_core:2: For  2  XLSMPOPTS=parthds=8:procs=32,40,33,41,34,42,35,43         
                                                       Physical cores: 4*4,5*4
04/003_008_1874754_core:2: For  3  XLSMPOPTS=parthds=8:procs=48,56,49,57,50,58,51,59         
                                                       Physical cores: 6*4,7*4

04/000_008_1874754_core:4: For  0  XLSMPOPTS=parthds=8:procs=0,8,16,24,1,9,17,25             
                                                       Physical cores: 0*2,1*2,2*2,3*2
04/001_008_1874754_core:4: For  1  XLSMPOPTS=parthds=8:procs=32,40,48,56,33,41,49,57
                                                       Physical cores: 4*2,5*2,6*2,7*2
04/002_008_1874754_core:4: For  2  XLSMPOPTS=parthds=8:procs=64,72,80,88,65,73,81,89
                                                       Physical cores: 8*2,9*2,10*2,11*2
04/003_008_1874754_core:4: For  3  XLSMPOPTS=parthds=8:procs=96,104,112,120,97,105,113,121
                                                       Physical cores: 12*2,13*2,14*2,15*2

Next we note that we get the best performance (matrix inversion rate) for this program when we are using all of the cores. This can be seen by comparing cases 36-39, 66-69, 96-99, 113-116, and 117-120 to the other runs. We go from a minimun performance rate of 7.07 inversions/second (case 1) to 140 inversion/second when each core has a load of 1 to over 180 inversions/second when each core has a load of 4. The performance drops a bit when the load on each core increases to 8.

We next look at the mappings in more detail for the cases where we have 4 threads per core and all 20 cores are in use, cases 38, 68, 98, 115, and 119. The environmental variable settings for these cases are:

  1. MPI tasks 1 OMP_NUM_THREADS=80 MP_TASK_AFFINITY=core:20
  2. MPI tasks 2 OMP_NUM_THREADS=40 MP_TASK_AFFINITY=core:10
  3. MPI tasks 5 OMP_NUM_THREADS=16 MP_TASK_AFFINITY=core:4
  4. MPI tasks 10 OMP_NUM_THREADS=8 MP_TASK_AFFINITY=core:2
  5. MPI tasks 20 OMP_NUM_THREADS=4 MP_TASK_AFFINITY=core:1
mappings
Figure 2. Each MPI task gets a different "procs" string,
mapping it to logical and physical cores.

For this particular program the performance is similar in all of these cases but the mappings to get 4 threads per core are different. For example in case 38 we have a single MPI task with 80 threads. The threads are distributed over all of the cores. The first 20 threads are distributed evenly over the 20 physical cores and after that they are distributed in a round robin manner. With multiple MPI tasks, say M tasks, the physical cores are first divided in M groups. Then the threads for an individual MPI task are spread over the divisions.

More than 20 MPI tasks/node

With more than 20 MPI tasks per node we need to use MP_TASK_AFFINITY=cpu:n where is is an integer. We ran cases specifying the number of MPI tasks to 40 (with 1,2,4 threads), 80 (with 1,2 threads) and 160 (with 1 thread). The results are reported below.

over
Figure 3. MPI tasks and OpenMP threads are mapped to
physical cores with more than 20 MPI tasks/node.

For cases 122, 123, 126, 131, 133, and 134 The settings of number of threads, the number of MPI tasks and the given setting for MP_TASK_AFFINITY resulted in XLSMPOPTS being set to "none" with poor performance results. This deserves further investigation.

Another method to spread tasks and threads to cores that is useful when you have 40 or 80 MPI tasks is to set the environmental variable MP_BIND_MODE=spread and to set MP_TASK_AFFINITY=cpu. Note there is no number after cpu in this case. The results of this are shown in figure 4. For these cases, the value of XLSMPOPTS contain "-1" in the procs string. For example for case 138, for the various MPI tasks we get:

For  0        XLSMPOPTS=parthds=4:procs=0,-1,-1,-1
For  1        XLSMPOPTS=parthds=4:procs=4,-1,-1,-1
For  2        XLSMPOPTS=parthds=4:procs=8,-1,-1,-1
...
...
For  38        XLSMPOPTS=parthds=4:procs=152,-1,-1,-1
For  39        XLSMPOPTS=parthds=4:procs=156,-1,-1,-1

By using thee ssh and ps command discussed in the notes below we can see that the N threads for each MPI task get mapped to the same logical core instead of different logical cores on the same physical core.

over2
Figure 4. MPI tasks and OpenMP threads are mapped to
physical cores with more than 20 MPI tasks/node with
MP_BIND_MODE=spread and MP_TASK_AFFINITY=cpu.

Conclusion and recommendations

We have shown that it is possible to map MPI tasks and OpenMP threads to various logical and thus physical cores on Power8 nodes. This is done by using the environmental variables OMP_NUM_THREADS and MP_TASK_AFFINITY, OMP_PROC_BIND. Setting these variables sets the XLSMPOPTS specifically for each MPI task. By proper mapping the performance of our example program increased.

The perfomance of of "your" applications may have different behaviors. It is recommended that experiments be performed to determine the best mappings.

Future work

More work needs to be done exploring running with more than 20 MPI tasks per node. In particular we were not able to successfully run

Work needs to be done documenting the mappings of non IBM versions of MPI and OpenMP.

Notes and references

Assuming you have an application running on Power8 node ppc002 the following command, run from the head node, will show the mappings of tasks and threads to cores.

while true ; do  ssh ppc002 ' ps -mo pid,tid,%cpu,psr,comm' | grep -v " 0.0 " ; sleep 5; done 

This command will log on to the node every 5 seconds and get a list of running applications and the mappings of their threads to cores.

The output files names contain the ":" character. This is a special character in the macOS operating system. (Historically it was used as a directory separator and is still used for some system level calls, in particular in AppleScript.) Interacting with the files at the finder level you will see the ":" replaced with a "/".

The program phostone.c, included in the tar ball below is a glorified hybrid MPI/OpenMP "Hello World". It is useful for seeing the mappings of MPI tasks and threads to processors and can be used to print environmental variables from within the program. See the source for command line options.

The OpenMP API specification for parallel programming
http://www.openmp.org
IBM Knowledge Center Reference: Setting environment variables Environment variables for parallel processing XLSMPOPTS
http://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.2/com.ibm.xlc131.aix.doc/compiler_ref/env_var_xlsmpopts.html
IBM Knowledge Center Reference: Using MP_TASK_AFFINITY to control the placement of tasks
http://www.ibm.com/support/knowledgecenter/SSFK3V_2.2.0/com.ibm.cluster.pe.v2r2.pe100.doc/am102_tskaff_affinity.htm
Implementing an IBM High-Performance Computing Solution on IBM Power System S822LC
http://www.redbooks.ibm.com/abstracts/sg248280.html
List of links to xlf (Fortran) documentation
http://www-01.ibm.com/support/docview.wss?uid=swg27036672#Table1552
List of links to xlc (C) documentation
http://www-01.ibm.com/support/docview.wss?uid=swg27036675
File containing source, scripts, results, *html
stuff.tgz

Publication Information:

First published
January 10, 2017
Updated with additional results
January 12, 2017
 Timothy H. Kaiser, Ph.D.
Director:Research and High 
  Performance Computing
HPC Questions:  hpcinfo@mines.edu
                http://hpc.mines.edu
Direct Email:   tkaiser@mines.edu