The HPC platform, Mio, has two IBM Power8 nodes, ppc001 and ppc002. Each node has 256 Gbytes of memory and 2 K80 GPU boards. (Each K80 board actually contains 2 K40 GPUs.) The nodes have 20 cores each and each core can support 8 threads. This note discusses how to map MPI processes and threads to the cores. In particular, we will look at mappings for hybrid MPI/OpenMP applications. The methods described here are also applicable to pure threaded applications. These results are specific to using IBM's version of MPI along with the XL compilers.
Why might you be interested in specifying a specific mapping of threads to cores? Performance! By default, the thread scheduler may "stuff" threads on cores, filling a core with 8 threads and leaving other cores empty. Also, if you are using the GPUs you might want to have specific cores partnering with individual GPUs.
In particular, we will show that we can control mapping of threads to cores using the environmental variable MP_TASK_AFFINITY and with proper mapping we can greatly increase performance.
We will first discuss how cores are numbered on the Power8 nodes. We can see the configuration of the nodes using the command:
[tkaiser@mio001 ibm]$ /opt/utility/slurmnodes ppc002 ppc002 NodeHostName ppc002 CoresPerSocket 10 CPULoad 0.70 State IDLE Version 15.08 TmpDisk 949020 SlurmdStartTime 2016-10-21T17:32:53 BootTime 2016-10-21T16:06:57 RealMemory 257476 Boards 1 Gres gpu:kepler:4 CPUAlloc 0 NodeAddr ppc002 FreeMem 249830 AllocMem 0 Features (null) Weight 1 CPUTot 160 ThreadsPerCore 8 CapWatts n/a Owner N/A CPUErr 0 Sockets 2 [tkaiser@mio001 ibm]$
Some things to note: There are 10 CoresPerSocket and 2 Sockets for a total of 20 physical cores. There are 8 ThreadsPerCore allowed. The odd feature is that these are combined to give and effective or logical CPUTot of 160. That is there are 160 "slots" for active processes or threads. We are interested in mapping to these 160 logical cores.
The physical cores on the nodes are numbered 0-19 and the logical cores are numbered from 0-159. Logical cores 0-7 map to physical core 0, 8-15 map to physical core 1 and so on. The physical core for a particular logical core is int(logical_core/8).
The first part of this document discusses running 20 or fewer MPI tasks/node. The second covers running 40, 80, and 160 MPI tasks/node.
We run simple predictable hybrid MPI/OpenMP program, with various numbers of MPI tasks and setting environmental variables that effect the number of threads and the distribution of threads to logical and physical cores. We report on the distribution of MPI tasks and threads and relative performance.
Our test program simply does a number of matrix inversions. MPI task 0 reads three integers. The first is the size of the matrix. The second is the number of matrices that will be generated and inverted by each MPI task. The third number is an iteration count, that is how many times will the sets of matices be inverted. Our inputs are:
[tkaiser@ppc001 ibm]$ cat input 1024 160 10 .true.
The value .true. specifies that there is a synchronization between MPI tasks after each set of inversions.
There is a single 3d array that holds all of the 2d matrices. We have a pointer, twod, that is used to reference sections of the 3d array as a 2d matrix. The data type for or matrices is "b8" which maps via parameter statement to an 8 byte real.
The code for generating the data is:
allocate(tarf(msize,msize,nrays)) ... tarf=1.0_b8 !$OMP PARALLEL DO PRIVATE(twod,i3,j) do i=1,nrays twod=>tarf(:,:,i) j=omp_get_thread_num()+1 do i3=1,msize twod(i2,i3)=j+10.0_b8 enddo enddo
Where msize is the matrix size and nrays is the number of them. As you can see the matrix generation is threaded.
We use the LAPACK routine DGESV to do the matrix solves. The solves are also threaded:
!$OMP PARALLEL DO PRIVATE(twod) do i=1,nrays twod=>tarf(:,:,i) call my_clock(cnt1(i)) CALL DGESV( N, NRHS, twod, LDA, IPIVs(:,i), Bs(:,i), LDB, INFOs(i) ) call my_clock(cnt2(i)) write(17,'(i5,i5,3(f12.3))')i,infos(i),cnt2(i),cnt1(i),real(cnt2(i)-cnt1(i),b8) enddo
There is an important, not obvious, point about the routine DGESV. It is not inherently threaded. That is each call to DGESV will only use a single thread. However, calling DGESV inside a threaded region will cause each particular invocation to be given a thread. Thus we can have several instances of DGESV running independently in parallel.
A simple makefile for building the application is:
LPATH=/software/lib/lapack/3.5.0/xl LIBS=-llapack -lblas default: pointerMPI pointerMPI: pointerMPInrc.f90 mpfort -compiler xlf90_r -L$(LPATH) -O3 -qsmp=omp pointerMPInrc.f90 $(LIBS) -o pointerMPI rm -rf numz.mod
There are a number of environmental variables that are important to our study.
We can use the Fortran subroutine call GET_ENVIRONMENT_VARIABLE to get the value of these variables from within the program.
For more information about these variables see the first three references given below.
We also use the environmental variable SLURM_JOB_ID which gets a new value for each job to create a new set of output files for each test. Our output file names are a combination of SLURM_JOB_ID, the MPI task ID and the setting of the variable MP_TASK_AFFINITY.
character(len=4096)::ont,MP_TASK_AFFINITY,XLSMPOPTS,SLURM_JOB_ID ... ... call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) ... ... call GET_ENVIRONMENT_VARIABLE("OMP_NUM_THREADS", & value=ont,status=is0) call GET_ENVIRONMENT_VARIABLE("MP_TASK_AFFINITY", & value=MP_TASK_AFFINITY,status=is1) if(is1 .ne. 0)write(MP_TASK_AFFINITY,"('none')") call GET_ENVIRONMENT_VARIABLE("XLSMPOPTS", & value=XLSMPOPTS,status=is2) if(is2 .ne. 0)write(XLSMPOPTS,"('none')") call GET_ENVIRONMENT_VARIABLE("SLURM_JOB_ID", & value=SLURM_JOB_ID,status=is3) if(is3 .ne. 0)write(SLURM_JOB_ID,"('xxxxx')") write(pname,"(i3.3,'_',i3.3,'_',a,'_',a)")myid,& omp_get_max_threads(),& trim(SLURM_JOB_ID), & trim(MP_TASK_AFFINITY) open(unit=17,file=trim(pname),status="unknown",POSITION='APPEND') write(17,*)"For ",myid," OMP_NUM_THREADS=",trim(ont) write(17,*)"For ",myid," XLSMPOPTS=",trim(XLSMPOPTS)
The file "007_016_1874756_core:2" was created by MPI task 7 running 16 threads as part of job 1874756 with MP_TASK_AFFINITY set to core:2.
Our example, pointerMPInrc.f90, program was run using the slurm script shown here.
#!/bin/bash #SBATCH --time=10:00:00 #SBATCH --partition=ppc #SBATCH --overcommit #SBATCH --exclusive #SBATCH --nodes=1 ##SBATCH --ntasks=5 #SBATCH --export=ALL #SBATCH --out=%J.out # Go to the directoy from which our job was launched cd $SLURM_SUBMIT_DIR source /etc/profile module purge unset OMP_PROC_BIND export MYID=$SLURM_JOBID cat $0 > $MYID.script export MP_RESD=poe export MP_HOSTFILE=$MYID.list printenv > $MYID.env export MP_LABELIO=yes ./expands > $MP_HOSTFILE cat $INPUT >> $MYID.out export INPUT=input export XLFRTEOPTS="buffering=disable_all" export EXE=./pointerMPI ulimit -c 0 for T in 1 2 4 8 16 20 40 80 160 ; do export OMP_NUM_THREADS=$T for S in core:1 core:2 core:4 core:8 core:10 core:20 ; do export MP_TASK_AFFINITY=$S echo $MP_TASK_AFFINITY >> $MYID.out poe $EXE -procs $SLURM_NTASKS < $INPUT >> $MYID.out done done unset MP_RESD unset MP_HOSTFILE
The program is run a number of times in a nested "for" loop with various values for OMP_NUM_THREADS and MP_TASK_AFFINITY. Note that OMP_PROC_BIND is unset. This needs to be done for MP_TASK_AFFINITY to have the effects discussed below. Also, the script was run with various MPI tasks using the command.
for ntasks in 1 2 4 5 8 10 20 ; do sbatch -n $ntasks doit; done
After the jobs completed the output files created by running with a particular number of MPI tasks were combined in a single directory. The post processing of the output files will be described in detail below. In short, we are interested in the number of matrix inversions per second as a function of the settings of the environmental variables.
The environmental variable MP_TASK_AFFINITY is set to core:N, where N has the value of 1 to 20. This variable determines the number of physical cores that each individual MPI task is allocated. If each MPI task launches less than N threads then physical cores will be left open. If the number of threads is greater than N then physical cores will run more that one thread. This can be seen in cases 62-69 where we have 2 MPI tasks and MP_TASK_AFFINITY=core:10. With a single thread, the first MPI task gets physical core 0 and the second task gets physical core 10. As we add threads successive cores become occupied. With 16 threads per task (case 66) all cores are busy and 12 cores are over subscribed, 6 for each task. With 20, 40, and 80 threads we have all cores in use with an equal load.
You may note that our script does not set the variable XLSMPOPTS even though it is "collected" and printed in our program. XLSMPOPTS is actually an important variable because it determines the actual mapping of MPI tasks and OpenMP threads to cores. It could be set in the script but that would not be particularly useful because each MPI task would have the same value and thus have the same mapping.
Note that in our script the command poe is used to launch our program, taking the place of the more common srun or mpirun commands. The IBM poe command generates a different setting for XLSMPOPTS for each MPI task. The value of XLSMPOPTS is a function of the MPI task id, OMP_NUM_THREADS, and XLSMPOPTS. Going back to example that created the file 007_016_1874756_core:2 was created by MPI task 7 running 16 threads as part of job 1874756 with MP_TASK_AFFINITY set to core:2 and looking at the output files from the other MPI tasks of this run we get:
[tkaiser@mio001 08]$ grep XLSMPOPTS *_016_* 000_016_1874756_core:2: For 0 XLSMPOPTS=parthds=16:procs=0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15 001_016_1874756_core:2: For 1 XLSMPOPTS=parthds=16:procs=16,24,17,25,18,26,19,27,20,28,21,29,22,30,23,31 002_016_1874756_core:2: For 2 XLSMPOPTS=parthds=16:procs=32,40,33,41,34,42,35,43,36,44,37,45,38,46,39,47 003_016_1874756_core:2: For 3 XLSMPOPTS=parthds=16:procs=48,56,49,57,50,58,51,59,52,60,53,61,54,62,55,63 004_016_1874756_core:2: For 4 XLSMPOPTS=parthds=16:procs=64,72,65,73,66,74,67,75,68,76,69,77,70,78,71,79 005_016_1874756_core:2: For 5 XLSMPOPTS=parthds=16:procs=80,88,81,89,82,90,83,91,84,92,85,93,86,94,87,95 006_016_1874756_core:2: For 6 XLSMPOPTS=parthds=16:procs=96,104,97,105,98,106,99,107,100,108,101,109,102,110,103,111 007_016_1874756_core:2: For 7 XLSMPOPTS=parthds=16:procs=112,120,113,121,114,122,115,123,116,124,117,125,118,126,119,127 [tkaiser@mio001 08]$
This shows we have 8 MPI tasks, each with 16 threads. For task 0 the threads get mapped to logical processors 0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15 or physical processors 0 and 1. Task 1 has threads on physical cores 2 and 3 ... task 8 has threads on physical cores 14 and 15.
As stated at the beginning of this note we are interested in performance. Here we use a relative measure of performance because absolute performance would only be meaningful if we used a highly tuned matrix inversion algorithm. We only want to see the changes in performance from different task and thread mappings. We use as our measure of performance the number of matrix inversion performed in one second.
Our example program produces an output file for each MPI task as shown above. Our post processing program postit.py takes the set of files produced by each run and pulls out the "XLSMPOPTS" line and the lines giving the run times for the collections of matrix inversions. It uses the XLSMPOPTS lines from all of the files to determine the number of threads running on each core. Statistics are gathered for the run times for the inversions. We calculate the minimum, maximum, average, and variance for the times over all interations and MPI tasks for the number of matrix inversion per second. This information is printed along with the number of MPI tasks, the number of threads, and the setting for MP_TASK_AFFINITY.
We post process our data files by first creating directories 01, 02, 04, 05, 08, 10, and 20. The numbers correspond to the number of MPI tasks used for the runs. We then move the data files from each run to their respected directory. Then we run the script postit.py using the command:
The results are shown in Figure 1.
|Figure 1. MPI tasks and OpenMP threads are mapped to |
We first note that not all test cases implied by the nested for loops in our runs script are represented in our output. The power nodes are currently configured to only allow up to 20 MPI tasks per nodes and at most 8 threads per core. If the combination of environmental variables would suggest a case outside of these limits the run fails with an error message stating that the limits are violated.
There is a direct mapping between OMP_NUM_THREADS and the "parthds" portion of the XLSMPOPTS variable. The core:N setting then maps the threads from a task to N physical cores. For example, consider the cases where we had 4 MPI tasks, 8 threads, and MP_TASK_AFFINITY in the set (core:1, core:2, core:4). A sorted grep "parthds=8" on the 4 MPI task runs produces the following.
04/000_008_1874754_core:1: For 0 XLSMPOPTS=parthds=8:procs=0,1,2,3,4,5,6,7 04/001_008_1874754_core:1: For 1 XLSMPOPTS=parthds=8:procs=8,9,10,11,12,13,14,15 04/002_008_1874754_core:1: For 2 XLSMPOPTS=parthds=8:procs=16,17,18,19,20,21,22,23 04/003_008_1874754_core:1: For 3 XLSMPOPTS=parthds=8:procs=24,25,26,27,28,29,30,31 04/000_008_1874754_core:2: For 0 XLSMPOPTS=parthds=8:procs=0,8,1,9,2,10,3,11 04/001_008_1874754_core:2: For 1 XLSMPOPTS=parthds=8:procs=16,24,17,25,18,26,19,27 04/002_008_1874754_core:2: For 2 XLSMPOPTS=parthds=8:procs=32,40,33,41,34,42,35,43 04/003_008_1874754_core:2: For 3 XLSMPOPTS=parthds=8:procs=48,56,49,57,50,58,51,59 04/000_008_1874754_core:4: For 0 XLSMPOPTS=parthds=8:procs=0,8,16,24,1,9,17,25 04/001_008_1874754_core:4: For 1 XLSMPOPTS=parthds=8:procs=32,40,48,56,33,41,49,57 04/002_008_1874754_core:4: For 2 XLSMPOPTS=parthds=8:procs=64,72,80,88,65,73,81,89 04/003_008_1874754_core:4: For 3 XLSMPOPTS=parthds=8:procs=96,104,112,120,97,105,113,121
We see that the procs string (logical cores) has 8 entries. We get core:N setting mapping to N physical cores and each physical core has 8/N threads. That is, the logical cores to physical cores mapping is:
04/000_008_1874754_core:1: For 0 XLSMPOPTS=parthds=8:procs=0,1,2,3,4,5,6,7 Physical cores: 0*8 04/001_008_1874754_core:1: For 1 XLSMPOPTS=parthds=8:procs=8,9,10,11,12,13,14,15 Physical cores: 1*8 04/002_008_1874754_core:1: For 2 XLSMPOPTS=parthds=8:procs=16,17,18,19,20,21,22,23 Physical cores: 2*8 04/003_008_1874754_core:1: For 3 XLSMPOPTS=parthds=8:procs=24,25,26,27,28,29,30,31 Physical cores: 3*8 04/000_008_1874754_core:2: For 0 XLSMPOPTS=parthds=8:procs=0,8,1,9,2,10,3,11 Physical cores: 0*4,1*4 04/001_008_1874754_core:2: For 1 XLSMPOPTS=parthds=8:procs=16,24,17,25,18,26,19,27 Physical cores: 2*4,3*4 04/002_008_1874754_core:2: For 2 XLSMPOPTS=parthds=8:procs=32,40,33,41,34,42,35,43 Physical cores: 4*4,5*4 04/003_008_1874754_core:2: For 3 XLSMPOPTS=parthds=8:procs=48,56,49,57,50,58,51,59 Physical cores: 6*4,7*4 04/000_008_1874754_core:4: For 0 XLSMPOPTS=parthds=8:procs=0,8,16,24,1,9,17,25 Physical cores: 0*2,1*2,2*2,3*2 04/001_008_1874754_core:4: For 1 XLSMPOPTS=parthds=8:procs=32,40,48,56,33,41,49,57 Physical cores: 4*2,5*2,6*2,7*2 04/002_008_1874754_core:4: For 2 XLSMPOPTS=parthds=8:procs=64,72,80,88,65,73,81,89 Physical cores: 8*2,9*2,10*2,11*2 04/003_008_1874754_core:4: For 3 XLSMPOPTS=parthds=8:procs=96,104,112,120,97,105,113,121 Physical cores: 12*2,13*2,14*2,15*2
Next we note that we get the best performance (matrix inversion rate) for this program when we are using all of the cores. This can be seen by comparing cases 36-39, 66-69, 96-99, 113-116, and 117-120 to the other runs. We go from a minimun performance rate of 7.07 inversions/second (case 1) to 140 inversion/second when each core has a load of 1 to over 180 inversions/second when each core has a load of 4. The performance drops a bit when the load on each core increases to 8.
We next look at the mappings in more detail for the cases where we have 4 threads per core and all 20 cores are in use, cases 38, 68, 98, 115, and 119. The environmental variable settings for these cases are:
Figure 2. Each MPI task gets a different "procs" string, |
mapping it to logical and physical cores.
For this particular program the performance is similar in all of these cases but the mappings to get 4 threads per core are different. For example in case 38 we have a single MPI task with 80 threads. The threads are distributed over all of the cores. The first 20 threads are distributed evenly over the 20 physical cores and after that they are distributed in a round robin manner. With multiple MPI tasks, say M tasks, the physical cores are first divided in M groups. Then the threads for an individual MPI task are spread over the divisions.
With more than 20 MPI tasks per node we need to use MP_TASK_AFFINITY=cpu:n where is is an integer. We ran cases specifying the number of MPI tasks to 40 (with 1,2,4 threads), 80 (with 1,2 threads) and 160 (with 1 thread). The results are reported below.
|Figure 3. MPI tasks and OpenMP threads are mapped to |
physical cores with more than 20 MPI tasks/node.
For cases 122, 123, 126, 131, 133, and 134 The settings of number of threads, the number of MPI tasks and the given setting for MP_TASK_AFFINITY resulted in XLSMPOPTS being set to "none" with poor performance results. This deserves further investigation.
Another method to spread tasks and threads to cores that is useful when you have 40 or 80 MPI tasks is to set the environmental variable MP_BIND_MODE=spread and to set MP_TASK_AFFINITY=cpu. Note there is no number after cpu in this case. The results of this are shown in figure 4. For these cases, the value of XLSMPOPTS contain "-1" in the procs string. For example for case 138, for the various MPI tasks we get:
For 0 XLSMPOPTS=parthds=4:procs=0,-1,-1,-1 For 1 XLSMPOPTS=parthds=4:procs=4,-1,-1,-1 For 2 XLSMPOPTS=parthds=4:procs=8,-1,-1,-1 ... ... For 38 XLSMPOPTS=parthds=4:procs=152,-1,-1,-1 For 39 XLSMPOPTS=parthds=4:procs=156,-1,-1,-1
By using thee ssh and ps command discussed in the notes below we can see that the N threads for each MPI task get mapped to the same logical core instead of different logical cores on the same physical core.
|Figure 4. MPI tasks and OpenMP threads are mapped to |
physical cores with more than 20 MPI tasks/node with
MP_BIND_MODE=spread and MP_TASK_AFFINITY=cpu.
We have shown that it is possible to map MPI tasks and OpenMP threads to various logical and thus physical cores on Power8 nodes. This is done by using the environmental variables OMP_NUM_THREADS and MP_TASK_AFFINITY, OMP_PROC_BIND. Setting these variables sets the XLSMPOPTS specifically for each MPI task. By proper mapping the performance of our example program increased.
The perfomance of of "your" applications may have different behaviors. It is recommended that experiments be performed to determine the best mappings.
More work needs to be done exploring running with more than 20 MPI tasks per node. In particular we were not able to successfully run
Work needs to be done documenting the mappings of non IBM versions of MPI and OpenMP.
Assuming you have an application running on Power8 node ppc002 the following command, run from the head node, will show the mappings of tasks and threads to cores.
while true ; do ssh ppc002 ' ps -mo pid,tid,%cpu,psr,comm' | grep -v " 0.0 " ; sleep 5; done
This command will log on to the node every 5 seconds and get a list of running applications and the mappings of their threads to cores.
The output files names contain the ":" character. This is a special character in the macOS operating system. (Historically it was used as a directory separator and is still used for some system level calls, in particular in AppleScript.) Interacting with the files at the finder level you will see the ":" replaced with a "/".
The program phostone.c, included in the tar ball below is a glorified hybrid MPI/OpenMP "Hello World". It is useful for seeing the mappings of MPI tasks and threads to processors and can be used to print environmental variables from within the program. See the source for command line options.
Timothy H. Kaiser, Ph.D. Director:Research and High Performance Computing HPC Questions: email@example.com http://hpc.mines.edu Direct Email: firstname.lastname@example.org