Welcome to the User's Guide for ra.mines.edu the Golden Energy Computing Organization computer system for the advancement of energy science.
Materials from the summer 2008 workshop series are available at geco.mines.edu/workshop. These workshops cover a wide range of topics from introductory to advanced MPI, OpenMP, hybrid programming, and parallel IO. The materials include a number of examples, including compile and run scripts.
Ra is a Dell cluster containing approximately 2150 compute cores and is rated at 23 teraflops peak performance. A more complete hardware description can be found at: http://geco.mines.edu/hardware.html
Access to Ra is via the command:
ssh ra.mines.edu
Currently, you can only access Ra from other machines on campus or via VPN. If you have an account on another machine on campus, such as imagine.mines.edu, then you can access Ra by first logging on to that machine and then using the ssh command shown above.
To see the status of the queue on Ra see ra.mines.edu/ganglia
Ra is that is a collection of nodes with each node containing 8 computing cores. The 8 compute cores on one node share the same memory. Memory is not shared across nodes.
Ra is designed to primarily run distributed memory applications. In distributed memory applications there are a collection of processes or tasks running on individual computing cores or processors. That is, each task in an application runs a separate copy of the same program and has its own memory. The various tasks of the the application communicate via message passing. The normal method for tasks to pass messages is to use calls from the Message Passing Interface (MPI) library.
It is also possible to write programs on Ra that exploit the feature that the 8 compute cores on a node share memory. One method of writing such applications is to use threads. The OpenMP package is available on Ra to facilitate writing threaded applications. "OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer."
Ra runs a custom distribution of Linux know as Rocks. See the links given at the end of this page for additional information on Rocks
The default login shell on Ra is /bin/bash. If you would like to change your shell you can use the command chsh. To see a list of the available shells type chsh --list-shells. To change your shell type chsh. You will be prompted for your password and the path to your new shell.
Ra contains a very complete set of compilers, including the open source gcc and g95 compilers as well as C/C++ and Fortran compilers from Intel and Portland group. All of these compilers are in the default execution path of Ra.
We will be creating short guides for using the various compilers. For now, the documentation for the Portland Group and Intel compilers can be found at:
The Intel and Portland Group debuggers are also available on Ra. The Portland Group is pgdbg and the name of the Intel debugger is idb.
The documentation is also available on Ra in the "doc" directory one level above where the executable for the compilers reside.
The documentation is also available from:
The primary versions of MPI on Ra are OpenMPI version 1.2.5 and MPICH version 1.2.7. (MPICH2 is also available but does not currently support communication over Infiniband so it most likely should not be used.) Three copies of the libraries have been built using the gcc, Intel, and Portland Group compilers
The default version MPI on Ra is OpenMPI built with the Intel compilers. You can change your MPI compiler suite by setting your PATH, MANPATH, and LD_LIBRARY_PATH environmental variables in your .cshrc, .tcshrc, or .bashrc files. The correct additions to these files can be found below.
The commands for compiling MPI programs are:
Most jobs are run in batch mode. To run in batch mode you first create a run script. You submit your job to be run and it is put in a queue. When there are enough nodes available to run the job it starts. Your output is available after it completes.
Ra uses the Moab Workload Manager to schedule execution of parallel jobs. Moab, in turn uses Torque for launching and managing jobs. Torque is a descendent of the open source version of PBS (Portable Batch System).
The batch scripts you create are called pbs scripts. They are actually Unix shell scripts, but they have two parts. The top part consists of a collection of lines that start with the token #PBS. Unix sees these lines as comments but they are significant to Torque.
Below we have a simple pbs script. This script will run the application c_ex00 on 8 computing cores. The first line indicates that we will run this script using the "C" shell. The "-l" flag used on lines 2 and 3 indicates that we want some particular resources. In this case we are asking for a single node with 8 computing cores and we want it for 2 hours. The "-N" flag gives our job a name. The "-o" and "-e" options indicate where the normal and error output from our job will be placed.
As it states in the comments, the "-V" option can be very important. This causes the current shell environment to be passed to your parallel application. This includes such things as your current execution path "$PATH" and your dynamic library path "$LD_LIBRARY_PATH". The LD_LIBRARY_PATH is used to find the libraries used by your application.
The last two lines are the actual shell commands that are run.
| PBS script | Explanation |
|---|---|
| #!/bin/csh | We will run this job using the "C" shell |
| #PBS -l nodes=1:ppn=8 | We want 1 node with 8 processors for a total of 8 processors |
| #PBS -l walltime=02:00:00 | We will run for up to 2 hours |
| #PBS -N testIO | The name of our job is testIO |
| #PBS -o stdout | Standard output from our prgram will go to a file stdout |
| #PBS -e stderr | Error output from our prgram will go to a file stderr |
| #PBS -V | Very important! Exports all environment variables from the submitting shell into the batch shell. |
| #----------------------------------------------------- | Not important. Just a separator line. |
| cd $PBS_O_WORKDIR | Very important! Go to directory $PBS_O_WORKDIR which is the directory which is where our script resides |
| mpirun -n 8 c_ex00 | Run the MPI program c_ex00 on 8 computing cores. |
| Variable | Meaning |
|---|---|
| PBS_JOBID | unique PBS job ID |
| PBS_JOBCOOKIE | job cookie |
| PBS_JOBNAME | user specified job name |
| PBS_MOMPORT | active port for mom daemon |
| PBS_NNODES | number of nodes requested |
| PBS_NODEFILE | file with list of allocated nodes |
| PBS_NODENUM | node offset number (see pbsdsh) |
| PBS_O_HOME | home dir of submitting user |
| PBS_O_HOST | host of currently running job |
| PBS_O_LANG | language variable for job |
| PBS_O_LOGNAME | name of submitting user |
| PBS_O_PATH | path to executables used in job script |
| PBS_O_SHELL | script shell |
| PBS_O_WORKDIR | jobs submission directory |
| PBS_QUEUE | job queue |
| PBS_TASKNUM | number of tasks requested (see pbsdsh) |
There are several available queues on Ra. You do not normally specify a particular queue in which to run your jobs. This is done automatically by the amount of memory you request and the time limit. The limits are set in your runs script. There are two memory sizes and four time buckets. The table below show the generic queue names, the time limit and an example line for your run script. The additions to your script to specify large memory nodes are given below.
| Queue | Time Limit | Script Example |
|---|---|---|
| short | 1 to 30 minutes | #PBS -l walltime=00:25:00 |
| medium1 | 30-minutes to 8 hours | #PBS -l walltime=04:00:00 |
| medium2 | 8 hours to 24 hours | #PBS -l walltime=12:00:00 |
| long | 24 hours to 400 hours | #PBS -l walltime=168:00:00 |
To submit a jobs to the queue on Ra use the msub command. For example:
msub rpbs6
Where rpbs6 is the name of your script.
| Command | Description |
|---|---|
| canceljob | cancel job |
| checkjob | provide detailed status report for specified job |
| mdiag | provide diagnostic reports for resources, workload, and scheduling |
| mjobctl | control and modify job |
| mrsvctl | create, control and modify reservations |
| mshow | displays various diagnostic messages about the system and job queues |
| msub | submit a job (Don't use qsub) |
| releasehold | release job defers and holds |
| releaseres | release reservations |
| sethold | set job holds |
| showq | show queued jobs |
| showres | show existing reservations |
| showstart | show estimates of when job can/will start |
| showstate | show current state of resources |
| Command | Description |
|---|---|
| tracejob | trace job actions and states recorded in TORQUE logs |
| pbsnodes | view/modify batch status of compute nodes |
| qalter | modify queued batch jobs |
| qdel | delete/cancel batch jobs |
| qhold | hold batch jobs |
| qrls | release batch job holds |
| qrun | start a batch job |
| qsig | send a signal to a batch job |
| qstat | view queues and jobs |
| pbsdsh | launch tasks within a parallel job |
| qsub | submit jobs (Don't use except for interactive runs) |
Ra has 184 nodes with 16 Gbytes per node and 84 nodes with 32 Gbytes of memory. To run only on the nodes that have 32 Gbytes add the option :fat to line in your script that contains the number of nodes you are requesting. For example if you want two nodes with 32 Gbytes the line:
#PBS -l nodes=2:ppn=8
becomes
#PBS -l nodes=2:ppn=8:fat
The default behavior of the queueing system on Ra
is to fill unused processors. If you submit a job that uses less than
8 processors per node than additional jobs might be scheduled on your
nodes. To force your exclusive access to your nodes add the option
-l naccesspolicy=singlejob
to your job submission
line. For example:
msub rpbs6 -l naccesspolicy=singlejob
You can also add the line
#PBS -l naccesspolicy=singlejob
to your run script.
Where rpbs6 is the name of your script.
Ra supports Multiple Instruction - Multiple Data (MIMD) or Multiple Program - Multiple Data (MPMD) programs while using the OpenMPI library. That is, each MPI task can be a different program. For example, one task can be a Fortran program and another a C. This paradigam is not supported by the mvapich_0.9.9 systen.
One simple method for doing MPMD under OpenMPI is described in the document mpmd.html
The command qstat will show a list of all jobs in the queue. The status of a job is given under the "S" column as "Q", "R", or "C" for Queded, Running, or Completed. To see information about only your jobs add the -u USERNAME option to the command. To see information about a particular job add the job id number to the qstat command. You can get more detailed information about a job by using the checkjob command.
For a list of commands for controling and monitoring jobs see msub and PBS/Torque commands
The command qstat will show a list of all jobs in the queue. Use the qdel JOBNUMBER command to delete a job.
For a list of commands for controling and monitoring jobs see msub and PBS/Torque commands
OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer."
The Portland Group and Intel compiler sets both support OpenMP for C and Fortran. OpenMP is off by default and must be enabled with a compile line option. The option for the Intel compilers is -openmp and -mp for the Portland Group compilers.
Assume we have the OpenMP Fortran and C programs omp_fft_join.f90 and invertc.c and we want to compile at optimization level 3. We could using the commands:
The following script runs these programs using different numbers of threads:
#!/bin/csh
#PBS -l nodes=1:ppn=8
#PBS -l walltime=00:10:00
#PBS -N testIO
#PBS -o out.pbs
#PBS -e err.pbs
#PBS -r n
#PBS -V
#-----------------------------------------------------
cd $PBS_O_WORKDIR
#run the example omp_fft_join.f90 using 1-8 threads
#using both the Intel and Portland Group compilers
foreach NUM (1 2 3 4 5 6 7 8)
setenv OMP_NUM_THREADS $NUM
echo "intel"
./omp_fft_join.it
echo " "
echo "pg"
./omp_fft_join.pg
echo " "
echo " "
end
#run the example invertc.c using 1, 2, and 4 threads
#using both the Intel and Portland Group compilers
foreach NUM (1 2 4)
setenv OMP_NUM_THREADS $NUM
echo "OMP_NUM_THREADS=" $OMP_NUM_THREADS
echo "intel"
./invertc.it
echo " "
echo "pg"
./invertc.pg
echo " "
echo " "
end
Running a parallel interactive job is a two step process. You first run a qsub command to request interactive nodes. After some time you will be connected (logged in) to an interactive node. You then "cd" to the directory that contains your executable and run it with an mpirun command.
We have an example below. The text in red is what is typed into the terminal window. We will run a simple "Hello World" MPI example. The source for the example can be obtained from geco.mines.edu/guideFiles/c_ex00.c
In our qsub command we request 1 node. This will give us 8 computational cores to use while running our parallel program. After we enter the qsub command we will get back a ready message. Next, we then enter a mpirun command, specifying the number of MPI tasks using the "-n" option. After our job finished we are free to run additional jobs. Note in the second case we specified 16 MPI tasks, even though we only have asked for 8 computational cores. This is legal and it might be useful in cases where you are just checking the correctness of an algorithm and don't care about performance.
After we are done with our runs we type exit to logout and release the nodes.
[tkaiser@ra ~/guide]$qsub -q INTERACTIVE -I -V -l nodes=1 qsub: waiting for job 1280.ra.mines.edu to start qsub: job 1280.ra.mines.edu ready [tkaiser@compute-9-8 ~]$cd guide [tkaiser@compute-9-8 ~/guide]$mpirun -n 4 c_ex00 Hello from 0 of 4 on compute-9-8.local Hello from 1 of 4 on compute-9-8.local Hello from 2 of 4 on compute-9-8.local Hello from 3 of 4 on compute-9-8.local [tkaiser@compute-9-8 ~/guide]$mpirun -n 16 c_ex00 Hello from 1 of 16 on compute-9-8.local Hello from 0 of 16 on compute-9-8.local Hello from 3 of 16 on compute-9-8.local Hello from 4 of 16 on compute-9-8.local Hello from 5 of 16 on compute-9-8.local Hello from 8 of 16 on compute-9-8.local Hello from 9 of 16 on compute-9-8.local Hello from 10 of 16 on compute-9-8.local Hello from 7 of 16 on compute-9-8.local Hello from 11 of 16 on compute-9-8.local Hello from 12 of 16 on compute-9-8.local Hello from 13 of 16 on compute-9-8.local Hello from 14 of 16 on compute-9-8.local Hello from 15 of 16 on compute-9-8.local Hello from 2 of 16 on compute-9-8.local Hello from 6 of 16 on compute-9-8.local [tkaiser@compute-9-8 ~/guide]$exit logout qsub: job 1280.ra.mines.edu completed [tkaiser@ra ~/guide]$
There are several versions of MPI available on Ra. You can change the version you use by changing your $PATH and $LD_LIBRARY_PATH environmental variables.
You can use the command mpi-selector to set your version of mpi. The command
mpi-selector --list
will list the availble versions of MPI. The command
mpi-selector --set <name>
will set your default version of MPI. <name> is one of the versions returned for using the --list option.
You can also set your MPI version manually. The following two snippets alow you to easilly change your settings. Use the first if your login shell is bash and the second if it is tcsh.
### Add the following to the end of your ~/.bashrc file.
###
### Then uncomment the line indicating which complier
### you would like to use to build MPI programs
### and uncomment the line indicating which
### version of MPI you would like.
#BASECOMP="pgi"
#BASECOMP="gcc"
#BASECOMP="intel"
#MYMPI="openmpi-1.2.5"
#MYMPI="mvapich-0.9.9"
#MYMPI="mpich2-1.0.6p1"
#MYMPI="none"
export MYMPI
case $MYMPI in
"mvapich-0.9.9")
PATH=/usr/mpi/$BASECOMP/mvapich-0.9.9/bin:$PATH
LD_LIBRARY_PATH=/usr/mpi/$BASECOMP/mvapich-0.9.9/lib:/usr/mpi/$BASECOMP/mvapich-0.9.9/lib/shared:$LD_LIBRARY
_PATH
MANPATH=/usr/mpi/$BASECOMP/mvapich-0.9.9/man:$MANPATH;;
"mpich2-1.0.6p1")
PATH=/usr/mpi/$BASECOMP/mpich2-1.0.6p1/bin:$PATH
LD_LIBRARY_PATH=/usr/mpi/$BASECOMP/mpich2-1.0.6p1/lib:$LD_LIBRARY_PATH
MANPATH=/usr/mpi/$BASECOMP/mpich2-1.0.6p1/man:$MANPATH;;
"openmpi-1.2.5")
PATH=/usr/mpi/$BASECOMP/openmpi-1.2.5/bin:$PATH
LD_LIBRARY_PATH=/usr/mpi/$BASECOMP/openmpi-1.2.5/lib:$LD_LIBRARY_PATH
MANPATH=/usr/mpi/$BASECOMP/openmpi-1.2.5/man:$MANPATH;;
esac
export PATH
export LD_LIBRARY_PATH
export MANPATH
### Add the following to end of your ~/.tcshrc file.
###
### Then uncomment the line indicating which complier
### you would like to use to build MPI programs
### and uncomment the line indicating which
### version of MPI you would like.
#setenv BASECOMP pgi
#setenv BASECOMP gcc
#setenv BASECOMP intel
#setenv MYMPI openmpi-1.2.5
#setenv MYMPI mvapich-0.9.9
#setenv MYMPI mpich2-1.0.6p1
switch ($MYMPI)
case mvapich-0.9.9:
setenv LD_LIBRARY_PATH /usr/mpi/$BASECOMP/mvapich-0.9.9/lib:/usr/mpi/$BASECOMP/mvapich-0.9.9/lib/shared:$LD_LIBRARY_PATH
setenv MANPATH /usr/mpi/$BASECOMP/mvapich-0.9.9/man:$MANPATH
set path = ( /usr/mpi/$BASECOMP/mvapich-0.9.9/bin $path )
breaksw
case mpich2-1.0.6p1:
setenv LD_LIBRARY_PATH /usr/mpi/$BASECOMP/mpich2-1.0.6p1/lib:$LD_LIBRARY_PATH
setenv MANPATH /usr/mpi/$BASECOMP/mpich2-1.0.6p1/man:$MANPATH
set path = ( /usr/mpi/$BASECOMP/mpich2-1.0.6p1/bin $path )
breaksw
case openmpi-1.2.5:
setenv LD_LIBRARY_PATH /usr/mpi/$BASECOMP/openmpi-1.2.5/lib:$LD_LIBRARY_PATH
setenv MANPATH /usr/mpi/$BASECOMP/openmpi-1.2.5/man:$MANPATH
set path = ( /usr/mpi/$BASECOMP/openmpi-1.2.5/bin $path )
breaksw
default:
#echo "mpi not set"
breaksw
endsw
The table given below shows the round trip messaage speed for a 1 Mbyte message using the various versions of MPI given above along with a custom built version. On node, refers to two tasks running on the same node
| MPI Version | On Node Bytes/sec | Off Node Bytes/sec |
|---|---|---|
| /usr/mpi/intel/openmpi-1.2.5 | 641105422 | 640782205 |
| /usr/mpi/intel/mvapich-0.9.9 | 600017739 | 599983407 |
| /usr/mpi/intel/mpich2-1.0.6p1 | 310298439 | 58692377 |
| ~/custom/mvapich2-1.0.2f | 310277780 | 58686875 |
Ra runs the Rocks distribution of RedHat Linux. Information on Rocks can be found at www.rocksclusters.org.
The Ra has its own webpage is ra.mines.edu. The most interesting link on this page is Cluster Status. The Cluster Status page contains information about RA's, including such things as load.
#PBS -l nodes=2:ppn=8
should be
#PBS -l nodes=2:ppn=8:fat