This document is under development and will
change dramatically over the next few days.

For content suggestions or corrections email
tkaiser@mines.edu

Overview

Welcome to the User's Guide for ra.mines.edu the Golden Energy Computing Organization computer system for the advancement of energy science.

For a "Quick Start" See: geco.mines.edu/quickstart.html

Training Materials

Materials from the summer 2008 workshop series are available at geco.mines.edu/workshop. These workshops cover a wide range of topics from introductory to advanced MPI, OpenMP, hybrid programming, and parallel IO. The materials include a number of examples, including compile and run scripts.

Hardware description

Ra is a Dell cluster containing approximately 2150 compute cores and is rated at 23 teraflops peak performance. A more complete hardware description can be found at: http://geco.mines.edu/hardware.html

Access to Ra is via the command:

ssh ra.mines.edu

Currently, you can only access Ra from other machines on campus or via VPN. If you have an account on another machine on campus, such as imagine.mines.edu, then you can access Ra by first logging on to that machine and then using the ssh command shown above.

To see the status of the queue on Ra see ra.mines.edu/ganglia

Ra is that is a collection of nodes with each node containing 8 computing cores. The 8 compute cores on one node share the same memory. Memory is not shared across nodes.

Ra is designed to primarily run distributed memory applications. In distributed memory applications there are a collection of processes or tasks running on individual computing cores or processors. That is, each task in an application runs a separate copy of the same program and has its own memory. The various tasks of the the application communicate via message passing. The normal method for tasks to pass messages is to use calls from the Message Passing Interface (MPI) library.

It is also possible to write programs on Ra that exploit the feature that the 8 compute cores on a node share memory. One method of writing such applications is to use threads. The OpenMP package is available on Ra to facilitate writing threaded applications. "OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer."

Basic environment

Ra runs a custom distribution of Linux know as Rocks. See the links given at the end of this page for additional information on Rocks

Changing your login shell

The default login shell on Ra is /bin/bash. If you would like to change your shell you can use the command chsh. To see a list of the available shells type chsh --list-shells. To change your shell type chsh. You will be prompted for your password and the path to your new shell.

Compilers

Ra contains a very complete set of compilers, including the open source gcc and g95 compilers as well as C/C++ and Fortran compilers from Intel and Portland group. All of these compilers are in the default execution path of Ra.

Open Source

Intel

Portland Group

We will be creating short guides for using the various compilers. For now, the documentation for the Portland Group and Intel compilers can be found at:

Debuggers

The Intel and Portland Group debuggers are also available on Ra. The Portland Group is pgdbg and the name of the Intel debugger is idb.

The documentation is also available on Ra in the "doc" directory one level above where the executable for the compilers reside.

The documentation is also available from:

MPI libraries

The primary versions of MPI on Ra are OpenMPI version 1.2.5 and MPICH version 1.2.7. (MPICH2 is also available but does not currently support communication over Infiniband so it most likely should not be used.) Three copies of the libraries have been built using the gcc, Intel, and Portland Group compilers

The default version MPI on Ra is OpenMPI built with the Intel compilers. You can change your MPI compiler suite by setting your PATH, MANPATH, and LD_LIBRARY_PATH environmental variables in your .cshrc, .tcshrc, or .bashrc files. The correct additions to these files can be found below.

Compiling MPI programs

The commands for compiling MPI programs are:

C programs:
mpicc
C++ programs:
mpiCC
Fortran 90 programs:
mpif90
Fortran 77 programs:
mpif77

Running Parallel MPI programs

Most jobs are run in batch mode. To run in batch mode you first create a run script. You submit your job to be run and it is put in a queue. When there are enough nodes available to run the job it starts. Your output is available after it completes.

Ra uses the Moab Workload Manager to schedule execution of parallel jobs. Moab, in turn uses Torque for launching and managing jobs. Torque is a descendent of the open source version of PBS (Portable Batch System).

The batch scripts you create are called pbs scripts. They are actually Unix shell scripts, but they have two parts. The top part consists of a collection of lines that start with the token #PBS. Unix sees these lines as comments but they are significant to Torque.

Below we have a simple pbs script. This script will run the application c_ex00 on 8 computing cores. The first line indicates that we will run this script using the "C" shell. The "-l" flag used on lines 2 and 3 indicates that we want some particular resources. In this case we are asking for a single node with 8 computing cores and we want it for 2 hours. The "-N" flag gives our job a name. The "-o" and "-e" options indicate where the normal and error output from our job will be placed.

As it states in the comments, the "-V" option can be very important. This causes the current shell environment to be passed to your parallel application. This includes such things as your current execution path "$PATH" and your dynamic library path "$LD_LIBRARY_PATH". The LD_LIBRARY_PATH is used to find the libraries used by your application.

The last two lines are the actual shell commands that are run.

PBS script Explanation
#!/bin/csh We will run this job using the "C" shell
#PBS -l nodes=1:ppn=8 We want 1 node with 8 processors for a total of 8 processors
#PBS -l walltime=02:00:00 We will run for up to 2 hours
#PBS -N testIO The name of our job is testIO
#PBS -o stdout Standard output from our prgram will go to a file stdout
#PBS -e stderr Error output from our prgram will go to a file stderr
#PBS -V Very important! Exports all environment variables from the submitting shell into the batch shell.
#----------------------------------------------------- Not important. Just a separator line.
cd $PBS_O_WORKDIR Very important! Go to directory $PBS_O_WORKDIR which is the directory which is where our script resides
mpirun -n 8 c_ex00 Run the MPI program c_ex00 on 8 computing cores.

Some useful PBS variables

Variable Meaning
PBS_JOBID unique PBS job ID
PBS_JOBCOOKIE job cookie
PBS_JOBNAME user specified job name
PBS_MOMPORT active port for mom daemon
PBS_NNODES number of nodes requested
PBS_NODEFILE file with list of allocated nodes
PBS_NODENUM node offset number (see pbsdsh)
PBS_O_HOME home dir of submitting user
PBS_O_HOST host of currently running job
PBS_O_LANG language variable for job
PBS_O_LOGNAME name of submitting user
PBS_O_PATH path to executables used in job script
PBS_O_SHELL script shell
PBS_O_WORKDIR jobs submission directory
PBS_QUEUE job queue
PBS_TASKNUM number of tasks requested (see pbsdsh)

Available queues

There are several available queues on Ra. You do not normally specify a particular queue in which to run your jobs. This is done automatically by the amount of memory you request and the time limit. The limits are set in your runs script. There are two memory sizes and four time buckets. The table below show the generic queue names, the time limit and an example line for your run script. The additions to your script to specify large memory nodes are given below.

Queue Time Limit Script Example
short 1 to 30 minutes #PBS -l walltime=00:25:00
medium1 30-minutes to 8 hours #PBS -l walltime=04:00:00
medium2 8 hours to 24 hours #PBS -l walltime=12:00:00
long 24 hours to 400 hours #PBS -l walltime=168:00:00

Submitting jobs to the queue

To submit a jobs to the queue on Ra use the msub command. For example:

msub rpbs6

Where rpbs6 is the name of your script.

msub and PBS/Torque commands

msub commands

Command Description
canceljob cancel job
checkjob provide detailed status report for specified job
mdiag provide diagnostic reports for resources, workload, and scheduling
mjobctl control and modify job
mrsvctl create, control and modify reservations
mshow displays various diagnostic messages about the system and job queues
msub submit a job (Don't use qsub)
releasehold release job defers and holds
releaseres release reservations
sethold set job holds
showq show queued jobs
showres show existing reservations
showstart show estimates of when job can/will start
showstate show current state of resources

PBS/Torque commands

Command Description
tracejob trace job actions and states recorded in TORQUE logs
pbsnodes view/modify batch status of compute nodes
qalter modify queued batch jobs
qdel delete/cancel batch jobs
qhold hold batch jobs
qrls release batch job holds
qrun start a batch job
qsig send a signal to a batch job
qstat view queues and jobs
pbsdsh launch tasks within a parallel job
qsub submit jobs (Don't use except for interactive runs)

Large memory per node jobs

Ra has 184 nodes with 16 Gbytes per node and 84 nodes with 32 Gbytes of memory. To run only on the nodes that have 32 Gbytes add the option :fat to line in your script that contains the number of nodes you are requesting. For example if you want two nodes with 32 Gbytes the line:

#PBS -l nodes=2:ppn=8

becomes

#PBS -l nodes=2:ppn=8:fat

Exclusive access to nodes

The default behavior of the queueing system on Ra is to fill unused processors. If you submit a job that uses less than 8 processors per node than additional jobs might be scheduled on your nodes. To force your exclusive access to your nodes add the option
-l naccesspolicy=singlejob
to your job submission line. For example:

msub rpbs6 -l naccesspolicy=singlejob

You can also add the line

#PBS -l naccesspolicy=singlejob

to your run script.

Where rpbs6 is the name of your script.

Multiple Instruction - Multiple Data Programs

Ra supports Multiple Instruction - Multiple Data (MIMD) or Multiple Program - Multiple Data (MPMD) programs while using the OpenMPI library. That is, each MPI task can be a different program. For example, one task can be a Fortran program and another a C. This paradigam is not supported by the mvapich_0.9.9 systen.

One simple method for doing MPMD under OpenMPI is described in the document mpmd.html

Where is my output?

Checking on jobs

The command qstat will show a list of all jobs in the queue. The status of a job is given under the "S" column as "Q", "R", or "C" for Queded, Running, or Completed. To see information about only your jobs add the -u USERNAME option to the command. To see information about a particular job add the job id number to the qstat command. You can get more detailed information about a job by using the checkjob command.

For a list of commands for controling and monitoring jobs see msub and PBS/Torque commands

Deleting jobs

The command qstat will show a list of all jobs in the queue. Use the qdel JOBNUMBER command to delete a job.

For a list of commands for controling and monitoring jobs see msub and PBS/Torque commands

Running OpenMP applications

OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer."

The Portland Group and Intel compiler sets both support OpenMP for C and Fortran. OpenMP is off by default and must be enabled with a compile line option. The option for the Intel compilers is -openmp and -mp for the Portland Group compilers.

Assume we have the OpenMP Fortran and C programs omp_fft_join.f90 and invertc.c and we want to compile at optimization level 3. We could using the commands:

Intel Fortran:
ifort -O3 -openmp omp_fft_join.f90 -o omp_fft_join.it
Portland Group Fortran:
pgf90 -O3 -mp omp_fft_join.f90 -o omp_fft_join.pg
Intel C:
icc -O3 -openmp invertc.c -o invertc.it
Portland Group C:
pgcc -O3 -mp invertc.c -o invertc.pg

The following script runs these programs using different numbers of threads:

#!/bin/csh
#PBS -l nodes=1:ppn=8
#PBS -l walltime=00:10:00
#PBS -N testIO
#PBS -o out.pbs
#PBS -e err.pbs
#PBS -r n
#PBS -V 
#-----------------------------------------------------
cd $PBS_O_WORKDIR

#run the example omp_fft_join.f90 using 1-8 threads
#using both the Intel and Portland Group compilers

foreach NUM (1 2 3 4 5 6 7 8) 
    setenv OMP_NUM_THREADS $NUM
    echo "intel"
    ./omp_fft_join.it
    echo " "
    echo "pg"
    ./omp_fft_join.pg
    echo " "
    echo " "
end

#run the example invertc.c using 1, 2, and 4 threads
#using both the Intel and Portland Group compilers

foreach NUM (1 2 4) 
    setenv OMP_NUM_THREADS $NUM
    echo "OMP_NUM_THREADS=" $OMP_NUM_THREADS
    echo "intel"
    ./invertc.it
    echo " "
    echo "pg"
    ./invertc.pg
    echo " "
    echo " "
end

Running Parallel Interactive jobs

Running a parallel interactive job is a two step process. You first run a qsub command to request interactive nodes. After some time you will be connected (logged in) to an interactive node. You then "cd" to the directory that contains your executable and run it with an mpirun command.

We have an example below. The text in red is what is typed into the terminal window. We will run a simple "Hello World" MPI example. The source for the example can be obtained from geco.mines.edu/guideFiles/c_ex00.c

In our qsub command we request 1 node. This will give us 8 computational cores to use while running our parallel program. After we enter the qsub command we will get back a ready message. Next, we then enter a mpirun command, specifying the number of MPI tasks using the "-n" option. After our job finished we are free to run additional jobs. Note in the second case we specified 16 MPI tasks, even though we only have asked for 8 computational cores. This is legal and it might be useful in cases where you are just checking the correctness of an algorithm and don't care about performance.

After we are done with our runs we type exit to logout and release the nodes.

[tkaiser@ra ~/guide]$qsub -q INTERACTIVE -I -V -l nodes=1
qsub: waiting for job 1280.ra.mines.edu to start
qsub: job 1280.ra.mines.edu ready

[tkaiser@compute-9-8 ~]$cd guide
[tkaiser@compute-9-8 ~/guide]$mpirun -n 4 c_ex00
Hello from 0 of 4 on compute-9-8.local
Hello from 1 of 4 on compute-9-8.local
Hello from 2 of 4 on compute-9-8.local
Hello from 3 of 4 on compute-9-8.local
[tkaiser@compute-9-8 ~/guide]$mpirun -n 16 c_ex00
Hello from 1 of 16 on compute-9-8.local
Hello from 0 of 16 on compute-9-8.local
Hello from 3 of 16 on compute-9-8.local
Hello from 4 of 16 on compute-9-8.local
Hello from 5 of 16 on compute-9-8.local
Hello from 8 of 16 on compute-9-8.local
Hello from 9 of 16 on compute-9-8.local
Hello from 10 of 16 on compute-9-8.local
Hello from 7 of 16 on compute-9-8.local
Hello from 11 of 16 on compute-9-8.local
Hello from 12 of 16 on compute-9-8.local
Hello from 13 of 16 on compute-9-8.local
Hello from 14 of 16 on compute-9-8.local
Hello from 15 of 16 on compute-9-8.local
Hello from 2 of 16 on compute-9-8.local
Hello from 6 of 16 on compute-9-8.local
[tkaiser@compute-9-8 ~/guide]$exit
logout

qsub: job 1280.ra.mines.edu completed
[tkaiser@ra ~/guide]$ 

Changing the MPI compiler suite

There are several versions of MPI available on Ra. You can change the version you use by changing your $PATH and $LD_LIBRARY_PATH environmental variables.

You can use the command mpi-selector to set your version of mpi. The command

mpi-selector --list

will list the availble versions of MPI. The command

mpi-selector --set <name>

will set your default version of MPI. <name> is one of the versions returned for using the --list option.

You can also set your MPI version manually. The following two snippets alow you to easilly change your settings. Use the first if your login shell is bash and the second if it is tcsh.

If you use bash for your shell...

### Add the following to the end of your ~/.bashrc file.
###
### Then uncomment the line indicating which complier 
### you would like to use to build MPI programs
### and uncomment the line indicating which
### version of MPI you would like.


#BASECOMP="pgi"
#BASECOMP="gcc"
#BASECOMP="intel"

#MYMPI="openmpi-1.2.5"
#MYMPI="mvapich-0.9.9"
#MYMPI="mpich2-1.0.6p1"
#MYMPI="none"

export MYMPI

    case $MYMPI in

        "mvapich-0.9.9")
        PATH=/usr/mpi/$BASECOMP/mvapich-0.9.9/bin:$PATH
        LD_LIBRARY_PATH=/usr/mpi/$BASECOMP/mvapich-0.9.9/lib:/usr/mpi/$BASECOMP/mvapich-0.9.9/lib/shared:$LD_LIBRARY
_PATH
        MANPATH=/usr/mpi/$BASECOMP/mvapich-0.9.9/man:$MANPATH;;

        "mpich2-1.0.6p1")
        PATH=/usr/mpi/$BASECOMP/mpich2-1.0.6p1/bin:$PATH
        LD_LIBRARY_PATH=/usr/mpi/$BASECOMP/mpich2-1.0.6p1/lib:$LD_LIBRARY_PATH
        MANPATH=/usr/mpi/$BASECOMP/mpich2-1.0.6p1/man:$MANPATH;;

        "openmpi-1.2.5")
        PATH=/usr/mpi/$BASECOMP/openmpi-1.2.5/bin:$PATH
        LD_LIBRARY_PATH=/usr/mpi/$BASECOMP/openmpi-1.2.5/lib:$LD_LIBRARY_PATH
        MANPATH=/usr/mpi/$BASECOMP/openmpi-1.2.5/man:$MANPATH;;

    esac

export PATH
export LD_LIBRARY_PATH
export  MANPATH

If you use tcsh for your shell...

### Add the following to end of your ~/.tcshrc file.
###
### Then uncomment the line indicating which complier 
### you would like to use to build MPI programs
### and uncomment the line indicating which
### version of MPI you would like.

#setenv BASECOMP pgi
#setenv BASECOMP gcc
#setenv BASECOMP intel

#setenv MYMPI openmpi-1.2.5
#setenv MYMPI mvapich-0.9.9
#setenv MYMPI mpich2-1.0.6p1

switch ($MYMPI)
            case mvapich-0.9.9:
                setenv LD_LIBRARY_PATH /usr/mpi/$BASECOMP/mvapich-0.9.9/lib:/usr/mpi/$BASECOMP/mvapich-0.9.9/lib/shared:$LD_LIBRARY_PATH
                setenv MANPATH /usr/mpi/$BASECOMP/mvapich-0.9.9/man:$MANPATH
                set path = (   /usr/mpi/$BASECOMP/mvapich-0.9.9/bin  $path )
                breaksw

            case mpich2-1.0.6p1:
                setenv LD_LIBRARY_PATH /usr/mpi/$BASECOMP/mpich2-1.0.6p1/lib:$LD_LIBRARY_PATH
                setenv MANPATH /usr/mpi/$BASECOMP/mpich2-1.0.6p1/man:$MANPATH
                set path = (   /usr/mpi/$BASECOMP/mpich2-1.0.6p1/bin  $path )
                breaksw

            case openmpi-1.2.5:
                setenv LD_LIBRARY_PATH /usr/mpi/$BASECOMP/openmpi-1.2.5/lib:$LD_LIBRARY_PATH
                setenv MANPATH /usr/mpi/$BASECOMP/openmpi-1.2.5/man:$MANPATH
                set path = (   /usr/mpi/$BASECOMP/openmpi-1.2.5/bin  $path )
                breaksw

            default:
                #echo "mpi not set"
                breaksw
endsw

The table given below shows the round trip messaage speed for a 1 Mbyte message using the various versions of MPI given above along with a custom built version. On node, refers to two tasks running on the same node

MPI VersionOn Node
Bytes/sec
Off Node
Bytes/sec
/usr/mpi/intel/openmpi-1.2.5 641105422 640782205
/usr/mpi/intel/mvapich-0.9.9 600017739 599983407
/usr/mpi/intel/mpich2-1.0.6p1 310298439 58692377
~/custom/mvapich2-1.0.2f 310277780 58686875

Intel based MPI libaries

PGI based MPI libaries

Some interesting links

Ra runs the Rocks distribution of RedHat Linux. Information on Rocks can be found at www.rocksclusters.org.

The Ra has its own webpage is ra.mines.edu. The most interesting link on this page is Cluster Status. The Cluster Status page contains information about RA's, including such things as load.

Known Issues

Jobs submitted to the FAT queues using qsub don't run.
Jobs submitted to the FAT queues using msub don't always get fat nodes.
Solution:
To run only on the nodes that have 32 Gbytes use msub to submit your runs not qsub and add the option :fat to line in your script that contains the number of nodes you are requesting. For example if you want two nodes with 32 Gbytes the line:
#PBS -l nodes=2:ppn=8

should be

#PBS -l nodes=2:ppn=8:fat
Jobs appear to run but don't produce any output either stdout or stderr.
Solution:
There is no perfect solution known. This problem comes and goes. Please let us know if you see it. You might want to pipe your output so your mpirun command will look something like:
mpirun -n 8 /lustre/home/tkaiser/mpiTests/ppong >& myout.$PBS_JOBID