test

RA users Guide

This guide has information about running on the CSM
Golden Energy Computing Organization
HPC resource RA.MINES.EDU.

RA user Agreement Usage Policy
File System Overview and Usage Backups
Access to RA Quick Start
Testing your environment
Building MPI programs Running MPI Programs
Building OpenMP (Threaded) Programs Running OpenMP (Threaded)
Mapping of parallel tasks to nodes Some Advanced Scripts
Requesting specific types of nodes Local Disk Space
Queue Times Queue Related Commands
Compilers Documentation Command Line Options for Debugging
Source Level Debugging Runtime Error Messages
Exclusive access to nodes Running Parallel Interactive jobs
Tutorials Link Changing your login shell
Common Problems and Questions Changing you MPI Version

RA user Agreement

RA users agree to the following:

  1. I will not give my password to anyone.
  2. I will not store it on any machine in plain text.
  3. I will not use a password that I use on another machine.
  4. I will not reuse the password or ssh keys that I had before September 10, 2009
  5. I will not use a blank passphrase for any ssh key that I create to go to/from RA.
  6. I will not allow anyone else to use my account.
  7. I will not login to RA from someone else's account.
  8. I understand that I am responsible for my own data backups.

Copies of this agreement can be found at
http://geco.mines.edu/guide/agreement.shtml
and
http://geco.mines.edu/guide/agreement.pdf

Usage Policy

Ra is that is a collection of nodes with each node containing 8 computing cores. The 8 compute cores on one node share the same memory. Memory is not shared across nodes.

Ra is designed to primarily run distributed memory applications. In distributed memory applications there are a collection of processes or tasks running on individual computing cores or processors. That is, each task in an application runs a separate copy of the same program and has its own memory. The various tasks of the the application communicate via message passing. The normal method for tasks to pass messages is to use calls from the Message Passing Interface (MPI) library. The link http://www.open-mpi.org/doc/v1.4/ has a nice list with documentation for the calls of the MPI library.

It is also possible to write programs on Ra that exploit the feature that the 8 compute cores on a node share memory. One method of writing such applications is to use threads. The OpenMP package is available on Ra to facilitate writing threaded applications. "OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer."

While it is possible to force restrictions on users as far as memory and process count such restrictions can cause undesired effects.

We have been relying on users to observe the following rules:

File System Overview and Usage

There are two parallel file systems on RA, /lustre/home and /lustre/scratch. As the name implies, /lustre/home, contains users home directories. Every user also has a directory in /lustre/scratch. Both file systems are available across all nodes.

/lustre/home
Small application builds, scripts, and small data sets
Community codes in /lustre/home/apps
/lustre/scratch
The primary location for running applications and storing larger data sets

/lustre/scratch is about 10 times larger than /lustre/home. So /lustre/scratch should be used for running applications. /lustre/scratch is also potentially faster.

Backups

/lustre/scratch is not backed up. It is too large for backups to be practical. Users are responsible for backing up their own data.

Access to RA

Access to Ra is via the command:

ssh ra.mines.edu

The only way to access Ra is by using ssh. Unix and Unix like operating systems, (OSX, Linux, Unicos...) have ssh built in. If you are using a Windows based machine to access RA then you must use a terminal package that support ssh, such as putty availabe from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.

We have a description of how to connect to RA using ssh key-based access from both Unix and Windows based machines at: http://geco.mines.edu/ssh. Ssh key-based access will enable you to log into RA all day while only typing a phase phrase one time.

Quick Start
Testing your environment

Testing your environment is on RA is just a matter of building and running parallel program. The method is the same as for RA and Mio. For a Copy & Paste guide to building and running parallel mpi applications see: http://hpc.mines.edu/quick.html. Unlink RA, Mio does not require account numbers as part of your script.

If at any time you feel your environment is not working properly the first test is to try is to run through the quick start guide given above.

Building MPI programs

The commands for compiling MPI programs are:

C programs:
mpicc
C++ programs:
mpiCC
Fortran 90 programs:
mpif90
Fortran 77 programs:
mpif77

The MPI compilers are actually scripts that call "normal" C, C++, or Fortran compilers adding in the MPI include files and libraries. Thus, any compile options that you would normally pass to the regular compilers can be added to the MPI compile lines. For example, -O3, provides a good level of optimization.

The Quick Start guide has a make file that shows how the compilers are called within "make."

Running MPI Programs

The machine "RA" is actually a collection of roughly 268 individual nodes. Each node is a complete computer in its own right. In turn, each node contains 8 compute cores. The compute cores perform the computation. All of the cores in a node see the same memory. Cores in different nodes do not share memory. Instead, they communicate with each other by passing messages using the Message Passing Interface (MPI) library.

When you login to RA you are logging into the "head" node. The head node should not be used to run parallel applications. Instead, you run a script that requests one or more of the other "compute" nodes of RA. The script then runs your program on the requested nodes.

All parallel applications, including MPI programs, must be run on compute nodes using the batch queuing system. Do not run a MPI application on the RA head node. This could hang the node, requiring a reboot, potentially killing other peoples jobs. This will annoy others. People running parallel on the head node may have their accounts suspended.

The Quick Start guide has a fairly complete example of compiling and running a parallel MPI program. We will summarize here.

MPI programs are built using the MPI compilers as discussed above. After they are build the programs are run by using a script. A script tells the number of nodes required and the program to run and the maximum run time for the program. The script is then submitted to the batch queuing system. The batch queuing system schedules a job to run on the number of nodes requested. The parallel job will not run until there are a sufficient number of nodes available to run the program. Often, small node count jobs with a short maximum run time will run sooner than large, long jobs. The scheduler may "fit in" a number of small short jobs on nodes it is setting aside for larger long jobs.

Batch scripts are just normal shell scripts and can have almost any normal shell script command. The lines in the batch scripts that begin with #PBS are special comments that are interpreted by the job scheduler.

Here is a simple example script. It requests a single node that contains 8 cores, nodes=1:ppn=8, for 10 minutes, walltime=00:10:00. The MPI program hello_mpi contained in the same directory as the script will be run on 8 processors, mpiexec -np 8.

#!/bin/bash
#PBS -l nodes=1:ppn=8
#PBS -l walltime=00:10:00 
#PBS -W x=NACCESSPOLICY:SINGLEJOB
#PBS -N testIO 
#PBS -o stdout 
#PBS -e stderr 
#PBS -r n 
#PBS -V 
#----------------------------------------------------- 
#Go to the directory that contains this script
cd $PBS_O_WORKDIR 

mpiexec -np 8 ./hello_mpi

The option, #PBS -V is important. It causes the environment that you have defined on RA to be exported to the compute nodes. Without this set most programs will not work properly.

The option #PBS -N testIO gives the job a name which can be seen in the commands that check job status.

Under some circumstances the job scheduler will try to put multiple jobs on a single node. The option #PBS -W x=NACCESSPOLICY:SINGLEJOB prevents this.

After the program runs the output will be put in the file stdout and any error information will be put in the file stderr.

Assuming the name of the script is myscript you submit it using the command

qsub myscript

This will return a job number. You can see the status of your job by running the command

qstat "job number"

Where "job number" is the numerical part of the output of the qsub command. Qstat will show:

Q
Waiting to run
R
The job is running
C
The job is finished
E
This is not seen very often but indicates
that a job is in between one of the other
states. It does not indicate an error.

Building OpenMP (Threaded) Programs

OpenMP is the de-facto standard for parallel programming on shared memory systems using threading. The cores on individual nodes of RA and Mio share memory so OpenMP can be used to do node level parallelism, that is across the 8 (or up to 12 on Mio) cores on a node. For more information on OpenMP see: http://openmp.org.

Compiling an OpenMP program requires a command line option that is specific to the compiler vendor as shown below.

Intel:

All Intel compilers use the -openmp option to enable OpenMP. For example:

C
icc -openmp
C++
icpc -openmp
Fortran
ifort -openmp

Portland Group:

All Portland Group compilers use the -mp option to enable OpenMP. For example:

C
pgcc -mp
C++
pgCC -mp
Fortran 90
pgf90 -mp
Fortran 95
pgf95 -mp

Running OpenMP (Threaded)

All parallel applications, including OpenMP programs, must be run on compute nodes using the batch queuing system. Do not run a threaded application on the RA head node. This could hang the node, requiring a reboot, potentially killing other peoples jobs. This will annoy others. People running parallel on the head node may have their accounts suspended.

The environmental variable OMP_NUM_THREADS controls the number of threads used by an OpenMP program. This variable should be set in your script. For example the following script can be used to run the OpenMP program hello_omp using 4 threads.

#!/bin/bash
#PBS -l nodes=1:ppn=8
#PBS -l walltime=00:10:00 
#PBS -W x=NACCESSPOLICY:SINGLEJOB
#PBS -N testIO 
#PBS -o stdout 
#PBS -e stderr 
#PBS -r n 
#PBS -V 
#----------------------------------------------------- 
#Go to the directory that contains this script
cd $PBS_O_WORKDIR 

#Set the number of threads to use to 4
export OMP_NUM_THREADS=4

#Run my program using 4 threads
./hello_omp

Note that we do not use the mpiexec command to run OpenMP programs. mpiexec is normally only used for MPI programs.

Here is a "Hello World" program written in Fortran using OpenMP. The program writes the "Thread Number" for each thread.

program hello
    implicit none
    integer OMP_GET_MAX_THREADS,OMP_GET_THREAD_NUM
!$OMP PARALLEL
!$OMP CRITICAL
    write(*,fmt="(a,i2,a,i2)")" thread= ",OMP_GET_THREAD_NUM(), &
                              " of ",     OMP_GET_MAX_THREADS()
!$OMP END CRITICAL
!$OMP END PARALLEL
end program

The compile line for this program would be:

ifort -openmp hello_omp.f90 -o hello_omp

Next we have a slightly more complicated version of the script shown above. This script will run the program 4 times using 1, 2, 4, and 8 threads, setting OMP_NUM_THREADS in a loop and then running the program.

#!/bin/bash
#PBS -l nodes=1:ppn=8:compute
#PBS -l walltime=00:10:00 
#PBS -W x=NACCESSPOLICY:SINGLEJOB
#PBS -N testIO 
#PBS -o stdout 
#PBS -e stderr 
#PBS -r n 
#PBS -V 
#----------------------------------------------------- 
#Go to the directory that contains this script
cd $PBS_O_WORKDIR 

#Save a nicely sorted list of nodes 
sort -u $PBS_NODEFILE  > mynodes.$PBS_JOBID 

#Run my program 4 times using 1, 2, 4, and 8 threads
for NT in 1 2 4 8 ; do
  export OMP_NUM_THREADS=$NT
  echo OMP_NUM_THREADS=$OMP_NUM_THREADS
#Run my program using threads
  ./hello
  echo
done

We would use the qsub command to run this script. For example

qsub myscript

After the script runs we would get the output:

OMP_NUM_THREADS=1
 thread=  0 of  1

OMP_NUM_THREADS=2
 thread=  0 of  2
 thread=  1 of  2

OMP_NUM_THREADS=4
 thread=  0 of  4
 thread=  1 of  4
 thread=  2 of  4
 thread=  3 of  4

OMP_NUM_THREADS=8
 thread=  0 of  8
 thread=  1 of  8
 thread=  2 of  8
 thread=  3 of  8
 thread=  4 of  8
 thread=  7 of  8
 thread=  5 of  8
 thread=  6 of  8

Mapping of parallel tasks to nodes

The scripts discussed under the Quick start and Running MPI Programs sections assume that you want to have a single MPI task running on each core. In some cases you might have other mappings of tasks to cores for example you might want:

This type of operation is supported but doing the mappings from within a script can be a bit tricky. We have created a script match which makes such mappings of tasks to cores easier. The script is documented here.

Additional advanced scripting techniques are discussed in the next section.

Some Advanced Scripts

TBD

We have a presentation on some advanced scripting techniques here. It will be expanded shortly.

Requesting specific types of nodes

Ra has 184 nodes with 16 Gbytes per node and 84 nodes with 32 Gbytes of memory. To run only on the nodes that have 32 Gbytes add the option :fat to line in your script that contains the number of nodes you are requesting. For example if you want two nodes with 32 Gbytes the line in your batch script:

#PBS -l nodes=2:ppn=8

becomes

#PBS -l nodes=2:ppn=8:fat

Also, there are two types of "fat" nodes pe1950 and pe6850. They have different processor types. (All of the thin nodes are pe1950 nodes but with less memory.) Most programs will run on either the pe1950 and pe6850 nodes. If you receive an error message similar to:

Fatal Error: This program was not built to run on the processor in your 
system. The allowed processors are: Intel(R) Core(TM) Duo processors and 
compatible Intel processors with supplemental Streaming SIMD Extensions 3
(SSSE3) instruction support.

then the program must be run on the pe1950 nodes. To force your program to run on pe1950 nodes the line in your batch script would be of the form:

#PBS -l nodes=2:ppn=8:pe1950

or

#PBS -l nodes=2:ppn=8:pe1950:fat

Local Disk Space

Users should not write to the /tmp directory. Each one of the compute nodes has a local disk which is writable by all /state/partition1. Please use /state/partition1 instead of /tmp for temporary files. These temporary files should be deleted at the end of your job as part of your pbs script. Note that you can not see these temporary files from Ra. So if you actually want to keep these files they must also be copied from the compute nodes to Ra as part of your pbs script. Click here to see a program and pbs script that creates files in /state/partition1 and then moves them to the working directory

The amount of space in /state/partition1 depends on the type of node as shown in the chart below.

Node
type
Size (Gbytes) of
/state/partition1
pbs option to select node
thin 1950 37 #PBS -l nodes=2:ppn=8:pe1950:thin
fat 1950 21 #PBS -l nodes=2:ppn=8:pe1950:fat

Queue Times

There are several available queues on Ra. You do not normally specify a particular queue in which to run your jobs. This is done automatically by the amount of memory you request and the time limit. The limits are set in your runs script. The maximum time you can request for a job is 6 days or 144 hours. However, there is a limit on the number of jobs that can be running on the machine with requested time over 2 days or 48 hours. So if you submit a job for over 48 hours and there are already a number of jobs running with requested times over 48 hours your job may not run until the other jobs finish. It it normally better to not submit jobs for over 48 hours.

Queue Related Commands

The web pages:

show the state of the nodes on RA and the jobs in the queue.

Command Description
qsub submit jobs
canceljob cancel job
qdel delete/cancel batch jobs
checkjob provide detailed status report for specified job
checkjob -v show why a job will not run on specific nodes
releasehold release job defers and holds
releaseres release reservations
sethold set job holds
showq show queued jobs
showres show existing reservations
showstart show estimates of when job can/will start
showstate show current state of resources
tracejob trace job actions and states recorded in batch logs
pbsnodes view/modify batch status of compute nodes
qalter modify queued batch jobs
qhold hold batch jobs
qrls release batch job holds
qsig send a signal to a batch job
qstat view queues and jobs

Compilers Documentation

Compiler Command Turn off Optimization "Good" Optimization Turn on OpenMP
Intel Fortran ifort -O0 -O3 -openmp
Intel C icc -O0 -O3 -openmp
Intel C++ icpc -O0 -O3 -openmp
Portland Group Fortran pgf90 -O0 -fast -mp
Portland Group C pgcc -O0 -fast -mp
Portland Group C++ pgCC -O0 -fast -mp
Click on the "command" to see a HTML version of the man page
Intel Compilers Full Documentation
http://ra.mines.edu/intel
Portland Group Compilers Full Documentation
http://ra.mines.edu/pg

The default Intel and Portland Group compilers on RA are rather old. You can set your environment to use the newest compiler versions by adding the following lines to your .bashrc file or .cshrc file.

Bash shell users add to .bashrc
source /opt/pgi/linux86-64/2012/pgi.sh
source /lustre/home/apps/compilers/intel/bin/compilervars.sh intel64
C shell users add to .cshrc
source /opt/pgi/linux86-64/2012/pgi.csh
source /lustre/home/apps/compilers/intel/bin/compilervars.csh intel64

Command Line Options for Debugging

There is an article here that describes a number of options that are available for debugging programs without using debuggers. This includes compiler and subroutine options for tracebacks and run time checking.

Source Level Debugging

The standard Unix debugger gdb is available in /lustre/home/apps/gdb-6.8

The Intel and Portland Group debuggers are also available on Ra. The Portland Group is pgdbg and the name of the Intel debugger is idb. See the compiler documentation pages for more information.

Runtime Error Messages

We also have a direct link to the Intel Fortran Run-time error messages here.

A direct link to the OpenMPI runtime error codes is also available here.

Exclusive access to nodes

The default behavior of the queueing system on Ra is to fill unused processors. If you submit a job that uses less than 8 processors per node than additional jobs might be scheduled on your nodes. To force your exclusive access to your nodes add one of the following options to your batch script:

#PBS -W x=NACCESSPOLICY:SINGLEJOB
Allows only one job on a node
#PBS -W x=NACCESSPOLICY:SINGLEUSER
Allows more than one job on a node but only by a single user

Running Parallel Interactive jobs

Running a parallel interactive job is a two step process. You first run a qsub command to request interactive nodes. After some time you will be connected (logged in) to an interactive node. You then "cd" to the directory that contains your executable and run it with an mpiexec command.

We have an example below. The text in red is what is typed into the terminal window. We will run a simple "Hello World" MPI example. The source for the example can be obtained from geco.mines.edu/guide/guideFiles/c_ex00.c

In our qsub command we request 1 node. This will give us 8 computational cores to use while running our parallel program. After we enter the qsub command we will get back a ready message. Next, we then enter a mpiexec command, specifying the number of MPI tasks using the "-n" option. After our job finished we are free to run additional jobs. Note in the second case we specified 16 MPI tasks, even though we only have asked for 8 computational cores. This is legal and it might be useful in cases where you are just checking the correctness of an algorithm and don't care about performance.

After we are done with our runs we type exit to logout and release the nodes.

[tkaiser@ra ~/guide]$qsub -q INTERACTIVE -I -V -l nodes=1:ppn=8 \
-W x=NACCESSPOLICY:SINGLEUSER -l walltime=00:15:00
qsub: waiting for job 1280.ra.mines.edu to start
qsub: job 1280.ra.mines.edu ready

[tkaiser@compute-9-8 ~]$cd guide
[tkaiser@compute-9-8 ~/guide]$mpiexec -n 4 c_ex00
Hello from 0 of 4 on compute-9-8.local
Hello from 1 of 4 on compute-9-8.local
Hello from 2 of 4 on compute-9-8.local
Hello from 3 of 4 on compute-9-8.local
[tkaiser@compute-9-8 ~/guide]$mpiexec -n 16 c_ex00
Hello from 1 of 16 on compute-9-8.local
Hello from 0 of 16 on compute-9-8.local
Hello from 3 of 16 on compute-9-8.local
Hello from 4 of 16 on compute-9-8.local
Hello from 5 of 16 on compute-9-8.local
Hello from 8 of 16 on compute-9-8.local
Hello from 9 of 16 on compute-9-8.local
Hello from 10 of 16 on compute-9-8.local
Hello from 7 of 16 on compute-9-8.local
Hello from 11 of 16 on compute-9-8.local
Hello from 12 of 16 on compute-9-8.local
Hello from 13 of 16 on compute-9-8.local
Hello from 14 of 16 on compute-9-8.local
Hello from 15 of 16 on compute-9-8.local
Hello from 2 of 16 on compute-9-8.local
Hello from 6 of 16 on compute-9-8.local
[tkaiser@compute-9-8 ~/guide]$exit
logout

qsub: job 1280.ra.mines.edu completed
[tkaiser@ra ~/guide]$ 

You can also request more than on node. Then cat $PBS_NODEFILE to see which nodes you wer given. For example:

[tkaiser@ra ~/guide]$qsub -q INTERACTIVE -I -V -l nodes=2:ppn=8 \
 -W x=NACCESSPOLICY:SINGLEUSER -l walltime=00:15:00
qsub: waiting for job 91420.ra5.local to start
qsub: job 91420.ra5.local ready

[tkaiser@fatcompute-12-2 ~]$cat $PBS_NODEFILE | sort -u
fatcompute-12-1
fatcompute-12-2
[tkaiser@fatcompute-12-2 ~]$exit

The options -W x=NACCESSPOLICY:SINGLEUSER -l walltime=00:15:00 in the above commands ensure that you have sole access to the node for 15 minutes.

Tutorials Link

The link http://geco.mines.edu/workshop/aug2011/ leads to a number of recent tutorials on HPC and running on RA

Changing your login shell

The default login shell on Ra is /bin/bash. If you would like to change your shell you can use the command chsh. To see a list of the available shells type chsh --list-shells. To change your shell type chsh. You will be prompted for your password and the path to your new shell.

Common Problems and Questions

To be completed

Changing you MPI Version

There are several versions of MPI available on RA. The default version, OpenMPI 1.41, built with the Intel 11.1 compiler is actually rather old. You can expect better performance using the newer version, OpenMPI 1.6, built with version 12.1 of the Intel compiler.

The command mpi-selector can be used to set your MPI version. Using the mpi-selector command is a simple, but multistep process. First, edit your .bashrc file to remove any reference to old versions of MPI. Then run the command mpi-selector --list. You should see a list similar to the one shown below.

[joeuser@ra5 bin]$ mpi-selector --list
intel_4.0.3_intel_12.1
mvapich2_gnu-1.4.1
mvapich2_intel-1.4.1
mvapich2_pgi-1.4.1
openmpi-1.3.2-gcc-i386
openmpi-1.3.2-gcc-x86_64
openmpi_1.6_intel_12.1
ra5_openmpi_gnu-1.4.1
ra5_openmpi_intel-1.4.1
ra5_openmpi_intel_debug-1.4.1
ra5_openmpi_pgi-1.4.1
ra5_openmpi_pgi_debug-1.4.1
[joeuser@ra5 bin]$ 

This gives a list of the versions of MPI available. Select one. Unless there is a reason not to do so, the version should be ra5_openmpi_intel-1.4.1 or openmpi_1.6_intel_12.1. To select the new version of MPI Run the command mpi-selector --set openmpi_1.6_intel_12.1 as shown below.

[joeuser@ra5 bin]$ mpi-selector --set openmpi_1.6_intel_12.1
Defaults already exist; overwrite them? (y/N) y
[joeuser@ra5 bin]$ 

Then log out. The next time you log back in rerun the which mpicc command to check that you now have the MPI environment available. Your parallel programs should be rebuilt using your new version of MPI.

If you select openmpi_1.6_intel_12.1 as your MPI version you will automatically get the Intel version 12 compiler. You can check your MPI and base compiler versions as shown below.

[joeuser@ra5 ~]$ mpicc --showme:version
mpicc: Open MPI 1.6 (Language: C)

[joeuser@ra5 ~]$ icc -V
Intel(R) C Intel(R) 64 Compiler XE 
for applications running on Intel(R) 64,
Version 12.1.4.319 Build 20120410
Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.

[joeuser@ra5 ~]$ mpif90 --showme:version
mpif90: Open MPI 1.6 (Language: Fortran 90)

[joeuser@ra5 ~]$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE 
for applications running on Intel(R) 64,
Version 12.1.4.319 Build 20120410
Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.

Programs linked with the old version of MPI will not work with the new version of mpiexec. You can run old programs if you specify the full path to old mpiexec command, /opt/ra5_openmpi_intel/1.4.1/bin/mpiexec.