RA users Guide
This guide has information about running on the CSM
Golden Energy Computing Organization
HPC resource RA.MINES.EDU.
RA user Agreement
RA users agree to the following:
- I will not give my password to anyone.
- I will not store it on any machine in plain text.
- I will not use a password that I use on another machine.
- I will not reuse the password or ssh keys that I had before September 10, 2009
- I will not use a blank passphrase for any ssh key that I create to go to/from RA.
- I will not allow anyone else to use my account.
- I will not login to RA from someone else's account.
- I understand that I am responsible for my own data backups.
Ra is that is a collection of nodes with each node containing 8 computing cores. The 8 compute cores on one node share the same memory. Memory is not shared across nodes.
Ra is designed to primarily run distributed memory applications. In distributed memory applications there are a collection of processes or tasks running on individual computing cores or processors. That is, each task in an application runs a separate copy of the same program and has its own memory. The various tasks of the the application communicate via message passing. The normal method for tasks to pass messages is to use calls from the Message Passing Interface (MPI) library. The link http://www.open-mpi.org/doc/v1.4/ has a nice list with documentation for the calls of the MPI library.
It is also possible to write programs on Ra that exploit the feature that the 8 compute cores on a node share memory. One method of writing such applications is to use threads. The OpenMP package is available on Ra to facilitate writing threaded applications. "OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer."
While it is possible to force restrictions on users as far as memory and process count such restrictions can cause undesired effects.
We have been relying on users to observe the following rules:
- Do not run parallel applications on the RA frontend, either OpenMP or MPI.
- If you need to run parallel interactive jobs reserve a node for doing so as discussed in the RA User's Guide in the Running Parallel Interactive Jobs section.
- Do not run memory intensive or long applications on the front end.
- The front end is designed primarily for edits and compiles.
- As a general guideline, if we notice your application running than it might be a problem. If we notice your application repeatedly then it is a problem. Any application that is taking too much resources will be killed.
- For the benefit of all users, repeat offenders will lose access.
- Do not create large data sets in your home directory. Large data sets should only be created in /lustre/scratch
File System Overview and Usage
There are two parallel file systems on RA, /lustre/home and /lustre/scratch. As the name implies, /lustre/home, contains users home directories. Every user also has a directory in /lustre/scratch. Both file systems are available across all nodes.
- Small application builds, scripts, and small data sets
Community codes in /lustre/home/apps
- The primary location for running applications and storing larger data sets
/lustre/scratch is about 10 times larger than /lustre/home. So /lustre/scratch should be used for running applications. /lustre/scratch is also potentially faster.
/lustre/scratch is not backed up. It is too large for backups to be practical. Users are responsible for backing up their own data.
Access to RA
Access to Ra is via the command:
The only way to access Ra is by using ssh. Unix and Unix like operating systems, (OSX, Linux, Unicos...) have ssh built in. If you are using a Windows based machine to access RA then you must use a terminal package that support ssh, such as putty availabe from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.
We have a description of how to connect to RA using ssh key-based access from both Unix and Windows based machines at: http://geco.mines.edu/ssh. Ssh key-based access will enable you to log into RA all day while only typing a phase phrase one time.
Testing your environment
Testing your environment is on RA is just a matter of building and running parallel program. The method is the same as for RA and Mio. For a Copy & Paste guide to building and running parallel mpi applications see: http://hpc.mines.edu/quick.html. Unlink RA, Mio does not require account numbers as part of your script.
If at any time you feel your environment is not working properly the first test is to try is to run through the quick start guide given above.
Building MPI programs
The commands for compiling MPI programs are:
- C programs:
- C++ programs:
- Fortran 90 programs:
- Fortran 77 programs:
The MPI compilers are actually scripts that call "normal" C, C++, or Fortran compilers adding in the MPI include files and libraries. Thus, any compile options that you would normally pass to the regular compilers can be added to the MPI compile lines. For example, -O3, provides a good level of optimization.
The Quick Start guide has a make file that shows how the compilers are called within "make."
Running MPI Programs
The machine "RA" is actually a collection of roughly 268 individual nodes. Each node is a complete computer in its own right. In turn, each node contains 8 compute cores. The compute cores perform the computation. All of the cores in a node see the same memory. Cores in different nodes do not share memory. Instead, they communicate with each other by passing messages using the Message Passing Interface (MPI) library.
When you login to RA you are logging into the "head" node. The head node should not be used to run parallel applications. Instead, you run a script that requests one or more of the other "compute" nodes of RA. The script then runs your program on the requested nodes.
All parallel applications, including MPI programs, must be run on compute nodes using the batch queuing system. Do not run a MPI application on the RA head node. This could hang the node, requiring a reboot, potentially killing other peoples jobs. This will annoy others. People running parallel on the head node may have their accounts suspended.
The Quick Start guide has a fairly complete example of compiling and running a parallel MPI program. We will summarize here.
MPI programs are built using the MPI compilers as discussed above. After they are build the programs are run by using a script. A script tells the number of nodes required and the program to run and the maximum run time for the program. The script is then submitted to the batch queuing system. The batch queuing system schedules a job to run on the number of nodes requested. The parallel job will not run until there are a sufficient number of nodes available to run the program. Often, small node count jobs with a short maximum run time will run sooner than large, long jobs. The scheduler may "fit in" a number of small short jobs on nodes it is setting aside for larger long jobs.
Batch scripts are just normal shell scripts and can have almost any normal shell script command. The lines in the batch scripts that begin with #PBS are special comments that are interpreted by the job scheduler.
Here is a simple example script. It requests a single node that contains 8 cores, nodes=1:ppn=8, for 10 minutes, walltime=00:10:00. The MPI program hello_mpi contained in the same directory as the script will be run on 8 processors, mpiexec -np 8.
#!/bin/bash #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:10:00 #PBS -W x=NACCESSPOLICY:SINGLEJOB #PBS -N testIO #PBS -o stdout #PBS -e stderr #PBS -r n #PBS -V #----------------------------------------------------- #Go to the directory that contains this script cd $PBS_O_WORKDIR mpiexec -np 8 ./hello_mpi
The option, #PBS -V is important. It causes the environment that you have defined on RA to be exported to the compute nodes. Without this set most programs will not work properly.
The option #PBS -N testIO gives the job a name which can be seen in the commands that check job status.
Under some circumstances the job scheduler will try to put multiple jobs on a single node. The option #PBS -W x=NACCESSPOLICY:SINGLEJOB prevents this.
After the program runs the output will be put in the file stdout and any error information will be put in the file stderr.
Assuming the name of the script is myscript you submit it using the command
This will return a job number. You can see the status of your job by running the command
qstat "job number"
Where "job number" is the numerical part of the output of the qsub command. Qstat will show:
- Waiting to run
- The job is running
- The job is finished
- This is not seen very often but indicates
that a job is in between one of the other
states. It does not indicate an error.
Building OpenMP (Threaded) Programs
OpenMP is the de-facto standard for parallel programming on shared memory systems using threading. The cores on individual nodes of RA and Mio share memory so OpenMP can be used to do node level parallelism, that is across the 8 (or up to 12 on Mio) cores on a node. For more information on OpenMP see: http://openmp.org.
Compiling an OpenMP program requires a command line option that is specific to the compiler vendor as shown below.
All Intel compilers use the -openmp option to enable OpenMP. For example:
- icc -openmp
- icpc -openmp
- ifort -openmp
All Portland Group compilers use the -mp option to enable OpenMP. For example:
- pgcc -mp
- pgCC -mp
- Fortran 90
- pgf90 -mp
- Fortran 95
- pgf95 -mp
Running OpenMP (Threaded)
All parallel applications, including OpenMP programs, must be run on compute nodes using the batch queuing system. Do not run a threaded application on the RA head node. This could hang the node, requiring a reboot, potentially killing other peoples jobs. This will annoy others. People running parallel on the head node may have their accounts suspended.
The environmental variable OMP_NUM_THREADS controls the number of threads used by an OpenMP program. This variable should be set in your script. For example the following script can be used to run the OpenMP program hello_omp using 4 threads.
#!/bin/bash #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:10:00 #PBS -W x=NACCESSPOLICY:SINGLEJOB #PBS -N testIO #PBS -o stdout #PBS -e stderr #PBS -r n #PBS -V #----------------------------------------------------- #Go to the directory that contains this script cd $PBS_O_WORKDIR #Set the number of threads to use to 4 export OMP_NUM_THREADS=4 #Run my program using 4 threads ./hello_omp
Note that we do not use the mpiexec command to run OpenMP programs. mpiexec is normally only used for MPI programs.
Here is a "Hello World" program written in Fortran using OpenMP. The program writes the "Thread Number" for each thread.
program hello implicit none integer OMP_GET_MAX_THREADS,OMP_GET_THREAD_NUM !$OMP PARALLEL !$OMP CRITICAL write(*,fmt="(a,i2,a,i2)")" thread= ",OMP_GET_THREAD_NUM(), & " of ", OMP_GET_MAX_THREADS() !$OMP END CRITICAL !$OMP END PARALLEL end program
The compile line for this program would be:
ifort -openmp hello_omp.f90 -o hello_omp
Next we have a slightly more complicated version of the script shown above. This script will run the program 4 times using 1, 2, 4, and 8 threads, setting OMP_NUM_THREADS in a loop and then running the program.
#!/bin/bash #PBS -l nodes=1:ppn=8:compute #PBS -l walltime=00:10:00 #PBS -W x=NACCESSPOLICY:SINGLEJOB #PBS -N testIO #PBS -o stdout #PBS -e stderr #PBS -r n #PBS -V #----------------------------------------------------- #Go to the directory that contains this script cd $PBS_O_WORKDIR #Save a nicely sorted list of nodes sort -u $PBS_NODEFILE > mynodes.$PBS_JOBID #Run my program 4 times using 1, 2, 4, and 8 threads for NT in 1 2 4 8 ; do export OMP_NUM_THREADS=$NT echo OMP_NUM_THREADS=$OMP_NUM_THREADS #Run my program using threads ./hello echo done
We would use the qsub command to run this script. For example
After the script runs we would get the output:
OMP_NUM_THREADS=1 thread= 0 of 1 OMP_NUM_THREADS=2 thread= 0 of 2 thread= 1 of 2 OMP_NUM_THREADS=4 thread= 0 of 4 thread= 1 of 4 thread= 2 of 4 thread= 3 of 4 OMP_NUM_THREADS=8 thread= 0 of 8 thread= 1 of 8 thread= 2 of 8 thread= 3 of 8 thread= 4 of 8 thread= 7 of 8 thread= 5 of 8 thread= 6 of 8
Mapping of parallel tasks to nodes
The scripts discussed under the Quick start and Running MPI Programs sections assume that you want to have a single MPI task running on each core. In some cases you might have other mappings of tasks to cores for example you might want:
- Only 2 or 4 MPI tasks on a node
- Different numbers of tasks on each node
- Different MPI source programs running on different cores (MPMD)
- A hybrid MPI/OpenMP program with less than N MPI tasks per node
This type of operation is supported but doing the mappings from within a script can be a bit tricky. We have created a script match which makes such mappings of tasks to cores easier. The script is documented here.
Additional advanced scripting techniques are discussed in the next section.
Some Advanced Scripts
We have a presentation on some advanced scripting techniques here. It will be expanded shortly.
Requesting specific types of nodes
Ra has 184 nodes with 16 Gbytes per node and 84 nodes with 32 Gbytes of memory. To run only on the nodes that have 32 Gbytes add the option :fat to line in your script that contains the number of nodes you are requesting. For example if you want two nodes with 32 Gbytes the line in your batch script:
#PBS -l nodes=2:ppn=8
#PBS -l nodes=2:ppn=8:fat
Also, there are two types of "fat" nodes pe1950 and pe6850. They have different processor types. (All of the thin nodes are pe1950 nodes but with less memory.) Most programs will run on either the pe1950 and pe6850 nodes. If you receive an error message similar to:
Fatal Error: This program was not built to run on the processor in your system. The allowed processors are: Intel(R) Core(TM) Duo processors and compatible Intel processors with supplemental Streaming SIMD Extensions 3 (SSSE3) instruction support.
then the program must be run on the pe1950 nodes. To force your program to run on pe1950 nodes the line in your batch script would be of the form:
#PBS -l nodes=2:ppn=8:pe1950
#PBS -l nodes=2:ppn=8:pe1950:fat
Local Disk Space
Users should not write to the /tmp directory. Each one of the compute nodes has a local disk which is writable by all /state/partition1. Please use /state/partition1 instead of /tmp for temporary files. These temporary files should be deleted at the end of your job as part of your pbs script. Note that you can not see these temporary files from Ra. So if you actually want to keep these files they must also be copied from the compute nodes to Ra as part of your pbs script. Click here to see a program and pbs script that creates files in /state/partition1 and then moves them to the working directory
The amount of space in /state/partition1 depends on the type of node as shown in the chart below.
|Size (Gbytes) of |
|pbs option to select node|
|thin 1950||37||#PBS -l nodes=2:ppn=8:pe1950:thin|
|fat 1950||21||#PBS -l nodes=2:ppn=8:pe1950:fat|
There are several available queues on Ra. You do not normally specify a particular queue in which to run your jobs. This is done automatically by the amount of memory you request and the time limit. The limits are set in your runs script. The maximum time you can request for a job is 6 days or 144 hours. However, there is a limit on the number of jobs that can be running on the machine with requested time over 2 days or 48 hours. So if you submit a job for over 48 hours and there are already a number of jobs running with requested times over 48 hours your job may not run until the other jobs finish. It it normally better to not submit jobs for over 48 hours.
Queue Related Commands
The web pages:
show the state of the nodes on RA and the jobs in the queue.
|qdel||delete/cancel batch jobs|
|checkjob||provide detailed status report for specified job|
|checkjob -v||show why a job will not run on specific nodes|
|releasehold||release job defers and holds|
|sethold||set job holds|
|showq||show queued jobs|
|showres||show existing reservations|
|showstart||show estimates of when job can/will start|
|showstate||show current state of resources|
|tracejob||trace job actions and states recorded in batch logs|
|pbsnodes||view/modify batch status of compute nodes|
|qalter||modify queued batch jobs|
|qhold||hold batch jobs|
|qrls||release batch job holds|
|qsig||send a signal to a batch job|
|qstat||view queues and jobs|
|Compiler||Command||Turn off Optimization||"Good" Optimization||Turn on OpenMP|
|Portland Group Fortran||pgf90||-O0||-fast||-mp|
|Portland Group C||pgcc||-O0||-fast||-mp|
|Portland Group C++||pgCC||-O0||-fast||-mp|
|Click on the "command" to see a HTML version of the man page|
- Intel Compilers Full Documentation
- Portland Group Compilers Full Documentation
The default Intel and Portland Group compilers on RA are rather old. You can set your environment to use the newest compiler versions by adding the following lines to your .bashrc file or .cshrc file.
- Bash shell users add to .bashrc
- source /opt/pgi/linux86-64/2012/pgi.sh
source /lustre/home/apps/compilers/intel/bin/compilervars.sh intel64
- C shell users add to .cshrc
- source /opt/pgi/linux86-64/2012/pgi.csh
source /lustre/home/apps/compilers/intel/bin/compilervars.csh intel64
Command Line Options for Debugging
There is an article here that describes a number of options that are available for debugging programs without using debuggers. This includes compiler and subroutine options for tracebacks and run time checking.
Source Level Debugging
The standard Unix debugger gdb is available in /lustre/home/apps/gdb-6.8
The Intel and Portland Group debuggers are also available on Ra. The Portland Group is pgdbg and the name of the Intel debugger is idb. See the compiler documentation pages for more information.
Runtime Error Messages
We also have a direct link to the Intel Fortran Run-time error messages here.
A direct link to the OpenMPI runtime error codes is also available here.
Exclusive access to nodes
The default behavior of the queueing system on Ra is to fill unused processors. If you submit a job that uses less than 8 processors per node than additional jobs might be scheduled on your nodes. To force your exclusive access to your nodes add one of the following options to your batch script:
- #PBS -W x=NACCESSPOLICY:SINGLEJOB
- Allows only one job on a node
- #PBS -W x=NACCESSPOLICY:SINGLEUSER
- Allows more than one job on a node but only by a single user
Running Parallel Interactive jobs
Running a parallel interactive job is a two step process. You first run a qsub command to request interactive nodes. After some time you will be connected (logged in) to an interactive node. You then "cd" to the directory that contains your executable and run it with an mpiexec command.
We have an example below. The text in red is what is typed into the terminal window. We will run a simple "Hello World" MPI example. The source for the example can be obtained from geco.mines.edu/guide/guideFiles/c_ex00.c
In our qsub command we request 1 node. This will give us 8 computational cores to use while running our parallel program. After we enter the qsub command we will get back a ready message. Next, we then enter a mpiexec command, specifying the number of MPI tasks using the "-n" option. After our job finished we are free to run additional jobs. Note in the second case we specified 16 MPI tasks, even though we only have asked for 8 computational cores. This is legal and it might be useful in cases where you are just checking the correctness of an algorithm and don't care about performance.
After we are done with our runs we type exit to logout and release the nodes.
[tkaiser@ra ~/guide]$qsub -q INTERACTIVE -I -V -l nodes=1:ppn=8 \ -W x=NACCESSPOLICY:SINGLEUSER -l walltime=00:15:00 qsub: waiting for job 1280.ra.mines.edu to start qsub: job 1280.ra.mines.edu ready [tkaiser@compute-9-8 ~]$cd guide [tkaiser@compute-9-8 ~/guide]$mpiexec -n 4 c_ex00 Hello from 0 of 4 on compute-9-8.local Hello from 1 of 4 on compute-9-8.local Hello from 2 of 4 on compute-9-8.local Hello from 3 of 4 on compute-9-8.local [tkaiser@compute-9-8 ~/guide]$mpiexec -n 16 c_ex00 Hello from 1 of 16 on compute-9-8.local Hello from 0 of 16 on compute-9-8.local Hello from 3 of 16 on compute-9-8.local Hello from 4 of 16 on compute-9-8.local Hello from 5 of 16 on compute-9-8.local Hello from 8 of 16 on compute-9-8.local Hello from 9 of 16 on compute-9-8.local Hello from 10 of 16 on compute-9-8.local Hello from 7 of 16 on compute-9-8.local Hello from 11 of 16 on compute-9-8.local Hello from 12 of 16 on compute-9-8.local Hello from 13 of 16 on compute-9-8.local Hello from 14 of 16 on compute-9-8.local Hello from 15 of 16 on compute-9-8.local Hello from 2 of 16 on compute-9-8.local Hello from 6 of 16 on compute-9-8.local [tkaiser@compute-9-8 ~/guide]$exit logout qsub: job 1280.ra.mines.edu completed [tkaiser@ra ~/guide]$
You can also request more than on node. Then cat $PBS_NODEFILE to see which nodes you wer given. For example:
[tkaiser@ra ~/guide]$qsub -q INTERACTIVE -I -V -l nodes=2:ppn=8 \ -W x=NACCESSPOLICY:SINGLEUSER -l walltime=00:15:00 qsub: waiting for job 91420.ra5.local to start qsub: job 91420.ra5.local ready [tkaiser@fatcompute-12-2 ~]$cat $PBS_NODEFILE | sort -u fatcompute-12-1 fatcompute-12-2 [tkaiser@fatcompute-12-2 ~]$exit
The options -W x=NACCESSPOLICY:SINGLEUSER -l walltime=00:15:00 in the above commands ensure that you have sole access to the node for 15 minutes.
The link http://geco.mines.edu/workshop/aug2011/ leads to a number of recent tutorials on HPC and running on RA
Changing your login shell
The default login shell on Ra is /bin/bash. If you would like to change your shell you can use the command chsh. To see a list of the available shells type chsh --list-shells. To change your shell type chsh. You will be prompted for your password and the path to your new shell.
Common Problems and Questions
To be completed
Changing you MPI Version
There are several versions of MPI available on RA. The default version, OpenMPI 1.41, built with the Intel 11.1 compiler is actually rather old. You can expect better performance using the newer version, OpenMPI 1.6, built with version 12.1 of the Intel compiler.
The command mpi-selector can be used to set your MPI version. Using the mpi-selector command is a simple, but multistep process. First, edit your .bashrc file to remove any reference to old versions of MPI. Then run the command mpi-selector --list. You should see a list similar to the one shown below.
[joeuser@ra5 bin]$ mpi-selector --list intel_4.0.3_intel_12.1 mvapich2_gnu-1.4.1 mvapich2_intel-1.4.1 mvapich2_pgi-1.4.1 openmpi-1.3.2-gcc-i386 openmpi-1.3.2-gcc-x86_64 openmpi_1.6_intel_12.1 ra5_openmpi_gnu-1.4.1 ra5_openmpi_intel-1.4.1 ra5_openmpi_intel_debug-1.4.1 ra5_openmpi_pgi-1.4.1 ra5_openmpi_pgi_debug-1.4.1 [joeuser@ra5 bin]$
This gives a list of the versions of MPI available. Select one. Unless there is a reason not to do so, the version should be ra5_openmpi_intel-1.4.1 or openmpi_1.6_intel_12.1. To select the new version of MPI Run the command mpi-selector --set openmpi_1.6_intel_12.1 as shown below.
[joeuser@ra5 bin]$ mpi-selector --set openmpi_1.6_intel_12.1 Defaults already exist; overwrite them? (y/N) y [joeuser@ra5 bin]$
Then log out. The next time you log back in rerun the which mpicc command to check that you now have the MPI environment available. Your parallel programs should be rebuilt using your new version of MPI.
If you select openmpi_1.6_intel_12.1 as your MPI version you will automatically get the Intel version 12 compiler. You can check your MPI and base compiler versions as shown below.
[joeuser@ra5 ~]$ mpicc --showme:version mpicc: Open MPI 1.6 (Language: C) [joeuser@ra5 ~]$ icc -V Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 188.8.131.529 Build 20120410 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. [joeuser@ra5 ~]$ mpif90 --showme:version mpif90: Open MPI 1.6 (Language: Fortran 90) [joeuser@ra5 ~]$ ifort -V Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 184.108.40.2069 Build 20120410 Copyright (C) 1985-2012 Intel Corporation. All rights reserved.
Programs linked with the old version of MPI will not work with the new version of mpiexec. You can run old programs if you specify the full path to old mpiexec command, /opt/ra5_openmpi_intel/1.4.1/bin/mpiexec.