Reservations are no longer required on Mio to evict people from your nodes. In the past people would set a reservation for their nodes and in doing so purge jobs from users not belonging to their group. Now, people need only run the job, selecting to run in their group's partition. See Selecting Nodes on Mio and Running only on nodes you own below.
There are two ways to manually select nodes on which to run. They can be listed on the command line or by selecting a partition. The "partition" method is discussed in the next section.
We have below a section of the man page for srun command describing how to specify a list of nodes on which to run:
Request a specific list of hosts. The job will contain at least these hosts. The list may be specified as a comma-separated list of hosts, a range of hosts (compute[1-5,7,...] for example), or a filename. The host list will be assumed to be a filename if it contains a "/" character. If you specify a max node count (-N1-2) if there are more than 2 hosts in the file only the first 2 nodes will be used in the request list. Rather than repeating a host name multiple times, an asterisk and a repitition count may be appended to a host name. For example "compute1,compute1" and "compute1*2" are equivalent.
Example: running the script myscript on compute001, compute002, and compute003...
[joeuser@mio001 ~]sbatch --nodelist=compute[001-003] myscript
Example: running the "hello world" program /opt/utility/phostname interactively on compute001, compute002, and compute003...
[joeuser@mio001 ~]srun --nodelist=compute[001-003] --tasks-per-node=4 /opt/utility/phostname compute001 compute001 compute001 compute001 compute002 compute002 compute002 compute002 compute003 compute003 compute003 compute003 [joeuser@mio001 color]$
There are several generation of nodes on Mio each with different "features." You can see the features by running the command:
[joeuser@mio001 ~]/opt/utility/slurmnodes -fAvailableFeatures compute000 Features core8,nehalem,mthca,ddr compute001 Features core8,nehalem,mthca,ddr ... compute032 Features core12,westmere,mthca,ddr compute033 Features core12,westmere,mthca,ddr ... compute157 Features core24,haswell,mlx4,fdr ... ...
Features can be used to select subsets of nodes. For example, if you want to run on nodes with 24 cores you can add an option --constraint=core24 to your sbatch command line or script.
[joeuser@mio001 ~]sbatch --constraint=core24 simple_slurm Submitted batch job 1289851 [joeuser@mio001 ~]
Which gives us:
[joeuser@mio001 ~]squeue -u joeuser JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1289851 compute hybrid joeuser R 0:01 2 compute[157-158] [joeuser@mio001 ~]
Every normal compute node (exceptions are GPU and PHI nodes) on mio is part of two partitions or groupings. They are part of the compute partition and they are part of a partition that is assigned to a research group. That is, each research group has a partition and their nodes are in that partition. The GPU and PHI nodes are in their own partition to prevent people from accidentally running on them.
You can see the partitions that you are allowed to use (compute, phi, gpu and your groups partions) by running the command sinfo. sinfo -node will display which partitions you are allowed to run in. sinfo -a will show all partitions. sinfo -a --format="%P %N" shows a compact list of all partitions and nodes.
Add the option -p partition_name to your sbatch command run in the named partition. The default partition is compute which is all of the normal nodes. By default your job can end up on any nodes. Specifying your groups partition will restrict your job to "your" nodes.
You can also add the line
#SBATCH -p partition_name
to your sbatch command run in the named partition. The default partition is compute which is all of the normal nodes. By default your job can end up on any nodes. Specifying your groups partition will restrict your job to "your" nodes.
Also, starting a job in your groups partition will purge any job running on your nodes that are run under the default partition. Thus, it is not necessary to create a reservation to gain access to your nodes. If you do not run in your partition your jobs have the potential to be deleted by the group owning the nodes.
There is a shortcut command that will show you the partitions in which you can run, /opt/utility/partitions. For example:
[joeuser@mio001 utility]$ /opt/utility/partitions Partitions and their nodes available to joeuser compute compute[000-003,008-013,016-033,035-041,043-047,049-052,054-081,083-193] phi phi[001-002] gpu gpu[001-003] joesgroup compute[056-061,160-167] [tkaiser@mio001 utility]$
We see that joeuser can run on nodes in the compute partition. The partitions compute, phi, and gpu are available to everyone. Joes group "owns" compute[056-061,160-167] and running in the joesgroup partition will allow preemption.
Running threaded jobs and/or Running with less than N MPI tasks per node Slurm will try to pack as many tasks on a node as it can to try to fill it so that there is at least 1 task or thread per core. So if you are running less than N MPI tasks per node where N is the number of cores slurm may put additional jobs on your node.
You can prevent this from happening by selecting setting values for the flags --tasks-per-node and --cpus-per-task on your sbatch command line or in you slurm script. The value for --tasks-per-node times --cpus-per-task should be the number of cores on the node. For example, if you are running on 2 16 core nodes you want 8 MPI tasks you might say
--nodes=2 --tasks-per-node=4 --cpus-per-task=4
where 2*4*4=32 or the total number of cores on two nodes.
You can also prevent additional jobs from running on nodes by using the --exclusive flag