Dependency and error checking in slurm

People use the --dependency option to chain jobs in slurm. The running of a job can be held until a particular job completes. This can be done so as to not to "hog" resources or because the output of one job is needed as input for the second.

The first portions of this document describes slurm dependency in general. We then look at how job errors can be delt with in slurm jobs, in particular in relationship to dependencies. Then finally, we look at an example where we automate the dependency setup for a number of jobs.

Here is the portion of the sbatch man page that describes the --dependency option.


-d, --dependency=<dependency_list>
	  Defer the start of this job until the specified  dependencies  have  been  
	  satisfied  completed.   <dependency_list>  is  of  the  form
	  <type:job_id[:job_id][,type:job_id[:job_id]]>.   
	  Many  jobs  can  share the same dependency and these jobs may even belong 
	  to different users. The  value may be changed after job submission using 
	  the scontrol command.

	  after:job_id[:jobid...]
			 This job can begin execution after the specified jobs have begun 
			 execution.

	  afterany:job_id[:jobid...]
			 This job can begin execution after the specified jobs have terminated.

	  afternotok:job_id[:jobid...]
			 This job can begin execution after the specified jobs have terminated
			 in some failed state (non-zero exit  code,  node  failure, timed out, etc).

	  afterok:job_id[:jobid...]
			 This  job can begin execution after the specified jobs have successfully 
			 executed (ran to completion with an exit code of zero).

	  expand:job_id
			 Resources allocated to this job should be used to expand the specified 
			 job.  The job to expand must share the same QOS  (Quality of Service) 
			 and partition.  Gang scheduling of resources in the partition is also 
			 not supported.

	  singleton
			 This job can begin execution after any previously launched jobs sharing 
			 the same job name and user have terminated.

Errors in Slurm and dependencies

There have been some questions about the suboption afterok:job_id[:jobid...].

  1. How does slurm track errors?
  2. What is an error?
  3. How do you get a script to return an error?
  4. What happens to a job that is dependent on a job that fails when suboption afterok is used?

The page http://slurm.schedmd.com/job_exit_code.html has good information about #1. It states:

     A job's exit code (aka exit status, return code and completion code) is 
     captured by Slurm and saved as part of the job record. For sbatch jobs, 
     the exit code that is captured is the output of the batch script. For 
     salloc jobs, the exit code will be the return value of the exit call 
     that terminates the salloc session. For srun, the exit code will be the 
     return value of the command that srun executes.

The text "For sbatch jobs, the exit code that is captured is the output of the batch script." is important. It says that if a command within a batch script fails but the script completes the error state will not be set thus the dependency "afterok" will not stop the next job from running.

Consider the following script "nocheck" that does 3 important things.

  1. If first sleeps for 30 seconds. This gives us time to start dependent jobs.
  2. Does an ls of a non existent file. There is a non-zero exit code for this command.
  3. Runs the "hello world" program phostname
#!/bin/bash
#SBATCH --job-name="atest"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=00:02:00
#SBATCH -o stdout.%j
#SBATCH -e stderr.%j
#SBATCH --export=ALL

#----------------------
cd $SLURM_SUBMIT_DIR
date
srun -n 8 sleep 30
date

ls this_file_does_not_exist


srun -n 8 /opt/utility/phostname -F

Now we do the following. We run this script, then run it two more times using the afterok and afterany options.

[joeuser@aun002 bins]$ sbatch -p debug nocheck
Submitted batch job 36805
[joeuser@aun002 bins]$ sbatch --dependency=afterany:36805 -p debug nocheck 
Submitted batch job 36806
[joeuser@aun002 bins]$ sbatch --dependency=afterok:36805 -p debug nocheck 
Submitted batch job 36807

We can see what jobs are in the queue and the dependencies:

[joeuser@aun002 bins]$ squeue -u joeuser
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             36806     debug    atest  joeuser PD       0:00      1 (Dependency)
             36807     debug    atest  joeuser PD       0:00      1 (Dependency)
             36805     debug    atest  joeuser  R       0:22      1 node001

After about a minute we see the output produced:

[joeuser@aun002 bins]$ ls -lt *36806* *36807* *36805*
-rw-rw-r-- 1 joeuser joeuser 616 Mar 11 12:02 stdout.36807
-rw-rw-r-- 1 joeuser joeuser 616 Mar 11 12:02 stdout.36806
-rw-rw-r-- 1 joeuser joeuser  70 Mar 11 12:02 stderr.36807
-rw-rw-r-- 1 joeuser joeuser  70 Mar 11 12:02 stderr.36806
-rw-rw-r-- 1 joeuser joeuser 616 Mar 11 12:01 stdout.36805
-rw-rw-r-- 1 joeuser joeuser  70 Mar 11 12:01 stderr.36805

We have below the output from job 36805. The "ls" command from job 36805 produced an error but job 36807, which had the afterok option specified, still ran. That is, as far as slurm is concerned job 36805 did not produce an error so it ran job 36807. Slurm did not see the error from job 36805 because after the "ls" error the script ran to completion.

[joeuser@aun002 bins]$ cat stderr.36805
ls: cannot access this_file_does_not_exist: No such file or directory
[joeuser@aun002 bins]$ 

[joeuser@aun002 bins]$ cat stdout.36807
Wed Mar 11 12:01:59 MDT 2015
Wed Mar 11 12:02:30 MDT 2015
0001      0000               node001        0000         0001
0004      0000               node001        0000         0004
0002      0000               node001        0000         0002
0003      0000               node001        0000         0003
0006      0000               node001        0000         0006
0005      0000               node001        0000         0005
0007      0000               node001        0000         0007
task    thread             node name  first task    # on node
0000      0000               node001        0000         0000
[joeuser@aun002 bins]$ 

This, job 36807 running, might not be the desired behavior. We might want errors within our initial script to prevent the second script from running.

There are a few options.

You can add the line:

set -e 

to your script. This option causes a script to exit on the first error with the error code of the offending command.

The above script gets the addition:

...
#----------------------
set -e
cd $SLURM_SUBMIT_DIR
...

We do the runs again:

[joeuser@aun002 bins]$ sbatch -p debug nocheck
Submitted batch job 36816
[joeuser@aun002 bins]$ sbatch --dependency=afterany:36816 -p debug nocheck
Submitted batch job 36817
[joeuser@aun002 bins]$ sbatch --dependency=afterok:36816 -p debug nocheck
Submitted batch job 36818
[joeuser@aun002 bins]$ sq
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             36817     debug    atest  joeuser PD       0:00      1 (Dependency)
             36818     debug    atest  joeuser PD       0:00      1 (Dependency)
             36816     debug    atest  joeuser  R       0:28      1 node001

We get the output:

[joeuser@aun002 bins]$ ls -lt *36816 *36817 *36818
ls: cannot access *36818: No such file or directory
-rw-rw-r-- 1 joeuser joeuser 70 Mar 11 13:01 stderr.36817
-rw-rw-r-- 1 joeuser joeuser 58 Mar 11 13:01 stdout.36817
-rw-rw-r-- 1 joeuser joeuser 70 Mar 11 13:01 stderr.36816
-rw-rw-r-- 1 joeuser joeuser 58 Mar 11 13:01 stdout.36816
[joeuser@aun002 bins]$ 

Two things to note about this output. It is much shorter.

[joeuser@aun002 bins]$ cat stdout.36816
Wed Mar 11 13:00:52 MDT 2015
Wed Mar 11 13:01:22 MDT 2015
[joeuser@aun002 bins]$ 

This is because the command phostname was not run because the script exited on the error of the ls command.

The other thing to note is that there is no output from job 36818. This is because the job was not run because its afterok setting required that job 36816 completed successfully.

We can see its status with the command:

[joeuser@aun002 bins]$ sacct -j 36818
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
36818             atest      debug       test          0  CANCELLED      0:0 
[joeuser@aun002 bins]$ 

The other option is that you can manually check the error status of individual commands within a script: The error status for a command is held in the variable $?. This can be checked and we can then force the script to exit. For example we can add the line

if [ $? -ne 0 ] ; then ; exit 1234 ;fi

after a command we want to check as shown below:

[joeuser@aun002 bins]$ cat check
#!/bin/bash
#SBATCH --job-name="atest"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=00:02:00
#SBATCH -o stdout.%j
#SBATCH -e stderr.%j
#SBATCH --export=ALL

#----------------------
cd $SLURM_SUBMIT_DIR
date
srun -n 8 sleep 30
date

ls this_file_does_not_exist
if [ $? -ne 0 ] ; then ; exit 1234 ;fi

srun -n 8 /opt/utility/phostname -F

Running this script as before we get the basically the same results, the job for which we specified afterok does not run.

[joeuser@aun002 bins]$ sbatch -p debug check
Submitted batch job 36822
[joeuser@aun002 bins]$ sbatch --dependency=afterok:36822 -p debug check
Submitted batch job 36823
[joeuser@aun002 bins]$ sbatch --dependency=afterany:36822 -p debug check
Submitted batch job 36824
[joeuser@aun002 bins]$ squeue -u joeuser
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             36823     debug    atest  joeuser PD       0:00      1 (Dependency)
             36824     debug    atest  joeuser PD       0:00      1 (Dependency)
             36822     debug    atest  joeuser  R       0:20      1 node001
[joeuser@aun002 bins]$ squeue -u joeuser
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[joeuser@aun002 bins]$ ls -lt *36822 *36823 *36824
ls: cannot access *36823: No such file or directory
-rw-rw-r-- 1 joeuser joeuser 250 Mar 11 13:22 stderr.36824
-rw-rw-r-- 1 joeuser joeuser  58 Mar 11 13:22 stdout.36824
-rw-rw-r-- 1 joeuser joeuser 250 Mar 11 13:21 stderr.36822
-rw-rw-r-- 1 joeuser joeuser  58 Mar 11 13:21 stdout.36822
[joeuser@aun002 bins]$ sacct -j 36823
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
36823             atest      debug       test          0  CANCELLED      0:0 

Finally, we note that there is an option to the srun command that automatically perform the check and exit if it produces an error. This could be used instead of the manual if [ $? -ne 0 ] ; then ; exit 1234 ;fi.

The option is:


   -K, --kill-on-bad-exit[=0|1]
		  Controls whether or not to terminate a job if any task exits with a 
		  non-zero exit code. 

Automating dependency settings for a collection of jobs

In the example we run /opt/utility/phostname which is a glorified hello world program.

Slurm has a dependency option that you can use to prevent a job from starting until the previous finishes.

The tricky part it to automate it. Here we look at two scripts:

Chain3 and old_new take advantage of another feature in slurm. It can read environmental variable settings from your login session. We use environmental variable OLD_DIR for the directory of the previous run and NEW_DIR for the current run.

The script chain3 sets values for OLD_DIR and NEW_DIR and then runs the slurm script old_new. It first starts the initial run of old_new with OLD_DIR unset and NEW_DIR=job1.

Old_new checks to see if NEW_DIR is set. If not it uses a "default" of the SLURM_JOBID and creates the directory. It then checks to see if OLD_DIR is set and if so copies files from OLD_DIR to NEW_DIR.

Next chain3 goes into a loop and starts 4 more jobs with OLD_DIR set to the previous NEW_DIR and NEW_DIR set to job[2-5]

Note the extra "stuff" on the sbatch lines in chain3. For the first job the extra stuff just captures the job id that is normally printed when a job is submitted.

For jobs[2-5] we also add


--dependency=afterok:$jid

which states that this job should not start until the previous job finishes.

*********** chain3 ***********

#!/bin/bash
unset OLD_DIR
export NEW_DIR=job1
jid=`sbatch old_new | awk '{print $NF }'`
echo $jid

for job in job2 job3 job4 job5 ; do
  export OLD_DIR=$NEW_DIR
  export NEW_DIR=$job
  jid=`sbatch --dependency=afterok:$jid old_new | awk '{print $NF }'`
  echo $jid
done

*********** old_new ***********

#!/bin/bash
#SBATCH --job-name="atest"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=01:00:00
#SBATCH --exclusive
#SBATCH -o stdout.%j
#SBATCH -e stderr.%j
#SBATCH --export=ALL

#----------------------
cd $SLURM_SUBMIT_DIR

# Make a directory for this run and go there.
# If NEW_DIR is defined then we use that for
# our directory name or we set it to SLURM_JOBID

if [ -z "$NEW_DIR" ]  ; then
  export NEW_DIR=$SLURM_JOBID
fi
mkdir $NEW_DIR

# If we have OLD_DIR defined then we copy old to new
if [ -n "$OLD_DIR" ]  ; then
  cp $OLD_DIR/* $NEW_DIR
fi

cd $NEW_DIR

# Here we just run the hello world program "phostname" and sleep 10 seconds
srun -n 8 /opt/utility/phostname -F >$SLURM_JOBID.out
sleep 10

*********** Typical output form running these scripts ***********

Note the dependencies in the beginning and the reduction in dependencies as jobs finish. The output from the previous runs is accumulated in each directory.

[joeuser@aun001 chain]$ ls
chain3  old_new
[joeuser@aun001 chain]$ ./chain3 
43677
43678
43679
43680
43681
[joeuser@aun001 chain]$ sq -u joeuser
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             43678       aun    atest  joeuser PD       0:00      1 (Dependency)
             43679       aun    atest  joeuser PD       0:00      1 (Dependency)
             43680       aun    atest  joeuser PD       0:00      1 (Dependency)
             43681       aun    atest  joeuser PD       0:00      1 (Dependency)
             43677       aun    atest  joeuser  R       0:05      1 node068
[joeuser@aun001 chain]$ sq -u joeuser
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             43679       aun    atest  joeuser PD       0:00      1 (Dependency)
             43680       aun    atest  joeuser PD       0:00      1 (Dependency)
             43681       aun    atest  joeuser PD       0:00      1 (Dependency)
             43678       aun    atest  joeuser  R       0:05      1 node068
[joeuser@aun001 chain]$ sq -u joeuser
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             43681       aun    atest  joeuser PD       0:00      1 (Dependency)
             43680       aun    atest  joeuser  R       0:03      1 node068
[joeuser@aun001 chain]$ sq -u joeuser
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             43681       aun    atest  joeuser  R       0:08      1 node068
[joeuser@aun001 chain]$ sq -u joeuser
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[joeuser@aun001 chain]$ ls
chain3  job2  job4  old_new       stderr.43678  stderr.43680  stdout.43677  stdout.43679  stdout.43681
job1    job3  job5  stderr.43677  stderr.43679  stderr.43681  stdout.43678  stdout.43680
[joeuser@aun001 chain]$ ls job*
job1:
43677.out

job2:
43677.out  43678.out

job3:
43677.out  43678.out  43679.out

job4:
43677.out  43678.out  43679.out  43680.out

job5:
43677.out  43678.out  43679.out  43680.out  43681.out
[joeuser@aun001 chain]$