Slurm

Commands

salloc obtain job allocation
sbatch submit a batch script to execute
srun obtain job allocation and execute an application
squeue view information about the jobs
sinfo information about nodes and partition

Run simple command

srun -n1 -l /bin/hostname

Notes:

-n number of tasks
this is run on the default partition

## Global informations

sinfo

Update the state of a node

scontrol update NodeName=nodename State=RESUME

Fix unknown state (or reboot unexpected)

scontrol update NodeName=nodename State=DOWN Reason='undraining'
scontrol update NodeName=nodename State=RESUME

History of jobs

sacct --format=User,JobIDRaw,JobID,Jobname%30,state,elapsed,start,end,nodelist,partition,time

Notes:

filter by joname with --name=jobname
filter by starting date --starttime 2020-07-25
filter by state --state=failed

Jobanmes only

sacct -X --starttime 2010-01-01 --format=Jobname%100 | uniq

Maintenance on node/partition

Set the state of the partition (or node) as down:

scontrol update Partition=debug State=down

Suspend all jobs that run:

scontrol suspend jobidrun

One maintenance is done, resume all jobs that were suspend, then set partition (or nodes) to up.

## sbatch example with pyxis

#!/bin/bash
#
#SBATCH --job-name=test
#
#SBATCH --ntasks=2
#SBATCH --mem-per-cpu=100
#
#SBATCH --distribution=cyclic:block
#
#SBATCH --output="/mnt/share/result_%A_%a.log"
#SBATCH --error="/mnt/share/result_%A_%a.log"
#SBATCH --array=[1-300]
#
#SBATCH --export=ENROOT_CONFIG_PATH=/etc/enroot

hostname

## --- pyxis required here (check also correct enroot export if auth is required) ---
srun --cpu-bind=cores --container-image="user@domain.tld#hpc/marymorstan-experiment:v1.0.0" --container-mount-home ls /
## -- end of pyxis usage ---

result="$?"

log_file="result_${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}.log"

if [ "${result}" != "0" ]; then
        echo "TROUBLE SRUN" >>${log_file}
else
        echo "SRUN OK" >>${log_file}
fi

if [ -f "${log_file}" ]
then
        mv "${log_file}" "${share_dir}"
fi

if [ "${result}" != "0" ]; then
        exit ${result}
fi

Note: you can use SBATCH --requeue to restart a job that failed

Tricks

Global view of the cluster

sinfo --Node --long

Suspend all jobs

squeue --format="%A" | xargs -n 1 scancel

Suspend all running jobs

squeue --format="%A" --state RUNNING | xargs scontrol suspend

Overall status of all jobs from sbatch

sbatch ... >jobs.txt
sacct -X --format=User,JobIDRaw,JobID,Jobname%30,state,time,start,end,elapsed,node --job=$(cat jobs.txt | awk '{print $4}' | tr ' \n' ',')

Show unique jobname in the queue

squeue --format=%j | sort | uniq

Priority per job in the queue

squeue --format='%j %.15p' | sort | uniq

Squeue essentials

squeue --format='%i %.50j %.15T %.15L %.15p'

Interactive use

srun ... --pty bash

Count cpu used for a partition

sinfo --Node --partition=prod | tail -n +2 | awk '{print $1}' | xargs -I {} scontrol show node {} | grep Alloc= | cut -d"=" -f2 | cut -d" " -f1 | paste -sd+ | bc

Cheat Sheet

Cheat sheet