Slurm
Commands
salloc
obtain job allocationsbatch
submit a batch script to executesrun
obtain job allocation and execute an applicationsqueue
view information about the jobssinfo
information about nodes and partition
Run simple command
srun -n1 -l /bin/hostname
Notes:
-n
number of tasks- this is run on the default partition
## Global informations
sinfo
See also for a specific node:
scontrol show node nodename
Note:
- replace
nodename
- You can also show informations concerning
job
(if running),partition
Update the state of a node
scontrol update NodeName=nodename State=RESUME
Fix unknown state (or reboot unexpected)
scontrol update NodeName=nodename State=DOWN Reason='undraining'
scontrol update NodeName=nodename State=RESUME
History of jobs
sacct --format=User,JobIDRaw,JobID,Jobname%30,state,elapsed,start,end,nodelist,partition,time
Notes:
- filter by joname with
--name=jobname
- filter by starting date
--starttime 2020-07-25
- filter by state
--state=failed
Jobanmes only
sacct -X --starttime 2010-01-01 --format=Jobname%100 | uniq
Maintenance on node/partition
Set the state of the partition (or node) as down:
scontrol update Partition=debug State=down
Suspend all jobs that run:
scontrol suspend jobidrun
One maintenance is done, resume all jobs that were suspend, then set partition (or nodes) to up.
## sbatch example with pyxis
#!/bin/bash
#
#SBATCH --job-name=test
#
#SBATCH --ntasks=2
#SBATCH --mem-per-cpu=100
#
#SBATCH --distribution=cyclic:block
#
#SBATCH --output="/mnt/share/result_%A_%a.log"
#SBATCH --error="/mnt/share/result_%A_%a.log"
#SBATCH --array=[1-300]
#
#SBATCH --export=ENROOT_CONFIG_PATH=/etc/enroot
hostname
## --- pyxis required here (check also correct enroot export if auth is required) ---
srun --cpu-bind=cores --container-image="user@domain.tld#hpc/marymorstan-experiment:v1.0.0" --container-mount-home ls /
## -- end of pyxis usage ---
result="$?"
log_file="result_${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}.log"
if [ "${result}" != "0" ]; then
echo "TROUBLE SRUN" >>${log_file}
else
echo "SRUN OK" >>${log_file}
fi
if [ -f "${log_file}" ]
then
mv "${log_file}" "${share_dir}"
fi
if [ "${result}" != "0" ]; then
exit ${result}
fi
Note: you can use SBATCH --requeue
to restart a job that failed
Tricks
Global view of the cluster
sinfo --Node --long
Suspend all jobs
squeue --format="%A" | xargs -n 1 scancel
Suspend all running jobs
squeue --format="%A" --state RUNNING | xargs scontrol suspend
Overall status of all jobs from sbatch
sbatch ... >jobs.txt
sacct -X --format=User,JobIDRaw,JobID,Jobname%30,state,time,start,end,elapsed,node --job=$(cat jobs.txt | awk '{print $4}' | tr ' \n' ',')
Show unique jobname in the queue
squeue --format=%j | sort | uniq
Priority per job in the queue
squeue --format='%j %.15p' | sort | uniq
Squeue essentials
squeue --format='%i %.50j %.15T %.15L %.15p'
Interactive use
srun ... --pty bash
Count cpu used for a partition
sinfo --Node --partition=prod | tail -n +2 | awk '{print $1}' | xargs -I {} scontrol show node {} | grep Alloc= | cut -d"=" -f2 | cut -d" " -f1 | paste -sd+ | bc