A list of useful SLURM commands
source link: https://gist.github.com/TysonRayJones/34ebca7056cadc60c32dd3d138388a14
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
In addition to Harvard's fantastic list, we list some other convenient SLURM commands.
Delay an enqueued job from running
Useful for letting other enqueued jobs run without having to kill/re-run already running jobs. To delay for 7 days:
scontrol update JobID=<JOB ID> StartTime=now+7days
The NODELIST(REASON)
field reported by squeue
for the delayed jobs will become (BeginTime)
.
Requeue and immediately delay running jobs
when
suspend
andhold
don't seem to do anything!
You may want to stop running jobs and requeue them further down the queue (i.e. avoid immediately re-runing them). This is useful for freeing up nodes to let other jobs run without having to resubmit your running jobs.
To requeue a job and delay it for one day:
(export jobid=<JOB ID>; scontrol requeue $jobid; scontrol update JobID=$jobid StartTime=now+1day)
If you have many jobs (with unique job-ids), you'll want to type out a list of jobs to requeue and delay using a for
loop:
for jobid in <SPACE SEPARATED LIST OF JOB IDS>; do scontrol requeue $jobid; scontrol update JobID=$jobid StartTime=now+1day; done
If many of your jobs share a common prefix which you don't want to retype; export it!
(export prefix=<COMMON JOB ID PREFIX>; for suffix in <SPACE SEPARATED LIST OF JOB ID SUFFIXES>; do scontrol requeue ${prefix}${suffix}; scontrol update JobID=${prefix}${suffix} StartTime=now+1day; done)
For example, of the following job id list...
1234567_10
1234567_11
1234567_12
1234567_13
1234567_14
if you want to requeue + delay jobs 1234567_11
and 1234567_12
for 2 days, you'd call
(export prefix=1234567_1; for suffix in 1 2; do scontrol requeue ${prefix}${suffix}; scontrol update JobID=${prefix}${suffix} StartTime=now+2days; done)
Note that SLURM will often not list the re-queued jobs in
squeue
, but rest assured, they're still enqueued!
Take care to ensure your jobs have everything they need (e.g. files) when they're eventually re-run.
Keep in mind re-queued jobs may behave differently when re-run. Think carefully e.g. about your random seeding!
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK