8

A list of useful SLURM commands

 2 years ago
source link: https://gist.github.com/TysonRayJones/34ebca7056cadc60c32dd3d138388a14
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

In addition to Harvard's fantastic list, we list some other convenient SLURM commands.


Delay an enqueued job from running

Useful for letting other enqueued jobs run without having to kill/re-run already running jobs. To delay for 7 days:

scontrol update JobID=<JOB ID> StartTime=now+7days

The NODELIST(REASON) field reported by squeue for the delayed jobs will become (BeginTime).


Requeue and immediately delay running jobs

when suspend and hold don't seem to do anything!

You may want to stop running jobs and requeue them further down the queue (i.e. avoid immediately re-runing them). This is useful for freeing up nodes to let other jobs run without having to resubmit your running jobs.

To requeue a job and delay it for one day:

(export jobid=<JOB ID>; scontrol requeue $jobid; scontrol update JobID=$jobid StartTime=now+1day)

If you have many jobs (with unique job-ids), you'll want to type out a list of jobs to requeue and delay using a for loop:

for jobid in <SPACE SEPARATED LIST OF JOB IDS>; do scontrol requeue $jobid; scontrol update JobID=$jobid StartTime=now+1day; done

If many of your jobs share a common prefix which you don't want to retype; export it!

(export prefix=<COMMON JOB ID PREFIX>; for suffix in <SPACE SEPARATED LIST OF JOB ID SUFFIXES>; do scontrol requeue ${prefix}${suffix}; scontrol update JobID=${prefix}${suffix} StartTime=now+1day; done)

For example, of the following job id list...

1234567_10
1234567_11
1234567_12
1234567_13
1234567_14

if you want to requeue + delay jobs 1234567_11 and 1234567_12 for 2 days, you'd call

(export prefix=1234567_1; for suffix in 1 2; do scontrol requeue ${prefix}${suffix}; scontrol update JobID=${prefix}${suffix} StartTime=now+2days; done)

Note that SLURM will often not list the re-queued jobs in squeue, but rest assured, they're still enqueued!

Take care to ensure your jobs have everything they need (e.g. files) when they're eventually re-run.

Keep in mind re-queued jobs may behave differently when re-run. Think carefully e.g. about your random seeding!



About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK