Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
417 views
in Technique[技术] by (71.8m points)

cluster computing - How to hold up a script until a slurm job (start with srun) is completely finished?

I am running a job array with SLURM, with the following job array script (that I run with sbatch job_array_script.sh [args]:

#!/bin/bash

#SBATCH ... other options ...

#SBATCH --array=0-1000%200

srun ./job_slurm_script.py $1 $2 $3 $4

echo 'open' > status_file.txt

To explain, I want job_slurm_script.py to be run as an array job 1000 times with 200 tasks maximum in parallel. And when all of those are done, I want to write 'open' to status_file.txt. This is because in reality I have more than 10,000 jobs, and this is above my cluster's MaxSubmissionLimit, so I need to split it into smaller chunks (at 1000-element job arrays) and run them one after the other (only when the previous one is finished).

However, for this to work, the echo statement can only trigger once the entire job array is finished (outside of this, I have a loop which checks status_file.txt so see if the job is finished, i.e when the contents are the string 'open').

Up to now I thought that srun holds the script up until the whole job array is finished. However, sometimes srun "returns" and the script goes to the echo statement before the jobs are finished, so all the subsequent jobs bounce off the cluster since it goes above the submission limit.

So how do I make srun "hold up" until the whole job array is finished?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can add the flag --wait to sbatch.

Check the manual page of sbatch for information about --wait.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...