Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
371 views
in Technique[技术] by (71.8m points)

slurm - How to start n tasks with one GPU each?

I have a large cluster of computing nodes, each node having 6 GPUs. And I want to start, lets say, 100 workers on that thing, each one having access to exactly one single GPU.

What I do now is like this:

sbatch --gres=gpu:6 --gpus-per-task=1 --ntasks='100' main.sh

And inside the main.sh:

srun --gpus-per-task=1 --gres=gpu:1 -n 100 worker.sh

And this way, I get 100 workers started (fully using like 17 nodes). But I have a problem: the CUDA_VISIBLE_DEVICES is not set properly.

sbatch --gres=gpu:6 --gpus-per-task=1 --ntasks='100' main.sh
# CUDA_VISIBLE_DEVICES in main.sh: 0,1,2,3,4,5 (that's fine)
srun --gpus-per-task=1 --gres=gpu:1 -n 100 worker.sh
# CUDA_VISIBLE_DEVICES in worker.sh: 0,1,2,3,4,5 (this is my problem: how to assign exactly 1 GPU to each worker and to that worker alone?)

It might be a misunderstanding on my part on how Slurm actually works since I'm quite new programming on such HPC systems. But any clue how to achieve what I want to achieve? (each worker having exactly 1 GPU assigned to it and only it)

We use SLURM 20.02.2.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...