c - Nested Parallelism : Why only the main thread runs and executes the parallel for loop four times?

Question

Welcome To Ask or Share your Answers For Others

c - Nested Parallelism : Why only the main thread runs and executes the parallel for loop four times?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - Nested Parallelism : Why only the main thread runs and executes the parallel for loop four times?

My code:

#include <cstdio>
#include "omp.h"

int main() {
    omp_set_num_threads(4);  
    #pragma omp parallel
    {
        #pragma omp parallel for 
        for (int i = 0; i < 6; i++)
        {
            printf("i = %d, I am Thread %d
", i, omp_get_thread_num());
        }    
    }
    return 0;
}

The output that I am getting:

i = 0, I am Thread 0
i = 1, I am Thread 0
i = 2, I am Thread 0
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 2, I am Thread 0
i = 2, I am Thread 0
i = 3, I am Thread 0
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 3, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0
i = 2, I am Thread 0
i = 3, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0
i = 3, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0

Adding "parallel" is the cause of the problem, but I don't know how to explain it.

My Question is: Why is there only the main thread and runs the for loop four times?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:07:51+0000

By default, nested parallelism is disabled. Nonetheless, you can explicitly enable nested parallelism, by either:

   omp_set_nested(1);

or by setting the OMP_NESTED environment variable to true.

also from the OpenMP standard we know that:

When a thread encounters a parallel construct, a team of threads is created to execute the parallel region. The thread that encountered the parallel construct becomes the master thread of the new team, with a thread number of zero for the duration of the new parallel region. All threads in the new team, including the master thread, execute the region. Once the team is created, the number of threads in the team remains constant for the duration of that parallel region.

From source you can read the following.

OpenMP parallel regions can be nested inside each other. If nested parallelism is disabled, then the new team created by a thread encountering a parallel construct inside a parallel region consists only of the encountering thread. If nested parallelism is enabled, then the new team may consist of more than one thread.

This explains the reason why when you add the second parallel region there is only one thread per team executing the enclosing code (i.e., the for loop). In other words, from the first parallel region, 4 threads are created, each of those threads when encountering the second parallel region will create a new team and become the master of that team (i.e., will have the ID=0 within the newly created team). However, because you did not explicitly enable the nested parallelism, each of those teams is only composed of a single thread. Hence, 4 teams with a thread each will execute the for loop. Consequently, you will have the following statement:

printf("i = %d, I am Thread %d
", i, omp_get_thread_num());

being printed 6 x 4 = 24 times (i.e., the total number of loop iterations multiple by the total number of threads across the 4 teams). The image below provides a visualization of that flow:

If you add a printf statement between the first and the second parallel region, as follows:

int main() {

    omp_set_num_threads(4);
    #pragma omp parallel
    {
       printf("Before nested parallel region:  I am Thread{%d}
", omp_get_thread_num());
       #pragma omp parallel for // Adding "parallel" is the cause of the problem, but I don't know how to explain it.
       for (int i = 0; i < 6; i++)
       {
           printf("i = %d, I am Thread %d
", i, omp_get_thread_num());
       }
    }
    return 0;
}

You would get something similar to the following output (bear in mind that the order in which the first 4 lines are outputted is nondeterministic).

Before nested parallel region:  I am Thread{1}
Before nested parallel region:  I am Thread{0}
Before nested parallel region:  I am Thread{2}
Before nested parallel region:  I am Thread{3}
i = 0, I am Thread 0
i = 0, I am Thread 0
i = 0, I am Thread 0
(...)
i = 5, I am Thread 0

Meaning that within the first parallel region (but still outside of the second parallel region) there is a single team of 4 threads -- with IDs varying from 0 to 3 -- executing in parallel. Hence, each of those threads will execute the printf statement:

printf("I am Thread outside the nested region {%d}
", omp_get_thread_num());

and display a different value for the omp_get_thread_num() method call.

As previously mentioned, the nested parallelism is disabled. Thus, when each of those threads encounters the second parallel region, each will create a new team and becomes the master (i.e., will have the ID=0 within the newly created team). -- and the only member -- of that team. Hence, why the statement

 printf("i = %d, I am Thread %d
", i, omp_get_thread_num());

inside the loop, outputs always (..) I am Thread 0, since the method omp_get_thread_num() in this context will return always 0. However, even though the method omp_get_thread_num() is returning 0, it does not imply that the code is being executed sequently (by the thread with ID=0), but rather that each master of each of the 4 teams is returning their ID=0.

If you enabled the nested parallelism you will have a flow like shown in the image below:

The execution of threads 1 to 3 was omitted for simplicity sake, nonetheless it would have been the same as thread 0.

So, from the first parallel region, a team with 4 threads is created. After encountering the next parallel region each thread from the previous team, will create a new team of 4 threads each, so at the moment we have a total of 16 threads across 4 teams. Finally, each team will execute the entire for loop. However, because you have a #pragma omp parallel for constructor, the iterations of the for loop will be divided among the threads within each team.

Bear in mind that in the image above, I am assuming a certain static loop distribution of iterations among loops, I am not implying that the loop iterations will always be divided like this across all the implementations of the OpenMP standard.

Categories

c - Nested Parallelism : Why only the main thread runs and executes the parallel for loop four times?

c - Nested Parallelism : Why only the main thread runs and executes the parallel for loop four times?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags