Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
133 views
in Technique[技术] by (71.8m points)

parallel processing - Grid size in phase #4 of Harris' reduction optimization

I am learning about unrolling loops to optimize kernel computation.

This is a code snippet from the book Professional CUDA C Programming:

if (idx + 4 * blockDim.x <= n)
{
    int a1 = g_idata[idx];
    int a2 = g_idata[idx + blockDim.x];
    int a3 = g_idata[idx + 2 * blockDim.x];
    int a4 = g_idata[idx + 3 * blockDim.x];
    tmpSum = a1 + a2 + a3 + a4;
}

In my understanding, each thread works on 4 data blocks and processes a single element from each data block. So, when we launch kernel, compared with kernel w/o unrolling grid.x, the configuration is changed to reduceSmemUnroll<<<grid.x / 4, block>>>.

Then I have a question about the code snippet from Mark Harris's presentation on parallel reduction on page 32:

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;

while (i < n) {
    sdata[tid] += g_idata[i] + g_idata[i+blockSize];
    i += gridSize;
}
__syncthreads();

My question is about how to determine the size of grid when launching the kernel? Should it be grid.x/2 compared to configuration w/o multiple load?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Yes, it should be half the number of blocks; it says so on the slide with the first occurrence of the code snippet you quoted from in Mark's presentation - already on slide 18:

Halve the number of blocks, and replace single load:

[code snippet]

with two loads and [the] first add of the reduction

Of course, you need to be careful about the sizes. The presentation assumes, for simplicity, that your overall length is a power of 2, so you can always safely divide by 2 while there are multiple elements left. In real life that is not the case, so you may need to allow for slack (e.g. "half the grid size plus one if it was odd").


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...