parallel processing - Grid size in phase #4 of Harris' reduction optimization

Question

Welcome To Ask or Share your Answers For Others

parallel processing - Grid size in phase #4 of Harris' reduction optimization

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

parallel processing - Grid size in phase #4 of Harris' reduction optimization

I am learning about unrolling loops to optimize kernel computation.

This is a code snippet from the book Professional CUDA C Programming:

if (idx + 4 * blockDim.x <= n)
{
    int a1 = g_idata[idx];
    int a2 = g_idata[idx + blockDim.x];
    int a3 = g_idata[idx + 2 * blockDim.x];
    int a4 = g_idata[idx + 3 * blockDim.x];
    tmpSum = a1 + a2 + a3 + a4;
}

In my understanding, each thread works on 4 data blocks and processes a single element from each data block. So, when we launch kernel, compared with kernel w/o unrolling grid.x, the configuration is changed to reduceSmemUnroll<<<grid.x / 4, block>>>.

Then I have a question about the code snippet from Mark Harris's presentation on parallel reduction on page 32:

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;

while (i < n) {
    sdata[tid] += g_idata[i] + g_idata[i+blockSize];
    i += gridSize;
}
__syncthreads();

My question is about how to determine the size of grid when launching the kernel? Should it be grid.x/2 compared to configuration w/o multiple load?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:15:55+0000

Yes, it should be half the number of blocks; it says so on the slide with the first occurrence of the code snippet you quoted from in Mark's presentation - already on slide 18:

Halve the number of blocks, and replace single load:

[code snippet]

with two loads and [the] first add of the reduction

Of course, you need to be careful about the sizes. The presentation assumes, for simplicity, that your overall length is a power of 2, so you can always safely divide by 2 while there are multiple elements left. In real life that is not the case, so you may need to allow for slack (e.g. "half the grid size plus one if it was odd").

Categories

parallel processing - Grid size in phase #4 of Harris' reduction optimization

parallel processing - Grid size in phase #4 of Harris' reduction optimization

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

parallel processing - Grid size in phase #4 of Harris&#39; reduction optimization

parallel processing - Grid size in phase #4 of Harris&#39; reduction optimization

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

parallel processing - Grid size in phase #4 of Harris' reduction optimization

parallel processing - Grid size in phase #4 of Harris' reduction optimization