I am learning about unrolling loops to optimize kernel computation.
This is a code snippet from the book Professional CUDA C Programming:
if (idx + 4 * blockDim.x <= n)
{
int a1 = g_idata[idx];
int a2 = g_idata[idx + blockDim.x];
int a3 = g_idata[idx + 2 * blockDim.x];
int a4 = g_idata[idx + 3 * blockDim.x];
tmpSum = a1 + a2 + a3 + a4;
}
In my understanding, each thread works on 4 data blocks and processes a single element from each data block.
So, when we launch kernel, compared with kernel w/o unrolling grid.x
, the configuration is changed to
reduceSmemUnroll<<<grid.x / 4, block>>>
.
Then I have a question about the code snippet from Mark Harris's presentation on parallel reduction on page 32:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
My question is about how to determine the size of grid when launching the kernel? Should it be grid.x/2
compared to configuration w/o multiple load?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…