memory alignment - CUDA kernel launch fails when using various offsets into input data

Question

Welcome To Ask or Share your Answers For Others

memory alignment - CUDA kernel launch fails when using various offsets into input data

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

memory alignment - CUDA kernel launch fails when using various offsets into input data

My code is giving an error message and I am trying to track down the cause of it. To make it easier to find the problem, I have stripped away code that apparently is not relevant to causing the error message. If you can tell me why the following simple code produces an error message, then I think I should be able to fix my original code:

#include "cuComplex.h"
#include <cutil.h>

__device__ void compute_energy(void *data, int isample, int nsamples) {
  cuDoubleComplex * const nminusarray          = (cuDoubleComplex*)data;
  cuDoubleComplex * const f                    = (cuDoubleComplex*)(nminusarray+101);
  double          * const abs_est_errorrow_all = (double*)(f+3);
  double          * const rel_est_errorrow_all = (double*)(abs_est_errorrow_all+nsamples*51);
  int             * const iid_all              = (int*)(rel_est_errorrow_all+nsamples*51);
  int             * const iiu_all              = (int*)(iid_all+nsamples*21);
  int             * const piv_all              = (int*)(iiu_all+nsamples*21);
  cuDoubleComplex * const energyrow_all        = (cuDoubleComplex*)(piv_all+nsamples*12);
  cuDoubleComplex * const refinedenergyrow_all = (cuDoubleComplex*)(energyrow_all+nsamples*51);
  cuDoubleComplex * const btplus_all           = (cuDoubleComplex*)(refinedenergyrow_all+nsamples*51);

  cuDoubleComplex * const btplus           = btplus_all+isample*21021;

  btplus[0] = make_cuDoubleComplex(0.0, 0.0);
}

__global__ void computeLamHeight(void *data, int nlambda) {
  compute_energy(data, blockIdx.x, nlambda);
}

int main(int argc, char *argv[]) {
  void *device_data;

  CUT_DEVICE_INIT(argc, argv);
  CUDA_SAFE_CALL(cudaMalloc(&device_data, 184465640));
  computeLamHeight<<<dim3(101, 1, 1), dim3(512, 1, 1), 45000>>>(device_data, 101);
  CUDA_SAFE_CALL(cudaThreadSynchronize());
}

I am using a GeForce GTX 480 and I am compiling the code like so:

nvcc -L /soft/cuda-sdk/4.0.17/C/lib -I /soft/cuda-sdk/4.0.17/C/common/inc -lcutil_x86_64 -arch sm_13 -O3 -Xopencc "-Wall" Main.cu

The output is:

Using device 0: GeForce GTX 480
Cuda error in file 'Main.cu' in line 31 : unspecified launch failure.

EDIT: I have now further simplified the code. The following simpler code still produces the error message:

#include <cutil.h>

__global__ void compute_energy(void *data) {
  *(double*)((int*)data+101) = 0.0;
}

int main(int argc, char *argv[]) {
  void *device_data;

  CUT_DEVICE_INIT(argc, argv);
  CUDA_SAFE_CALL(cudaMalloc(&device_data, 101*sizeof(int)+sizeof(double)));
  compute_energy<<<dim3(1, 1, 1), dim3(1, 1, 1)>>>(device_data);
  CUDA_SAFE_CALL(cudaThreadSynchronize());
}

Now it is easy to see that the offset should be valid. I tried running cuda-memcheck and it says the following:

========= CUDA-MEMCHECK
Using device 0: GeForce GTX 480
Cuda error in file 'Main.cu' in line 13 : unspecified launch failure.
========= Invalid __global__ write of size 8
=========     at 0x00000020 in compute_energy
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x200200194 is misaligned
=========
========= ERROR SUMMARY: 1 error

I tried searching the internet to find what is meant by the address being misaligned, but I failed to find an explanation. What is the deal?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:55:44+0000

It was very hard to parse your original code with all of those magic constants, but your updated repro case makes the problem immediately obvious. The GPU architecture requires all pointers to be aligned to word boundaries. Your kernel contains a pointer access which is not correctly word aligned. Doubles are an 64 bit type, and your addressing is not aligned to an even 64 bit boundary. This:

*(double*)((int*)data+100) = 0.0; // 50th double

or this:

*(double*)((int*)data+102) = 0.0; // 51st double

are both legal. This:

*(double*)((int*)data+101) = 0.0; // not aligned to a 64 bit boundary

is not.

Categories

memory alignment - CUDA kernel launch fails when using various offsets into input data

memory alignment - CUDA kernel launch fails when using various offsets into input data

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags