Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
888 views
in Technique[技术] by (71.8m points)

cuda - About error code "invalid device function" by nvcc with compute_ and sm_ compile option

I hope you can help me to figure out the correct compiler option required for the below card:

> ./deviceQuery Starting...
> 
>  CUDA Device Query (Runtime API) version (CUDART static linking)
> 
> Detected 1 CUDA Capable device(s)
> 
> Device 0: "GeForce GTX 780 Ti"   
> CUDA Driver Version / Runtime Version 7.0 / 6.5
> CUDA Capability Major/Minor version number:    3.5  
> Total amount of global memory:                 3072 MBytes (3220897792
> bytes)
> (15) Multiprocessors, (192) CUDA Cores/MP:
>     2880 CUDA Cores   
> GPU Clock rate:                                1020 MHz (1.02GHz)
> Memory Clock rate:                             3500 Mhz  
> Memory Bus Width:                              384-bit
> L2 Cache Size:                                 1572864 bytes
...
  Maximum Texture
> Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536),
> 3D=(4096, 4096, 4096)   Maximum Layered 1D Texture Size, (num) layers 
> 1D=(16384), 2048 layers   Maximum Layered 2D Texture Size, (num)
> layers  2D=(16384, 16384), 2048 layers   Total amount of constant
> memory:               65536 bytes   Total amount of shared memory per
> block:       49152 bytes   Total number of registers available per
> block: 65536   Warp size:                                     32  
> Maximum number of threads per multiprocessor:  2048   Maximum number
> of threads per block:           1024   Max dimension size of a thread
> block (x,y,z): (1024, 1024, 64)   Max dimension size of a grid size   
> (x,y,z): (2147483647, 65535, 65535)   Maximum memory pitch:           
> 2147483647 bytes   Texture alignment:                             512
> bytes   Concurrent copy and kernel execution:          Yes with 1 copy
> engine(s)   Run time limit on kernels:                     Yes  
> Integrated GPU sharing Host Memory:            No   Support host
> page-locked memory mapping:       Yes   Alignment requirement for
> Surfaces:            Yes   Device has ECC support:                    
> Disabled   Device supports Unified Addressing (UVA):      Yes   Device
> PCI Bus ID / PCI location ID:           3 / 0   Compute Mode:
>      < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> 
> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA
> Runtime Version = 6.5, NumDevs = 1, Device0 = GeForce GTX 780 Ti
> Result = PASS

I have a piece of cuda code and debug with nvcc (CUDA 6.5). When I added those options:

-arch compute_20 -code sm_20

then program gave me this error:

error code invalid device function

If I remove those options (nvcc source -o exe), the program runs fine. Can anyone help me figure out which compute_ and sm_ is suitable for my card by looking at the output of ./deviceQuery? I read from the nvidia manual that using the correct option of compute_ and sm_ for the card results in significant speed up . Has anyone observed quantitatively this speed up?

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

"Invalid Device Function" error in CUDA generally means you have compiled with GPU architecture settings that don't match or are not compatible with the GPU you are running on.

The general process to solve this is to run the deviceQuery sample code on your GPU, determine the compute capability major and minor versions from the output, and use that to select compile architecture settings for your GPU.

if you your GPU is architecture compute capability X.Y, then a very simple choice would be:

-arch=sm_XY

Can anyone help me figure out which compute_ and sm_ is suitable for my card by looking at the output of ./deviceQuery?

Following your example, the correct settings for GTX 780 Ti are:

-arch compute_35 -code sm_35

The above will generate code that will run on a cc3.5 device (only). I think it's better just to specify:

-arch=sm_35

which is a shorthand for slightly more complicated version:

-gencode arch=compute_35,code=sm_35 -gencode arch=compute_35,code=compute_35

This will generate code that will run on a cc3.5 or newer device. The 3.5/35 number arises from this line in your deviceQuery output:

Capability Major/Minor version number: 3.5 

If you want to understand the switch options/differences better, I suggest you review the nvcc manual and this question/answer.

For more description of the behavior of the -arch switch, see here.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...