Monday, June 8, 2009

CUDA errors propagate

Sometimes if a kernel or a cudaMemcpy fails it will take CUDA some time to recover, this often causes kernels or memcpys that follow the failing one to also fail, or at least not execute.

It is therefore important to always include error checking after of all CUDA calls (as mentioned before here: http://www.herikstad.net/2009/05/cuda-kernel-errors.html)or alternatively use CUDA Utility Library. Also if you get several failures in a row, fix the first error first, since the others might just be because of the first call failing.


To use cutil you need to include the cutil_inline.h file located in /NVIDIA_CUDA_SDK/common/inc in your home directory. After doing so, you can enclose all your communication with the CUDA device with different inline functions that will catch errors. The functions are:
cutilDrvSafeCallNoSync(err)
cutilDrvSafeCall(err)
cutilDrvCtxSync()
cutilSafeCallNoSync(err)
cutilSafeCall(err)
cutilSafeThreadSync()
cufftSafeCall(err)
cutilCheckError(err)
cutilCheckMsg(msg)
cutilSafeMalloc(mallocCall)
cutilCondition(val)
cutilExit(argc, argv)


To use, simply do the following for a kernel:
kernelCall<<>>(d_Data, dataSize);
cutilCheckMsg("kernelCall failed");


Or for a memcpy:
cutilSafeCall( cudaMemcpy(h_Data, d_Data, dataSize, cudaMemcpyDeviceToHost) );


Note: The CUDA Utility Library is a "wrapper" that makes it easier to utilize functions such as __cudaSafeCall()

No comments:

Post a Comment