Summary
In this post, I will introduce how to debug async kernels or async APIs in CUDA.
The async operations will not block CPU codes. When we check the return type of the functions calls, it may be SUCCESS but there are bugs like "illegal memory access". On the other hand, when we find the return type is FAILED, it may mean that previous operations failed. But we do not know which operation has bugs.
Conclusion
Insert the following codes into the suspicious positions:
{
cudaMemoryTest();
kernel_A<<<...>>>;
......
cudaMemoryTest();
kernel_B<<<...>>>;
......
}
The macros are:
#define cudaSafeCall(call) \
do {\
cudaError_t err = call;\
if (cudaSuccess != err) \
{\
std::cerr << "CUDA error in " << __FILE__ << "(" << __LINE__ << "): " \
<< cudaGetErrorString(err);\
exit(EXIT_FAILURE);\
}\
} while(0)
void cudaMemoryTest()
{
const unsigned int N = 1048576;
const unsigned int bytes = N * sizeof(int);
int *h_a = (int*)malloc(bytes);
int *d_a;
cudaSafeCall(cudaMalloc((int**)&d_a, bytes));
memset(h_a, 0, bytes);
cudaSafeCall(cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
cudaSafeCall(cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost));
}
Details
The above codes work since there are specific behaviors for the streams. In short, the function cudaMemoryTest()
uses the default stream. The default stream will block all previous operations in all other streams (except for non-blocking streams). See my another post for details.
So when we call the function cudaMemoryTest()
, it will block until all previous async codes finish. So if there are bugs, the function cudaMemoryTest()
will fail. Thus, we can easily locate the bugs.