Summary
In this post, I will introduce the Sync and Async behaviors in CUDA.
Conclusion
The followings are handy codes testing the behaviors of CPU and streams.
__global__ void cuda_hello1(){
clock_block(10000);
printf("Hello World from GPU1!\n");
}
__global__ void cuda_hello2(){
printf("Hello World from GPU2!\n");
clock_block(10000);
}
void cpu_hello() {
printf("hello world from cpu?\n");
}
/*
hello world from cpu?
hello world from cpu?
Hello World from GPU1!
Hello World from GPU2!
*/
int main1() {
cudaStream_t streams;
cudaStreamCreate(&streams);
cpu_hello();
cuda_hello1<<<1,1,0,streams>>>();
// legacy default stream blocks other blocking streams
cuda_hello2<<<1,1>>>();
cpu_hello();
cudaDeviceSynchronize();
return 0;
}
/*
hello world from cpu?
hello world from cpu?
Hello World from GPU2!
Hello World from GPU1!
*/
int main2() {
cudaStream_t streams;
cudaStreamCreateWithFlags(&streams, cudaStreamNonBlocking);
// legacy default stream does not block non-blocking streams
cpu_hello();
cuda_hello1<<<1,1,0,streams>>>();
cuda_hello2<<<1,1>>>();
cpu_hello();
cudaDeviceSynchronize();
return 0;
}
Details
There are two aspects, kernels and streams.
1. Kernels
Some of my conclusions are,
- All kernels will return immediately no matter we use the default stream or customized streams;
- Kernels launched in different streams will run concurrently; Kernels in the same stream (including the default stream) will run sequentially;
- Different threads share the same default stream (called legacy stream) unless per-thread options are turned on;
- If a kernel of the legacy stream launches, from the device side, it will block until all previous kernels in all other streams (except for non-blocking stream) finished; Use
cudaStreamCreateWithFlags(&streams, cudaStreamNonBlocking)
to create non-blocking streams.
2. Default Stream (aka Stream ‘0’, the legacy stream)
Usually, we will use APIs without the “Async” suffix. These APIs are completely synchronous w.r.t. host and device. For the host, it will return only if the API finishes. However, there are some exceptions,
Exceptions – asynchronous w.r.t. host, return immediately,
- Kernel launches in the default stream;
- cudaMemcpyAsync; cudaMemsetAsync;
- cudaMemcpy within the same device (within single GPU); but cudaMemcpy HOST TO HOST is sync;
For example,
// completely synchronous with respect to the device, which means for the device, the order is sequential: cudaMalloc, cudaMemcpy, kernel2, kernel3, cudaMemcpy
cudaMalloc ( &dev1, size ) ;
double* host1 = (double*) malloc ( &host1, size ) ;
…
cudaMemcpy ( dev1, host1, size, H2D ) ;
// GPU kernels are asynchronous with host by default, kernel launch will return immediately
kernel2 <<< grid, block, 0 >>> ( …, dev2, … );
// potentially overlapped, which means cpu will execute this function concurrently with kernel2
some_CPU_method ();
kernel3 <<< grid, block, 0 >>> ( …, dev3, … );
cudaMemcpy ( host4, dev4, size, D2H ) ;
Per thread stream
CUDA 7 introduces the per-thread default stream
. It has two effects.
- It gives each host thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently;
- These default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams;
- The per-thread default stream is not a non-blocking stream and will synchronize with the legacy default stream if both are used in a program.
3. Customized Stream
Usually, we will use APIs with the “Async” suffix.
Async APIs will return the status immediately with the following exceptions.
Exceptions – synchronous w.r.t. host, blocking,
cudaMemcpy*Async
from device memory to pageable host memory, function call blocks until the copy has completed.cudaMemcpy*Async
from any host memory to any host memory, function call blocks until the copy has completed.- For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread.
For example,
cudaStream_t stream1, stream2, stream3, stream4 ;
cudaStreamCreate ( &stream1) ;
...
cudaMalloc ( &dev1, size ) ;
// pinned memory required on host …
cudaMallocHost ( &host1, size ) ;
// paged memory
malloc(&host2, size);
// return to CPU immediately
cudaMemcpyAsync ( dev1, host1, size, H2D, stream1 ) ;
kernel2 <<< grid, block, 0, stream2 >>> ( …, dev2, … ) ;
kernel3 <<< grid, block, 0, stream3 >>> ( …, dev3, … ) ;
cudaMemcpyAsync ( host4, dev4, size, D2H, stream4 ) ;
// blocks until finishing, since host2 is paged memory
cudaMemcpyAsync ( dev1, host2, size, D2H, dev ) ;
some_CPU_method ();